Yesterday I posted about my tests working with Gemini and bigger tables. I realized myself and Chris Webb also suggested that my method of generating new records by simply duplicating them probably affected my results. So I ran more tests with different data.
I generated a new table by adding random number to the existing numeric fields (dimension keys and amounts). From the new table I was able to load just about 17mln rows into Gemini. I was getting the same memory error message when I was trying to load more records. My load speed was about 3.5mln rows per minute. Saving the Excel workbook took 75 seconds this time, and the xlsx file was much larger – 448MB. While working with my data set, Excel was using 740MB of RAM. This time opening an existing Excel workbook took me 85 seconds.
But although with random data some operations were slower, I still was able to confirm that after loading data, all filtering/sorting operation were very fast and all pivot queries were returning results almost instantly. So duplicate or not duplicate data, if you are able to fit it into memory, then Gemini will handle it with amazing speed.
During my tests I realized that the amount of data you will be able to load into Gemini will depend entirely on your data. During my initial data “randomization” attempt I did not rounded my numeric results and I had numbers like 1.234567890. With such data I was able to load into Gemini just 4mln rows and the size of Excel workbook was about 580MB. After applying rounding to the same fields I was able to load 4 times (!) more data – 17mln rows. So when you will build you Gemini models, make sure that for bigger fact tables you load just the fields that are necessary for your analysis and make sure you round your numeric values for any calculations. There are no miracles – every character uses memory space and you need to minimize usage of that space as much as possible.
I am still learning Gemini and I am still impressed with results.