On Wed, 17 Apr 2019, Logan Kelly wrote:
I have students who are working with very big dataset--around 9
million observations. I had one student try to load a 4 GB csv
file into gretl, and gretl loaded it! But with some errors.
What sort of errors -- can you elaborate?
So my question are
1. What is the largest data set one should expect gretl to handle?
Well, that's going to depend on how much RAM you have.
2. Are there any suggestions for handling large datasets in gretl?
For one thing, with many millions of observations any tiny, tiny
effect will be "statistically significant"; it's probably a good
idea to down-sample (perhaps at random) to an n in the hundreds of
thousands.
3. Is there a better file type than csv to import large datasets
into gretl?
Not really; our CSV importer is about the most effective of our
various importers.
A general comment: In gretl, every data value is stored as a
"double" (a double-precision floating-point value, which occupies 64
bits or 8 bytes). But in some huge datasets many of the variables
may be representable in a much smaller data type, such as a single
byte (8 bits). If you're loading a 4 GB CSV file with a lot of 0s
and 1s as data values, those values will be expanded by a factor of
8 in gretl's in-memory version -- which may make the difference
between feasible and infeasible, for given RAM.
This is something we may want to think about in future. It will not
be easy to allow smaller data types for series but maybe that's
something we need to aim for, eventually.
Allin