Am 13.12.20 um 15:35 schrieb Allin Cottrell:
On Sat, 12 Dec 2020, Artur Tarassow wrote:
> Hi all,
>
> out of curiosity I tried to replicate a comparison between two python
> libraries, namely pandas and py-polars. I also added gretl to the
> horse race.
>
> There two simple tasks to do:
> 1) Load a 360MB csv file with two columns and 26 million rows.
> 2) Sort by one of the columns.
Thanks for the thorough look at this and reply on this, Allin. I am
going to update the github summary page on this tomorrow or so.
Here are some comments on Artur's experiment.
Loading speed: The CSV file in question has two columns, "author"
(strings) and "n" (integers). Since "author" is not recognized by
gretl
as indicating observation markers the first column was being treated as
a string-valued variable, hence requiring a numeric coding and a hash
table. Since the author strings are actually unique IDs (that is,
observation markers) the hash table was huge, and the work setting it up
a huge waste of cycles. Things go quite a lot better if you rename the
first column as "obs".
Wow, this has reduced reading time from 34 to 12 seconds! Very useful to
know.
Sorting a dataset: This was not optimized for a huge number of
observations. We were allocating more temporary storage than strictly
necessary and moreover, at some points, calling malloc() per observation
when it was possible to substitute a single call to get a big chunk of
memory. Neither of these points were much of an issue with a normal-size
dataset but they became a serious problem in the huge case. That's now
addressed in git.
As I already wrote you privately: This change is a boost as sorting time
got reduced from 14 to 7.5 seconds. Thanks for this!
By the way, does this increased speed in sorting also affect the
aggregate() function?
Memory usage: gretl stores all numeric data, including the coding for
string-valued variables, as doubles. So when the CSV file was read
as-is, besides storing the strings we were also storing two doubles per
observation, instead of a single int ("n"). That's 16-4 = 12 extra bytes
per observation, an additional 297 MB. Even once the first column is
changed to "obs" the data expand by 100 MB in memory. If we truly want
to handle data of this sort and magnitude, we'd have to introduce more
economical internal storage types for data that don't really need 8
bytes per value. That would be a hefty redesign.
I think you mentioned this 'issue' a while ago on the mailing list when
I had to deal with some other large data set which consumed a lot of memory.
Writing gdtb: Artur gave up on this as taking far too long. I suspect
the big problem here is zip compression. A gdtb file is a zipfile
containing numeric data in binary plus metadata in XML. I wonder, Artur,
are you building gretl using libgsf? On running the configure script
you'll see a line of output that says
Use libgsf for zip/unzip: <yes or no>
If you're not using libgsf you might find that using it helps.
Yes, I already use libgsf as I have the appropriate dev-library installed.
However, gdtb is not designed for a dataset like this. I guess
you'd
want an uncompressed pure-binary format for best read and write speed
and minimal memory consumption.
I guess you're right but how can I store data as a pure-binary format
without any compression? I can't find anything on this in the help text.
By the way, I stored the csv file as a gdt file via "store
<FILENAME>.gdt". To my surprise, the gdt file is 843 MB large and hence
almost 2.5 times larger than the csv.
Also, trying to open the file consumes massive RAM here on my Ubunut
20.04 machine: RAM increases from 1GB before starting to read the file
to 15GB (and actually slightly more but the swap file already starts to
increase).
Thanks,
Artur