On Sat, 12 Dec 2020, Artur Tarassow wrote:
Hi all,
out of curiosity I tried to replicate a comparison between two python
libraries, namely pandas and py-polars. I also added gretl to the horse race.
There two simple tasks to do:
1) Load a 360MB csv file with two columns and 26 million rows.
2) Sort by one of the columns.
Here are some comments on Artur's experiment.
Loading speed: The CSV file in question has two columns, "author"
(strings) and "n" (integers). Since "author" is not recognized by
gretl as indicating observation markers the first column was being
treated as a string-valued variable, hence requiring a numeric
coding and a hash table. Since the author strings are actually
unique IDs (that is, observation markers) the hash table was huge,
and the work setting it up a huge waste of cycles. Things go quite a
lot better if you rename the first column as "obs".
Sorting a dataset: This was not optimized for a huge number of
observations. We were allocating more temporary storage than
strictly necessary and moreover, at some points, calling malloc()
per observation when it was possible to substitute a single call to
get a big chunk of memory. Neither of these points were much of an
issue with a normal-size dataset but they became a serious problem
in the huge case. That's now addressed in git.
Memory usage: gretl stores all numeric data, including the coding
for string-valued variables, as doubles. So when the CSV file was
read as-is, besides storing the strings we were also storing two
doubles per observation, instead of a single int ("n"). That's 16-4
= 12 extra bytes per observation, an additional 297 MB. Even once
the first column is changed to "obs" the data expand by 100 MB in
memory. If we truly want to handle data of this sort and magnitude,
we'd have to introduce more economical internal storage types for
data that don't really need 8 bytes per value. That would be a hefty
redesign.
Writing gdtb: Artur gave up on this as taking far too long. I
suspect the big problem here is zip compression. A gdtb file is a
zipfile containing numeric data in binary plus metadata in XML. I
wonder, Artur, are you building gretl using libgsf? On running the
configure script you'll see a line of output that says
Use libgsf for zip/unzip: <yes or no>
If you're not using libgsf you might find that using it helps.
However, gdtb is not designed for a dataset like this. I guess you'd
want an uncompressed pure-binary format for best read and write
speed and minimal memory consumption.
Allin