[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs

Sunday, 13 December 2020

On Sat, 12 Dec 2020, Artur Tarassow wrote:

...
 Hi all,

 out of curiosity I tried to replicate a comparison between two python 
 libraries, namely pandas and py-polars. I also added gretl to the horse race.

 There two simple tasks to do:
 1) Load a 360MB csv file with two columns and 26 million rows.
 2) Sort by one of the columns. 
Here are some comments on Artur's experiment.

Loading speed: The CSV file in question has two columns, "author" 
(strings) and "n" (integers). Since "author" is not recognized by 
gretl as indicating observation markers the first column was being 
treated as a string-valued variable, hence requiring a numeric 
coding and a hash table. Since the author strings are actually 
unique IDs (that is, observation markers) the hash table was huge, 
and the work setting it up a huge waste of cycles. Things go quite a 
lot better if you rename the first column as "obs".

Sorting a dataset: This was not optimized for a huge number of 
observations. We were allocating more temporary storage than 
strictly necessary and moreover, at some points, calling malloc() 
per observation when it was possible to substitute a single call to 
get a big chunk of memory. Neither of these points were much of an 
issue with a normal-size dataset but they became a serious problem 
in the huge case. That's now addressed in git.

Memory usage: gretl stores all numeric data, including the coding 
for string-valued variables, as doubles. So when the CSV file was 
read as-is, besides storing the strings we were also storing two 
doubles per observation, instead of a single int ("n"). That's 16-4 
= 12 extra bytes per observation, an additional 297 MB. Even once 
the first column is changed to "obs" the data expand by 100 MB in 
memory. If we truly want to handle data of this sort and magnitude, 
we'd have to introduce more economical internal storage types for 
data that don't really need 8 bytes per value. That would be a hefty 
redesign.

Writing gdtb: Artur gave up on this as taking far too long. I 
suspect the big problem here is zip compression. A gdtb file is a 
zipfile containing numeric data in binary plus metadata in XML. I 
wonder, Artur, are you building gretl using libgsf? On running the 
configure script you'll see a line of output that says

Use libgsf for zip/unzip: <yes or no>

If you're not using libgsf you might find that using it helps.
However, gdtb is not designed for a dataset like this. I guess you'd
want an uncompressed pure-binary format for best read and write 
speed and minimal memory consumption.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs