Hi all,
out of curiosity I tried to replicate a comparison between two python
libraries, namely pandas and py-polars. I also added gretl to the horse
race.
There two simple tasks to do:
1) Load a 360MB csv file with two columns and 26 million rows.
2) Sort by one of the columns.
The summary is as follows:
1) Py-polars takes about 2.5 seconds to load a 360MB large csv file as a
data frame. Pandas is about 5 times slower while Gretl needs 36 seconds
for this.
2) Sorting 26 million records takes py-polars about 4.6 seconds and
hence is just slightly faster than Pandas with 5.8 sec. -- ok, still
20% percent. Gretl needs about 14 seconds for the same task.
I am not too bothered with gretl's performance -- okay, reading a csv
may be faster but still: how often do you load such a big file?
The repo and readme can be accessed here:
https://github.com/atecon/gretl_pandas_pypolars
There is one thing which is weird though: at the end of the README.md
you see that I tried to store the loaded csv data set as a gdtb file.
But after an hour of computation I gave up... Maybe worth to look at.
Best,
Artur