Reading large csv and sorting data set -- a comparison with 2 python libs

Saturday, 12 December 2020

Hi all,

out of curiosity I tried to replicate a comparison between two python 
libraries, namely pandas and py-polars. I also added gretl to the horse 
race.

There two simple tasks to do:
1) Load a 360MB csv file with two columns and 26 million rows.
2) Sort by one of the columns.

The summary is as follows:
1) Py-polars takes about 2.5 seconds to load a 360MB large csv file as a 
data frame. Pandas is about 5 times slower while Gretl needs 36 seconds 
for this.

2) Sorting 26 million records takes py-polars about 4.6 seconds and 
hence is just slightly faster than Pandas  with 5.8 sec. -- ok, still 
20% percent. Gretl needs about 14 seconds for the same task.

I am not too bothered with gretl's performance -- okay, reading a csv 
may be faster but still: how often do you load such a big file?

The repo and readme can be accessed here:
https://github.com/atecon/gretl_pandas_pypolars

There is one thing which is weird though: at the end of the README.md 
you see that I tried to store the loaded csv data set as a gdtb file. 
But after an hour of computation I gave up... Maybe worth to look at.

Best,
Artur

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004