[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs

Monday, 14 December 2020

Am 14.12.20 um 17:41 schrieb Allin Cottrell:
...
 On Mon, 14 Dec 2020, Artur Tarassow wrote:

 [...]

> Am 13.12.20 um 15:35 schrieb Allin Cottrell:
>
>> Sorting a dataset: This was not optimized for a huge number of 
>> observations. We were allocating more temporary storage than 
>> strictly necessary and moreover, at some points, calling malloc() 
>> per observation when it was possible to substitute a single call to 
>> get a big chunk of memory. Neither of these points were much of an 
>> issue with a normal-size dataset but they became a serious problem 
>> in the huge case. That's now addressed in git.
>
> As I already wrote you privately: This change is a boost as sorting 
> time got reduced from 14 to 7.5 seconds. Thanks for this!
>
> By the way, does this increased speed in sorting also affect the 
> aggregate() function?

 Right now it's specific to the case of sorting an entire dataset, but 
 it would be worth taking a look at the aggregate case too. 
Ok, thanks.

...
>> gdtb is not designed for a dataset like this. I guess
you'd want an 
>> uncompressed pure-binary format for best read and write speed and 
>> minimal memory consumption.
>
> I guess you're right but how can I store data as a pure-binary format 
> without any compression? I can't find anything on this in the help text.

 Well, no. That's not a format that gretl has offered to date. But see 
 below.

> By the way, I stored the csv file as a gdt file via "store 
> <FILENAME>.gdt". To my surprise, the gdt file is 843 MB large and 
> hence almost 2.5 times larger than the csv.

 That shouldn't be suprising if you've ever looked "inside" a gdt file.

 The CSV file just contains the data. The structured XML file contains 
 per-observation tags. With a single data series (mostly smallish 
 integers), the tags occupy more bytes than the data. 
I see. I guess that's a case where a json format may be beneficial as it 
should not have so many tags if I am right. But I am pretty sure that 
switching from xml to json would involve complex changes under the hood.

Just one note: If in future the topic on large/ big data should become 
(more) relevant for the Gretl project, the parquet (open-source) file 
format may be of interest (https://parquet.apache.org/). This is a 
widely used format nowadays in the big data context. Here is another 
reference: https://en.wikipedia.org/wiki/Apache_Parquet

Ah, I just found out about the xmlreader API for Libxml2 
(http://xmlsoft.org/xmlreader.html). Instead of loading the whole 
document in memory and to expose it as a tree, xmlreader allows one to 
"go through" the document stream which may be a more efficient way to 
go. But still, this does not solve the basic issue of structured xml 
files for large data as you said.

...
> Also, trying to open the file consumes massive RAM here on my
Ubunut 
> 20.04 machine: RAM increases from 1GB before starting to read the 
> file to 15GB (and actually slightly more but the swap file already 
> starts to increase).

 Libxml2 is having to allocate a ton of memory to parse the entire 
 document. XML is just not the way to go for this volume of data.

 There's an experiment ("proof of concept") in git which might be of 
 interest. I've enabled reading and writing "pure binary" data files, 
 with ".gbin" suffix. Here are my initial test scripts.

 Read the original data and write gbin:

 <hansl>
 set stopwatch
 # users.csv has first column named "obs"
 open users.csv
 printf "load time %gs\n", $stopwatch
 summary n --simple

 set stopwatch
 dataset sortby n
 printf "sort time %gs\n", $stopwatch
 summary n --simple

 set stopwatch
 store users.gbin
 printf "gbin store time %gs\n", $stopwatch
 </hansl>

 The times I'm seeing are:

 read csv:   9.95s
 sort by n:  6.09s
 write gbin: 1.40s

 Then restart and read the gbin:

 <hansl>
 set stopwatch
 open users.gbin
 printf "gbin open time %gs\n", $stopwatch
 summary n --simple
 </hansl>

 I see an open time of 1.76s.

 As I said, it's at proof-of-concept stage. String-valued series are 
 not handled and only minimal metadata are conveyed (variable names, 
 observation markers if present, basic time-series info). 
That's already great, and very helpful to see how well such binary 
formats perform.

Thanks for the technical deep-dive Allin!

Artur

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs