[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs

Tuesday, 15 December 2020

On Mon, 14 Dec 2020, Artur Tarassow wrote:

[...]

...
 Am 13.12.20 um 15:35 schrieb Allin Cottrell:

> Sorting a dataset: This was not optimized for a huge number of 
> observations. We were allocating more temporary storage than strictly 
> necessary and moreover, at some points, calling malloc() per observation 
> when it was possible to substitute a single call to get a big chunk of 
> memory. Neither of these points were much of an issue with a normal-size 
> dataset but they became a serious problem in the huge case. That's now 
> addressed in git.

 As I already wrote you privately: This change is a boost as sorting time got 
 reduced from 14 to 7.5 seconds. Thanks for this!

 By the way, does this increased speed in sorting also affect the 
 aggregate() function? 
Right now it's specific to the case of sorting an entire dataset, 
but it would be worth taking a look at the aggregate case too.

...
> gdtb is not designed for a dataset like this. I guess you'd
want 
> an uncompressed pure-binary format for best read and write speed 
> and minimal memory consumption.

 I guess you're right but how can I store data as a pure-binary format without 
 any compression? I can't find anything on this in the help text. 
Well, no. That's not a format that gretl has offered to date. But 
see below.

...
 By the way, I stored the csv file as a gdt file via "store
<FILENAME>.gdt". 
 To my surprise, the gdt file is 843 MB large and hence almost 2.5 times 
 larger than the csv. 
That shouldn't be suprising if you've ever looked "inside" a gdt 
file. The CSV file just contains the data. The structured XML file 
contains per-observation tags. With a single data series (mostly 
smallish integers), the tags occupy more bytes than the data.

...
 Also, trying to open the file consumes massive RAM here on my 
 Ubunut 20.04 machine: RAM increases from 1GB before starting to 
 read the file to 15GB (and actually slightly more but the swap 
 file already starts to increase). 
Libxml2 is having to allocate a ton of memory to parse the entire 
document. XML is just not the way to go for this volume of data.

There's an experiment ("proof of concept") in git which might be of 
interest. I've enabled reading and writing "pure binary" data files, 
with ".gbin" suffix. Here are my initial test scripts.

Read the original data and write gbin:

<hansl>
set stopwatch
# users.csv has first column named "obs"
open users.csv
printf "load time %gs\n", $stopwatch
summary n --simple

set stopwatch
dataset sortby n
printf "sort time %gs\n", $stopwatch
summary n --simple

set stopwatch
store users.gbin
printf "gbin store time %gs\n", $stopwatch
</hansl>

The times I'm seeing are:

read csv:   9.95s
sort by n:  6.09s
write gbin: 1.40s

Then restart and read the gbin:

<hansl>
set stopwatch
open users.gbin
printf "gbin open time %gs\n", $stopwatch
summary n --simple
</hansl>

I see an open time of 1.76s.

As I said, it's at proof-of-concept stage. String-valued series are 
not handled and only minimal metadata are conveyed (variable names, 
observation markers if present, basic time-series info).

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Re: Reading large csv and sorting data set -- a comparison with 2 python libs