Am 14.12.20 um 17:41 schrieb Allin Cottrell:
On Mon, 14 Dec 2020, Artur Tarassow wrote:
[...]
> Am 13.12.20 um 15:35 schrieb Allin Cottrell:
>
>> Sorting a dataset: This was not optimized for a huge number of
>> observations. We were allocating more temporary storage than
>> strictly necessary and moreover, at some points, calling malloc()
>> per observation when it was possible to substitute a single call to
>> get a big chunk of memory. Neither of these points were much of an
>> issue with a normal-size dataset but they became a serious problem
>> in the huge case. That's now addressed in git.
>
> As I already wrote you privately: This change is a boost as sorting
> time got reduced from 14 to 7.5 seconds. Thanks for this!
>
> By the way, does this increased speed in sorting also affect the
> aggregate() function?
Right now it's specific to the case of sorting an entire dataset, but
it would be worth taking a look at the aggregate case too.
Ok, thanks.
>> gdtb is not designed for a dataset like this. I guess
you'd want an
>> uncompressed pure-binary format for best read and write speed and
>> minimal memory consumption.
>
> I guess you're right but how can I store data as a pure-binary format
> without any compression? I can't find anything on this in the help text.
Well, no. That's not a format that gretl has offered to date. But see
below.
> By the way, I stored the csv file as a gdt file via "store
> <FILENAME>.gdt". To my surprise, the gdt file is 843 MB large and
> hence almost 2.5 times larger than the csv.
That shouldn't be suprising if you've ever looked "inside" a gdt file.
The CSV file just contains the data. The structured XML file contains
per-observation tags. With a single data series (mostly smallish
integers), the tags occupy more bytes than the data.
I see. I guess that's a case where a json format may be beneficial as it
should not have so many tags if I am right. But I am pretty sure that
switching from xml to json would involve complex changes under the hood.
Just one note: If in future the topic on large/ big data should become
(more) relevant for the Gretl project, the parquet (open-source) file
format may be of interest (
https://parquet.apache.org/). This is a
widely used format nowadays in the big data context. Here is another
reference:
https://en.wikipedia.org/wiki/Apache_Parquet
Ah, I just found out about the xmlreader API for Libxml2
(
http://xmlsoft.org/xmlreader.html). Instead of loading the whole
document in memory and to expose it as a tree, xmlreader allows one to
"go through" the document stream which may be a more efficient way to
go. But still, this does not solve the basic issue of structured xml
files for large data as you said.
> Also, trying to open the file consumes massive RAM here on my
Ubunut
> 20.04 machine: RAM increases from 1GB before starting to read the
> file to 15GB (and actually slightly more but the swap file already
> starts to increase).
Libxml2 is having to allocate a ton of memory to parse the entire
document. XML is just not the way to go for this volume of data.
There's an experiment ("proof of concept") in git which might be of
interest. I've enabled reading and writing "pure binary" data files,
with ".gbin" suffix. Here are my initial test scripts.
Read the original data and write gbin:
<hansl>
set stopwatch
# users.csv has first column named "obs"
open users.csv
printf "load time %gs\n", $stopwatch
summary n --simple
set stopwatch
dataset sortby n
printf "sort time %gs\n", $stopwatch
summary n --simple
set stopwatch
store users.gbin
printf "gbin store time %gs\n", $stopwatch
</hansl>
The times I'm seeing are:
read csv: 9.95s
sort by n: 6.09s
write gbin: 1.40s
Then restart and read the gbin:
<hansl>
set stopwatch
open users.gbin
printf "gbin open time %gs\n", $stopwatch
summary n --simple
</hansl>
I see an open time of 1.76s.
As I said, it's at proof-of-concept stage. String-valued series are
not handled and only minimal metadata are conveyed (variable names,
observation markers if present, basic time-series info).
That's already great, and very helpful to see how well such binary
formats perform.
Thanks for the technical deep-dive Allin!
Artur