On Mon, 14 Dec 2020, Artur Tarassow wrote:
[...]
Am 13.12.20 um 15:35 schrieb Allin Cottrell:
> Sorting a dataset: This was not optimized for a huge number of
> observations. We were allocating more temporary storage than strictly
> necessary and moreover, at some points, calling malloc() per observation
> when it was possible to substitute a single call to get a big chunk of
> memory. Neither of these points were much of an issue with a normal-size
> dataset but they became a serious problem in the huge case. That's now
> addressed in git.
As I already wrote you privately: This change is a boost as sorting time got
reduced from 14 to 7.5 seconds. Thanks for this!
By the way, does this increased speed in sorting also affect the
aggregate() function?
Right now it's specific to the case of sorting an entire dataset,
but it would be worth taking a look at the aggregate case too.
> gdtb is not designed for a dataset like this. I guess you'd
want
> an uncompressed pure-binary format for best read and write speed
> and minimal memory consumption.
I guess you're right but how can I store data as a pure-binary format without
any compression? I can't find anything on this in the help text.
Well, no. That's not a format that gretl has offered to date. But
see below.
By the way, I stored the csv file as a gdt file via "store
<FILENAME>.gdt".
To my surprise, the gdt file is 843 MB large and hence almost 2.5 times
larger than the csv.
That shouldn't be suprising if you've ever looked "inside" a gdt
file. The CSV file just contains the data. The structured XML file
contains per-observation tags. With a single data series (mostly
smallish integers), the tags occupy more bytes than the data.
Also, trying to open the file consumes massive RAM here on my
Ubunut 20.04 machine: RAM increases from 1GB before starting to
read the file to 15GB (and actually slightly more but the swap
file already starts to increase).
Libxml2 is having to allocate a ton of memory to parse the entire
document. XML is just not the way to go for this volume of data.
There's an experiment ("proof of concept") in git which might be of
interest. I've enabled reading and writing "pure binary" data files,
with ".gbin" suffix. Here are my initial test scripts.
Read the original data and write gbin:
<hansl>
set stopwatch
# users.csv has first column named "obs"
open users.csv
printf "load time %gs\n", $stopwatch
summary n --simple
set stopwatch
dataset sortby n
printf "sort time %gs\n", $stopwatch
summary n --simple
set stopwatch
store users.gbin
printf "gbin store time %gs\n", $stopwatch
</hansl>
The times I'm seeing are:
read csv: 9.95s
sort by n: 6.09s
write gbin: 1.40s
Then restart and read the gbin:
<hansl>
set stopwatch
open users.gbin
printf "gbin open time %gs\n", $stopwatch
summary n --simple
</hansl>
I see an open time of 1.76s.
As I said, it's at proof-of-concept stage. String-valued series are
not handled and only minimal metadata are conveyed (variable names,
observation markers if present, basic time-series info).
Allin