Re: [Gretl-devel] memory with many discrete variables in large (panel) dataset

Tuesday, 7 January 2014

On Tue, 7 Jan 2014, Sven Schreiber wrote:

...
 Am 06.01.2014 20:14, schrieb Allin Cottrell:
>
> I've now done some testing for speed of "skip-padding" and actually
the
> improvement is not all that great. [...]

 Let's see if I have paid enough atttention: The skip-padding takes time
 and is currently (in cvs) enabled and cannot be switched off. Instead
 the compression can be user-configured and basically substitutes the
 skip-padding. Then it sounds like a good idea to undo the skip-padding
 in cvs, no? Then I should be able to switch off compression to get the
 fastest (but of course also biggest) result. 
Well, not quite. In my testing on a panel dataset with the general 
characteristics we've been talking about, skip-padding (other 
things) equal always shaved about 10% off the write time and about 
40% off the read time. However, those results have now been 
overtaken by some new thinking. Sorry, this is a bit long but here 
we go.

The basic gdt design (gzipped XML with a fairly simple but 
extensible DTD) is IM0 nice and clean, and it can in principle 
handle data of any size. But it's certainly slower than some 
alternatives, and with very big datasets speed can become a serious 
bother.

With this in mind, I've been working on two things in CVS. The first 
relates to the way we print numbers in a gdt file and the second 
concerns the possibility of using a binary format.

1) Writing data as text

When we're writing out numerical data in text form, as in an XML 
file, we need to preserve precision. Double-precision (64-bit) 
floating point values are generally good for about 15 or 16 
significant digits. To be on the safe side, we have for some time 
now written data values to 17 significant digits. To be precise, we 
use the printf format "%.17g", which does not write trailing zeros. 
So if you open up a gdt file in a text editor you'll see that some 
values are actually printed with 17 digits (e.g. generated logs and 
pseudo-random values), while at the other end of the spectrum 
dummies are justed printed as 1s and 0s.

The question arises, what about primary data that are not just 
dummies (or other small-integer codings)? Primary economic data are 
most unlikely to carry more than 6 significant digits (maybe 9 at 
the outside, but that's probably spurious) and often carry fewer. 
However, if you print such data using %.17g you get nasty artifacts 
arising from the limited precision of doubles: a value that has been 
published as 8.98 (and therefore should really be recorded in the 
gdt file as just that) appears as 8.9800000000000004, one that has 
been published as 217.004 appears as 217.00399999999999, and so on. 
Such artifacts increase the size and decrease the compressibility of 
the XML file. So we try to get rid of them. When we're getting ready 
to save as gdt series whose values are in the "middling" size range 
occupied by most primary data, we try to figure out a format that 
will represent the series with full precision but cut out the 
artifacts. This works OK, but it takes time, and with very big data 
the time becomes non-trivial.

What's new here is that I've worked out a substantially faster way 
of determining the appropriate the format specification for each 
series. In my tests this cuts several seconds off big data writes. 
This improvement is independent of the compression level and the 
skip-padding status.

2) Writing binary data

Tweaks to our writing of data in text form are be useful, but 
there's no question that if you want raw speed you're better using 
C's fwrite and fread to zap big swathes of bytes from RAM to disk or 
vice versa. I've implemented a --binary option to "store" that 
causes gretl to write out an XML .gdt file containing the metadata 
plus a binary .bdt file containing doubles. If compression is 
specified it applies to both files. The bdt file is written in host 
endianness; the reading function should take care of conversion if 
need be (but that's not tested yet since I don't have access to any 
big-endian machine.)

The speed-up via this approach is dramatic: 600 MB or so can be 
written in 0.54 seconds and read back in 0.17 seconds. I've put a 
table with my current results at

http://ricardo.ecn.wfu.edu/~cottrell/tmp/binzip.html

This shows the relationship between binary vs text, compression vs 
none, and skip padding vs not. As you'll see, if speed is the only 
concern (and disk space not an issue), a straight binary write/read 
with no compression or skipping of padding wins the race. On the 
other hand you can get quite nice performance using skip-padding and 
compression level 1: write in 2.4 seconds, read in 1.2 seconds, and 
use only 4 percent of the maximal disk space.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] memory with many discrete variables in large (panel) dataset