On Tue, 7 Jan 2014, Sven Schreiber wrote:
Am 06.01.2014 20:14, schrieb Allin Cottrell:
>
> I've now done some testing for speed of "skip-padding" and actually
the
> improvement is not all that great. [...]
Let's see if I have paid enough atttention: The skip-padding takes time
and is currently (in cvs) enabled and cannot be switched off. Instead
the compression can be user-configured and basically substitutes the
skip-padding. Then it sounds like a good idea to undo the skip-padding
in cvs, no? Then I should be able to switch off compression to get the
fastest (but of course also biggest) result.
Well, not quite. In my testing on a panel dataset with the general
characteristics we've been talking about, skip-padding (other
things) equal always shaved about 10% off the write time and about
40% off the read time. However, those results have now been
overtaken by some new thinking. Sorry, this is a bit long but here
we go.
The basic gdt design (gzipped XML with a fairly simple but
extensible DTD) is IM0 nice and clean, and it can in principle
handle data of any size. But it's certainly slower than some
alternatives, and with very big datasets speed can become a serious
bother.
With this in mind, I've been working on two things in CVS. The first
relates to the way we print numbers in a gdt file and the second
concerns the possibility of using a binary format.
1) Writing data as text
When we're writing out numerical data in text form, as in an XML
file, we need to preserve precision. Double-precision (64-bit)
floating point values are generally good for about 15 or 16
significant digits. To be on the safe side, we have for some time
now written data values to 17 significant digits. To be precise, we
use the printf format "%.17g", which does not write trailing zeros.
So if you open up a gdt file in a text editor you'll see that some
values are actually printed with 17 digits (e.g. generated logs and
pseudo-random values), while at the other end of the spectrum
dummies are justed printed as 1s and 0s.
The question arises, what about primary data that are not just
dummies (or other small-integer codings)? Primary economic data are
most unlikely to carry more than 6 significant digits (maybe 9 at
the outside, but that's probably spurious) and often carry fewer.
However, if you print such data using %.17g you get nasty artifacts
arising from the limited precision of doubles: a value that has been
published as 8.98 (and therefore should really be recorded in the
gdt file as just that) appears as 8.9800000000000004, one that has
been published as 217.004 appears as 217.00399999999999, and so on.
Such artifacts increase the size and decrease the compressibility of
the XML file. So we try to get rid of them. When we're getting ready
to save as gdt series whose values are in the "middling" size range
occupied by most primary data, we try to figure out a format that
will represent the series with full precision but cut out the
artifacts. This works OK, but it takes time, and with very big data
the time becomes non-trivial.
What's new here is that I've worked out a substantially faster way
of determining the appropriate the format specification for each
series. In my tests this cuts several seconds off big data writes.
This improvement is independent of the compression level and the
skip-padding status.
2) Writing binary data
Tweaks to our writing of data in text form are be useful, but
there's no question that if you want raw speed you're better using
C's fwrite and fread to zap big swathes of bytes from RAM to disk or
vice versa. I've implemented a --binary option to "store" that
causes gretl to write out an XML .gdt file containing the metadata
plus a binary .bdt file containing doubles. If compression is
specified it applies to both files. The bdt file is written in host
endianness; the reading function should take care of conversion if
need be (but that's not tested yet since I don't have access to any
big-endian machine.)
The speed-up via this approach is dramatic: 600 MB or so can be
written in 0.54 seconds and read back in 0.17 seconds. I've put a
table with my current results at
http://ricardo.ecn.wfu.edu/~cottrell/tmp/binzip.html
This shows the relationship between binary vs text, compression vs
none, and skip padding vs not. As you'll see, if speed is the only
concern (and disk space not an issue), a straight binary write/read
with no compression or skipping of padding wins the race. On the
other hand you can get quite nice performance using skip-padding and
compression level 1: write in 2.4 seconds, read in 1.2 seconds, and
use only 4 percent of the maximal disk space.
Allin