[Gretl-devel] binary-form gretl data files

Thursday, 16 January 2014

Here's the "RFC" I promised in response to Sven's posting at
http://lists.wfu.edu/pipermail/gretl-devel/2014-January/004867.html

Saving a gretl dataset in binary form

1. Why? Because for very large datasets it is _much_ faster to save
(and to reload) data in binary form than as text. For details see

http://ricardo.ecn.wfu.edu/~cottrell/tmp/binzip.html

2. How? At present (in CVS) this can be done only by passing the
--binary option to gretl's "store" command (where the name of the file
to be saved has the ".gdt" extension). There is not yet any means of
doing this via the GUI, pending further discussion of potential
problems.

3. File details: At present if you use the --binary option gretl
writes two files, a gdt file containing exactly the same sort of
metadata that is stored in a regular (pure XML) gdt file and a file
with extension ".bdt" containing double-precision floating-point
values written in the platform endianness. The gdt file contains an
attribute named "binary" in the "gretldata" element, which has value
either "little-endian" or "big-endian". This attribute is
"IMPLIED"
(in XML jargon) and is omitted in pure XML gdt files.  When gretl
finds the "binary" attribute in a gdt file it knows to open the
companion bdt file, which must have the same name but with the ".gdt"
suffix replaced by ".bdt".

4. Recent change: As of CVS of 2014-01-16 the bdt file contains a
small header, by way of sanity check. This is a string that says
either "gretl-bdt:little-endian" or "gretl-bdt:big-endian", in either
case padded to 24 bytes with nuls. Gretl will not proceed to read the
file if this header is missing, or if the endianness it indicates
disagrees with that stated in the gdt file. If anyone wants to read a
gretl binary data file created before this change, the shell script
fixbin.sh (attached) can be used to update the bdt file.

5. Discussion: When I first introduced this idea, Sven and Jack
remarked that it would be desirable to use an extension other than
".gdt" for the XML component of a metadata/binary pair of files on the
new pattern, so as to avoid potential confusion. I can see the case
for this, but I'm not sure it's a good idea. I explain my misgivings
below.

Internally, a gdt file is just a gdt file, regardless of whether it
has a binary companion file: it's XML conforming to a common DTD.  The
functions to read and write such files are in common. There are many
places in the gretl code where it's assumed that a native data file
has the ".gdt" extension and it would be a pain to go through all of
those and adjust for the possibility of another extension. In other
words, there's no internal rationale for a different extension, this
would be purely for users' convenience.

But would it in fact be convenient for users? So far as GUI use is
concerned, I don't see any reason why users should care. The format is
mostly "hidden", all you have to bother with is (say) marking a check
box saying "Use binary format" if you have a huge data file and
write/read speed is an issue. (And I'm thinking this box might not be
shown for datasets smaller than some reasonable threshold.)  It would
seem "fussy" to have a drop-down selector for different extensions in
file dialogs pertaining to native gretl data files.

It's true, there is some possibility of confusion in CLI use. The main
issue I see is that someone might save a dataset as binary, then later
decide to send it to a colleague or move it to another directory: in
that case she has to know to send/move the bdt file as well as the gdt.
Of course, she'd have to know to do that even if the XML component
were named differently, but there would be some visual clue if it
had, say, an ".mdt" suffix. On the other hand, it would be easy enough
to check the size of the gdt file: if you've stored tens or hundreds
of MB of data and the gdt file is 3K, there's your clue. We could also
provide a little command-line helper program that tests a gdt file and
tells you some stuff about it, including whether it has a binary
companion (this could also be provided as a GUI menu item).

Besides, there would be some possibility of confusion if we used
different extensions. Suppose you first have a dataset in pure XML
format, then you decide to save as XML + binary. Then maybe you make
some changes to the data and re-save. Now some time later you want to
reopen it: which file should you open? OK, it's not very hard to look
at the last modification dates of the files to see which is more
recent, but there's no uncertainty if a native data-save uses just one
filename extension.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Gretl-devel] binary-form gretl data files