On Thu, 2 Jan 2014, Sven Schreiber wrote:
Am 02.01.2014 19:14, schrieb Allin Cottrell:
> On Wed, 1 Jan 2014, Sven Schreiber wrote:
>
>> I'd like to raise an issue which is probably quite fundamental in terms
>> of data handling. I'm currently working on a large panel dataset,
>> meaning that gretl occupies more than 600MB of memory with the data
>> loaded. In terms of file sizes, the Stata file version occupies 42MB,
>> the gretl workfile only about 3.5MB. This shows that gretl stores the
>> data very efficiently (by zipping), but OTOH opening and saving takes
>> quite some time. Actually it is much faster even in gretl to import the
>> Stata file instead of the native gretl file.
>
> I'd like to experiment with this. Can you give a little more detail
> on the characteristics of the data file? That is (roughly) how many
> observations? And how many variables? And what sort of ratio of
> quantitative variables to small-integer coded variables?
First of all, I just found out that opening and saving the same data is
much faster if the dataset is left as undated, as opposed to using panel
index variables. On saving, gretl reports in the pop-up window 177052KB
in the first case, versus 571712KB in the panel-structured case. Not
sure if that's expected, the difference seems quite extreme.
Gretl requires that a panel dataset have equal T_i for all units i, even
if this means padding with NAs (which is done automatically when you use
"setobs" with the --panel-vars option). So if the dataset is very
"holey",
it's not too surprising that the full panel could be a lot bigger than the
minimal undated, unpadded dataset.
However, it seems that we shouldn't necessarily save all the padding rows
when writing a gdt file; all we need to do is save all the rows that
actually contain some data, plus enough information to reconstitute the
panel on reading. There's now an experiment to this effect in CVS (it has
been tested, but more testing would be good). Here's the deal:
* When saving a panel dataset in native format we check the total size of
the panel. If this exceeds 10MB we then check what fraction of the data
rows are pure padding, and if this exceeds 30% we go into "skip-padding"
mode. (These parameters are of course somewhat arbitary and could be
tweaked.)
* In skip-padding mode we skip the pure padding rows (duh!) but we record
the (1-based) unit index and time index for each row we write, these
appearing as two extra series at the end of the dataset; we also record
the fact that we're skipping.
* On re-reading the gdt file, gretl recognizes that a panel should be
reconstituted, and it uses the unit and time info to do that.
There's room for debate over what the extra, automatically added series
should be called. Right now they're called "unit__" and "time__"
(with
trailing double underscores). These names seem unlikely to collide with
user-specified ones. Even if they do, that won't matter if the file is
read by current gretl, since if we get the skip-padding flag we know to
use the last two series regardless of what they're called (and delete them
after using them). The naming of these series is intended to ease sharing
of data with older versions of gretl: if need be, a "skip-padding" panel
dataset could be reconstituted manually by
open skip-padding.gdt
setobs unit__ time__ --panel-vars
In this case a name collision could be a problem -- in the (unlikely)
event that the dataset already contains series of these names, and they do
not constitute a valid basis for --panel-vars.
Related: The --gzipped option to the "store" command now has an optional
compression-level parameter. This should be an integer in the range 0-9,
with 0 indicating no compression and 9 maximal. If no parameter is given,
the default level is 6 (the zlib default), which in most cases seems to
offer a reasonable balance between speed and compressed size.
The GUI File save dialog for native gdt files now also has a compression
level selector.
Allin