Re: [Gretl-devel] memory with many discrete variables in large (panel) dataset

Sunday, 5 January 2014

On Thu, 2 Jan 2014, Sven Schreiber wrote:

...
 Am 02.01.2014 19:14, schrieb Allin Cottrell:
> On Wed, 1 Jan 2014, Sven Schreiber wrote:
>
>> I'd like to raise an issue which is probably quite fundamental in terms
>> of data handling. I'm currently working on a large panel dataset,
>> meaning that gretl occupies more than 600MB of memory with the data
>> loaded. In terms of file sizes, the Stata file version occupies 42MB,
>> the gretl workfile only about 3.5MB. This shows that gretl stores the
>> data very efficiently (by zipping), but OTOH opening and saving takes
>> quite some time. Actually it is much faster even in gretl to import the
>> Stata file instead of the native gretl file.
>
> I'd like to experiment with this. Can you give a little more detail
> on the characteristics of the data file? That is (roughly) how many
> observations? And how many variables? And what sort of ratio of
> quantitative variables to small-integer coded variables?

 First of all, I just found out that opening and saving the same data is
 much faster if the dataset is left as undated, as opposed to using panel
 index variables. On saving, gretl reports in the pop-up window 177052KB
 in the first case, versus 571712KB in the panel-structured case. Not
 sure if that's expected, the difference seems quite extreme. 
Gretl requires that a panel dataset have equal T_i for all units i, even 
if this means padding with NAs (which is done automatically when you use 
"setobs" with the --panel-vars option). So if the dataset is very
"holey", 
it's not too surprising that the full panel could be a lot bigger than the 
minimal undated, unpadded dataset.

However, it seems that we shouldn't necessarily save all the padding rows 
when writing a gdt file; all we need to do is save all the rows that 
actually contain some data, plus enough information to reconstitute the 
panel on reading. There's now an experiment to this effect in CVS (it has 
been tested, but more testing would be good). Here's the deal:

* When saving a panel dataset in native format we check the total size of 
the panel. If this exceeds 10MB we then check what fraction of the data 
rows are pure padding, and if this exceeds 30% we go into "skip-padding" 
mode. (These parameters are of course somewhat arbitary and could be 
tweaked.)

* In skip-padding mode we skip the pure padding rows (duh!) but we record 
the (1-based) unit index and time index for each row we write, these 
appearing as two extra series at the end of the dataset; we also record 
the fact that we're skipping.

* On re-reading the gdt file, gretl recognizes that a panel should be 
reconstituted, and it uses the unit and time info to do that.

There's room for debate over what the extra, automatically added series 
should be called. Right now they're called "unit__" and "time__"
(with 
trailing double underscores). These names seem unlikely to collide with 
user-specified ones. Even if they do, that won't matter if the file is 
read by current gretl, since if we get the skip-padding flag we know to 
use the last two series regardless of what they're called (and delete them 
after using them). The naming of these series is intended to ease sharing 
of data with older versions of gretl: if need be, a "skip-padding" panel 
dataset could be reconstituted manually by

open skip-padding.gdt
setobs unit__ time__ --panel-vars

In this case a name collision could be a problem -- in the (unlikely) 
event that the dataset already contains series of these names, and they do 
not constitute a valid basis for --panel-vars.

Related: The --gzipped option to the "store" command now has an optional 
compression-level parameter. This should be an integer in the range 0-9, 
with 0 indicating no compression and 9 maximal. If no parameter is given, 
the default level is 6 (the zlib default), which in most cases seems to 
offer a reasonable balance between speed and compressed size.

The GUI File save dialog for native gdt files now also has a compression 
level selector.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] memory with many discrete variables in large (panel) dataset