Am 05.01.2014 20:56, schrieb Allin Cottrell:
On Thu, 2 Jan 2014, Sven Schreiber wrote:
>
> First of all, I just found out that opening and saving the same data is
> much faster if the dataset is left as undated, as opposed to using panel
> index variables. On saving, gretl reports in the pop-up window 177052KB
> in the first case, versus 571712KB in the panel-structured case. Not
> sure if that's expected, the difference seems quite extreme.
Gretl requires that a panel dataset have equal T_i for all units i, even
if this means padding with NAs (which is done automatically when you use
"setobs" with the --panel-vars option). So if the dataset is very
"holey",
it's not too surprising that the full panel could be a lot bigger than the
minimal undated, unpadded dataset.
I must confess I'm not sure I understand how gretl stores internally the
undated workfile version in a way that (a) the panel structure can be
restored at any time with 'setobs' and (b) much less memory needs to be
allocated. I had heard that series are all just in a big array of
doubles, but there seems to be more to it, because that would seem to
already include all the padding implicitly... But maybe this is leading
too far, no need to explain the details.
However, it seems that we shouldn't necessarily save all the padding rows
when writing a gdt file; all we need to do is save all the rows that
actually contain some data, plus enough information to reconstitute the
panel on reading. There's now an experiment to this effect in CVS (it has
been tested, but more testing would be good). Here's the deal:
* When saving a panel dataset in native format we check the total size of
the panel. If this exceeds 10MB we then check what fraction of the data
rows are pure padding, and if this exceeds 30% we go into "skip-padding"
mode. (These parameters are of course somewhat arbitary and could be
tweaked.)
Sure, but sounds reasonable enough to me.
...
In this case a name collision could be a problem -- in the (unlikely)
event that the dataset already contains series of these names, and they do
not constitute a valid basis for --panel-vars.
I don't see this as a real problem, stuff like double trailing
underscores could easily be prohibited to be used by the end user.
Related: The --gzipped option to the "store" command now has an optional
compression-level parameter. This should be an integer in the range 0-9,
with 0 indicating no compression and 9 maximal. If no parameter is given,
the default level is 6 (the zlib default), which in most cases seems to
offer a reasonable balance between speed and compressed size.
The GUI File save dialog for native gdt files now also has a compression
level selector.
I experimented with this (snapshot Jan 5th gretl 32bit on Win7 64bit);
first of all, the speed feeling unfortunately doesn't change and appears
independent of the compression level, it's always taking about 15sec.
BTW, there is a long delay until the status bar in the progress dialog
actually starts to be filled, whatever that means.
Also, when I choose "save as" but pick the current filename for
overwriting, the size stays the same, the compression level apparently
isn't respected.
Then about the sizes: the original .gdt file in question had 4471KB,
before the new features. The amount that gretl displays in the save
progress window is 45455KB.
Level 0 (GUI chosen) gives 20789KB file size (panel-structured).
Level 1: 2841KB
Level 2: 2692KB
Level 4: 2420KB
Level 6: 2204KB
Level 9: 2133KB
So at least in this case everything beyond level 1 is not really
necessary IMO, but as I said this isn't reflected in speed gains.
Thanks,
sven