On Mon, 6 Jan 2014, Sven Schreiber wrote:
Am 05.01.2014 20:56, schrieb Allin Cottrell:
>
> However, it seems that we shouldn't necessarily save all the padding rows
> when writing a gdt file; all we need to do is save all the rows that
> actually contain some data, plus enough information to reconstitute the
> panel on reading. There's now an experiment to this effect in CVS (it has
> been tested, but more testing would be good). Here's the deal:
>
> * When saving a panel dataset in native format we check the total size of
> the panel. If this exceeds 10MB we then check what fraction of the data
> rows are pure padding, and if this exceeds 30% we go into "skip-padding"
> mode. (These parameters are of course somewhat arbitary and could be
> tweaked.)
Sure, but sounds reasonable enough to me.
I've now done some testing for speed of "skip-padding" and actually the
improvement is not all that great. It turns out that with a big and
severely unbalanced panel it takes quite a while to count the padding
rows. I'm showing my results below. In all cases the simulated datasets
had 1200 variables and about 70 percent of the rows were filled with NAs
to represent panel padding; the cases differ by NT (the number of rows in
the balanced panel) and the compression level used.
Overall I'm seeing a slight gain in (compress + write) speed, and a more
substantial but still not dramatic gain in (read + decompress) speed.
While the size of the gdt file on disk is smaller with skip-padding, when
compression is enabled the difference is not as great as one might
imagine. Evidently zlib does a pretty good job of shrinking the padding
rows to next-to-nothing.
case 1 (NT = 10000, gzip = 0)
original with skip ratio
store(sec) 2.97 2.80 0.943
open(sec) 0.76 0.45 0.591
disk(Kb) 37224 12836 0.345
case 2 (NT = 60000, gzip = 6)
original with skip ratio
store(sec) 22.77 21.27 0.934
open(sec) 5.10 2.82 0.552
disk(Kb) 24161 22415 0.928
case 3 (NT = 60000, gzip = 1)
original with skip ratio
store(sec) 18.55 16.90 0.911
open(sec) 5.11 2.77 0.542
disk(Kb) 28746 26126 0.909
Allin