On Tue, 19 Jun 2012, Riccardo (Jack) Lucchetti wrote:
On Mon, 18 Jun 2012, Allin Cottrell wrote:
> A few changes in recent gretl CVS address the issue of
> handling very large datasets -- datasets that will not fit
> into RAM in their entirety. [...]
First of all, let me thank Allin for all the work he's done in this
direction: I ran a few tests on the new CVS features and I can confirm that
every test I tried works splendidly.
That said, I have the feeling that, in order to use effectively the datasets
Allin is referring to, we need an extra ingredient (which, IMHO, is THE
feature that made Stata the killer package in some quarters of the
econometrics profession): the ability to extract data sensibly by performing
those operations that, in database parlance, are called JOINs.
[...]
Thanks for the clear explanation, and I think you're right; handling
JOINs is something we should work towards.
I'm also thinking that it would be nice to offer a shortcut for the
kind of procedure I outlined, so that you could do something like
open huge.txt --cols=whatever --rowmask="gender==1"
In the background gretl would do what I described before: find and
use the full-length gender series to construct a rowmask, then read
the selected columns using the mask. This would not only be more
user-friendly, it would also be more efficient: libgretl could use a
simple byte array for the rowmask, and it wouldn't necessarily have
to read the whole gender series into memory, just scan it row by
row.
A question for users of big data (somewhat relevant to implementing
the above suggestion): do monster-size text datafiles typically come
in fixed format? That's my sense, but I don't have a lot of
experience in this area. (If that's right it makes sense
computationally: it's much quicker to read specific variables out of
a big file if you know in advance exactly where to find them.)
Allin