On Tue, 19 Jun 2012, Allin Cottrell wrote:
I'm also thinking that it would be nice to offer a shortcut for
the kind of
procedure I outlined, so that you could do something like
open huge.txt --cols=whatever --rowmask="gender==1"
In the background gretl would do what I described before: find and use the
full-length gender series to construct a rowmask, then read the selected
columns using the mask. This would not only be more user-friendly, it would
also be more efficient: libgretl could use a simple byte array for the
rowmask, and it wouldn't necessarily have to read the whole gender series
into memory, just scan it row by row.
Yes, that would be very nice. One more thing we have to keep in mind:
There are some cases in which the "identifier" field may be non-numeric.
For example, the World Development Indicators. These are by no means a
database as huge as the ones we're potentially dealing with here (although
respectable in size: the zipped set of CVS files is a hefty 35.7 Mb: see
http://data.worldbank.org/data-catalog/world-development-indicators/),
but make for an interesting test case: all the relevant items you may want
to select rows on are strings. That is, we may be in the position of
needing something like "--rowmask="FOO==\"bar\""
Moreover, we should also be prepared to find csv or fixed-formats files in
which the variable names are not valid gretl identifiers, because are too
long or contain spaces, etc.
A question for users of big data (somewhat relevant to implementing
the above
suggestion): do monster-size text datafiles typically come in fixed format?
That's my sense, but I don't have a lot of experience in this area. (If
that's right it makes sense computationally: it's much quicker to read
specific variables out of a big file if you know in advance exactly where to
find them.)
In my experience, fixed format was the standard in the past. Nowadays,
it's getting less common and I think is being supplanted by CSV. But I'm
not in the position of saying anything authoritative.
--------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
--------------------------------------------------