A few changes in recent gretl CVS address the issue of
handling very large datasets -- datasets that will not fit
into RAM in their entirety. These changes are not finalized --
though hopefully they will be by the end of the summer, once
Jack Lucchetti and I have worked on the matter together -- so
I haven't yet started to document them. But I'll explain the
current state of affairs here and if people would like to test
and give their comments that would be great.
First, let me note in passing that Jack and I have considered
the idea of introducing a new internal data-type in gretl -- a
type smaller than the double-precision floating-point type
that we currently use to represent all data series. Doubles
take up 8 bytes apiece, but some data (e.g. dummy variables)
could be represented correctly using a single byte. That way
we could cram more data into RAM. However, we've abandoned
that idea for the present; making the change would be a huge
amount of work, given the heavy dependence of the existing
gretl code on the assumption that all series live in a big
array of uniform type.
So what I'll describe here is not actually a way of fitting
more data into RAM; it's a way of pulling data for analysis
from a data source that is too big to handle in full.
To fix ideas, consider a census dataset with a million
observations on a thousand variables, so a giga-values
dataset. In double precision this would occupy 8GB of memory.
To load such data in full, let alone run regressions on them,
you'd need substantially more than 8GB of RAM. But it's
unlikely you'd want to run regressions using the entire
dataset; more plausibly, you might want to use a subset of the
variables and/or a subset of the observations. The problem
we're addressing is this: how do you extract such a subset
using gretl, if you can't read the full data into memory to
start with?
The answer depends on the format of the full dataset. Gretl
can read specified individual series from various sorts of
databases (native gretl binary databases, RATS 4.0, PcGive),
and via ODBC it can extract both particular series and
particular observations as specified via SQL. But what if the
original data are not in any of these formats?
Very large public datasets are quite often available in plain
text format, either delimited (comma-separated or similar) or
in fixed format (where each variable has a known starting
column and a known width in bytes). To date, gretl has had a
mechanism for reading specified variables from a fixed-format
text datafile (the --cols option to the "open" command), but
there has been no mechanism for reading specified variables
from a delimited text file, nor has there been a way of asking
gretl to read only certain observations (rows) from such a
file.
So here are the changes in CVS:
* The --cols option has been generalized so that it can be
used on delimited text files as well as fixed-format ones.
* A --rowmask option has been added which enables you to read
specified rows from text datafiles (and also from native
binary databases).
I'll get to the (provisional) syntax in a moment, but first
let me give an overview of how one might proceed, starting
from a huge text datafile. I'll assume we want to subset both
the series and the observations.
(1) Open the data source using the --cols option to extract
the series that will be used to pick out the observations we
want. (For example, maybe we need a male/female dummy variable
to pick out observations on women.) I'm assuming here that the
number of series needed for this task is small enough that we
can afford to load all the observations.
(2) Create a (matrix) "mask" with 1s for observations we want
and 0s for those to be skipped. Clear the current dataset but
keep the matrix.
(3) Open the source again: this time use both the --cols and
--rowmask options to extract the particular data we want.
To give a sense of the current syntax, here's a hansl example
to carry out steps (1) to (3) above. I'll assume at first that
we're reading from a delimited text file.
<hansl>
# read a specified column
open huge.txt --cols=15 --delimited
# create the observations mask
matrix mask = (gender == 1)
# re-open and read
open huge.txt --cols=1,2,20,156 --rowmask=mask --delimited --preserve
</hansl>
In the first "open" we need the --delimited option to tell
gretl to interpret the --cols specification as giving a list
of one or more delimited columns. The "15" says to read the
15th data column, which I assume contains a series named
"gender". In the second "open" we specify the columns (series)
that we want for analysis along with the row mask we just
created. The --preserve option is needed so that the matrix we
want to use as a mask doesn't get destroyed.
Here's the equivalent, but this time supposing we're reading
from a fixed-format file:
<hansl>
# read a specified column
open fixed.txt --cols=15,1
# create the observations mask
matrix mask = (v1 == 1)
# re-open and read
matrix C = {1,6,7,8,32,6}
open fixed.txt --cols=C --rowmask=mask --preserve
</hansl>
In this example the --cols specification has its original
meaning (as documented in relation to "open"). That is, it is
made up of pairs (c,w) where c gives the starting byte and w
the width in bytes. So the series we want for the row mask is
a single byte starting at byte 15 on each line of the data.
With fixed-format data variable names are not supported, but
the first variable to be read will be automatically given the
name "v1". We then proceed to read specified observations on
three series: one starting at byte 1 and occupying 6 bytes,
one 8 bytes wide starting at byte 7, and one of 6 bytes
starting at byte 32. This example also illustrates the way you
can (now) use a named matrix with the --cols option.
One more point. Note that a native gretl binary database can
be both read and written piece-wise, without ever holding the
whole thing in memory, so one could modify the above scenarios
to write out the huge data in native db format. The advantage
of that approach is that reading from a native gretl db is
much faster than reading from a text file, so if we want to go
back and read various different subsets of the data we'd get a
big gain in efficiency. In this context it would be nice to be
able to use gretl's "open" command in a loop. This is not
currently enabled (it may happen at some point, but it
wouldn't be trivial to implement). But you can use the shell
to implement the loop, as in the following pairs of scripts.
Running the bash script will cause gretl to read a thousand
series from huge.txt, one at a time, and write them into the
database file huge.bin.
<hansl>
# writedb.inp
open huge.txt --cols=COL --delimited --quiet
store huge.bin --database
</hansl>
<bash>
for i in {1..1000}
do
sed -e "s+COL+$i+" writedb.inp > tmp.inp
gretlcli -b tmp.inp
done
</bash>
Allin