[Gretl-devel] really big data

Monday, 18 June 2012

A few changes in recent gretl CVS address the issue of 
handling very large datasets -- datasets that will not fit 
into RAM in their entirety. These changes are not finalized -- 
though hopefully they will be by the end of the summer, once 
Jack Lucchetti and I have worked on the matter together -- so 
I haven't yet started to document them. But I'll explain the 
current state of affairs here and if people would like to test 
and give their comments that would be great.

First, let me note in passing that Jack and I have considered 
the idea of introducing a new internal data-type in gretl -- a 
type smaller than the double-precision floating-point type 
that we currently use to represent all data series. Doubles 
take up 8 bytes apiece, but some data (e.g. dummy variables) 
could be represented correctly using a single byte. That way 
we could cram more data into RAM. However, we've abandoned 
that idea for the present; making the change would be a huge 
amount of work, given the heavy dependence of the existing 
gretl code on the assumption that all series live in a big 
array of uniform type.

So what I'll describe here is not actually a way of fitting 
more data into RAM; it's a way of pulling data for analysis 
from a data source that is too big to handle in full.

To fix ideas, consider a census dataset with a million 
observations on a thousand variables, so a giga-values 
dataset. In double precision this would occupy 8GB of memory. 
To load such data in full, let alone run regressions on them, 
you'd need substantially more than 8GB of RAM. But it's 
unlikely you'd want to run regressions using the entire 
dataset; more plausibly, you might want to use a subset of the 
variables and/or a subset of the observations. The problem 
we're addressing is this: how do you extract such a subset 
using gretl, if you can't read the full data into memory to 
start with?

The answer depends on the format of the full dataset. Gretl 
can read specified individual series from various sorts of 
databases (native gretl binary databases, RATS 4.0, PcGive), 
and via ODBC it can extract both particular series and 
particular observations as specified via SQL. But what if the 
original data are not in any of these formats?

Very large public datasets are quite often available in plain 
text format, either delimited (comma-separated or similar) or 
in fixed format (where each variable has a known starting 
column and a known width in bytes). To date, gretl has had a 
mechanism for reading specified variables from a fixed-format 
text datafile (the --cols option to the "open" command), but 
there has been no mechanism for reading specified variables 
from a delimited text file, nor has there been a way of asking 
gretl to read only certain observations (rows) from such a 
file.

So here are the changes in CVS:

* The --cols option has been generalized so that it can be 
used on delimited text files as well as fixed-format ones.

* A --rowmask option has been added which enables you to read 
specified rows from text datafiles (and also from native 
binary databases).

I'll get to the (provisional) syntax in a moment, but first 
let me give an overview of how one might proceed, starting 
from a huge text datafile. I'll assume we want to subset both 
the series and the observations.

(1) Open the data source using the --cols option to extract 
the series that will be used to pick out the observations we 
want. (For example, maybe we need a male/female dummy variable 
to pick out observations on women.) I'm assuming here that the 
number of series needed for this task is small enough that we 
can afford to load all the observations.

(2) Create a (matrix) "mask" with 1s for observations we want 
and 0s for those to be skipped. Clear the current dataset but 
keep the matrix.

(3) Open the source again: this time use both the --cols and 
--rowmask options to extract the particular data we want.

To give a sense of the current syntax, here's a hansl example 
to carry out steps (1) to (3) above. I'll assume at first that 
we're reading from a delimited text file.

<hansl>
# read a specified column
open huge.txt --cols=15 --delimited
# create the observations mask
matrix mask = (gender == 1)
# re-open and read
open huge.txt --cols=1,2,20,156 --rowmask=mask --delimited --preserve
</hansl>

In the first "open" we need the --delimited option to tell 
gretl to interpret the --cols specification as giving a list 
of one or more delimited columns. The "15" says to read the 
15th data column, which I assume contains a series named 
"gender". In the second "open" we specify the columns (series) 
that we want for analysis along with the row mask we just 
created. The --preserve option is needed so that the matrix we 
want to use as a mask doesn't get destroyed.

Here's the equivalent, but this time supposing we're reading 
from a fixed-format file:

<hansl>
# read a specified column
open fixed.txt --cols=15,1
# create the observations mask
matrix mask = (v1 == 1)
# re-open and read
matrix C = {1,6,7,8,32,6}
open fixed.txt --cols=C --rowmask=mask --preserve
</hansl>

In this example the --cols specification has its original 
meaning (as documented in relation to "open"). That is, it is 
made up of pairs (c,w) where c gives the starting byte and w 
the width in bytes. So the series we want for the row mask is 
a single byte starting at byte 15 on each line of the data. 
With fixed-format data variable names are not supported, but 
the first variable to be read will be automatically given the 
name "v1". We then proceed to read specified observations on 
three series: one starting at byte 1 and occupying 6 bytes, 
one 8 bytes wide starting at byte 7, and one of 6 bytes 
starting at byte 32. This example also illustrates the way you 
can (now) use a named matrix with the --cols option.

One more point. Note that a native gretl binary database can 
be both read and written piece-wise, without ever holding the 
whole thing in memory, so one could modify the above scenarios 
to write out the huge data in native db format. The advantage 
of that approach is that reading from a native gretl db is 
much faster than reading from a text file, so if we want to go 
back and read various different subsets of the data we'd get a 
big gain in efficiency. In this context it would be nice to be 
able to use gretl's "open" command in a loop. This is not 
currently enabled (it may happen at some point, but it 
wouldn't be trivial to implement). But you can use the shell 
to implement the loop, as in the following pairs of scripts. 
Running the bash script will cause gretl to read a thousand 
series from huge.txt, one at a time, and write them into the 
database file huge.bin.

<hansl>
# writedb.inp
open huge.txt --cols=COL --delimited --quiet
store huge.bin --database
</hansl>

<bash>
for i in {1..1000}
do
    sed -e "s+COL+$i+" writedb.inp > tmp.inp
    gretlcli -b tmp.inp
done
</bash>

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Gretl-devel] really big data