On Mon, 18 Jun 2012, Allin Cottrell wrote:
A few changes in recent gretl CVS address the issue of
handling very large datasets -- datasets that will not fit
into RAM in their entirety. [...]
First of all, let me thank Allin for all the work he's done in this
direction: I ran a few tests on the new CVS features and I can confirm
that every test I tried works splendidly.
That said, I have the feeling that, in order to use effectively the
datasets Allin is referring to, we need an extra ingredient (which, IMHO,
is THE feature that made Stata the killer package in some quarters of the
econometrics profession): the ability to extract data sensibly by
performing those operations that, in database parlance, are called JOINs.
Let me provide an example for the unfamiliar: suppose you have two
datafiles, one for individuals and one for households (example attached).
Of course, the two datafiles don't match by row, since there may be more
than one individual in a household; however, you may want to attach data
from one dataset to the other one.
The following script should clarify what I have in mind:
<hansl>
# example 1 (easy): assigning a household variable to an individual
open hh_data.csv --quiet
matrix X = {HH_NO, AREA}
open ind_data.csv --quiet --preserve
series AREA = HH_NO
AREA = replace(AREA, X[,1], X[,2])
print -o
# example 3 (harder): going in the opposite direction
open ind_data.csv --quiet
matrix v = values(HH_NO)
matrix X = v ~ 0
loop i = 1..rows(v) --quiet
smpl HH_NO=v[i] --restrict --replace
X[i,2] = mean(AGE)
end loop
open hh_data.csv --quiet --preserve
series AVGAGE = X[,2]
print -o
</hansl>
In the first example, we attach to the individuals dataset the area where
the household lives, while in the second one we attach to the households
dataset the average age of its members; I'm sure you'll have no problems
in imagining all the possible generalisations (and caveats: for example,
NAs and missing rows have to be handled).
For very large bodies of data, this is typically the job for SQL-based
DBMSes, such as Oracle, Postgres, MySQL, etcetera, which do this kind of
jobs with amazing speed and accuracy through astonishingly clever
algorithms and tricks. However, you need to have (1) the data in that
format already (2) a decent knowledge of SQL, which very few applied
economists have.
It'd be extremely nice to be able to perform this kind of operations from
a gretl database: in the example above, I would imagine that you could
have a .bin file containing all the series in the two datasets, with some
mechanism in place for identifying which series is a household-level
series, which one is a in individual-level series, and so forth.
--------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
--------------------------------------------------