The following script will generate two testfiles whose size depends on the
two parameters ncountries and mean_n_hh (mean number of people per
household). In order to get nearly the same size as your real data, you
could set ncountries to 30 and mean_n_hh to 7500 (roughly). Then, a "join"
will be performed and the time taken.
On my pc this takes about half a second with mean_n_hh=200, nearly a
minute with mean_n_hh=4000 and about 8 minutes with mean_n_hh=10000. From
some experimenting, it would seem that time is approximately quadratic; I
suppose we could try something to make it less convex (although I suspect
it won't be easy to make it linear).
<hansl>
set echo off
set messages off
set seed 123456
ncountries = 30
mean_n_hh = 10000
n = 0
# generate the outer dataset
printf "generating outer file\n"
outfile "outer.csv" --write
printf "cntry, hid, x\n"
loop i=1..ncountries --quiet
nind = ceil(randgen1(z,mean_n_hh,10))
n += nind
loop j=1..nind --quiet
printf "%d,%d,%12.5f\n", i, j, randgen1(z,0,1)
endloop
end loop
outfile "outer.csv" --close
open outer.csv --quiet --preserve
# generate the inner dataset
printf "generating inner file\n"
outfile "inner.csv" --write
printf "hid, iid, cntry, y\n"
loop i=1..$nobs --quiet
nh = randgen1(i,1,4)
loop j=1..nh --quiet
printf "%d,%d,%d,%12.5f\n", hid[i], j, cntry[i], randgen1(z,0,1)
endloop
end loop
outfile "inner.csv" --close
# do the join
open inner.csv --preserve
printf "performing join\n"
set stopwatch
join outer.csv x --ikey=hid,cntry
printf "individuals = %d (%d countries, %d households); time = %g
seconds\n", \
$nobs, ncountries, n, $stopwatch
smpl 1 30
print -o
smpl full
</hansl>
-------------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Scienze Economiche e Sociali (DiSES)
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------