Am 13.09.2013 18:47, schrieb Riccardo (Jack) Lucchetti:
The following script will generate two testfiles whose size depends on
the two parameters ncountries and mean_n_hh (mean number of people per
household). In order to get nearly the same size as your real data, you
could set ncountries to 30 and mean_n_hh to 7500 (roughly). Then, a
"join" will be performed and the time taken.
On my pc this takes about half a second with mean_n_hh=200, nearly a
minute with mean_n_hh=4000 and about 8 minutes with mean_n_hh=10000.
From some experimenting, it would seem that time is approximately
quadratic; I suppose we could try something to make it less convex
(although I suspect it won't be easy to make it linear).
The "mean_n_hh=10000" case takes around 20 min. here. But interestingly,
I ran the following on STATA 11 using this "10000" case:
<STATA>
insheet using "outer.csv", clear
sort cntry hid
save "Z:\home\artur\gretl\outer.dta", replace
insheet using "inner.csv", clear
merge cntry hid using "outer.dta" /// Didn't work with cvs
</STATA>
On STATA it only takes about 3 seconds or so.
Artur