On Fri, 13 Sep 2013, Artur T. wrote:
Am 13.09.2013 18:47, schrieb Riccardo (Jack) Lucchetti:
>
> The following script will generate two testfiles whose size depends on
> the two parameters ncountries and mean_n_hh (mean number of people per
> household). In order to get nearly the same size as your real data, you
> could set ncountries to 30 and mean_n_hh to 7500 (roughly). Then, a
> "join" will be performed and the time taken.
>
> On my pc this takes about half a second with mean_n_hh=200, nearly a
> minute with mean_n_hh=4000 and about 8 minutes with mean_n_hh=10000.
> From some experimenting, it would seem that time is approximately
> quadratic; I suppose we could try something to make it less convex
> (although I suspect it won't be easy to make it linear).
>
The "mean_n_hh=10000" case takes around 20 min. here. But interestingly,
I ran the following on STATA 11 using this "10000" case:
<STATA>
insheet using "outer.csv", clear
sort cntry hid
save "Z:\home\artur\gretl\outer.dta", replace
insheet using "inner.csv", clear
merge cntry hid using "outer.dta" /// Didn't work with cvs
</STATA>
On STATA it only takes about 3 seconds or so.
Yeah. As Jack said it seems quadratic and that should not be the case. I'm
working on it.
Allin