Re: [Gretl-users] Speed of "join"

Friday, 13 September 2013

The following script will generate two testfiles whose size depends on the 
two parameters ncountries and mean_n_hh (mean number of people per 
household). In order to get nearly the same size as your real data, you 
could set ncountries to 30 and mean_n_hh to 7500 (roughly). Then, a "join" 
will be performed and the time taken.

On my pc this takes about half a second with mean_n_hh=200, nearly a 
minute with mean_n_hh=4000 and about 8 minutes with mean_n_hh=10000. From 
some experimenting, it would seem that time is approximately quadratic; I 
suppose we could try something to make it less convex (although I suspect 
it won't be easy to make it linear).

<hansl>
set echo off
set messages off
set seed 123456

ncountries = 30
mean_n_hh = 10000
n = 0

# generate the outer dataset
printf "generating outer file\n"
outfile "outer.csv" --write
printf "cntry, hid, x\n"
loop i=1..ncountries --quiet
     nind = ceil(randgen1(z,mean_n_hh,10))
     n += nind
     loop j=1..nind --quiet
 	printf "%d,%d,%12.5f\n", i, j, randgen1(z,0,1)
     endloop
end loop
outfile "outer.csv" --close

open outer.csv --quiet --preserve

# generate the inner dataset
printf "generating inner file\n"
outfile "inner.csv" --write
printf "hid, iid, cntry, y\n"
loop i=1..$nobs --quiet
     nh = randgen1(i,1,4)
     loop j=1..nh --quiet
 	printf "%d,%d,%d,%12.5f\n", hid[i], j, cntry[i], randgen1(z,0,1)
     endloop
end loop
outfile "inner.csv" --close

# do the join
open inner.csv --preserve
printf "performing join\n"
set stopwatch
join outer.csv x --ikey=hid,cntry
printf "individuals = %d (%d countries, %d households); time = %g 
seconds\n", \
   $nobs, ncountries, n, $stopwatch

smpl 1 30
print -o
smpl full
</hansl>

-------------------------------------------------------
   Riccardo (Jack) Lucchetti
   Dipartimento di Scienze Economiche e Sociali (DiSES)

   Università Politecnica delle Marche
   (formerly known as Università di Ancona)

   r.lucchetti(a)univpm.it
   http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Gretl-users] Speed of "join"