Hi all,
I just have to work a with a large panel dataset (left-hand side) to
which I would like to join a couple of series from a RHS-dataset. The
correct mapping is done via two keys.
I did some performance check, and it seems that the current
implementation runs the sorting/ mapping for each series joined
separately even though a single sorting/ mapping should be sufficient
(if I am not wrong).
In a first experiment I join all series from the RHS dataset by means of
the wildcard operator:
<join "@NAME_RHS_DATA" * --ikey=datedim,unitdim>
which takes about 5 sec. here.
Then I re-run the experiment by successively increasing the number of
series to join:
<hansl>
loop i=1..nelem(RHS_SERIES_NAMES)
printf "\nInfo: Start joining %d series.\n", $i
flush
strings tojoin = RHS_SERIES_NAMES[1:$i]
set stopwatch
join "@NAME_RHS_DATA" tojoin --ikey=datedim,unitdim
printf "\nInfo: Joining took %.2f sec.\n", $stopwatch
flush
list New = dataset - Base
delete New --force
endloop
</hansl>
The output is as follows:
<output>
Info: Joining all series took 4.91 sec.
Info: Start joining 1 series.
Info: Joining took 1.91 sec.
Info: Start joining 2 series.
Info: Joining took 2.88 sec.
Info: Start joining 3 series.
Info: Joining took 3.88 sec.
Info: Start joining 4 series.
Info: Joining took 4.84 sec.
Script done
</output>
Do you agree that the sorting or mapping overhead can in principle be
reduced when joining multiple series at once?
Thanks,
Artur