Inefficiency in join command?

Wednesday, 31 March 2021

Hi all,

I just have to work a with a large panel dataset (left-hand side) to 
which I would like to join a couple of series from a RHS-dataset. The 
correct mapping is done via two keys.

I did some performance check, and it seems that the current 
implementation runs the sorting/ mapping for each series joined 
separately even though a single sorting/ mapping should be sufficient 
(if I am not wrong).

In a first experiment I join all series from the RHS dataset by means of 
the wildcard operator:
<join "@NAME_RHS_DATA" * --ikey=datedim,unitdim>
which takes about 5 sec. here.

Then I re-run the experiment by successively increasing the number of 
series to join:

<hansl>
loop i=1..nelem(RHS_SERIES_NAMES)
     printf "\nInfo: Start joining %d series.\n", $i
     flush
     strings tojoin = RHS_SERIES_NAMES[1:$i]

     set stopwatch
     join "@NAME_RHS_DATA" tojoin --ikey=datedim,unitdim

     printf "\nInfo: Joining took %.2f sec.\n", $stopwatch
     flush

     list New = dataset - Base
     delete New --force
endloop
</hansl>

The output is as follows:
<output>
Info: Joining all series took 4.91 sec.

Info: Start joining 1 series.

Info: Joining took 1.91 sec.

Info: Start joining 2 series.

Info: Joining took 2.88 sec.

Info: Start joining 3 series.

Info: Joining took 3.88 sec.

Info: Start joining 4 series.

Info: Joining took 4.84 sec.
Script done
</output>

Do you agree that the sorting or mapping overhead can in principle be 
reduced when joining multiple series at once?

Thanks,
Artur

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006