On Fri, 10 Sep 2021, Sven Schreiber wrote:
[shows incomplete example from Artur, suggesting inefficiency in
"join"]
As some of you may have noticed, since back in March a feature
request
ticket had been created and some discussion took place there
(
https://sourceforge.net/p/gretl/feature-requests/151/)
I believe in current git Allin has worked on fixing this, but I haven't
tested myself yet - part of the reason was that Artur's script above was
not self-contained.
Looking into the join code it became apparent that there was at
least one possible source of inefficiency. That is, if you're
importing n series via a single invocation of the join command, we
were calculating the matching of inner and outer keys n times, where
in principle we could do this once, store the results and apply them
to each of the imports -- since when you specify multiple series for
joining, the keys (if any) must be the same for all of them. But
note that this is not an entirely "free lunch", since storing the
matching results requires allocation of extra memory that's not
needed otherwise.
In current git we employ the "calculate once and store" method for
the first key (not yet for the second, if present, but most of the
work goes on the first one).
I've tested an example where the dataset has 20000 observations and
there are 20 series to import in one go, with a single matching key
and aggregation via the average of the matching values. What I found
was a speed-up of about 2 or 3 percent with the new method. So with
these parameters it appears that the key-matching code actually
takes a trivial proportion of the overall compute time, hardly worth
bothering with.
At this point it would be good to have an example which exhibits a
substantial (supra-linear) slowdown on importing more series. Maybe
that's the case with Artur's example. Anyway, a minimal but
informative test case would be very useful.
Allin