Am 14.07.2023 um 22:05 schrieb Allin Cottrell:
Actually, I think it's worth fixing, even if partially, now.
First, I've come to understand a point that I wasn't clear on before,
namely that the expensive GUI-specific check is only about more
explicit error-reporting; it won't catch any errors in addition to the
check run by "setobs". That's because if there are any duplicated
pairs of values in the unit and period series, this is bound to show
up as (total number of observations) > (number of distinct units) *
(number of disinct periods), which condition is checked by setobs, and
also checked in the GUI _prior_ to doing the expensive thing.
So we can immediately make the more expensive check (EC) conditional
on an error exposed by the simpler check. If your index series are OK
we won't call EC. That's done in git.
OK, thanks. That sounds like the most
relevant case.
Second (not surprisingly, doh!) the expensive check can be reduced
from O(n^2) to O(n log n). That too is in git.
The remaining question: Is n log n still too complex for a jumbo dataset?
...
It runs pretty fast for me with 10000 units and 20 periods.
So if I understand the background correctly, we have n = 200K, and apart
from some constant factor we're comparing 40bn (4e10) to 2.4m (2.4e6).
That sounds like a sufficient speedup factor to me! My biggest dataset
right has something like 6000 units and close to 900 periods, so n =
5.4m almost. Then the comparison is 2.9e13 vs. 8.4e7, even more radical.
BTW, all that dataset shrinking didn't really pay off in the end. I
guess this is due to gretl's requirement of a "nominally balanced"
panel. So it seems that gretl basically reinstates most of the obs with
missings because they are needed to fill up the rectangular grid,
combined with the usage of panel index variables, because then the
values of those need to be stored at least.
cheers
sven