On Fri, 14 Jul 2023, Sven Schreiber wrote:
Am 14.07.2023 um 19:07 schrieb Allin Cottrell:
> ...
>
> * But also in the GUI we'd like to produce the most explicit possible error
> message, so the (duplicated) check involves an extra, expensive step. We
> check explicitly for the pathology whereby for one or more of the (unit,
> period) pairs there are two or more observations in the dataset. Errors of
> this sort will be caught by "setobs", but in the GUI we go an extra step so
> we can provide details: for example, "The combination of unit 12 and period
> 5 occurs at both observation 3445 and observation 7981".
OK, I see.
>
> As things stand, the mechanism for doing this is of roughly quadratic
> complexity in the total number of observations. [...]
In any case, this is something for after the release, I'd say. Another option
would be to display the progress, so the user knows something's happening,
and can decide whether to abort. Or, skip the detailed check first, and only
if there's an error, offer a deeper analysis.
Actually, I think it's worth fixing, even if partially, now.
First, I've come to understand a point that I wasn't clear on
before, namely that the expensive GUI-specific check is only about
more explicit error-reporting; it won't catch any errors in addition
to the check run by "setobs". That's because if there are any
duplicated pairs of values in the unit and period series, this is
bound to show up as (total number of observations) > (number of
distinct units) * (number of disinct periods), which condition is
checked by setobs, and also checked in the GUI _prior_ to doing the
expensive thing.
So we can immediately make the more expensive check (EC) conditional
on an error exposed by the simpler check. If your index series are
OK we won't call EC. That's done in git.
Second (not surprisingly, doh!) the expensive check can be reduced
from O(n^2) to O(n log n). That too is in git.
The remaining question: Is n log n still too complex for a jumbo
dataset?
You could answer that by interpolating a few commands at the point
where you've shrunk the dataset by skipping a lot of missing values
but haven't yet re-imposed a panel interpretation:
dataset addobs 1
STATIONS_ID[$nobs] = <x>
iso[$nobs] = <y>
where {x,y} is a pair of {STATIONS_ID, iso} values that already
occur in the sub-sampled dataset. Then you can see if the time taken
in the GUI is still prohibitive. Using current git, that is.
It runs pretty fast for me with 10000 units and 20 periods.
Allin