[Gretl-devel] Re: problem with large gdtb panel datafile and index variables

Friday, 14 July 2023

On Fri, 14 Jul 2023, Sven Schreiber wrote:

...
 Am 14.07.2023 um 19:07 schrieb Allin Cottrell:
> ...
> 
> * But also in the GUI we'd like to produce the most explicit possible error 
> message, so the (duplicated) check involves an extra, expensive step. We 
> check explicitly for the pathology whereby for one or more of the (unit, 
> period) pairs there are two or more observations in the dataset. Errors of 
> this sort will be caught by "setobs", but in the GUI we go an extra step so

> we can provide details: for example, "The combination of unit 12 and period 
> 5 occurs at both observation 3445 and observation 7981".
 OK, I see.
> 
> As things stand, the mechanism for doing this is of roughly quadratic 
> complexity in the total number of observations. [...]

 In any case, this is something for after the release, I'd say. Another option 
 would be to display the progress, so the user knows something's happening, 
 and can decide whether to abort. Or, skip the detailed check first, and only 
 if there's an error, offer a deeper analysis. 
Actually, I think it's worth fixing, even if partially, now.

First, I've come to understand a point that I wasn't clear on 
before, namely that the expensive GUI-specific check is only about 
more explicit error-reporting; it won't catch any errors in addition 
to the check run by "setobs". That's because if there are any 
duplicated pairs of values in the unit and period series, this is 
bound to show up as (total number of observations) > (number of 
distinct units) * (number of disinct periods), which condition is 
checked by setobs, and also checked in the GUI _prior_ to doing the 
expensive thing.

So we can immediately make the more expensive check (EC) conditional 
on an error exposed by the simpler check. If your index series are 
OK we won't call EC. That's done in git.

Second (not surprisingly, doh!) the expensive check can be reduced 
from O(n^2) to O(n log n). That too is in git.

The remaining question: Is n log n still too complex for a jumbo 
dataset?

You could answer that by interpolating a few commands at the point 
where you've shrunk the dataset by skipping a lot of missing values 
but haven't yet re-imposed a panel interpretation:

dataset addobs 1
STATIONS_ID[$nobs] = <x>
iso[$nobs] = <y>

where {x,y} is a pair of {STATIONS_ID, iso} values that already 
occur in the sub-sampled dataset. Then you can see if the time taken 
in the GUI is still prohibitive. Using current git, that is.

It runs pretty fast for me with 10000 units and 20 periods.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Gretl-devel] Re: problem with large gdtb panel datafile and index variables