[Gretl-devel] Re: problem with large gdtb panel datafile and index variables

Friday, 14 July 2023

On Fri, 14 Jul 2023, Sven Schreiber wrote:

...
 Am 14.07.2023 um 17:34 schrieb Cottrell, Allin:
> On Fri, Jul 14, 2023 at 8:59 AM Sven Schreiber
>
> <sven.schreiber(a)fu-berlin.de&gt; wrote:
>
>> I'm seeing a problem handling a fairly large panel dataset in the GUI.
>> It started with a ~200MB binary gdtb file that was a recognized panel,
>> with index variables "iso" (time, iso dates) and
"STATIONS_ID".
>> 
>> This had lots of unneeded obs with mostly missings, so I saved a reduced
>> version of the dataset (after "smpl --no-all-missing
<somelist>"),
>> yielding a file with about 100MB. As expected (?), the official panel
>> structure was lost there, so I wanted to re-instate it, using the nice
>> GUI dataset structure tool. Basically, this ended up in a non-responsive
>> gretl which had to be killed by the OS. (I waited several minutes before
>> doing that.)
>
> The GUI steps for doing this are:
> 
> 1. Select "panel" as the target structure.
> 2. Select "Use index variables" as the organization.
> 3. Select Unit and Time variables from the candidates offered.
> 4. Specify the panel time dimension.
> 5. Confirm the selections.
> 
> I tried mocking up a fairly hefty panel (though no doubt a good deal
> smaller than yours) and I got a pause of several seconds between steps
> 3 and 4. Is that the point at which gretl became unresponsive in your
> case?

 Yes! 
Good, so we're on the same page.

...
> There's a potentially expensive check going on at that stage;
I'm 
> not sure it's really needed but if so it can probably be made 
> more efficient.

 So how much longer should I wait to see the effect? The CPU load 
 seemed to amount to one full (hyper) thread, 1 of 12. Again, I'm 
 puzzled about the fact that in console-mode this doesn't take much 
 time, or so it seems. 
I'm not 100% sure of ths following diagnosis, but pretty close.

* In the GUI, we'd like to be sure that step 3 (see above) is fully 
correct before proceeding to step 4 -- that is, flag an error as 
soon as we can detect it, before asking the user for more input. So 
immediately following step 3 we run a check on the validity of the 
index variables, which will actually be repeated when we call the 
relevant internal function to complete the transaction (the same 
function that's called by the "setobs" command). By itself this 
would account for only a modest slow-down relative to CLI mode: 2 
times a short time is still a pretty short time.

* But also in the GUI we'd like to produce the most explicit 
possible error message, so the (duplicated) check involves an extra, 
expensive step. We check explicitly for the pathology whereby for 
one or more of the (unit, period) pairs there are two or more 
observations in the dataset. Errors of this sort will be caught by 
"setobs", but in the GUI we go an extra step so we can provide 
details: for example, "The combination of unit 12 and period 5 
occurs at both observation 3445 and observation 7981".

As things stand, the mechanism for doing this is of roughly 
quadratic complexity in the total number of observations. 
Possibilities are: skip the extra code for N*T > some sane limit, 
or, if possible, find a much more efficient implementation.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Gretl-devel] Re: problem with large gdtb panel datafile and index variables