On Fri, 14 Jul 2023, Sven Schreiber wrote:
Am 14.07.2023 um 17:34 schrieb Cottrell, Allin:
> On Fri, Jul 14, 2023 at 8:59 AM Sven Schreiber
>
> <sven.schreiber(a)fu-berlin.de> wrote:
>
>> I'm seeing a problem handling a fairly large panel dataset in the GUI.
>> It started with a ~200MB binary gdtb file that was a recognized panel,
>> with index variables "iso" (time, iso dates) and
"STATIONS_ID".
>>
>> This had lots of unneeded obs with mostly missings, so I saved a reduced
>> version of the dataset (after "smpl --no-all-missing
<somelist>"),
>> yielding a file with about 100MB. As expected (?), the official panel
>> structure was lost there, so I wanted to re-instate it, using the nice
>> GUI dataset structure tool. Basically, this ended up in a non-responsive
>> gretl which had to be killed by the OS. (I waited several minutes before
>> doing that.)
>
> The GUI steps for doing this are:
>
> 1. Select "panel" as the target structure.
> 2. Select "Use index variables" as the organization.
> 3. Select Unit and Time variables from the candidates offered.
> 4. Specify the panel time dimension.
> 5. Confirm the selections.
>
> I tried mocking up a fairly hefty panel (though no doubt a good deal
> smaller than yours) and I got a pause of several seconds between steps
> 3 and 4. Is that the point at which gretl became unresponsive in your
> case?
Yes!
Good, so we're on the same page.
> There's a potentially expensive check going on at that stage;
I'm
> not sure it's really needed but if so it can probably be made
> more efficient.
So how much longer should I wait to see the effect? The CPU load
seemed to amount to one full (hyper) thread, 1 of 12. Again, I'm
puzzled about the fact that in console-mode this doesn't take much
time, or so it seems.
I'm not 100% sure of ths following diagnosis, but pretty close.
* In the GUI, we'd like to be sure that step 3 (see above) is fully
correct before proceeding to step 4 -- that is, flag an error as
soon as we can detect it, before asking the user for more input. So
immediately following step 3 we run a check on the validity of the
index variables, which will actually be repeated when we call the
relevant internal function to complete the transaction (the same
function that's called by the "setobs" command). By itself this
would account for only a modest slow-down relative to CLI mode: 2
times a short time is still a pretty short time.
* But also in the GUI we'd like to produce the most explicit
possible error message, so the (duplicated) check involves an extra,
expensive step. We check explicitly for the pathology whereby for
one or more of the (unit, period) pairs there are two or more
observations in the dataset. Errors of this sort will be caught by
"setobs", but in the GUI we go an extra step so we can provide
details: for example, "The combination of unit 12 and period 5
occurs at both observation 3445 and observation 7981".
As things stand, the mechanism for doing this is of roughly
quadratic complexity in the total number of observations.
Possibilities are: skip the extra code for N*T > some sane limit,
or, if possible, find a much more efficient implementation.
Allin