---------- Forwarded message ----------
Date: Sat, 1 Jul 2023 19:39:27 +0000
From: Allin Cottrell <cottrell(a)wfu.edu>
To: "Riccardo (Jack) Lucchetti" <r.lucchetti(a)univpm.it>
Cc: Josué Martínez-Castillo <jota3mc(a)gmail.com>,
Subject: Re: 100-based indices with panel data
On Sat, 1 Jul 2023, Riccardo (Jack) Lucchetti wrote:
On Sat, 1 Jul 2023, Allin Cottrell wrote:
> On Fri, 30 Jun 2023, Josué Martínez-Castillo wrote:
>
>> I'm a newbie in gretl, very excited to learn how to use the program for
>> learning econometrics on my own. However, right now I'm curious on how to
>> estimate 100-based indices when dealing with panel data. For example, what
>> if I want to estimate a 100-based index for each unit using as base year
>> the first year available of, say, real GDP.
>>
>> I was looking for the answer in the manual of the 2023 version of gretl. No
>> success. I was hoping maybe someone can help me with guidance.
>
> Good question. As things stand there isn't a built-in way to construct such
> indices for panel data using the graphical interface. But assuming you want
> the indices to work in the time dimension for each panel unit, it's actually
> not hard to do via scripting. Here's an example:
[...]
Here's another approach, which avoids the loop. The syntax is a bit too
terse, perhaps, but IMO instructive.
<hansl>
open abdata.gdt
base = cum(ok(EMP)) == 1 ? EMP : NA
EMP_b100 = EMP/pexpand({base}) * 100
print EMP EMP_b100 --byobs
</hansl>
Yes, quite instructive! In case anyone's interested let's unpack Jack's
formulation.
First consider:
base = cum(ok(EMP)) == 1 ? EMP : NA
We're looking at what gretl calls series here.
The inner expression, "ok(EMP)" creates a series with value 1 for valid values
of its series argument and 0 for NAs (missing values).
This addresses a problem with the first variant I posted, where I just took the
base of the indices to be the first observation for each unit. That's OK with
the grunfeld data that I referenced because it has no missing values. But if
the first observation for a unit were NA, the whole index series for that unit
would be NA via my method (since NAs propagate in arithmetical calculation).
Not accidentally, Jack chose the supplied abdata dataset (Arellano and Bond),
which contains missing values, to illustrate his calculation, and I'll work
with it here.
Now cum() is gretl's cumulation function, and it works "properly" for panel
data: it cumulates in the time dimension, starting over for each unit. So
"cum(ok(EMP))" gives a series holding the count of valid values "to
date" for
each unit. OK, so far?
Then "cum(ok(EMP)) == 1 ? EMP : NA" is an instance of the very handy ternary
operator. It has the form:
result = condition ? one_thing : other_thing
which can be spelled out a bit as
if (condition is true) result is one_thing, otherwise result is other_thing
So, "cum(ok(EMP)) == 1 ? EMP : NA" gives a series holding the value of EMP for
each first-valid-observation per panel unit, and NA for all other observations.
To see what's happening up to this point one could open the abdata dataset in
gretl and execute these commands (in a script or via the console):
series eok = ok(EMP)
series cumeok = cum(eok)
series base = cum(ok(EMP)) == 1 ? EMP : NA
print EMP eok cumeok base --byobs
Next comes the line:
EMP_b100 = EMP/pexpand({base}) * 100
On the left-hand side is the final indices series. On the right-hand side we're
using the original EMP (employment), multiplying by 100 (as per convention),
and dividing by "pexpand({base})". What the heck is this last thing?
Well, notice the curly brackets around "base". These turn a series into a
vector (special case of a matrix) and the thing you need to know here is that
in gretl by default this conversion skips any missing values. [Note: you can
prevent this via the command "set skip_missing off".] So in a panel with N
units {base} will be an N-vector holding just the first valid observation of
EMP for each unit.
Then the pexpand ("panel-expand") function turns this N-vector into a series by
repeating each of the N values T times, for each unit. Which is (probably) just
what we want to divide EMP by, to create the per-unit indices. In a panel
dataset with no missing values it's exactly equivalent to the more pedestrian
formulations I posted earlier.
Now for a couple of missing-data complications we'd want to deal with in a
built-in version of this functionality.
1) What if some units have NO valid values for the variable we're working with?
Then {base} will not be an N-vector and Jack's method will not work unmodified.
2) What if the date of the first valid observation differs across units, but we
want a set of indices that start in the same period? Again, some fancier
footwork would be needed. We'd need to look for the first period with a common
non-missing observation across all units that had more than one non-missing
observation.
Allin Cottrell