panel data issues
by Sven Schreiber
Hello all panel-interested people,
while using gretl for teaching with panel data (which I hadn't done much
before) I noticed the following, let's say, interface nuisances compared
to the usual luxury gretl offers for time series:
1: The sample and/or range in the main window (bottom) are given as pure
index numbers, even if "panel data marker strings" (cf. user guide p.23)
are defined. At least for the time dimension it would be useful to show
the sample periods in a human-readable form (through the markers). Also,
I noticed that the period numbers shown do not always coincide with the
values of the "time" index variable, if subsampling is in effect. (Seen
in the CEL.gdt dataset after applying the sample restriction year>1970
for example.)
1b: A slightly more general suggestion, also for non-panel data: The
active sample restriction criterion could be shown next to the resulting
active sample in the main window. (At least for simple restrictions,
maybe not for complex, multiple ones.)
2: Menu Sample -> Set range: Only the group range can be chosen, not the
periods. Actually, given the often arbitrary ordering of groups, this is
really the less useful dimension to choose a contiguous range from. (I
know I can use "set sample based on criterion" for periods, but that's
not the point.)
3: About pshrink(): A version that returns a full panel series (with
repeated values like pmean() etc.) could be useful -- practical example:
in growth regressions one needs the initial value of output-per-worker
as a regressor. Also maybe it should be called "pfirst()" or something
instead.
4: Time-constant variables: I'm not sure how to create variables that
only vary along the cross-section, like it is done with the built-in
pmean() etc. functions. Or how to append them (like the user guide p.114
"adding a time series", but along the other panel dimension).
5: Constant in a fixed-effects regression: I don't understand what gretl
reports as the global constant term in a fixed-effects model, and it
doesn't seem to be defined in the guide. It's also confusing that gretl
complains if one wants to discard the constant in the specification
dialog (when fixed effects are selected). (But obviously gretl
estimates the right thing as the comparison with explicit LSDV
regression shows, just the constant is mysterious -- even if it's the
average of the fixed effects it's not clear where the standard errors
come from.)
6: Lags not showing in model spec dialog when sample is restricted to a
single period: If I restrict the CEL.gdt data with year==1985, I cannot
include any previously created lags (of y for example) in the
regression, because they don't show up in the variable selector. Because
the subsampled dataset is now treated as "undated", there's also no
"lags..." button in the dialog. -- Actually I don't understand why gretl
"temporarily forgets" the panel structure of the dataset when a single
period is active. It would seem less problematic to treat even a T=1
sample as a special case of panel data if the underlying dataset has a
panel structure; especially in conjunction with point 1 above about
showing the selected periods in the sample.
Ok, that was a long post, sorry, but still necessary I think.
Cheers,
Sven
10 years, 5 months
our handling of daily data
by Allin Cottrell
Sven has raised the question of the handling of daily data in gretl;
see the threads starting from
http://lists.wfu.edu/pipermail/gretl-users/2014-May/010037.html
I'm glad of that: it's time we clarified what we do now, and what we
should do in future. (But please note, I'm mostly talking here about
5-day financial-market data; other sorts of daily data might require
different handling.)
Sorry, this is long, but I'd encourage those who work with daily
data to read on...
First a minor point in relation to Sven's example: I think the
Bundesbank is in fact unusual in including blank weekends in
business-day data files. At least, that's not the practice of the
Federal Reserve, the Bank of England, the Banque de France, the
Banca d'Italia, the Sveriges Riksbank... (at which point I got tired
of googling).
Anyway, it's (now) easy enough to strip out weekends, which leaves
the more interesting question of how to deal with holidays.
I think it's fair to say:
(a) most econometricians who wish to apply time-series methods to
daily financial market data will, most of the time, want to ignore
holidays as well as weekends, treating the data as if these days did
not exist and the actual trading days formed a continuous series,
but
(b) for some purposes it may be important to be able to recover
information on (e.g.) which days were Mondays or which days followed
holidays.
How are these needs best supported by econometric software? I can
see two possibilities:
(1) The storage for 5-day data includes rows for all Mondays to
Fridays (or even all days as per the Bundesbank) -- hence satisfying
point (b) automatically -- and the software provides a mechanism for
skipping non-trading days on demand when estimating models.
(2) The data storage includes only actual trading days -- hence
satisfying point (a) automatically -- but with a record of their
calendar dates, and the software provides means of retrieving the
information under point (b) on demand.
Currently gretl includes a half-hearted gesture towards approach (1)
but de facto mostly relies on approach (2). Let me explain.
When we first introduced support for daily data I initially assumed
that we'd want to store 5-day data including rows for all relevant
days, with NAs for holidays. So in view of point (a) above I put in
place a mechanism for skipping NAs in daily data when doing OLS. But
this never got properly documented, and it was never extended to
other estimators.
What happened? Well, as we started adding examples of daily data to
the gretl package it became apparent that approach (2) is quite
common in practice. See for example the "djclose" dataset from Stock
and Watson and the Bollerlev-Ghysels exchange-rate returns series
(b-g.gdt). Both of these have non-trading days squeezed out of them;
let's call this "compressed" daily data.
The Bollerlev-Ghysels dataset is not the best example, as the
authors did not record the actual dates of the included
observations, only the starting and ending dates. But djclose will
serve as a test case: although it excludes non-trading days the date
of each observation is recorded in its "marker" string and it's
straightforward to retrieve all the information one might want via
gretl's calendrical functions, as illustrated below.
<hansl>
/* analysis of compressed 5-day data */
open djclose.gdt
# get day of week and "epoch day" number
series wd = weekday($obsmajor, $obsminor, $obsmicro)
series ed = epochday($obsmajor, $obsminor, $obsmicro)
# maybe we want a dummy for Mondays?
series monday = wd == 1
# find the "delta days" between observations
series delta = diff(ed)
# the "standard" delta days in absence of holidays:
# three for Mondays, otherwise one
series std_delta = wd == 1 ? 3 : 1
# create a dummy for days following holidays
series posthol = delta > std_delta
# take a look...
print wd monday delta posthol --byobs
</hansl>
Here's a proposal for regularizing our handling of daily data. In
brief, it's this: scrap our gesture towards what I called approach
(1) above, and beef up our support for approach (2).
Why get rid of the mechanism for automatically skipping NAs in daily
data for OLS? Because it's anomalous that it only works for OLS, it
would be a lot of work to provide this mechanism for all estimators,
and anyway it probably should not be automatic: ignoring NAs when
they're present in the dataset should require some user
intervention.
By beefing up approach (2) I mean providing easy means of converting
between "uncompressed" and "compressed" daily data. We already
support both variants, but (a) given an uncompressed daily sequence
it should be easy for the user to squeeze out NAs if she thinks
that's appropriate for estimation purposes, and (b) it might be
useful in some contexts to be able to reconstitute the full calendar
sequence from a compressed dataset such as djclose.
Such conversion is possible via "low-level" hansl, but not
convenient. I've therefore added the following two things in
CVS/snapshots:
(1) If you apply an "smpl" restriction to a daily dataset, we try to
reconstitute a useable daily time-series. If it has gaps, we record
the specific dates of the included observations. At present this is
subject to two conditions, which are open to discussion.
(i) Define the "delta" of a given daily observation as the epoch day
(= 1 for the first of January in 1 AD) of that observation minus the
epoch day of the previous one. So, for example, in the case of
complete 7-day data the delta will always be 1. With complete 5-day
data the delta will be 3 for Mondays and 1 for Tuesdays through
Fridays. The first condition on converting from "full" data to
something like djclose.gdt (dated daily data with gaps) is that the
maximum daily delta is less than 10.
(ii) The "smpl" restriction in question may involve throwing away
"empty" weekends; this will lose about 2/7 of the observations and
preserve about 5/7. Allowing for this, we then require that the
number of observations in the sub-sample is at least 90 percent of
the maximum possible. Or in other words we're allowing up to 10
percent loss of observations due to holidays. That's generous --
perhaps too generous?
(The point of these restrictions is to avoid "pretending" that a
seriously gappy daily sequence -- much gappier than could be
accounted for by trading holidays -- can be treated as if it were a
continuous time series for econometric purposes.)
(2) Second thing added: a new trope for the "dataset" command,
namely
dataset pad-daily <days-in-week>
This will pad out a dataset such as djclose, adding in NAs for
holidays and (if the days-in-week parameter is 7) for weekends too.
I'm not sure if this second thing is worth keeping and documenting,
but for now it permits a test of the whole apparatus by
round-tripping. Here's an example, supposing we're starting from
data on a complete 7-day calendar, but with empty weekends and
all-NA rows for holidays (as in Sven's Bundesbank data):
<hansl>
open <seven-day-data>
outfile orig.txt --write
print --byobs
outfile --close
smpl --no-missing --permanent
outfile compressed.txt --write
print --byobs
outfile --close
dataset pad-daily 7
outfile reconstructed.txt --write
print --byobs
outfile --close
string diffstr = $(diff orig.txt reconstructed.txt)
printf "diffstr = '%s'\n", diffstr
</hansl>
So if the round trip is successful, diffstr should be empty. Ah, but
with Sven's data it's not quite empty. What's the problem? It's with
the logic of --no-missing, which excludes all rows on which there's
at least one NA. What we really want, to skip holidays, is to
exclude all and only those rows on which all of our daily variables
are NA. That's feasible via raw hansl, but not so convenient. So one
more modification to "smpl" in CVS: add an option --no-all-missing
(the name may be debatable). Substitute --no-all-missing for
--no-missing in the script above and the difference between orig.txt
and reconstructed.txt really is null.
If you don't have a handy Bundesbank-style data file (though it's
not hard to fake one), here's another round-trip test, in the other
direction: we pad out djclose then shrink it again.
<hansl>
open djclose.gdt -q
outfile orig.txt --write
print --byobs
outfile --close
dataset pad-daily 5
outfile padded.txt --write
print --byobs
outfile --close
smpl --no-all-missing --permanent
outfile reconstructed.txt --write
print --byobs
outfile --close
string diffstr = $(diff orig.txt reconstructed.txt)
printf "diffstr = '%s'\n", diffstr
</hansl>
The use of the --permanent option in the round-trip scripts is just
to ensure that all vestiges of the original data are destroyed
before the reconstruction takes place. In "normal usage" one could
just do
<hansl fragment="true">
open <seven-day-data>
smpl --no-all-missing
</hansl>
then carry out econometric analysis without tripping over NAs.
Allin
10 years, 6 months
long and gappy time series
by Sven Schreiber
Hello to all the data wizards out there,
today I hit the limit in the GUI that the earliest year can be set to
1500. But I was looking at the really historic time series from here:
http://www.ggdc.net/maddison/maddison-project/orihome.htm, which
actually starts at 1 A.D. It worked ok via script, but I think it also
should work via the dialog window.
Now let's see if I manage to load that gappy data into the workfile...
no, there are problems, and I think some of them are bugs. (This is
1.9.90 on Win7.)
When I start with an empty annual dataset from 1 to 2100 and try to
append the Maddison data from an Excel worksheet (where I have named the
year column with "date"), the rows/years are not properly matched
against the inner years ("inner" in the sense from 'join'). That's
because of the (huge) gaps in the source file. Strangely, when I use
"obs" instead of "date" then gretl says instead I must not use this as a
variable name.
I also have to rename many many variables in the xls file before gretl
accepts them, and I think this is really not the optimal way to handle
this because it's very time-consuming and dull; there should be some
automagic "mangling" of the names by gretl, maybe accompanied by a
warning message, or the whole mangling could be a user-configurable option.
Then I tried to treat the whole thing as a (country) panel structure --
but I'm noticing (for the first time although it must have been there
for ages) that when I choose "new dataset" from the menu, the dialog
forces on me the detour to specify the overall number of obs (anybody
got a calculator ready?) and then afterwards only can I impose the panel
structure. Suggestion: why not have radio buttons with
cross-section/time-series/panel in that dialog, and in the panel case
let the user input numbers for both dimensions right away. (plus the
periodicity for time series and panel as well)
Another suggestion: why not allow the use of a time index variable for
time series the same way that index variables are allowed for panels?
I haven't succeeded so far with the import, the only solution I can
think of right now are to add hundreds of empty rows to the source file
to remove the gaps.
Hm.
cheers,
sven
10 years, 6 months
weekdays (was Re: [Gretl-users] problems with daily data)
by Allin Cottrell
On Wed, 21 May 2014, Riccardo (Jack) Lucchetti wrote:
[Re. Sven's wish to convert 7-day daily data, with nothing but NAs
for Saturdays and Sundays, into 5-day data]
> This should do what [Sven wanted]: not the most elegant approach,
> but IMO quite clear and general. [...]
>
> <hansl>
> nulldata 28
> setobs 7 2014-04-01
> x = normal()
> print x -o
>
> /*
> trash weekends
> */
>
> # first, construct a "weekend" dummy series
>
> scalar y1 = $obsmajor[1]
> scalar m1 = $obsminor[1]
> scalar d1 = $obsmicro[1]
> scalar wd1 = weekday(y1, m1, d1)
> series wd = time + wd1 - 1
> series we = (wd%7)==6 || (wd%7)==0
>
> # clear periodicity
>
> setobs 1 1
> smpl we==0 --restrict
[...and then it's pretty simple]
</hansl>
Nicely done! But I note that it's a bit of a struggle since (up till
now) the weekday() function has only accepted scalar arguments. In
today's CVS I've "upgraded" this via our usual overloading approach:
you can now use series instead of scalars with weekday(). The
relevant portion of the above could then read:
<hansl fragment="true">
# construct a "weekend" dummy series
series wday = weekday($obsmajor, $obsminor, $obsmicro)
series weekend = wday == 6 || wday == 0
</hansl>
As a general comment, I'd say it's pretty uncommon to have to do
this sort of thing: almost all 5-day daily data does not include
weekends stuffed with NAs. So while gretl should be able to deal
with it, I don't think we have to go to great lengths to make it a
unitary ("one stop shopping") operation.
Allin
10 years, 7 months
genr (was Re: [Gretl-users] problems with daily data)
by Allin Cottrell
On Wed, 21 May 2014, Sven Schreiber wrote:
> Am 21.05.2014 16:35, schrieb Ignacio Diaz-Emparanza:
>> On 21/05/14 16:15, Riccardo (Jack) Lucchetti wrote:
>>> On Wed, 21 May 2014, Sven Schreiber wrote:
>>>
>>>> Agreed (well, maybe deprecated and undocumented would be enough...); but
>>>> there should still be a script way of creating seasonal/periodic
>>>> dummies, and currently there is no alternative, or is there?
>>>
>>> <hansl>
>>> tmp = time % $pd
>>> list DUMS = dummify(tmp)
>>> </hansl>
>>
>> I prefer that the number of each dummy corresponds with the observation:
>>
>> tmp = (time-1)%$pd + 1
>> list DUMS = dummify(tmp)
>>
>
> Aren't you assuming that the workfile/sample actually starts with the
> "right" obs here?
>
> Anyway, thanks for all your suggestions, but what I really meant was a
> function (or command) that mirrors the menu entry like 'genr dummy'
> does, not some clever way to code it...
At present we have 7 "specials" with the form "genr <name>", for <name> =
dummy, timedum, unitdum, time, index, unit, weekday. I haven't checked
rigorously but I don't think "genr weekday" is documented (though there is
a documented weekday() function).
The "genr unit" special is shadowed by the accessor $unit. We could do the
same for "index" if that were thought worthwhile.
As for the ones that add several variables, I don't think a $-accessor
would work well. We _could_ have (e.g.) an accessor $dummy that adds a
bunch of series and returns a list but that would look a bit weird if you
didn't need to assign the list, having "$dummy" by itself on a line of
hansl. Maybe a function named "dummies" (that returns a list) with a
parameter to handle the panel cases (timedum, unitdum).
Allin
10 years, 7 months
Sugestion: set termoption dash
by Logan Kelly
Hello,
I have a suggestion for the GUI interface for time series plots. Could an option be added to include something like
set termoption dash
in the gnuplot script. It could be a check box in the main tab of the gretl plot control dialog.
Plotting with dashed lines can be done easily in hansl, but I don't think doing dashed lines is easily accomplished in the GUI???
Thanks
Logan
10 years, 7 months
ARMA Interpolation
by GOO Creations
Hi,
I'm not sure if this is possible in gretl. I want to interpolate a gap
of samples with ARMA by using the value to the left and right of the
gap. If I have data like this:
Time lag: 1 2 3 4 5 6 7 8 9 10 11
Values: 0 0.1 0.2 0.3 * * * 0.3 0.2 0.1 0
The values marked with a star are the gap I want to interpolate (at lag
5, 6 and 7: the values should be something like 0.4, 0.5, 0.4 after
interpolation). How would you go about creating a gretl DATASET with
these values? And will the get_forecast function work on "predicting"
the interpolated samples?
I've always used ARMA for out-of-sample forecasts, but never for
interpolation, so I'm not sure if this is possible.
Regards
Chris
10 years, 7 months
built-in curl() function
by Allin Cottrell
This is a follow-up to earlier discussions of functions to retrieve data
from various servers, such as the BLS.
There's now a built-in curl() function in CVS (which uses the libcurl API
rather than relying on the presence of a curl executable). It's very
similar to the hansl function of the same name that Jack circulated. It's
documented in the Function reference but its details are not set in stone
at this point, so if anyone has comments/suggestions, please fire away.
Allin
10 years, 7 months