our handling of daily data

Monday, 26 May 2014

Sven has raised the question of the handling of daily data in gretl; 
see the threads starting from

http://lists.wfu.edu/pipermail/gretl-users/2014-May/010037.html

I'm glad of that: it's time we clarified what we do now, and what we
should do in future. (But please note, I'm mostly talking here about 
5-day financial-market data; other sorts of daily data might require 
different handling.)

Sorry, this is long, but I'd encourage those who work with daily
data to read on...

First a minor point in relation to Sven's example: I think the
Bundesbank is in fact unusual in including blank weekends in
business-day data files. At least, that's not the practice of the
Federal Reserve, the Bank of England, the Banque de France, the
Banca d'Italia, the Sveriges Riksbank... (at which point I got tired
of googling).

Anyway, it's (now) easy enough to strip out weekends, which leaves
the more interesting question of how to deal with holidays.

I think it's fair to say:

(a) most econometricians who wish to apply time-series methods to
daily financial market data will, most of the time, want to ignore
holidays as well as weekends, treating the data as if these days did
not exist and the actual trading days formed a continuous series,
but

(b) for some purposes it may be important to be able to recover
information on (e.g.) which days were Mondays or which days followed
holidays.

How are these needs best supported by econometric software? I can
see two possibilities:

(1) The storage for 5-day data includes rows for all Mondays to
Fridays (or even all days as per the Bundesbank) -- hence satisfying
point (b) automatically -- and the software provides a mechanism for
skipping non-trading days on demand when estimating models.

(2) The data storage includes only actual trading days -- hence
satisfying point (a) automatically -- but with a record of their
calendar dates, and the software provides means of retrieving the
information under point (b) on demand.

Currently gretl includes a half-hearted gesture towards approach (1)
but de facto mostly relies on approach (2). Let me explain.

When we first introduced support for daily data I initially assumed
that we'd want to store 5-day data including rows for all relevant
days, with NAs for holidays. So in view of point (a) above I put in
place a mechanism for skipping NAs in daily data when doing OLS. But
this never got properly documented, and it was never extended to
other estimators.

What happened? Well, as we started adding examples of daily data to
the gretl package it became apparent that approach (2) is quite
common in practice. See for example the "djclose" dataset from Stock
and Watson and the Bollerlev-Ghysels exchange-rate returns series
(b-g.gdt). Both of these have non-trading days squeezed out of them;
let's call this "compressed" daily data.

The Bollerlev-Ghysels dataset is not the best example, as the
authors did not record the actual dates of the included
observations, only the starting and ending dates. But djclose will
serve as a test case: although it excludes non-trading days the date
of each observation is recorded in its "marker" string and it's
straightforward to retrieve all the information one might want via
gretl's calendrical functions, as illustrated below.

<hansl>
/* analysis of compressed 5-day data */
open djclose.gdt
# get day of week and "epoch day" number
series wd = weekday($obsmajor, $obsminor, $obsmicro)
series ed = epochday($obsmajor, $obsminor, $obsmicro)
# maybe we want a dummy for Mondays?
series monday = wd == 1
# find the "delta days" between observations
series delta = diff(ed)
# the "standard" delta days in absence of holidays:
# three for Mondays, otherwise one
series std_delta = wd == 1 ? 3 : 1
# create a dummy for days following holidays
series posthol = delta > std_delta
# take a look...
print wd monday delta posthol --byobs
</hansl>

Here's a proposal for regularizing our handling of daily data. In
brief, it's this: scrap our gesture towards what I called approach
(1) above, and beef up our support for approach (2).

Why get rid of the mechanism for automatically skipping NAs in daily
data for OLS? Because it's anomalous that it only works for OLS, it
would be a lot of work to provide this mechanism for all estimators,
and anyway it probably should not be automatic: ignoring NAs when
they're present in the dataset should require some user
intervention.

By beefing up approach (2) I mean providing easy means of converting
between "uncompressed" and "compressed" daily data. We already
support both variants, but (a) given an uncompressed daily sequence
it should be easy for the user to squeeze out NAs if she thinks
that's appropriate for estimation purposes, and (b) it might be
useful in some contexts to be able to reconstitute the full calendar
sequence from a compressed dataset such as djclose.

Such conversion is possible via "low-level" hansl, but not
convenient. I've therefore added the following two things in
CVS/snapshots:

(1) If you apply an "smpl" restriction to a daily dataset, we try to
reconstitute a useable daily time-series. If it has gaps, we record
the specific dates of the included observations. At present this is
subject to two conditions, which are open to discussion.

(i) Define the "delta" of a given daily observation as the epoch day
(= 1 for the first of January in 1 AD) of that observation minus the
epoch day of the previous one. So, for example, in the case of
complete 7-day data the delta will always be 1. With complete 5-day
data the delta will be 3 for Mondays and 1 for Tuesdays through
Fridays. The first condition on converting from "full" data to
something like djclose.gdt (dated daily data with gaps) is that the
maximum daily delta is less than 10.

(ii) The "smpl" restriction in question may involve throwing away
"empty" weekends; this will lose about 2/7 of the observations and
preserve about 5/7. Allowing for this, we then require that the
number of observations in the sub-sample is at least 90 percent of
the maximum possible. Or in other words we're allowing up to 10
percent loss of observations due to holidays. That's generous --
perhaps too generous?

(The point of these restrictions is to avoid "pretending" that a
seriously gappy daily sequence -- much gappier than could be
accounted for by trading holidays -- can be treated as if it were a
continuous time series for econometric purposes.)

(2) Second thing added: a new trope for the "dataset" command,
namely

dataset pad-daily <days-in-week>

This will pad out a dataset such as djclose, adding in NAs for
holidays and (if the days-in-week parameter is 7) for weekends too.

I'm not sure if this second thing is worth keeping and documenting,
but for now it permits a test of the whole apparatus by
round-tripping. Here's an example, supposing we're starting from
data on a complete 7-day calendar, but with empty weekends and
all-NA rows for holidays (as in Sven's Bundesbank data):

<hansl>
open <seven-day-data>
outfile orig.txt --write
print --byobs
outfile --close
smpl --no-missing --permanent
outfile compressed.txt --write
print --byobs
outfile --close
dataset pad-daily 7
outfile reconstructed.txt --write
print --byobs
outfile --close
string diffstr = $(diff orig.txt reconstructed.txt)
printf "diffstr = '%s'\n", diffstr
</hansl>

So if the round trip is successful, diffstr should be empty. Ah, but
with Sven's data it's not quite empty. What's the problem? It's with
the logic of --no-missing, which excludes all rows on which there's
at least one NA. What we really want, to skip holidays, is to
exclude all and only those rows on which all of our daily variables
are NA. That's feasible via raw hansl, but not so convenient. So one
more modification to "smpl" in CVS: add an option --no-all-missing
(the name may be debatable). Substitute --no-all-missing for
--no-missing in the script above and the difference between orig.txt
and reconstructed.txt really is null.

If you don't have a handy Bundesbank-style data file (though it's
not hard to fake one), here's another round-trip test, in the other
direction: we pad out djclose then shrink it again.

<hansl>
open djclose.gdt -q
outfile orig.txt --write
print --byobs
outfile --close
dataset pad-daily 5
outfile padded.txt --write
print --byobs
outfile --close
smpl --no-all-missing --permanent
outfile reconstructed.txt --write
print --byobs
outfile --close
string diffstr = $(diff orig.txt reconstructed.txt)
printf "diffstr = '%s'\n", diffstr
</hansl>

The use of the --permanent option in the round-trip scripts is just
to ensure that all vestiges of the original data are destroyed
before the reconstruction takes place. In "normal usage" one could
just do

<hansl fragment="true">
open <seven-day-data>
smpl --no-all-missing
</hansl>

then carry out econometric analysis without tripping over NAs.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006