John Paravantis recently posed an interesting data-parsing
challenge. It came to me as list-owner and not to the list because
it had as attachment a very big dataset. I'll quote (most of) John's
query below and append my analysis.
John says:
I successfully imported a very big data file (almost 244 thousand
observations, attached in zipped text) in gretl BUT cannot make
gretl accept its unbalanced panel structure so that I may run
AR(1) regression or use other time series techniques. [...]
The data set contains half-hourly temperature measurements
(December 2012 to April 2013) and accompanying socioeconomic
variables of 50 homes in Athens, Greece. Not all homes contain the
same number of temperature measurements though, some started in
December, others in February.
So the data set structure resembles unbalanced panel data (I
think), where the time index variable is an integer counting the
number of half-hour periods (variable DAYS) and the panel group
(variable HOUSENUM) simply holding the house number.
I'm not quite sure what DAYS measures (it ranges from -5 to 118) but
it clearly is _not_ an index of the half-hour periods, of which
there are several thousand.
When I try to set the dataset structure in gretl though by
selecting "panel" and then "use index variables" with variable
HOUSENUM as the unit or group index variable and DATE as the time
index variable, I get the message "The selected index variables do
not represent a panel structure."
As elaborated below, DATE is a string variable identifying the
calendar day.
I also tried to use the DAYS variable as a time index but this one
does not even appear in the list of variables available as a time
index, don't know why not.
What am I doing wrong?
Is it OK to just have gretl accept the data set as a very long
time series with 243814 entries (even if it is not) just to get
access to the time series submenu items in the model menu?
To answer the last question first: No, it is not at all OK to
proceed as if a panel dataset were a continuous time series; any
results would be meaningless statistical garbage, as I'm sure John
knows.
Consider the symptom John mentions: gretl says that the index
variables he selected "do not represent a panel structure". That's
quite right. The "unit" variable must constitute a unique ID for (in
the case) the household, and he's OK in that respect, he has
HOUSENUM. But in addition the "time" variable must constitute a
unique identifier for the time period, and he's not OK in that
respect: there's no variable in the dataset that fulfils that role.
There's a bunch of time-related variables -- DAYS, DATE, MONTH, DAY,
TIME, HOUR, etc., too many in fact! -- but no unique period ID for
the half-hourly frequency. The variables that look most promising
for forming such a series are DATE and TIME. These are both string
variables, with formats
DATE: "DD-MM-YYYY"
TIME: "hh:mm"
I'm adding below a script that creates a period ID using these
variables. Perhaps it ought to be possible to do this more easily --
and maybe in fact it is (Jack?) but anyway it's something of a
showcase for gretl's current string-handling capabilities. The
strategy is to write out these two variables as .txt files, read
them back in as strings, and parse them to create a series that (a)
has distinct values for every half hour and (b) is actually
increasing in the true time dimension.
<hansl>
open <John's data file>
# we need to exceed the default max number of
# "while" iterations below
set loop_maxiter 0
# write DATE (string: DD-MM-YYY) out to text
# with no added "obs" column or header
store dates.txt DATE --omit-obs --no-header
# and read it back as a big string
string s = readfile("dates.txt")
string line
scalar y, m, d
series ymd
# Loop across the dates.txt lines, parsing out day,
# month and year and reassembling as a proper sequence.
# We scan line + 1 to skip the leading quote
scalar i = 1
loop while getline(s, line) --quiet
sscanf(line + 1, "%d-%d-%d", d, m, y)
ymd[i] = y*10000 + m*100 + d
i++
endloop
# now write out TIME (string: hh:mm) as text
store time.txt TIME --omit-obs --no-header
# and read back in as a string
s = readfile("time.txt")
scalar h
series hm
# loop across the time.txt lines, parsing out hour,
# minute and reassembling as a numerical sequence
i = 1
loop while getline(s, line) --quiet
sscanf(line + 1, "%d:%d", h, m)
hm[i] = h*100 + m
i++
endloop
# now we're ready to construct a proper "time" index
# for these half-hourly data
series realtime = 10000*ymd + hm
# take a look and check
print DATE ymd TIME hm realtime --byobs
# use "realtime" as the panel time variable: OK!
setobs HOUSENUM realtime --panel-vars
</hansl>
Allin Cottrell