New subject: cannot import unbalanced panel data

Friday, 13 June 2014

John Paravantis recently posed an interesting data-parsing 
challenge. It came to me as list-owner and not to the list because 
it had as attachment a very big dataset. I'll quote (most of) John's 
query below and append my analysis.

John says:

...
 I successfully imported a very big data file (almost 244 thousand 
 observations, attached in zipped text) in gretl BUT cannot make 
 gretl accept its unbalanced panel structure so that I may run 
 AR(1) regression or use other time series techniques. [...]

 The data set contains half-hourly temperature measurements 
 (December 2012 to April 2013) and accompanying socioeconomic 
 variables of 50 homes in Athens, Greece. Not all homes contain the 
 same number of temperature measurements though, some started in 
 December, others in February.

 So the data set structure resembles unbalanced panel data (I 
 think), where the time index variable is an integer counting the 
 number of half-hour periods (variable DAYS) and the panel group 
 (variable HOUSENUM) simply holding the house number. 
I'm not quite sure what DAYS measures (it ranges from -5 to 118) but 
it clearly is _not_ an index of the half-hour periods, of which 
there are several thousand.

...
 When I try to set the dataset structure in gretl though by 
 selecting "panel" and then "use index variables" with variable 
 HOUSENUM as the unit or group index variable and DATE as the time 
 index variable, I get the message "The selected index variables do 
 not represent a panel structure." 
As elaborated below, DATE is a string variable identifying the 
calendar day.

...
 I also tried to use the DAYS variable as a time index but this one 
 does not even appear in the list of variables available as a time 
 index, don't know why not.

 What am I doing wrong?

 Is it OK to just have gretl accept the data set as a very long 
 time series with 243814 entries (even if it is not) just to get 
 access to the time series submenu items in the model menu? 
To answer the last question first: No, it is not at all OK to 
proceed as if a panel dataset were a continuous time series; any
results would be meaningless statistical garbage, as I'm sure John 
knows.

Consider the symptom John mentions: gretl says that the index 
variables he selected "do not represent a panel structure". That's 
quite right. The "unit" variable must constitute a unique ID for (in 
the case) the household, and he's OK in that respect, he has 
HOUSENUM. But in addition the "time" variable must constitute a 
unique identifier for the time period, and he's not OK in that 
respect: there's no variable in the dataset that fulfils that role.

There's a bunch of time-related variables -- DAYS, DATE, MONTH, DAY, 
TIME, HOUR, etc., too many in fact! -- but no unique period ID for 
the half-hourly frequency. The variables that look most promising 
for forming such a series are DATE and TIME. These are both string 
variables, with formats

DATE: "DD-MM-YYYY"
TIME: "hh:mm"

I'm adding below a script that creates a period ID using these 
variables. Perhaps it ought to be possible to do this more easily -- 
and maybe in fact it is (Jack?) but anyway it's something of a 
showcase for gretl's current string-handling capabilities. The 
strategy is to write out these two variables as .txt files, read 
them back in as strings, and parse them to create a series that (a) 
has distinct values for every half hour and (b) is actually 
increasing in the true time dimension.

<hansl>
open <John's data file>

# we need to exceed the default max number of
# "while" iterations below
set loop_maxiter 0

# write DATE (string: DD-MM-YYY) out to text
# with no added "obs" column or header
store dates.txt DATE --omit-obs --no-header
# and read it back as a big string
string s = readfile("dates.txt")
string line
scalar y, m, d
series ymd
# Loop across the dates.txt lines, parsing out day,
# month and year and reassembling as a proper sequence.
# We scan line + 1 to skip the leading quote
scalar i = 1
loop while getline(s, line) --quiet
   sscanf(line + 1, "%d-%d-%d", d, m, y)
   ymd[i] = y*10000 + m*100 + d
   i++
endloop

# now write out TIME (string: hh:mm) as text
store time.txt TIME --omit-obs --no-header
# and read back in as a string
s = readfile("time.txt")
scalar h
series hm
# loop across the time.txt lines, parsing out hour,
# minute and reassembling as a numerical sequence
i = 1
loop while getline(s, line) --quiet
   sscanf(line + 1, "%d:%d", h, m)
   hm[i] = h*100 + m
   i++
endloop

# now we're ready to construct a proper "time" index
# for these half-hourly data
series realtime = 10000*ymd + hm
# take a look and check
print DATE ymd TIME hm realtime --byobs
# use "realtime" as the panel time variable: OK!
setobs HOUSENUM realtime --panel-vars
</hansl>

Allin Cottrell

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Gretl-users] cannot import unbalanced panel data