gretl's CSV reader and categorical data

Monday, 24 July 2017

This is quite long. Impatient readers who are nonetheless disposed 
to help with gretl development: please skip to the "Requests" 
section at the end!

Gretl's CSV reader is, I think, pretty good at this point: it can 
handle most "CSV" data (in a broad sense) that you throw at it, so 
long as the input is not too badly broken. However, I've recently 
discovered that there is a potentially important problem, and I'm 
trying to fix it.

The problem: some time ago we decided to ease the task of parsing 
"CSV" by deleting quotation marks from each line of input. (We can 
and do recognize string-valued input, but only by determining that 
it cannot be parsed as numeric.)  Quotation is sometimes used 
inconsistently and arbitrarily in "CSV" files (in which case we 
don't lose anything by the policy just mentioned), but sometimes 
it's used in a systematic way to indicate columns that present 
categorical data (per R, "factors") even though the values are 
apparently numeric. I've come across such cases in the American 
Household Survey (AHS), and also R's write.csv() uses quotation of 
integer values to indicate factors.

If you're working with a fairly small, and adequately documented, 
dataset in CSV form this isn't a big problem; it's easy enough to 
figure out which columns are categorical, and treat them accordingly 
(e.g. by using "dummify"). But given a dataset with hundreds of 
variables (e.g. AHS), and maybe not all that well documented, 
figuring out what's categorical and what's not can be a real 
headache.

So, I've been working on a revision of our CSV reader in which we 
"respect" quotation in this sense: we do not delete quotation marks 
in CSV input, and if it turns out that all the values in a given 
column are quoted integers, we take that column to be an encoding of 
a categorical variable. To that end I've introduced a new attribute 
of gretl series, namely "coded". This is set on input from CSV, 
where applicable, and is preserved in write/read of gdt and gdtb 
data files.

At present, the revised CSV reader is invoked if and only if you add 
the option flag --respect-quotes when opening a CSV file (which has 
to be done via console or scripting), as in

open foo.csv --respect-quotes

My goal is that once this whole thing has stabilized, that is 
removed as an option and becomes the default.

The "coded" attribute is represented in the gretl GUI, under "Edit 
attributes", via the checkbox labeled "Numeric values represent an 
encoding". It's also settable (and removable, in case it has been 
wrongly imputed), via the "setinfo" command:

setinfo <series-name> --coded   # add "coded" attribute
setinfo <series-name> --numeric # remove "coded" attribute

But note that the condition for adding the "coded" attribute is that 
(a) the series is purely integer-valued and (b) it's not just a 0/1 
dummy. That's because this attribute is intended to tell gretl that 
the series needs to be "dummified" for econometric use, which is 
obviously not the case for a series that's already a binary dummy.

What are the consequences when a series is taken to be "coded"? 
(Clearly, if there are no practical consequences this is all a waste 
of time.) Well, that's work in progress, but the idea is that you 
can modify a regression list to automatically "dummify" any coded 
series that it might contain. There's a first pass at this in gretl 
git which is illustrated by the following:

<hansl>
open foo.csv # may contain coded series?
list X = *   # complete list of series
X -= y       # remove the dependent variable "y" from X
X = dummify(X, -999)
ols y X
</hansl>

The use of a second argument of -999 to dummify() is, of course, 
just a temporary hack. What it means at present is: replace any 
"coded" series in the first (list) argument by their 
"dummifications" (lists of per-value binary dummies, omitting the 
first value in each encoding). For future reference, this hack 
should be replaced by either a new function, or an overloading of 
dummify() with a special second argument, or maybe an option for 
estimation commands which calls for pre-processing of the list of 
independent variables as described.

Requests:

1) I'd be grateful if people who work with CSV data could try 
importing using the --respect-quotes option to "open" in git and 
snapshots. I'm particularly interested in any cases where the 
original CSV reader works OK but the new version fails; any such 
cases will have to be fixed before we can proceed.

2) I'd also be grateful if people could test the dummify(list, -999) 
hack. But note, this won't do anything (or at least, shouldn't do 
anything!) if list contains no "coded" series. And you won't have 
any such series in your dataset unless you import from suitable CSV, 
or mark any suitable series via the "setinfo" command or the GUI 
"Edit attributes" dialog (see above).

Allin

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006