This is quite long. Impatient readers who are nonetheless disposed
to help with gretl development: please skip to the "Requests"
section at the end!
Gretl's CSV reader is, I think, pretty good at this point: it can
handle most "CSV" data (in a broad sense) that you throw at it, so
long as the input is not too badly broken. However, I've recently
discovered that there is a potentially important problem, and I'm
trying to fix it.
The problem: some time ago we decided to ease the task of parsing
"CSV" by deleting quotation marks from each line of input. (We can
and do recognize string-valued input, but only by determining that
it cannot be parsed as numeric.) Quotation is sometimes used
inconsistently and arbitrarily in "CSV" files (in which case we
don't lose anything by the policy just mentioned), but sometimes
it's used in a systematic way to indicate columns that present
categorical data (per R, "factors") even though the values are
apparently numeric. I've come across such cases in the American
Household Survey (AHS), and also R's write.csv() uses quotation of
integer values to indicate factors.
If you're working with a fairly small, and adequately documented,
dataset in CSV form this isn't a big problem; it's easy enough to
figure out which columns are categorical, and treat them accordingly
(e.g. by using "dummify"). But given a dataset with hundreds of
variables (e.g. AHS), and maybe not all that well documented,
figuring out what's categorical and what's not can be a real
headache.
So, I've been working on a revision of our CSV reader in which we
"respect" quotation in this sense: we do not delete quotation marks
in CSV input, and if it turns out that all the values in a given
column are quoted integers, we take that column to be an encoding of
a categorical variable. To that end I've introduced a new attribute
of gretl series, namely "coded". This is set on input from CSV,
where applicable, and is preserved in write/read of gdt and gdtb
data files.
At present, the revised CSV reader is invoked if and only if you add
the option flag --respect-quotes when opening a CSV file (which has
to be done via console or scripting), as in
open foo.csv --respect-quotes
My goal is that once this whole thing has stabilized, that is
removed as an option and becomes the default.
The "coded" attribute is represented in the gretl GUI, under "Edit
attributes", via the checkbox labeled "Numeric values represent an
encoding". It's also settable (and removable, in case it has been
wrongly imputed), via the "setinfo" command:
setinfo <series-name> --coded # add "coded" attribute
setinfo <series-name> --numeric # remove "coded" attribute
But note that the condition for adding the "coded" attribute is that
(a) the series is purely integer-valued and (b) it's not just a 0/1
dummy. That's because this attribute is intended to tell gretl that
the series needs to be "dummified" for econometric use, which is
obviously not the case for a series that's already a binary dummy.
What are the consequences when a series is taken to be "coded"?
(Clearly, if there are no practical consequences this is all a waste
of time.) Well, that's work in progress, but the idea is that you
can modify a regression list to automatically "dummify" any coded
series that it might contain. There's a first pass at this in gretl
git which is illustrated by the following:
<hansl>
open foo.csv # may contain coded series?
list X = * # complete list of series
X -= y # remove the dependent variable "y" from X
X = dummify(X, -999)
ols y X
</hansl>
The use of a second argument of -999 to dummify() is, of course,
just a temporary hack. What it means at present is: replace any
"coded" series in the first (list) argument by their
"dummifications" (lists of per-value binary dummies, omitting the
first value in each encoding). For future reference, this hack
should be replaced by either a new function, or an overloading of
dummify() with a special second argument, or maybe an option for
estimation commands which calls for pre-processing of the list of
independent variables as described.
Requests:
1) I'd be grateful if people who work with CSV data could try
importing using the --respect-quotes option to "open" in git and
snapshots. I'm particularly interested in any cases where the
original CSV reader works OK but the new version fails; any such
cases will have to be fixed before we can proceed.
2) I'd also be grateful if people could test the dummify(list, -999)
hack. But note, this won't do anything (or at least, shouldn't do
anything!) if list contains no "coded" series. And you won't have
any such series in your dataset unless you import from suitable CSV,
or mark any suitable series via the "setinfo" command or the GUI
"Edit attributes" dialog (see above).
Allin