On Thu, 25 Oct 2007, Allin Cottrell wrote:
When gretl encounters non-numeric data for a particular variable
in a CSV import it treats the values of that variable as strings,
constructs a numeric coding, and creates a "string table" that
presents the coding to the user. BUT this is done only if
non-numeric data are encountered in the first data row for the
variable in question. That is, if we read (apparently) numeric
data on rows 1 to k-1, then encounter non-numeric data on row k,
we flag an error and stop reading.
The trouble is that some of the PUMS variables are codings, some
but not all values of which contain non-numeric characters. For
example, NAICSP, the "NAICS Industry Code", which has values
(among others) of 1133 and 113M.
Here's a solution, perhaps not permanent if we can think of
something better: I've added a new parameter to the "set" command,
namely "codevars". You can do, for example,
[...]
The problem I see with this approach is that one has to know in advance
which variables must be treated specially. With large datasets, you may
not; the improved debugging info does help, but IMO only to an extent. A
possible alternative may be the following: first, read all the data as if
they were all strings. Then, with the data already in RAM, convert to
numeric whenever possible. This way, you read the datafile only once, and
the way stays open if we want, for instance, flag some of the variables as
dummies or discrete variables straight away.
What do you think?
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
r.lucchetti(a)univpm.it
http://www.econ.univpm.it/lucchetti