Riccardo (Jack) Lucchetti schrieb:
On Thu, 25 Oct 2007, Allin Cottrell wrote:
> When gretl encounters non-numeric data for a particular variable
> in a CSV import it treats the values of that variable as strings,
> constructs a numeric coding, and creates a "string table" that
> presents the coding to the user. BUT this is done only if
> non-numeric data are encountered in the first data row for the
> variable in question. That is, if we read (apparently) numeric
> data on rows 1 to k-1, then encounter non-numeric data on row k,
> we flag an error and stop reading.
>
> The trouble is that some of the PUMS variables are codings, some
> but not all values of which contain non-numeric characters. For
> example, NAICSP, the "NAICS Industry Code", which has values
> (among others) of 1133 and 113M.
>
> Here's a solution, perhaps not permanent if we can think of
> something better: I've added a new parameter to the "set" command,
> namely "codevars". You can do, for example,
[...]
The problem I see with this approach is that one has to know in advance
which variables must be treated specially. With large datasets, you may
not; the improved debugging info does help, but IMO only to an extent. A
possible alternative may be the following: first, read all the data as
if they were all strings. Then, with the data already in RAM, convert to
numeric whenever possible. This way, you read the datafile only once,
and the way stays open if we want, for instance, flag some of the
variables as dummies or discrete variables straight away.
What do you think?
Not sure if the question was directed at people like me, but Jack's idea
sounds good. If I understand correctly, it's an approach to convert as
much as possible to usable variables and data, and inform the user about
the rest. (Rather than throwing errors and stopping.) That would be good.
BTW, on a (only loosely) related issue, it would be useful if gretl
could handle files like some I recently downloaded from the US BLS site;
they report quarterly data with an additional row for year averages,
like so:
1950Q01 3.5
1950Q02 4.2
1950Q03 9.4
1950Q04 5.3
1950Q05 <you do the calc ;-)>
Maybe an option the skip every n-th row would be a solution. Or a
condition to exclude obs labels with pattern '*5'.
Apart from that gretl determines the above file as monthly data, IIRC,
even though there is a 'q' in the labels. Maybe the corresponding
heuristic could be made smarter.
cheers,
sven