On Fri, 26 Oct 2007, Sven Schreiber wrote:
Riccardo (Jack) Lucchetti schrieb:
>
> A possible alternative may be the following: first, read all
> the data as if they were all strings. Then, with the data
> already in RAM, convert to numeric whenever possible. This
> way, you read the datafile only once, and the way stays open
> if we want, for instance, flag some of the variables as
> dummies or discrete variables straight away.
Jack's idea sounds good. If I understand correctly, it's an
approach to convert as much as possible to usable variables and
data, and inform the user about the rest. (Rather than throwing
errors and stopping.) That would be good.
I like Jack's idea too, with a couple of reservations.
First, I'm not too keen on reading all the data into RAM as
strings. To ensure no data loss, these strings would have to be
fairly long -- say 32 characters. Now with something like PUMS
you can have tens or hundreds of thousands of observations on
hundreds of variables. This makes for a big memory chunk when
stored as doubles, and perhaps 4 times as big when stored as
strings. So I tend to favour two passes.
Second, I think that attempting to parse all non-numeric stuff as
coded data should probably be governed by an explicit option.
It'll work fine on a well-formed PUMS file, but could cause a
nasty mess with a very large data file that has a few extraneous
non-numeric characters in it, 100% CPU for a long time. Think of
a file with 200000 observations and a stray 'x' on the last row.
BTW, on a (only loosely) related issue, it would be useful if
gretl could handle files like some I recently downloaded from
the US BLS site; they report quarterly data with an additional
row for year averages, like so:
1950Q01 3.5
1950Q02 4.2
1950Q03 9.4
1950Q04 5.3
1950Q05 <you do the calc ;-)>
Yes, I've seen data of that sort too. I'll think about that
issue.
Allin.