On Fri, 26 Oct 2007, Allin Cottrell wrote:
On Fri, 26 Oct 2007, Sven Schreiber wrote:
> Riccardo (Jack) Lucchetti schrieb:
>>
>> A possible alternative may be the following: first, read all
>> the data as if they were all strings. Then, with the data
>> already in RAM, convert to numeric whenever possible. This
>> way, you read the datafile only once, and the way stays open
>> if we want, for instance, flag some of the variables as
>> dummies or discrete variables straight away.
>
> Jack's idea sounds good. If I understand correctly, it's an
> approach to convert as much as possible to usable variables and
> data, and inform the user about the rest. (Rather than throwing
> errors and stopping.) That would be good.
I like Jack's idea too, with a couple of reservations.
First, I'm not too keen on reading all the data into RAM as
strings. To ensure no data loss, these strings would have to be
fairly long -- say 32 characters. Now with something like PUMS
you can have tens or hundreds of thousands of observations on
hundreds of variables. This makes for a big memory chunk when
stored as doubles, and perhaps 4 times as big when stored as
strings. So I tend to favour two passes.
True. Still, it's not inconceivable to allow the RAM policy for small
files and the "two-passes" policy for larger files. Clearly, this would
require some heuristics, but...
Second, I think that attempting to parse all non-numeric stuff as
coded data should probably be governed by an explicit option.
It'll work fine on a well-formed PUMS file, but could cause a
nasty mess with a very large data file that has a few extraneous
non-numeric characters in it, 100% CPU for a long time. Think of
a file with 200000 observations and a stray 'x' on the last row.
> BTW, on a (only loosely) related issue, it would be useful if
> gretl could handle files like some I recently downloaded from
> the US BLS site; they report quarterly data with an additional
> row for year averages, like so:
>
> 1950Q01 3.5
> 1950Q02 4.2
> 1950Q03 9.4
> 1950Q04 5.3
> 1950Q05 <you do the calc ;-)>
Yes, I've seen data of that sort too. I'll think about that
issue.
The last two points are related IMO. It's very nice from the user point of
view to have gretl handle sensibly cases such as these, but in the end
it's the user's responsibility to feed a decently-formed CSV file into
gretl. No-one can reasonably complain if gretl (or any other program, for
that matter) refuses to read a CSV file which contains a stray x at the
end. As for Sven's case, it'd be rather easy to do a
grep -v Q05 originalfile.csv > modifiedfile.csv
(pity those poor souls who lack Unix tools). My point is that we should
not try to cover internally all possible cases that occur in practice;
there's always going to be one more special case, and there are tools for
this.
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
r.lucchetti(a)univpm.it
http://www.econ.univpm.it/lucchetti