Re: [Gretl-devel] Gretl and PUMS Data

Friday, 26 October 2007

On Fri, 26 Oct 2007, Allin Cottrell wrote:

...
 On Fri, 26 Oct 2007, Sven Schreiber wrote:

> Riccardo (Jack) Lucchetti schrieb:
>>
>> A possible alternative may be the following: first, read all
>> the data as if they were all strings. Then, with the data
>> already in RAM, convert to numeric whenever possible. This
>> way, you read the datafile only once, and the way stays open
>> if we want, for instance, flag some of the variables as
>> dummies or discrete variables straight away.
>
> Jack's idea sounds good. If I understand correctly, it's an
> approach to convert as much as possible to usable variables and
> data, and inform the user about the rest. (Rather than throwing
> errors and stopping.) That would be good.

 I like Jack's idea too, with a couple of reservations.

 First, I'm not too keen on reading all the data into RAM as
 strings.  To ensure no data loss, these strings would have to be
 fairly long -- say 32 characters.  Now with something like PUMS
 you can have tens or hundreds of thousands of observations on
 hundreds of variables.  This makes for a big memory chunk when
 stored as doubles, and perhaps 4 times as big when stored as
 strings.  So I tend to favour two passes. 
True. Still, it's not inconceivable to allow the RAM policy for small 
files and the "two-passes" policy for larger files. Clearly, this would 
require some heuristics, but...

...
 Second, I think that attempting to parse all non-numeric stuff as
 coded data should probably be governed by an explicit option.
 It'll work fine on a well-formed PUMS file, but could cause a
 nasty mess with a very large data file that has a few extraneous
 non-numeric characters in it, 100% CPU for a long time.  Think of
 a file with 200000 observations and a stray 'x' on the last row.

> BTW, on a (only loosely) related issue, it would be useful if
> gretl could handle files like some I recently downloaded from
> the US BLS site; they report quarterly data with an additional
> row for year averages, like so:
>
> 1950Q01 3.5
> 1950Q02 4.2
> 1950Q03 9.4
> 1950Q04 5.3
> 1950Q05 <you do the calc ;-)>

 Yes, I've seen data of that sort too.  I'll think about that
 issue. 
The last two points are related IMO. It's very nice from the user point of 
view to have gretl handle sensibly cases such as these, but in the end 
it's the user's responsibility to feed a decently-formed CSV file into 
gretl. No-one can reasonably complain if gretl (or any other program, for 
that matter) refuses to read a CSV file which contains a stray x at the 
end. As for Sven's case, it'd be rather easy to do a

grep -v Q05 originalfile.csv > modifiedfile.csv

(pity those poor souls who lack Unix tools). My point is that we should 
not try to cover internally all possible cases that occur in practice; 
there's always going to be one more special case, and there are tools for 
this.

Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche

r.lucchetti(a)univpm.it
http://www.econ.univpm.it/lucchetti

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] Gretl and PUMS Data