Re: [Gretl-devel] Gretl and PUMS Data

Friday, 26 October 2007

Riccardo (Jack) Lucchetti schrieb:
...
 On Thu, 25 Oct 2007, Allin Cottrell wrote:

> When gretl encounters non-numeric data for a particular variable
> in a CSV import it treats the values of that variable as strings,
> constructs a numeric coding, and creates a "string table" that
> presents the coding to the user.  BUT this is done only if
> non-numeric data are encountered in the first data row for the
> variable in question.  That is, if we read (apparently) numeric
> data on rows 1 to k-1, then encounter non-numeric data on row k,
> we flag an error and stop reading.
>
> The trouble is that some of the PUMS variables are codings, some
> but not all values of which contain non-numeric characters.  For
> example, NAICSP, the "NAICS Industry Code", which has values
> (among others) of 1133 and 113M.
>
> Here's a solution, perhaps not permanent if we can think of
> something better: I've added a new parameter to the "set" command,
> namely "codevars".  You can do, for example,
 [...]

 The problem I see with this approach is that one has to know in advance
 which variables must be treated specially. With large datasets, you may
 not; the improved debugging info does help, but IMO only to an extent. A
 possible alternative may be the following: first, read all the data as
 if they were all strings. Then, with the data already in RAM, convert to
 numeric whenever possible. This way, you read the datafile only once,
 and the way stays open if we want, for instance, flag some of the
 variables as dummies or discrete variables straight away.

 What do you think?

Not sure if the question was directed at people like me, but Jack's idea
sounds good. If I understand correctly, it's an approach to convert as
much as possible to usable variables and data, and inform the user about
the rest. (Rather than throwing errors and stopping.) That would be good.

BTW, on a (only loosely) related issue, it would be useful if gretl
could handle files like some I recently downloaded from the US BLS site;
they report quarterly data with an additional row for year averages,
like so:

1950Q01 3.5
1950Q02 4.2
1950Q03 9.4
1950Q04 5.3
1950Q05 <you do the calc ;-)>

Maybe an option the skip every n-th row would be a solution. Or a
condition to exclude obs labels with pattern '*5'.

Apart from that gretl determines the above file as monthly data, IIRC,
even though there is a 'q' in the labels. Maybe the corresponding
heuristic could be made smarter.

cheers,
sven

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] Gretl and PUMS Data