I thought the following might be of interest to people who work in
microeconometrics.
Allin Cottrell
---------- Forwarded message ----------
Date: Thu, 25 Oct 2007 11:44:59 -0400 (EDT)
From: Allin Cottrell
To: Mohammad (Mitu) Ashraf
Subject: Re: Gretl and PUMS Data
On Wed, 24 Oct 2007, Mohammad (Mitu) Ashraf wrote:
I am an associate professor of economics at UNC-Pembroke. I just
started using Gretl for my research. I am switching from SAS.
First of all, I want to thank you and your colleagues for
developing such a wonderful tool.
Thanks!
I have been trying to figure out how to Gretl for Public Use
Micro Data Sample (PUMS). I am wondering if you can point me in
the right direction. Your response is greatly appreciated.
I haven't made much use of PUMS data myself, but here's what I
found on quick experimentation. I went to
http://factfinder.census.gov/home/en/acs_pums_2006.html
and downloaded the 2006 Population Records for North Carolina in
CSV format. Gretl was close to being able to read this straight
off, but there was one problem.
When gretl encounters non-numeric data for a particular variable
in a CSV import it treats the values of that variable as strings,
constructs a numeric coding, and creates a "string table" that
presents the coding to the user. BUT this is done only if
non-numeric data are encountered in the first data row for the
variable in question. That is, if we read (apparently) numeric
data on rows 1 to k-1, then encounter non-numeric data on row k,
we flag an error and stop reading.
The trouble is that some of the PUMS variables are codings, some
but not all values of which contain non-numeric characters. For
example, NAICSP, the "NAICS Industry Code", which has values
(among others) of 1133 and 113M.
Here's a solution, perhaps not permanent if we can think of
something better: I've added a new parameter to the "set" command,
namely "codevars". You can do, for example,
set codevars NAICSP SOCP
prior to importing a CSV file. This tells gretl that the
variables NAICSP and SOCP should be interpreted as string-coded,
even if the first values look to be numeric.
(In general you say: "set codevars <varnames>", where <varnames>
is a space-separated list of names. You can say "set codevars
null" to clean out the list.)
For the North Carolina PUMS data, this now works to open the file
in gretl:
set codevars NAICSP SOCP
open ss06pnc.csv
This feature is in CVS gretl, and also in the current Windows
snapshot at
http://ricardo.ecn.wfu.edu/pub/gretl/gretl_install.exe
You may have to engage in some trial and error. I've beefed up
the error reporting a little. So, in relation to the example
above, if you do
set codevars NAICSP
open ss06pnc.csv
you then see:
Variable 106 (SOCP), observation 12, '434XXX':
Extraneous character 'X' in data
which in effect tells you that you need to add SOCP to the
"codevars" list -- if it seems to you that 434XXX is a legtitimate
value for that variable.