I recently responded to this question from a gretl user:
I have been trying to figure out how to Gretl for Public Use
Micro Data Sample (PUMS). I am wondering if you can point me in
the right direction. Your response is greatly appreciated.
My response is below (you may have seen it on gretl-users),
followed by a design question.
<initial response>
I haven't made much use of PUMS data myself, but here's what I
found on quick experimentation. I went to
http://factfinder.census.gov/home/en/acs_pums_2006.html
and downloaded the 2006 Population Records for North Carolina in
CSV format. Gretl was close to being able to read this straight
off, but there was one problem.
When gretl encounters non-numeric data for a particular variable
in a CSV import it treats the values of that variable as strings,
constructs a numeric coding, and creates a "string table" that
presents the coding to the user. BUT this is done only if
non-numeric data are encountered in the first data row for the
variable in question. That is, if we read (apparently) numeric
data on rows 1 to k-1, then encounter non-numeric data on row k,
we flag an error and stop reading.
The trouble is that some of the PUMS variables are codings, some
but not all values of which contain non-numeric characters. For
example, NAICSP, the "NAICS Industry Code", which has values
(among others) of 1133 and 113M.
Here's a solution, perhaps not permanent if we can think of
something better: I've added a new parameter to the "set" command,
namely "codevars". You can do, for example,
set codevars NAICSP SOCP
prior to importing a CSV file. This tells gretl that the
variables NAICSP and SOCP should be interpreted as string-coded,
even if the first values look to be numeric.
(In general you say: "set codevars <varnames>", where <varnames>
is a space-separated list of names. You can say "set codevars
null" to clean out the list.)
For the North Carolina PUMS data, this now works to open the file
in gretl:
set codevars NAICSP SOCP
open ss06pnc.csv
This feature is in CVS gretl, and also in the current Windows
snapshot at
http://ricardo.ecn.wfu.edu/pub/gretl/gretl_install.exe
You may have to engage in some trial and error. I've beefed up
the error reporting a little. So, in relation to the example
above, if you do
set codevars NAICSP
open ss06pnc.csv
you then see:
Variable 106 (SOCP), observation 12, '434XXX':
Extraneous character 'X' in data
which in effect tells you that you need to add SOCP to the
"codevars" list -- if it seems to you that 434XXX is a legtitimate
value for that variable.
</initial response>
Now here's my question. I wonder if it might be better (or
complementary, perhaps) to add an option flag to open/import, that
forces gretl to treat all data columns containing non-numeric
values as legitimate codings. (There could be a corresponding
checkbox in the GUI.)
Internally, this would require two passes through the file, one to
assess which variables need special treatment, and a second to
atually read (and code) the data.
The general issue here is that non-numeric values are sometimes
legit, but sometimes reflect a screwed-up data file. It might be
useful for the user to be able to say, "I know that anything
non-numeric in this file is in fact legit".
Allin.