reading CSV files

Saturday, 27 October 2007

There are a couple of modifications in CVS and the Windows 
snapshot, based on recent discussions.

1) Non-numeric data.  Up till now we've treated a given data 
column as string-coded only if the first observation is 
non-numeric.  Now we're more generous, and treat any column that 
contains non-numeric values as a coding, subject to the following 
qualification (designed to catch genuine data errors):

* If there's only one non-numeric value in a given column, or if 
the non-numeric values amount to less than 1 percent of the total 
non-missing values, we give up on the coding and flag an error.

The user can override this qualification for specific named 
columns using "set codevars <list of names>" or can override it 
globally by adding the "--coded" flag to the "open" command.

Also, as suggested by Jack, we automatically flag variables 
treated in this way as discrete.

2) BLS-type data files with "five quarters" or "13 months": gretl 
will now read at least some such files correctly, disregarding the 
extra lines.  However, my feeling is that the BLS is playing silly 
buggers with this sort of file and that somebody should file a bug 
report with them.

If a file of this type is not recognized, then besides the nice 
and easy "grep -v" method suggested by Jack, it's also easy to 
clean up such files using gretl's data manipulation tools.  For 
example, for a "five-quarter" data file where the data start in 
1950Q1:

open nonsense.csv
genr index
smpl (index % 5 > 0) --restrict
setobs 4 1950:1
store sensible.gdt

Allin

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006