Re: [Gretl-devel] gretl's CSV reader and categorical data

Tuesday, 25 July 2017

On Tue, 25 Jul 2017, Sven Schreiber wrote:

...
 Am 23.07.2017 um 21:44 schrieb Allin Cottrell:

> The problem: some time ago we decided to ease the task of parsing "CSV" by

> deleting quotation marks from each line of input. (We can and do recognize 
> string-valued input, but only by determining that it cannot be parsed as 
> numeric.)  Quotation is sometimes used inconsistently and arbitrarily in 
> "CSV" files 

 I am absolutely no csv fundamentalist (like people who don't accept 
 semicolons or tabs as column separators), but could you remind us why coping 
 with CSV files with inconsistent quotation has to be done? Spontaneously I'd 
 say such files are really the problem of their creators.

> So, I've been working on a revision of our CSV reader in which we
"respect" 
> quotation in this sense: we do not delete quotation marks in CSV input, and 
> if it turns out that all the values in a given column are quoted integers, 
> we take that column to be an encoding of a categorical variable. 

 Except if they're years, I hope... No seriously, doesn't this mess with a lot 
 of variables that may be only integers but that we usually treat as 
 quasi-continuous? 
Let me try to explain more clearly what I'm up to. Consider the 
following CSV fragment:

"x","y"
12,"1"
2,"0"
9,"3"
31,"1"
15,"2"

The data are all integers, but the values in the y column are quoted 
while those in the x column are not. As things stand we ignore this 
difference by default: both x and y will be considered "properly 
numeric" by gretl, the y-quotes being stripped out in a 
pre-processing step.

However, in CSV fles from various sources, including R's 
write.csv(), the presence/absence of quotation in the data is 
semantically significant: we are supposed to read only unquoted 
values as "properly numeric" and the quoted onces as encodings (of 
"factor" variables). That's precisely what the new --respect-quotes 
option does.

Once we've shaken the bugs out of the new option I'd like to make 
respecting quotation in this way the default, but perhaps add an 
--ignore-quotes option to give the old behavior. Why might that be 
wanted? Because I'm pretty sure I've seen CSV files where quotation 
is used arbitrarily -- even on what are clearly "properly numeric" 
fields, and in that case you do want to ignore it.

If a "CSV" file contains truly broken use of quotation (quotes 
opened but not closed in a field, use of double-quotes in some 
fields and single-quotes in others) then I agree, it's not our job 
to try and fix such a mess. But I do think we should try to make 
sense of quoted versus unquoted numbers. There's no de jure standard 
here, but what R does (and various governmental sources also do) is 
a useful de facto standard.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] gretl's CSV reader and categorical data