On Tue, 25 Jul 2017, Sven Schreiber wrote:
Am 23.07.2017 um 21:44 schrieb Allin Cottrell:
> The problem: some time ago we decided to ease the task of parsing "CSV" by
> deleting quotation marks from each line of input. (We can and do recognize
> string-valued input, but only by determining that it cannot be parsed as
> numeric.) Quotation is sometimes used inconsistently and arbitrarily in
> "CSV" files
I am absolutely no csv fundamentalist (like people who don't accept
semicolons or tabs as column separators), but could you remind us why coping
with CSV files with inconsistent quotation has to be done? Spontaneously I'd
say such files are really the problem of their creators.
> So, I've been working on a revision of our CSV reader in which we
"respect"
> quotation in this sense: we do not delete quotation marks in CSV input, and
> if it turns out that all the values in a given column are quoted integers,
> we take that column to be an encoding of a categorical variable.
Except if they're years, I hope... No seriously, doesn't this mess with a lot
of variables that may be only integers but that we usually treat as
quasi-continuous?
Let me try to explain more clearly what I'm up to. Consider the
following CSV fragment:
"x","y"
12,"1"
2,"0"
9,"3"
31,"1"
15,"2"
The data are all integers, but the values in the y column are quoted
while those in the x column are not. As things stand we ignore this
difference by default: both x and y will be considered "properly
numeric" by gretl, the y-quotes being stripped out in a
pre-processing step.
However, in CSV fles from various sources, including R's
write.csv(), the presence/absence of quotation in the data is
semantically significant: we are supposed to read only unquoted
values as "properly numeric" and the quoted onces as encodings (of
"factor" variables). That's precisely what the new --respect-quotes
option does.
Once we've shaken the bugs out of the new option I'd like to make
respecting quotation in this way the default, but perhaps add an
--ignore-quotes option to give the old behavior. Why might that be
wanted? Because I'm pretty sure I've seen CSV files where quotation
is used arbitrarily -- even on what are clearly "properly numeric"
fields, and in that case you do want to ignore it.
If a "CSV" file contains truly broken use of quotation (quotes
opened but not closed in a field, use of double-quotes in some
fields and single-quotes in others) then I agree, it's not our job
to try and fix such a mess. But I do think we should try to make
sense of quoted versus unquoted numbers. There's no de jure standard
here, but what R does (and various governmental sources also do) is
a useful de facto standard.
Allin