The post from Thorsten Wingenroth at
http://lists.wfu.edu/pipermail/gretl-users/2015-January/010606.html
and some of the follow-ups raised the issue of thousands separators
in "CSV" data files.
I said that such separators (',' in English-speaking locales and '.'
in many others) should never appear in data files intended to be
read by computers. I stand by that, but there's something here that
needs attention.
If a supposedly numeric string such as "10,233.45" or "10.233,45"
simply raised an error in gretl's CSV reader that would be OK, in my
opinion, but the complication is that our reader can handle
"string-valued" variables, and almost-numeric fields of this sort
will be accepted as string values, which can be quite confusing.
(Gretl will state that such-and-such variables have been taken as
string-valued, but a hasty user could well miss that.)
I've therefore added some code to gretl's CSV reader which attempts
to figure out if a "non-numeric" field is really a numeric field
with thousands separators, and if so handles it as numeric. This is
in CVS and snapshots. If people could test it, that would be
appreciated.
Our initial heuristic for detecting such a field is that it contains
nothing but digits, '.' and ',' (possibly with a leading minus). If
both '.' and ',' appear in a given field, we conclude that only the
right-most of these non-digits could be the decimal character and
the other might be a thousands separator. In addition, if two or
more instances of comma appear in a given field then comma cannot be
the decimal character but might be a thousands separator, and the
same goes for '.'.
Having guessed at a possible thousands separator in this way, we
then check the guess: it's wrong unless every instance of this
character is followed by exactly 3 digits.
If and only if we get a consistent result from such guessing and
checking across all observations, we make a second pass though the
data, stripping out the presumed thousands separator.
I've tested this using the "DAX" data file that Thorsten posted, in
the four possible cases:
1) The locale decimal character is '.'; the CSV file uses '.' for
thousands and ',' for decimal.
2) The locale decimal character is '.'; the CSV file uses ',' for
thousands and '.' for decimal.
3) The locale decimal character is ','; the CSV file uses '.' for
thousands and ',' for decimal.
4) The locale decimal character is ','; the CSV file uses ',' for
thousands and '.' for decimal.
All these cases are working OK with Thorsten's data.
Allin