thousands separator in delimited-text data

Monday, 2 February 2015

The post from  Thorsten Wingenroth at
http://lists.wfu.edu/pipermail/gretl-users/2015-January/010606.html
and some of the follow-ups raised the issue of thousands separators 
in "CSV" data files.

I said that such separators (',' in English-speaking locales and '.' 
in many others) should never appear in data files intended to be 
read by computers. I stand by that, but there's something here that 
needs attention.

If a supposedly numeric string such as "10,233.45" or "10.233,45" 
simply raised an error in gretl's CSV reader that would be OK, in my 
opinion, but the complication is that our reader can handle 
"string-valued" variables, and almost-numeric fields of this sort 
will be accepted as string values, which can be quite confusing. 
(Gretl will state that such-and-such variables have been taken as 
string-valued, but a hasty user could well miss that.)

I've therefore added some code to gretl's CSV reader which attempts 
to figure out if a "non-numeric" field is really a numeric field 
with thousands separators, and if so handles it as numeric. This is 
in CVS and snapshots. If people could test it, that would be 
appreciated.

Our initial heuristic for detecting such a field is that it contains 
nothing but digits, '.' and ',' (possibly with a leading minus). If 
both '.' and ',' appear in a given field, we conclude that only the 
right-most of these non-digits could be the decimal character and 
the other might be a thousands separator. In addition, if two or 
more instances of comma appear in a given field then comma cannot be 
the decimal character but might be a thousands separator, and the 
same goes for '.'.

Having guessed at a possible thousands separator in this way, we 
then check the guess: it's wrong unless every instance of this 
character is followed by exactly 3 digits.

If and only if we get a consistent result from such guessing and 
checking across all observations, we make a second pass though the 
data, stripping out the presumed thousands separator.

I've tested this using the "DAX" data file that Thorsten posted, in
the four possible cases:

1) The locale decimal character is '.'; the CSV file uses '.' for 
thousands and ',' for decimal.

2) The locale decimal character is '.'; the CSV file uses ',' for 
thousands and '.' for decimal.

3) The locale decimal character is ','; the CSV file uses '.' for 
thousands and ',' for decimal.

4) The locale decimal character is ','; the CSV file uses ',' for 
thousands and '.' for decimal.

All these cases are working OK with Thorsten's data.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004