Thanks Allin for implementing the thousands separater detection, I think that is a very
good idea.
What distinguishes Gretl from R in my perspective is its user interface. This means that
also novices like first year students can work with it concentrating on statistics instead
of learning (and forgetting soon after) R commands.
Especially convenient is the automated data import. Since thousands separators can be
found very often in finance data, I think it is very useful for novices and experts alike
to have Gretl do the work of cleaning the data.
So, thanks for making the effort.
Kind regards,
Thorsten
-----Ursprüngliche Nachricht-----
Von: gretl-users-bounces(a)lists.wfu.edu [mailto:gretl-users-bounces@lists.wfu.edu] Im
Auftrag von Allin Cottrell
Gesendet: Sonntag, 1. Februar 2015 23:04
An: Gretl users
Betreff: [Gretl-users] thousands separator in delimited-text data
The post from Thorsten Wingenroth at
http://lists.wfu.edu/pipermail/gretl-users/2015-January/010606.html
and some of the follow-ups raised the issue of thousands separators in "CSV"
data files.
I said that such separators (',' in English-speaking locales and '.'
in many others) should never appear in data files intended to be read by computers. I
stand by that, but there's something here that needs attention.
If a supposedly numeric string such as "10,233.45" or "10.233,45"
simply raised an error in gretl's CSV reader that would be OK, in my opinion, but the
complication is that our reader can handle "string-valued" variables, and
almost-numeric fields of this sort will be accepted as string values, which can be quite
confusing.
(Gretl will state that such-and-such variables have been taken as string-valued, but a
hasty user could well miss that.)
I've therefore added some code to gretl's CSV reader which attempts to figure out
if a "non-numeric" field is really a numeric field with thousands separators,
and if so handles it as numeric. This is in CVS and snapshots. If people could test it,
that would be appreciated.
Our initial heuristic for detecting such a field is that it contains nothing but digits,
'.' and ',' (possibly with a leading minus). If both '.' and
',' appear in a given field, we conclude that only the right-most of these
non-digits could be the decimal character and the other might be a thousands separator. In
addition, if two or more instances of comma appear in a given field then comma cannot be
the decimal character but might be a thousands separator, and the same goes for
'.'.
Having guessed at a possible thousands separator in this way, we then check the guess:
it's wrong unless every instance of this character is followed by exactly 3 digits.
If and only if we get a consistent result from such guessing and checking across all
observations, we make a second pass though the data, stripping out the presumed thousands
separator.
I've tested this using the "DAX" data file that Thorsten posted, in the four
possible cases:
1) The locale decimal character is '.'; the CSV file uses '.' for
thousands and ',' for decimal.
2) The locale decimal character is '.'; the CSV file uses ',' for
thousands and '.' for decimal.
3) The locale decimal character is ','; the CSV file uses '.' for
thousands and ',' for decimal.
4) The locale decimal character is ','; the CSV file uses ',' for
thousands and '.' for decimal.
All these cases are working OK with Thorsten's data.
Allin
_______________________________________________
Gretl-users mailing list
Gretl-users(a)lists.wfu.edu
http://lists.wfu.edu/mailman/listinfo/gretl-users