On Fri, 18 Oct 2013, Ignacio Diaz-Emparanza wrote:
I read in the gretl changelog this
"readfile() function: check for valid UTF-8 and recode if
possible if the text is not UTF-8"
But I see this is not properly working. I tried to open the attached file
(ISO-8859) but in running the 'readline' command I obtain the error message
"Invalid byte sequence in conversion input"
The question is, what codeset are we to assume we're converting from?
To date we have assumed that if text is not UTF-8 it will be in the
encoding of the current locale, and have used the GLib function
g_locale_to_utf8() on the imported text. This will fail if your locale
codeset is in fact UTF-8, which I presume is what's happening in your
case.
I've now made this a little smarter in CVS. First we check if the current
locale codeset is UTF-8. If so, we avoid using g_locale_to_utf8() and
instead use g_convert(), guessing at ISO-8859-15 as the source codeset.
If the current locale codeset is _not_ UTF-8 we try using that as the
source encoding; but if that fails we try ISO-8859-15.
Perhaps we should offer an optional second argument to readfile(),
allowing the user to specify the source codeset.
By the way, the correspondence on this topic illustrates how tricky this
whole business is. Ignacio attached a file which he said was in ISO-8859,
but in fact what came across via email was an ASCII file with question
marks in place of accented characters. Helio gave an inline example of a
file that was again supposed to be in ISO-8859-15, but what came across
via email here was in fact UTF-8.
If you want to illustrate anything to do with codesets via email, it's
necessary to zip or tar the files in question; otherwise you have no idea
what your reader is going to see!
Allin