Re: [Gretl-devel] readline() and non utf8

Friday, 18 October 2013

On Fri, 18 Oct 2013, Ignacio Diaz-Emparanza wrote:

...
 I read in the gretl changelog this

 "readfile() function: check for valid UTF-8 and recode if
  possible if the text is not UTF-8"

 But I see this is not properly working. I tried to open the attached file 
 (ISO-8859) but in running the 'readline' command I obtain the error message

 "Invalid byte sequence in conversion input" 
The question is, what codeset are we to assume we're converting from?

To date we have assumed that if text is not UTF-8 it will be in the 
encoding of the current locale, and have used the GLib function 
g_locale_to_utf8() on the imported text. This will fail if your locale 
codeset is in fact UTF-8, which I presume is what's happening in your 
case.

I've now made this a little smarter in CVS. First we check if the current 
locale codeset is UTF-8. If so, we avoid using g_locale_to_utf8() and 
instead use g_convert(), guessing at ISO-8859-15 as the source codeset.
If the current locale codeset is _not_ UTF-8 we try using that as the 
source encoding; but if that fails we try ISO-8859-15.

Perhaps we should offer an optional second argument to readfile(), 
allowing the user to specify the source codeset.

By the way, the correspondence on this topic illustrates how tricky this 
whole business is. Ignacio attached a file which he said was in ISO-8859, 
but in fact what came across via email was an ASCII file with question 
marks in place of accented characters. Helio gave an inline example of a 
file that was again supposed to be in ISO-8859-15, but what came across 
via email here was in fact UTF-8.

If you want to illustrate anything to do with codesets via email, it's 
necessary to zip or tar the files in question; otherwise you have no idea 
what your reader is going to see!

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] readline() and non utf8