Am 01.01.2014 02:57, schrieb Allin Cottrell:
On Tue, 31 Dec 2013, Sven Schreiber wrote:
> Further information on this: in the original imported file, the variable
> with the next ID number after "nordwest" was called "südwest"
(with
> u-umlaut) which was obviously not imported correctly, there was some
> strange sign instead of the umlaut. Possibly this got interpreted by
> gretl as something weird, which caused the mess-up.
Text encoding in Stata files is quite primitive. Stata doesn't support
UTF-8, and apparently encodes non-ASCII characters in the locale MS-DOS
"code page" -- without, so far as I can tell, recording the specific
encoding used in the .dta file.
That said, I've now modified the Stata importer in CVS to use the
libgretl function iso_to_ascii() on Stata variable names. Hopefully, in
most cases this should manage to convert "südwest" to "sudwest"
without
knowledge of the actual encoding used.
Thanks, this should help in most cases. According to the table on
http://en.wikipedia.org/wiki/Windows-1252 I guess that the remaining
problematic letters would be "contracted oe" (unicode 0152 and 0153),
and s and z with "inverted hats" (unicode 0160, 0161, 017D, 017E). Don't
know if these are valid characters in the respective local Stata versions.
Another primitive-but-all-purpose solution might be to replace any
non-Ascii characters with placeholders like "__"?
cheers,
sven