On Tue, 31 Dec 2013, Sven Schreiber wrote:
Further information on this: in the original imported file, the
variable
with the next ID number after "nordwest" was called "südwest" (with
u-umlaut) which was obviously not imported correctly, there was some
strange sign instead of the umlaut. Possibly this got interpreted by
gretl as something weird, which caused the mess-up.
Text encoding in Stata files is quite primitive. Stata doesn't
support UTF-8, and apparently encodes non-ASCII characters in the
locale MS-DOS "code page" -- without, so far as I can tell,
recording the specific encoding used in the .dta file.
That said, I've now modified the Stata importer in CVS to use the
libgretl function iso_to_ascii() on Stata variable names. Hopefully,
in most cases this should manage to convert "südwest" to "sudwest"
without knowledge of the actual encoding used.
(Googling this issue reveals some frustration among Stata users in
non-English locales. It also reveals that a prominent Stata guru
apparently believes that UTF-8 performs the magic of representing
all Unicode characters in an 8-bit encoding, a misapprehension which
does not bode well for Stata's eventually supporting UTF-8.)
Allin