On Tue, 8 Sep 2009, Allin Cottrell wrote:
A word to the wise: if you want filenames, and names of objects
inside files such as sheets within a "workbook", to be portable,
then use ASCII characters only for such names...
Let me expand on this just a little.
For anyone who lives in a locale where accented Roman characters
-- or non-Roman characters for that matter -- are part of everyday
life, these characters will seem "natural" and limiting oneself to
the US ASCII character-set will seem artificial.
But the facts of life are that, for historical reasons, (a) every
modern computer "understands" ASCII just fine, while (b) there is
(sadly) no universal standard for the representation of non-ASCII
characters. For ASCII, each 'a', 'A', 'b', 'B', ...
corresponds
to a well-known pattern of 0s and 1s in an 8-bit field, while for
non-ASCII characters -- i-acute, s-cedilla, ogonek, whatever --
the pattern of 0s and 1s, as well as the size of the field in
which they are to be found, depends on the operating system and
the application software in use.
There's "Unicode", but at the level of implementation in software
this is not so much a standard as a set of competing standards:
UTF-8, UTF-16, UTF-32, BOM... And it's not all software that uses
any variant of Unicode; older ad hoc encodings are still in use,
some of these ISO standards and some conforming to no agreed
standard (Microsoft).
In the view of many people (not all), the UTF-8 encoding
represents the best (most elegant and economical) way of extending
the binary encoding of characters beyond ASCII, to encompass all
of the world's languages. (UTF-8 was devised by Ken Thompson, who
also brought us unix; it contains ASCII as a subset.) UTF-8 is
used on most modern Linux and Mac systems, among others, but it is
not used by Microsoft; moreover not all software on Linux and Mac
uses UTF-8.
It's a perennial struggle to get the encoding and re-coding
right when switching back and forth between UTF-8 and other
representations of non-ASCII characters. In gretl, we've devoted
a lot of time to trying to get this right for filenames and in the
context of gnuplot graphs. I think we're OK for the most part,
though every now and then a new encoding bug crops up.
When it comes to getting this difficult task right for names of
objects inside spreadsheets, in my view we've reached something of
quite low priority. Just refer to sheets by number if the names
are not recognized.
Allin.