universal character names

Thu Apr 6 00:43:57 UTC 2000

>>>>> Martin von Loewis <loewis at informatik.hu-berlin.de> writes:

 > A) External names for identifiers containing unicode letters; e.g.

 >    namespace newmath{
 >       const long double \u03A0 = 3.14159265358979;
 >    }

 >    This is also an issue for C99, so it may be that the base ABI has a
 >    specification; we'd have to follow that at least for extern "C"
 >    names. If not, I propose that such names are encoded in UTF-8.

UTF-8 is inappropriate for mangled names, as it uses values > 127 to encode
non-ASCII characters.

GNU Java encodes names in UTF-8 internally.  For the mangled name, if there
are non-ASCII characters, it adds a 'U' to the beginning and encodes each
such UCS-2 character as _%04x.  See gcc/java/mangle.c.

This assumes that all interesting characters fall within the Basic
Multilingual Plane (the low 16 bits); that is a valid assumption for us,
since all the extended characters valid for use in C++ identifiers are part
of the BMP.

 > B) Object file representation of narrow and wide string literals
 >    containing such characters, eg.

 >       wchar_t MvL[]=L"Martin von L\u00F6wis";

 >    First, what is sizeof(wchar_t) in the base ABI? I'll assume 4 for
 >    the moment. Then, the question comes down to: What is the execution
 >    character set, and the wide execution character set? 2.2/3 says
 >    they are implementation-defined, so I guess we must define
 >    them. Typically, people expect this to be a run-time setting (which
 >    is a reasonable assumption), but it kind-of breaks for string
 >    literals.

 >    Proposal: The wide execution character set is UCS-4. The
 >    execution-character-set is "as-is", i.e. bytes from the source
 >    character set are copied unmodified to the object
 >    file. Universal-character-names appearing in narrow (ie. char)
 >    strings are not portable in this ABI (the other alternatives would
 >    be to say they are Latin-1, or encoded as UTF-8, I guess).

I have been told that it is inappropriate to assume that wchar_t is always
UCS-4; a suggestion was to convert from UCS-4 to the host locale character
set using iconv(), and then if we're in a wide string, convert to wchar_t
with mbtowc().  This makes sense to me, though of course it requires iconv
to know about UCS-4.

Jason