universal character names
Jason Merrill
jason at cygnus.com
Thu Apr 6 00:43:57 UTC 2000
>>>>> Martin von Loewis <loewis at informatik.hu-berlin.de> writes:
> A) External names for identifiers containing unicode letters; e.g.
> namespace newmath{
> const long double \u03A0 = 3.14159265358979;
> }
> This is also an issue for C99, so it may be that the base ABI has a
> specification; we'd have to follow that at least for extern "C"
> names. If not, I propose that such names are encoded in UTF-8.
UTF-8 is inappropriate for mangled names, as it uses values > 127 to encode
non-ASCII characters.
GNU Java encodes names in UTF-8 internally. For the mangled name, if there
are non-ASCII characters, it adds a 'U' to the beginning and encodes each
such UCS-2 character as _%04x. See gcc/java/mangle.c.
This assumes that all interesting characters fall within the Basic
Multilingual Plane (the low 16 bits); that is a valid assumption for us,
since all the extended characters valid for use in C++ identifiers are part
of the BMP.
> B) Object file representation of narrow and wide string literals
> containing such characters, eg.
> wchar_t MvL[]=L"Martin von L\u00F6wis";
> First, what is sizeof(wchar_t) in the base ABI? I'll assume 4 for
> the moment. Then, the question comes down to: What is the execution
> character set, and the wide execution character set? 2.2/3 says
> they are implementation-defined, so I guess we must define
> them. Typically, people expect this to be a run-time setting (which
> is a reasonable assumption), but it kind-of breaks for string
> literals.
> Proposal: The wide execution character set is UCS-4. The
> execution-character-set is "as-is", i.e. bytes from the source
> character set are copied unmodified to the object
> file. Universal-character-names appearing in narrow (ie. char)
> strings are not portable in this ABI (the other alternatives would
> be to say they are Latin-1, or encoded as UTF-8, I guess).
I have been told that it is inappropriate to assume that wchar_t is always
UCS-4; a suggestion was to convert from UCS-4 to the host locale character
set using iconv(), and then if we're in a wide string, convert to wchar_t
with mbtowc(). This makes sense to me, though of course it requires iconv
to know about UCS-4.
Jason
More information about the cxx-abi-dev
mailing list