universal character names

Thu Mar 23 07:53:05 UTC 2000

I could not find an issue for that, but I think it needs to be
one. 2.2, [lex.charset]/2 allows usage of universal-character-names in
C++ programs, especially in identifiers and strings. This gives us two
issues:

A) External names for identifiers containing unicode letters; e.g.

   namespace newmath{
      const long double \u03A0 = 3.14159265358979;
   }

   This is also an issue for C99, so it may be that the base ABI has a
   specification; we'd have to follow that at least for extern "C"
   names. If not, I propose that such names are encoded in UTF-8.

B) Object file representation of narrow and wide string literals
   containing such characters, eg.

      wchar_t MvL[]=L"Martin von L\u00F6wis";

   First, what is sizeof(wchar_t) in the base ABI? I'll assume 4 for
   the moment. Then, the question comes down to: What is the execution
   character set, and the wide execution character set? 2.2/3 says
   they are implementation-defined, so I guess we must define
   them. Typically, people expect this to be a run-time setting (which
   is a reasonable assumption), but it kind-of breaks for string
   literals.

   Proposal: The wide execution character set is UCS-4. The
   execution-character-set is "as-is", i.e. bytes from the source
   character set are copied unmodified to the object
   file. Universal-character-names appearing in narrow (ie. char)
   strings are not portable in this ABI (the other alternatives would
   be to say they are Latin-1, or encoded as UTF-8, I guess).

Martin