Implementing Universal Character Names in identifiers

Mon Oct 28 10:39:00 GMT 2002

On Mon, Oct 28, 2002 at 09:53:35AM +0100, Martin v. LÃ¶wis wrote:
> > My plan for general extended-character-encoding support is to
> > convert to UTF-8 and process that representation; that plus iconv
> > plus some glue and heuristics will get us most of the way there.
>> Notice that this might be difficult to incorporate into the
> parser. Parsing extended characters will require maintenance of a
> shift state (mbstate_t); iconv does not directly expose the
> mbstate. So you have to carefully keep the mbstate_t and the iconv_t
> synchronized.
>> Alternatively, you could even use iconv to split the input into
> individual characters, and then perform parsing on the iconv result
> (conversion to UTF-8 might be appropriate); but that would be a
> significant change.

The plan is to implement this /as if/ the entire file is run through
iconv(3) conversion to UTF-8 before the parser ever sees it. (The
actual implementation may do it on the fly when the parser encounters
nonwhitespace characters outside the 0x20-0x7f range.) I won't be
using any of the <wchar.h> interfaces.
http://gcc.gnu.org/projects/cpplib.html#charset contains some
discussion of the plan - comments would be appreciated.
> > You want to look closely at what is currently done for UCNs in wide
> > character constants and string literals. I'm pretty sure it's wrong,
> > and I would appreciate suggestions.
>> As for the preprocessor, it looks quite right to me; also, the output
> is right, assuming gcc implies ISO 10646 for wchar_t on all platforms
> (which is a sensible choice, and correct for GNU systems).

Glad to hear. ISO10646 for wchar_t is not universally correct, but we
can do no better at present.
> > We should normalize identifiers before entering them in the symbol
> > table, and for output; otherwise there will be great confusion.
> > That needs to happen as part of the initial patch.

What you wrote in response to this is interesting but doesn't address
the issue of Unicode normalization of identifiers. It sounds more
like an extended discussion of the previous point. I'm talking about
the process described in UAX 15 (http://www.unicode.org/unicode/reports/tr15/)
and in particular annex 7 of that document ("Programming Language
Identifiers").
> The C++ ABI left this open; the current recommendation (which is not
> normative) is to use UTF-8 unless something else is specified by the
> vendor. Encoding schemes don't really work for C, and add complexity
> for C++.
>> I have now the opinion that encoding schemes (other than UTF-8) should
> not be used. Compatibility with Java might be an issue; it might be
> necessary to special-case extern "Java" identifiers in the C++
> front-end. I could add that to the patch - although I would prefer if
> the Java API would change.
>> As for assemblers that don't allow UTF-8 in source code: I'd rather
> disable the feature for those assemblers than trying to find a
> solution - this allows for compatibility should the vendor decide on
> this matter later.
>> The tricky part is how to determine whether UTF-8 is supported in
> assembler output: initially, I'd just assume that GNU as supports it,
> and no other assembler does; this can then be extended as support on
> other systems becomes possible.

This all seems entirely reasonable, but please do communicate with the
Java folks about their requirements. I've added java@ to the cc list.
> The next question is where to block unacceptable identifiers: in
> cpplib, or later? If in cpplib, or later? Later might be better since,
> atleast for C++, supporting this in Java identifiers might be
> desirable, plus you could use it in macro names even if the assembler
> does not support it.

At the language level, yes, and perhaps only for identifiers that will
map to assembly symbols. I would suggest copying your NODE_UTF8 bit
into a flag on IDENTIFIER_NODEs and then checking that in make_decl_rtl.
> If UTF-8 identifiers must be rejected (or converted) in the language
> front-ends, how can I efficiently determine whether an identifier uses
> UTF-8? Can I use deprecated_flag on IDENTIFIERs for that?

If there is no other use for that bit in IDENTIFIER_NODEs, then yes
(make sure to document it appropriately).
I think it would be more appropriate to call the bit
"USES_EXTENDED_CHARACTERS" rather than "UTF8" as technically 7-bit
ASCII is UTF8 too. It will be referenced rarely enough that a long,
meaningful name is best.
> > (1) This routine belongs in libiberty, as part of the safe-ctype.h
> > interface.
>> Really? The list of characters is quite specific to the language (and
> perhaps even the language revision). I haven't even checked whether
> the lists of acceptable characters are the same in C++98 and C99.
>> > (3) The ranges need to be updated from the latest Unicode standard,
> > and the standard version noted in commentary.
>> No. They are mandated by the language specification. For C++, see
> Annex E. For C99, see Annex D (unfortunately, I can't, since I don't
> have the final copy of C99). C++ claims to have copied the table from
> PDTR 10176, C from TR 10176.
>> *If* my C99 draft is accurate, then there are differences between
> these two tables: e.g. in C99, U+00AA (FEMININE ORDINAL INDICATOR)
> is acceptable in an identifier; in C++98, it is not.

Ugh. IMO, this is a defect in both standards - they should simply
reference UAX15a7 and be done with it. It's been around since 1998,
so they don't really have an excuse for not using it.
I suggest:
 - In libiberty, provide interfaces that implement UAX15. On
 reflection, this should be a new <unicode.h> interface set, not
 tacked onto <safe-ctype.h>.
 - In cpplib, provide routines that validate individual identifiers
 against the precise lists in C99 and C++98.
 - GCC enforces the precise lists in C99 and C++98 only in -pedantic
 mode.
 - We file a couple of Defect Reports.
> > Naturally, cpp_classify_number should categorize such numbers as
> > CPP_N_INVALID (allowing digits outside the basic source character
> > set strikes me as a bad idea).
>> Please educate me: is this taking the target language into account? If
> not, there is nothing wrong with that token, as a pp-token.

cpp_classify_number is used in the conversion from pp-tokens to
tokens. While extended characters are valid nondigits and therefore
valid in pp-number tokens, they are not valid in phase 7's
integer-constants and floating-constants (see C99 6.4.4).
Incidentally, references to standard sections in commentary should be
of the form "C++98 Annex E [extendid]" not just "[extendid]". If
you're referring to a specific paragraph, write e.g. "C99 6.4.2.1p1"
as "6.4.2.1.1" is ambiguous. (Alas, C99 doesn't have section name tags.)
> > Please find a more efficient way to accomplish this. This code is
> > already *the* bottleneck for textual preprocessing. (For instance, if
> > you implement support for raw UTF8 as input encoding, we can just
> > splat out the identifier as is.)
>> Is that necessary? Few tokens will ever have the flag set, and the
> only part where I added overhead is the test for the flag.

I am not sure about "few" some years down the road, when people start
_using_ the ability to write identifiers in their own languages. In
any case, using "fwrite(ptr, 1, 1, file)" is just silly when
"putc(*ptr, file)" will do.
zw