ISO/ IEC JTC1/SC22/WG14 N628

					 DOC NUMBER: WG14/N628
							X3J11/96-092
					 Date: 07-Nov-96
					
The attached document WG21/N0886 was adopted by the WG21/X3J16 
C++ committee with the following two amendments. These amendments 
were taken from the WG21 C++ Stockholm minutes. 
==== C Compatibility (Nelson) ====
 
 17) Motion (to change the syntax for universal-character-name) by
 Nelson/Plum:
 
 Move we amend the WP by changing all occurrences of "??u" to "\u"
 and all occurrences of "??U" to "\U". (The affected clauses are 2.2
 [lex.charset], footnote 18 in 2.10 [lex.name], and A.2 [gram.lex].)
 
 Motion passed X3J16: 31 yes, 0 no.
 Motion passed WG21: 7 yes, 0 no, 0 abstain.
 
 18) Motion (to clarify phases of translation) by Nelson/Benito:
 
 Move we amend clause 2.1 [lex.phases] of the WP as follows:
 
 -- under phase 1, add to the end of sentence 1: "in an
 implementation-defined manner."
 
 -- under phase 2, after sentence 1, add the following new sentence:
 "If a character sequence results which matches the syntax of a
 universal-character-name, the behavior is undefined."
 
 -- under phase 4, after sentence 1, add the following new sentence:
 "If a character sequence is produced by token concatenation
 (_cpp.concat_) which matches the syntax of a universal-char-
 acter-name, the behavior is undefined."
 
 Motion passed X3J16: 31 yes, 0 no.
 Motion passed WG21: 7 yes, 0 no, 0 abstain.

							 
							 WG21/N0886
 X3J16/96-0068
 1996年03月13日
 
Extended Identifiers and Extended Literals
Thomas Plum, John Benito, Clark Nelson
 
 
Move that we revise the Working Paper as follows:
 
Item 1) In 2.1, Phases of Translation, add a new paragraph 1 to precede
the existing first paragraph. Then add onto "phase 1" a new sentence,
"Any source file character not in the basic source character set is replaced
by the _universal-character-name_ that designates that character." so that
the first two paragraphs would read as follows:
 
2.1 Phases of translation [lex.phases]
 
1 The _basic source character set_ consists of 96 characters: the space
 character, the control characters representing horizontal tab,
 vertical tab, form feed, and new-line, plus the following 91
 graphical characters:
 a b c d e f g h i j k l m n o p q r s t u v w x y z (26)
 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (26)
 0 1 2 3 4 5 6 7 8 9 (10)
 _ { } [ ] # ( ) <> % : ; . ? * + - / ^ & | ~ ! = , \ " ' (29)
 
 The _universal-character-name_ construct provides a way to name other
 characters. The character designated by the _universal-character-name_
 ??UNNNNNNNN is that character whose encoding in ISO/IEC 10646 is the
 hexadecimal value NNNNNNNN; the character designated by the
 _universal-character-name_ ??uNNNN is that character whose encoding in
 ISO/IEC 10646 is the hexadecimal value 0000NNNN.
 
 hex-quad:
 hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
 
 universal-character-name:
 ??u hex-quad
 ??U hex-quad hex-quad
 
2 The precedence among the syntax rules of translation is specified by the
following phases.
 1 Physical source file characters are mapped to the source character set
 (introducing new-line characters for end-of-line indicators) if necessary.
 Trigraph sequences (2.2) are replaced by corresponding single-character
 internal representations. Any source file character not in the basic
 source character set is replaced by the _universal-character-name_ that
 designates that character. [Footnote -- The process of handling extended
 characters is specified in terms of mapping to an encoding that uses only the
 basic source character set, and, in the case of character literals and
 strings, further mapping to the execution character set. In practical terms,
 however, any internal encoding may be used, so long as an actual extended
 character encountered in the input, and the same extended character
 expressed in the input as an _universal-character-name_ (i.e. using the
 ??uXXXX notation), are handled equivalently.]
 
[end of quote from revised WP]
 
Item 2) In 2.1, Phases of Translation, revise "phase 5" as follows:
 
 5 Each source character set member, escape sequence, or
 _universal-character-name_ in character literals and string literals is
 converted to a member of the execution character set.
 
Item 3) In 2.8, Identifiers, add a new line into the definition of _nondigit_,
and modify paragraph 1, so that the revised text of 2.8 reads as follows:
 
2.8 Identifiers [lex.name]
 
 identifier:
 nondigit
 identifier nondigit
 identifier digit
 
 nondigit: one of
 _universal-character-name_
 _ a b c d e f g h i j k l m n o p q r s t u v w x y z
 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
 
 digit: one of
 0 1 2 3 4 5 6 7 8 9
 
1 An identifier is an arbitrarily long sequence of nondigits and digits.
Each _universal-character-name_ in an identifier shall designate a character
whose encoding in ISO 10646 falls into one of the ranges specified in Annex E.
Upper- and lower-case letters are different. All characters are significant. *
 
[*Footnote: On systems in which linkers cannot accept extended characters, an
encoding of the universal-character-name may be used in forming valid external
identifiers. For example, some otherwise unused character or sequence of characters
may be used to encode the "??u" in a universal-character-name. Extended characters
may produce a long external identifier, but C++ does not place a
translation limit on significant characters for external identifiers. In C++,
upper and lower case letters are considered different for all identifiers,
including external identifiers.]
 
Item 4) Augment the definition of _c-char_ in 2.10.2, Character Literals, as
follows:
 
 c-char:
 any member of the source character set except
 the single-quote ', backslash ,円 or new-line character
 escape-sequence
 universal-character-name
 
Then add a new paragraph 5, as follows:
 
5 A _universal-character-name_ is translated to the encoding, in the
execution character set, of the character named. If there is no such
encoding, the _universal-character-name_ is translated to an
implementation-defined encoding. [Note: In translation phase 1 a
_universal-character-name_ is introduced whenever an actual extended
character is encountered in the source text. Therefore, all extended
characters are described in terms of _universal-character-names_.
However, the actual compiler implementation may
use its own native character set, so long as the same results are obtained.]
 
Item 5) Augment the definition of _s-char_ in 2.10.4, String Literals, as
follows:
 
 s-char:
 any member of the source character set except
 the double-quote ", backslash ,円 or new-line character
 escape-sequence
 universal-character-name
 
Then, in paragraph 5, change "Escape sequences" to "Escape sequences and
_universal-character-names". Change the last sentence to read as follows:
 
 In a non-wide string literal, a _universal-character-name_ may map to more
 than one char element. The size of a wide string literal is the total number
 of escape sequence, _universal-character-names_, and other characters, plus
 one for the terminating L'0円'. The size of a non-wide string literal is
 the total number of escape sequences and other characters, plus at least one
 for the multibyte encoding of each _universal-character-name_, plus one for
 the terminating '0円'.
 
Item 6) Add an annex to list the universal-character-names for identifiers.
 
________________________________________________________________________________
 
Annex E (normative) Universal-character-names for Identifiers [extended-id]
________________________________________________________________________________
 
1 This Clause lists the hexadecimal code values that are valid in
_universal-character-names_ in C++ identifiers.
 
2 This table is reproduced unchanged from ISO/IEC PDTR 10176, produced by
ISO/IEC JTC1/SC22/WG20, except that the ranges 0041-005a and 0061-007a designate
the upper and lower case English alphabets, which are part of the basic source
character set, and are not repeated in the table below.
 
[Editorial Note: If PDTR 10176 is changed during its balloting and adoption as
a TR, then this table should be changed to match its changes.]
 
Latin: 00c0-00d6,00d8-00f6,00f8-01f5,01fa-0217,
 0250-02a8,1e00-1e9a,1ea0-1ef9
Greek: 0384,0388-038a,038c,038e-03a1,03a3-03ce,03d0-03d6,03da,03dc,03de,
 03e0,03e2-03f3,
 1f00-1f15,1f18-1f1d,1f20-1f45,1f48-1f4d,1f50-1f57,1f59,1f5b,1f5d,
 1f5f-1f7d,1f80-1fb4,1fb6-1fbc,1fc2-1fc4,1fc6-1fcc,1fd0-1fd3,
 1fd6-1fdb,1fe0-1fec,1ff2-1ff4,1ff6-1ffc,
Cyrilic: 0401-040d,040f-044f,0451-045c,045e-0481,0490-04c4,04c7-04c8,
 04cb-04cc,04d0-04eb,04ee-04f5,04f8-04f9
Armenian: 0531-0556,0561-0587
Hebrew: 05d0-05ea,05f0-05f4
Arabic: 0621-063a,0640-0652,0670-06b7,06ba-06be,06c0-06ce,06e5-06e7,
Devanagari: 0905-0939,0958-0962
Bengali: 0985-098c,098f-0990,0993-09a8,09aa-09b0,09b2,09b6-09b9,
 09dc-09dd,09df-09e1,09f0-09f1
Gurmukhi: 0a05-0a0a,0a0f-0a10,0a13-0a28,0a2a-0a30,0a32-0a33,
 0a35-0a36,0a38-0a39,0a59-0a5c,0a5e
Gujarati: 0a85-0a8b,0a8d,0a8f-0a91,0a93-0aa8,0aaa-0ab0,0ab2-0ab3,
 0ab5-0ab9,0ae0,
Oriya: 0b05-0b0c,0b0f-0b10,0b13-0b28,0b2a-0b30,0b32-0b33,0b36-0b39,
 0b5c-0b5d,0b5f-0b61,
Tamil: 0b85-0b8a,0b8e-0b90,0b92-0b95,0b99-0b9a,0b9c,0b9e-0b9f,0ba3-0ba4,
 0ba8-0baa,0bae-0bb5,0bb7-0bb9,
Telugu: 0c05-0c0c,0c0e-0c10,0c12-0c28,0c2a-0c33,0c35-0c39,0c60-0c61,
Kannada: 0c85-0c8c,0c8e-0c90,0c92-0ca8,0caa-0cb3,0cb5-0cb9,0ce0-0ce1,
Malayalam: 0d05-0d0c,0d0e-0d10,0d12-0d28,0d2a-0d39,0d60-0d61,
Thai: 0e01-0e30,0e32-0e33,0e40-0e46,0e4f-0e5b,
Lao: 0e81-0e82,0e84,0e87,0e88,0e8a,0e0d,0e94-0e97,0e99-0e9f,0ea1-0ea3,
 0ea5,0ea7,0eaa,0eab,0ead-0eb0,0eb2,0eb3,0ebd,0ec0-0ec4,0ec6,
Georgian: 10a0-10c5,10d0-10f6,
Hiragana: 3041-3094,309b-309e
Katakana: 30a1-30fe,
Bopmofo: 3105-312c,
Hangul: 1100-1159,1161-11a2,11a8-11f9
CJK Unified Ideographs: f900-fa2d,
 fb1f-fb36,fb38-fb3c,fb3e,fb40-fb41,fb42-fb44,fb46-fbb1,fbd3-fd3f,
 fd50-fd8f,fd92-fdc7,fdf0-fdfb,fe70-fe72,fe74,5e76-fefc,
 ff21-ff3a,ff41-ff5a,ff66-ffbe,ffc2-ffc7,ffca-ffcf,ffd2-ffd7,
 ffda-ffdc,4e00-9fa5