ISO/ IEC JTC1/SC22/WG14 N717

WG14/N717
J11/97-080
1997”N06ŒŽ23“ú
Thomas Plum
Wording for "Extended Identifiers" [Revision #4, after voting]
In the text below, lines that start with 6 spaces are quoted
intact from C9X draft 9. The lines at the left margin are
the proposed words to incorporate extended identifiers, 
taken generally verbatim from the second C++ CD 14882.
 5.1.1.2 Translation phases
 [#1] The precedence among the syntax rules of translation is
 specified by the following phases.
 1. Physical source file characters are mapped to the
 source character set (introducing new-line characters
 for end-of-line indicators) if necessary. Trigraph
 sequences are replaced by corresponding single-
 character internal representations.
Any source file character not in the basic source character set 
is replaced by the universalƒ…characterƒ…name that designates that 
character.*)
---------------
*) The process of handling extended characters is specified in terms 
of mapping to an encoding that uses only the basic source character 
set, and, in the case of character literals and strings, further 
mapping to the execution character set. In practical terms, however, 
any internal encoding may be used, so long as an actual extended 
character encountered in the input, and the same extended character 
expressed in the input as a universalƒ…characterƒ…name (i.e. using the 
notation), are handled equivalently.
---------------
 [...]
 4. Preprocessing directives are executed, macro
 invocations are expanded, and pragma unary operator
 expressions are executed. 
If a character sequence that matches the syntax of a 
universalƒ…characterƒ…name is produced by token concatenation 
(16.3.3), the behavior is undefined.
 A #include preprocessing
 directive causes the named header or source file to be
 processed from phase 1 through phase 4, recursively.
 All preprocessing directives are then deleted.
 5. Each source character set member,
escape sequence, and universal-character-name
 in character constants and string literals is
 converted to a member of the execution character set.
 [etc as-is]
Constraints
A universal-character-name shall not specify a character short identifier
in the ranges 0000 through 0020 or 007F through 009F, inclusive. A 
universal-character-name shall not designate a character in the basic source character set.
 5.2 Environmental considerations
 5.2.1 Character sets
 [#1] Two sets of characters and their associated collating
 sequences shall be defined: the set in which source files
 are written, and the set interpreted in the execution
 environment. The values of the members of the execution
 character set are implementation-defined; any additional
 members beyond those required by this subclause are locale-
 specific.
[etc as-is, to the last paragraph of 5.2.1, then add...]
The universalƒ…characterƒ…name construct provides a way to name other 
characters.
hexƒ…quad: hexadecimalƒ…digit hexadecimalƒ…digit hexadecimalƒ…digit hexadecimalƒ…digit
universalƒ…characterƒ…name: \u hexƒ…quad 
 \U hexƒ…quad hexƒ…quad
The character designated by the universalƒ…characterƒ…name \UNNNNNNNN 
is that character whose character short identifier is
NNNNNNNN specified by ISO/IEC 10646 pDAM-9; 
the character designated by the 
universalƒ…characterƒ…name \uNNNN is that character whose 
character short identifier is
0000NNNN specified by ISO/IEC 10646 pDAM-9.
[This wording reflects comments from Japan about C++ CD2.]
 Forward references: character constants (6.1.3.4),
 preprocessing directives (6.8), string literals (6.1.4),
 comments (6.1.9).
 
 [...]
 6.1.2 Identifiers
 Syntax
 [#1]
 identifier:
 nondigit
 identifier nondigit
 nondigit: one of
universalƒ…characterƒ…name
 _ a b c d e f g h i j k l m
 n o p q r s t u v w x y z
 A B C D E F G H I J K L M
 N O P Q R S T U V W X Y Z
 [#2] An identifier is a sequence of nondigit characters
 (including the underscore _ and the lowercase and uppercase
 letters) and digits. 
Each universalƒ…characterƒ…name in an identifier shall designate 
a character whose encoding in ISO 10646 
falls into one of the ranges specified in Annex xxx.*)
-----------------
*) On systems in which linkers cannot accept extended characters, 
an encoding of the universalƒ…characterƒ…name may be used in forming 
valid external identifiers. For example, some otherwise unused 
character or sequence of characters may be used to encode the \u in 
a universalƒ…characterƒ…name. Extended characters may produce a long 
external identifier. 
-----------------
 The first character shall be a nondigit character.
 [...]
 6.1.3.4 Character constants
 Syntax
 [#1]
 c-char:
 any member of the source character set except
 the single-quote ', backslash ,‰~ or 
 new-line character
 escape-sequence
universal-character-name
 6.1.4 String literals
 Syntax
 [#1]
 s-char:
 any member of the source character set except
 the double-quote ", backslash ,‰~ or 
 new-line character
 escape-sequence
universal-character-name
 ___________________________________________________________________
 Annex xxx (normative)
 Universal-character-names for identifiers
 ___________________________________________________________________
1 This Clause lists the hexadecimal code values that are valid in uni-
 versal-character-names in identifiers.
2 This table is reproduced unchanged from ISO/IEC PDTR 10176, produced
 by ISO/IEC JTC1/SC22/WG20, except that the ranges 0041-005a and
 0061-007a designate the upper and lower case English alphabets, which
 are part of the basic source character set, and are not repeated in
 the table below.*)
--------------
*) If PDTR 10176 is changed during its balloting
 and adoption as a TR, then this table should be changed to match its
 changes.
--------------
 Latin: 00c0-00d6, 00d8-00f6, 00f8-01f5, 01fa-0217, 0250-02a8,
 1e00-1e9a, 1ea0-1ef9
 Greek: 0384, 0388-038a, 038c, 038e-03a1, 03a3-03ce, 03d0-03d6, 03da,
 03dc, 03de, 03e0, 03e2-03f3, 1f00-1f15, 1f18-1f1d, 1f20-1f45,
 1f48-1f4d, 1f50-1f57, 1f59, 1f5b, 1f5d, 1f5f-1f7d, 1f80-1fb4,
 1fb6-1fbc, 1fc2-1fc4, 1fc6-1fcc, 1fd0-1fd3, 1fd6-1fdb, 1fe0-1fec,
 1ff2-1ff4, 1ff6-1ffc
 Cyrilic: 0401-040d, 040f-044f, 0451-045c, 045e-0481, 0490-04c4,
 04c7-04c8, 04cb-04cc, 04d0-04eb, 04ee-04f5, 04f8-04f9
 Armenian: 0531-0556, 0561-0587
 Hebrew: 05d0-05ea, 05f0-05f4
 Arabic: 0621-063a, 0640-0652, 0670-06b7, 06ba-06be, 06c0-06ce,
 06e5-06e7
 Devanagari: 0905-0939, 0958-0962
 Bengali: 0985-098c, 098f-0990, 0993-09a8, 09aa-09b0, 09b2, 09b6-09b9,
 09dc-09dd, 09df-09e1, 09f0-09f1
 Gurmukhi: 0a05-0a0a, 0a0f-0a10, 0a13-0a28, 0a2a-0a30, 0a32-0a33,
 0a35-0a36, 0a38-0a39, 0a59-0a5c, 0a5e
 Gujarati: 0a85-0a8b, 0a8d, 0a8f-0a91, 0a93-0aa8, 0aaa-0ab0,
 0ab2-0ab3, 0ab5-0ab9, 0ae0
 Oriya: 0b05-0b0c, 0b0f-0b10, 0b13-0b28, 0b2a-0b30, 0b32-0b33,
 0b36-0b39, 0b5c-0b5d, 0b5f-0b61
 Tamil: 0b85-0b8a, 0b8e-0b90, 0b92-0b95, 0b99-0b9a, 0b9c, 0b9e-0b9f,
 0ba3-0ba4, 0ba8-0baa, 0bae-0bb5, 0bb7-0bb9
 Telugu: 0c05-0c0c, 0c0e-0c10, 0c12-0c28, 0c2a-0c33, 0c35-0c39,
 0c60-0c61
 Kannada: 0c85-0c8c, 0c8e-0c90, 0c92-0ca8, 0caa-0cb3, 0cb5-0cb9,
 0ce0-0ce1
 Malayalam: 0d05-0d0c, 0d0e-0d10, 0d12-0d28, 0d2a-0d39, 0d60-0d61
 Thai: 0e01-0e30, 0e32-0e33, 0e40-0e46, 0e4f-0e5b
 Lao: 0e81-0e82, 0e84, 0e87, 0e88, 0e8a, 0e0d, 0e94-0e97, 0e99-0e9f,
 0ea1-0ea3, 0ea5, 0ea7, 0eaa, 0eab, 0ead-0eb0, 0eb2, 0eb3, 0ebd,
 0ec0-0ec4, 0ec6
 Georgian: 10a0-10c5, 10d0-10f6
 Hiragana: 3041-3094, 309b-309e
 Katakana: 30a1-30fe
 Bopmofo: 3105-312c
 Hangul: 1100-1159, 1161-11a2, 11a8-11f9
 CJK Unified Ideographs: f900-fa2d, fb1f-fb36, fb38-fb3c, fb3e,
 fb40-fb41, fb42-fb44, fb46-fbb1, fbd3-fd3f, fd50-fd8f, fd92-fdc7,
 fdf0-fdfb, fe70-fe72, fe74, 5e76-fefc, ff21-ff3a, ff41-ff5a,
 ff66-ffbe, ffc2-ffc7, ffca-ffcf, ffd2-ffd7, ffda-ffdc, 4e00-9fa5
[Denmark (Keld Simonsen) commented re C++ CD2: 
Due to the change in ISO/IEC 10646 of the encoding of Hangul characters,
we propose to change the allowable characters defined for extended
identifiers as follows:
Remove the range U3400..U4DFF
insert the range UAC00..UD7AF
This change has also been processed to DTR 10176.]