homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, tchrist, terry.reedy
Date 2011年10月02日.05:33:36
SpamBayes Score 0.0
Marked as misclassified No
Message-id <6829.1317533598@chthon>
In-reply-to <4E872F1E.6050604@v.loewis.de>
Content
>> Perl does not provide the old 1.0 names at all. We don't have a Unicode
>> 1.0 legacy to support, which makes this cleaner. However, we do provide
>> for the names of the C0 and C1 Control Codes, because apart from Unicode
>> 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20
> If there would be a reasonably official source for these names, and one
> that guarantees that there is no collision with UCD names, I could
> accept doing so for Python as well.
The C0 and C1 control code names don't change. There is/was one stability
issue where they screwed up, because they ended up having a UAX (required)
and a UTS (not required) fighting because of the dumb stuff they did with
the Emoji names. They neglected to prefix them with "Emoji ..." or some
such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or
"MUSICAL ..." did. The problem is they stole BELL without calling it EMOJI
BELL. This is C0 name for Control-G. Dimwits.
The problem with official names is that they have things in them that you
are not expected in names. Do you really and truly mean to tell me you
think it is somehow **good** that people are forced to write
 \N{LINE FEED (LF)}
Rather than the more obvious pair of 
 \N{LINE FEED}
 \N{LF}
??
If so, then I don't understand that. Nobody in their right 
mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"'
 U+000A
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"'
 U+000A
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"'
 U+000A
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"'
 U+0085
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"'
 U+0085
 % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"'
 U+0085
>> We also provide for certain well known aliases from the Names file:
>> anything that says "* commonly abbreviated as ...", so things like LRO
>> and ZWJ and such.
> -1. Readability counts, writability not so much (I know this is
> different for Perl :-). 
I actually very strongly resent and rebuff that entire mindset in the most
extreme way possible. Well-written Perl code is perfectly readable by
people who speak that langauge. If you find Perl code that isn't readable,
it is by definition not well-written.
*PLEASE* don't start. 
Yes, I just got done driving 16 hours and am overtired, but it's 
something I've been fighting against all of professional career.
It's a "leyenda negra".
> If there is too much aliasing, people will
> wonder what these codes actually mean.
There are 15 "commonly abbreviated as" aliases in the Names.txt file.
 * commonly abbreviated as NBSP
 * commonly abbreviated as SHY
 * commonly abbreviated as CGJ
 * commonly abbreviated ZWSP
 * commonly abbreviated ZWNJ
 * commonly abbreviated ZWJ
 * commonly abbreviated LRM
 * commonly abbreviated RLM
 * commonly abbreviated LRE
 * commonly abbreviated RLE
 * commonly abbreviated PDF
 * commonly abbreviated LRO
 * commonly abbreviated RLO
 * commonly abbreviated NNBSP
 * commonly abbreviated WJ
All of the standards documents *talk* about things like LRO and ZWNJ.
I guess the standards aren't "readable" then, right? :)
From the charnames manpage, which shows that we really don't just make
these up as we feel like (although we could; see below). They're all from
this or that standard:
 ALIASES
 A few aliases have been defined for convenience: instead
 of having to use the official names
 LINE FEED (LF)
 FORM FEED (FF)
 CARRIAGE RETURN (CR)
 NEXT LINE (NEL)
 (yes, with parentheses), one can use
 LINE FEED
 FORM FEED
 CARRIAGE RETURN
 NEXT LINE
 LF
 FF
 CR
 NEL
 All the other standard abbreviations for the controls,
 such as "ACK" for "ACKNOWLEDGE" also can be used.
 One can also use
 BYTE ORDER MARK
 BOM
 and these abbreviations
 Abbreviation Full Name
 CGJ COMBINING GRAPHEME JOINER
 FVS1 MONGOLIAN FREE VARIATION SELECTOR ONE
 FVS2 MONGOLIAN FREE VARIATION SELECTOR TWO
 FVS3 MONGOLIAN FREE VARIATION SELECTOR THREE
 LRE LEFT-TO-RIGHT EMBEDDING
 LRM LEFT-TO-RIGHT MARK
 LRO LEFT-TO-RIGHT OVERRIDE
 MMSP MEDIUM MATHEMATICAL SPACE
 MVS MONGOLIAN VOWEL SEPARATOR
 NBSP NO-BREAK SPACE
 NNBSP NARROW NO-BREAK SPACE
 PDF POP DIRECTIONAL FORMATTING
 RLE RIGHT-TO-LEFT EMBEDDING
 RLM RIGHT-TO-LEFT MARK
 RLO RIGHT-TO-LEFT OVERRIDE
 SHY SOFT HYPHEN
 VS1 VARIATION SELECTOR-1
 .
 .
 .
 VS256 VARIATION SELECTOR-256
 WJ WORD JOINER
 ZWJ ZERO WIDTH JOINER
 ZWNJ ZERO WIDTH NON-JOINER
 ZWSP ZERO WIDTH SPACE
 For backward compatibility one can use the old names for
 certain C0 and C1 controls
 old new
 FILE SEPARATOR INFORMATION SEPARATOR FOUR
 GROUP SEPARATOR INFORMATION SEPARATOR THREE
 HORIZONTAL TABULATION CHARACTER TABULATION
 HORIZONTAL TABULATION SET CHARACTER TABULATION SET
 HORIZONTAL TABULATION WITH JUSTIFICATION CHARACTER TABULATION
 WITH JUSTIFICATION
 PARTIAL LINE DOWN PARTIAL LINE FORWARD
 PARTIAL LINE UP PARTIAL LINE BACKWARD
 RECORD SEPARATOR INFORMATION SEPARATOR TWO
 REVERSE INDEX REVERSE LINE FEED
 UNIT SEPARATOR INFORMATION SEPARATOR ONE
 VERTICAL TABULATION LINE TABULATION
 VERTICAL TABULATION SET LINE TABULATION SET
 but the old names in addition to giving the character will
 also give a warning about being deprecated.
 And finally, certain published variants are usable,
 including some for controls that have no Unicode names:
 name character
 END OF PROTECTED AREA END OF GUARDED AREA, U+0097
 HIGH OCTET PRESET U+0081
 HOP U+0081
 IND U+0084
 INDEX U+0084
 PAD U+0080
 PADDING CHARACTER U+0080
 PRIVATE USE 1 PRIVATE USE ONE, U+0091
 PRIVATE USE 2 PRIVATE USE TWO, U+0092
 SGC U+0099
 SINGLE GRAPHIC CHARACTER INTRODUCER U+0099
 SINGLE-SHIFT 2 SINGLE SHIFT TWO, U+008E
 SINGLE-SHIFT 3 SINGLE SHIFT THREE, U+008F
 START OF PROTECTED AREA START OF GUARDED AREA, U+0096
 perl v5.14.0 2011年05月07日 2
Those are the defaults. They are overridable. That's because we feel that
people should be able to name their character constants however they feel
makes sense for them. If they get tired of typing 
 \N{LATIN SMALL LETTER U WITH DIAERESIS}
let alone
 \N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER}
then they can, because there is a mechanism for making aliases:
 use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
 };
That way you can do 
 s/\N{U_uml}/UE/;
 s/\N{u_uml}/ue/;
This is probably not as persuasive as the private-use case described below.
It is important to remember that all charname bindings in Perl are attached
to a *lexically-scoped declaration. It is completely constrained to
operate only within that lexical scope. That's why the compiler replaces
things like
 use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
 };
 my $find_u_uml = qr/\N{u_uml}/i;
 print "Seach pattern is: $find_u_uml\n";
Which dutifully prints out:
 Seach pattern is: (?^ui:\N{U+FC})
So charname bindings are never "hard to read" because the effect is
completely lexically constrained, and can never leak outside of the scope.
I realize (or at least, believe) that Python has no notion of nested
lexical scopes, and like many things, this sort of thing can therefore
never work there because of that.
The most persuasive use-case for user-defined names is for private-use
area code points. These will never have an official name. But it is 
just fine to use them. Don't they deserve a better name, one that makes
sense within your own program that uses them? Of course they do.
For example, Apple has a bunch of private-use glyphs they use all the time.
In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
logo/glyph thingie of an apple with a bite taken out of it. (Microsoft
also has a bunch of these.) If you upgrade MacRoman to Unicode, you will
find that that 0xF0 maps to code point U+F8FF using the regular converter.
Now what are you supposed to do in your program when you want a named character
there? You certainly do not want to make users put an opaque magic number
as a Unicode escape. That is always really lame, because the whole reason 
we have \N{...} escapes is so we don't have to put mysterious unreadable magic
numbers in our code!!
So all you do is 
 use charnames ":alias" => {
 "APPLE LOGO" => 0xF8FF,
 };
and now you can use \N{APPLE LOGO} anywhere within that lexical scope. The
compiler will dutifully resolve it to U+F8FF, since all name lookups happen
at compile-time. And it cannot leak out of the scope.
I assert that this facility makes your program more readable, and its
absence makes your program less readable.
Private use characters are important in Asian texts, but they are also
important for other things. For example, Unicode intends to get around
to allocating Tengwar up the the SMP. However, lots of stupid old code
can't use full Unicode, being constrained to UCS-2 only. So many Tengwar
fonts start at a different base, and put it in the private use area instead
or the SMP. Here are two constants:
 use constant {
 TB_CONSCRIPT_UNICODE_REGISTRY => 0x00_E000, # private use
 TB_UNICODE_CONSORTIIUM => 0x01_6080, # where it will really go
 };
I have an entire Tengwar module that makes heavy use of named 
private-use characters. All I do is this:
 use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY;
 use charnames ":alias" => { 
 reverse (
 (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
 (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
 (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
 (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
 (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
 ....
 )
 };
Now you can write \N{TENGWAR LETTER TINCO} etc. See how slick that is?
Consider the alternative. Magic numbers. Worse, magic numbers with funny
calculations in them. That is just so wrong that it completely justifies
letting people name things how they want to, so long as they don't make
other people do the same. What people do in the privacy of their own
lexical scope is their own business.
It gets better. Perl lets you define your character properties, too.
Therefore I can write things like \p{Is_Tengwar_Decimal} and such.
Right now I have these properties:
 In_Tengwar, Is_Tengwar
 In_Tengwar_Alphanumerics
 In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics
 In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal
 In_Tengwar_Punctuation
 In_Tengwar_Marks 
So I have code in my Tengwar module that does stuff like this, using
my own named characters (which again, are compile-time resolved and 
work only within this lexical scope):
 chr( 1ドル + ord("\N{TENGWAR DIGIT ZERO}") )
Not to mention this using my own properties:
 $TENGWAR_GRAPHEME_RX = qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x;
Actually, I'm fibbing. I *never* write regexes all on one line like
that: they are abhorrent to me. The pattern really looks like this in
the code:
 $TENGWAR_GRAPHEME_RX = qr{
 (?:
 (?= \p{In_Tengwar} ) \P{In_Tengwar_Marks} # Either one basechar...
 \p{In_Tengwar_Marks} * # ... plus 0 or more marks
 ) | 
 \p{In_Tengwar_Marks} # or else a naked unpaired mark.
 }x;
People who write patterns without whitespace for cognitive chunking (plus
comments for explanation) are wicked wicked wicked. Frankly I'm surprised 
Python doesn't require it. :)/2
Anyway, do you see how much better that is than opaque unreadable magic
numbers? Can you just imagine the sheer horror of writing that sort of
code without the ability to define your own named characters *and* your 
own character properties? It's beautiful, simple, clean, and readable.
I'll even go so far as to call it intuitive.
No, I don't expect Python to do this sort of thing. You don't have proper
scoping, so you can't ever do it cleanly the way Perl can.
I just wanted to give a concrete example where flexibility leads to a 
much more readable program than inflexibility ever can. 
--tom
 "We hates magic numberses. We hates them forevers!"
 --Sméagol the Hacker
History
Date User Action Args
2011年10月02日 05:33:42tchristsetrecipients: + tchrist, lemburg, gvanrossum, loewis, terry.reedy, ezio.melotti, mrabarnett
2011年10月02日 05:33:41tchristlinkissue12753 messages
2011年10月02日 05:33:36tchristcreate

AltStyle によって変換されたページ (->オリジナル) /