homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients Arfrever, ezio.melotti, gvanrossum, loewis, tchrist, terry.reedy, vstinner
Date 2011年09月30日.12:37:56
SpamBayes Score 1.3228307e-13
Marked as misclassified No
Message-id <26418.1317386261@chthon>
In-reply-to <4E859BAF.2050505@v.loewis.de>
Content
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> "Split S into words. Change the first letter in a word to upper-case,
Except that I think you actually mean that the first "letter" is 
changed into titlecase not uppercase. 
One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves. For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:
 % (echo xyz; echo ab AB | unisupers) | uc
 XYZ
 ab AB
> and all subsequent letters to lower case. A word is a sequence that
> starts with a letter, followed by letter-related characters."
I don't like the way you have defined letters and letter-related
characters. The first already has a definition, which is not the
one you are using. Word characters also has a definition in Unicode,
and it is not the one you are using. I strongly advise against
redefining standard Unicode properties. Choose other, unused terms 
if you must. It is very confusing otherwise.
> Letters are all characters from the "Alphabetic" category, i.e.
> Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.
Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property. It is a mistake to equate
Letter=Alphabetic, and very confusing too.
I agree that this probably what you want, though. I just don't think you
should use "letter-related characters" when there is an existing formal
definition that works, or that you should redefine Letter.
> "letter-related" characters are letters + marks (Mn, Mc, Me).
That isn't quite right. 
 * Letters are Lu+Ll+Lt+Lm+Lo.
 * Alphabetic is Letters + Other_Alphabetic.
 * Other_Alphabetic is certain marks (like the iota subscript) and the
 letter numbers (Nl), as well as a few symbols.
 * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
I think you are looking for here are Word characters without 
Nd + Pc, so just Alphabetic + Mn+Mc+Me. 
Is that right?
--tom
PS: You can do union/intersection stuff with properties to see what
 the resulting sets look like using the unichars command-line tool.
 This is everything that is both alphabetic and also a mark:
 % unichars -gs '\p{Alphabetic}' '\pM'
 ‭ しろまるͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
 ‭ しろまるְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA
 ‭ しろまるֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL
 ‭ しろまるֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH
 ‭ しろまるֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS
 ...
 ‭ しろまるं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA
 ‭ ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA
 ‭ ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA
 ‭ ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I
 ‭ ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II
 ‭ しろまるु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U
 ‭ しろまるू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU
 ‭ しろまるृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R
 ‭ しろまるॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR
 ...
 While these are the NON-alphabetic marks, which are still Word
 characters though of course:
 % unichars -gs '\P{Alphabetic}' '\pM'
 ‭ しろまる̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT
 ‭ しろまる́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT
 ‭ しろまる̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT
 ‭ しろまる̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE
 ‭ しろまる̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON
 ‭ しろまる̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE
 ‭ しろまる̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE
 ‭ しろまる̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE
 ‭ しろまる̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS
 ‭ しろまる̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE
 ‭ しろまる̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE
 ‭ しろまる̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT
 ‭ しろまる̌ U+030C GC=Mn SC=Inherited COMBINING CARON
 ...
 And here are the Cased code points that are do not change when 
 upper-, title-, or lowercased:
 % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
 ‭ a U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR
 ‭ o U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR
 ‭ ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA
 ‭ ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA
 ‭ ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE
 ‭ ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP
 ‭ ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK
 ‭ ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL
 ‭ ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
 ‭ ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL
 ‭ ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL
 ‭ ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL
 ‭ ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL
 ‭ ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J
 ‭ ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH
 ‭ ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH
 ‭ ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL
 ‭ ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E
 ‭ ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK
 ‭ ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E
 ‭ ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
 ‭ ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E
 ‭ ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE
 ‭ ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G
 ‭ ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G
 ‭ ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN
 ‭ ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H
 ‭ ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK
 ...
 You can get unichars from http://training.perl.com/scripts/unichars
 where you might also care to get uniprops and perhaps uninames to go
 with it. There are other Unicode tools there (the directory is
 100% Unicode tools, not general scripts as its name suggests), but
 those are the important ones, I reckon.
History
Date User Action Args
2011年09月30日 12:37:58tchristsetrecipients: + tchrist, gvanrossum, loewis, terry.reedy, vstinner, ezio.melotti, Arfrever
2011年09月30日 12:37:57tchristlinkissue12737 messages
2011年09月30日 12:37:56tchristcreate

AltStyle によって変換されたページ (->オリジナル) /