Message 144688 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, gvanrossum, loewis, tchrist, terry.reedy, vstinner
Date	2011年09月30日.12:37:56
SpamBayes Score	1.3228307e-13
Marked as misclassified	No
Message-id	<26418.1317386261@chthon>
In-reply-to	<4E859BAF.2050505@v.loewis.de>

Content
> Martin v. Löwis <martin@v.loewis.de> added the comment: > "Split S into words. Change the first letter in a word to upper-case, Except that I think you actually mean that the first "letter" is changed into titlecase not uppercase. One might also say try to change for all these, in that not all cased code points in Unicode have casemaps that are different from themselves. For example, a superscript lowercase a or b has no distinct uppercase mapping, the way the non-superscript versions do: % (echo xyz; echo ab AB \| unisupers) \| uc XYZ ab AB > and all subsequent letters to lower case. A word is a sequence that > starts with a letter, followed by letter-related characters." I don't like the way you have defined letters and letter-related characters. The first already has a definition, which is not the one you are using. Word characters also has a definition in Unicode, and it is not the one you are using. I strongly advise against redefining standard Unicode properties. Choose other, unused terms if you must. It is very confusing otherwise. > Letters are all characters from the "Alphabetic" category, i.e. > Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. Except that is exactly the definition of the Unicode Alphabetic property, not the Unicode Letter property. It is a mistake to equate Letter=Alphabetic, and very confusing too. I agree that this probably what you want, though. I just don't think you should use "letter-related characters" when there is an existing formal definition that works, or that you should redefine Letter. > "letter-related" characters are letters + marks (Mn, Mc, Me). That isn't quite right. * Letters are Lu+Ll+Lt+Lm+Lo. * Alphabetic is Letters + Other_Alphabetic. * Other_Alphabetic is certain marks (like the iota subscript) and the letter numbers (Nl), as well as a few symbols. * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. I think you are looking for here are Word characters without Nd + Pc, so just Alphabetic + Mn+Mc+Me. Is that right? --tom PS: You can do union/intersection stuff with properties to see what the resulting sets look like using the unichars command-line tool. This is everything that is both alphabetic and also a mark: % unichars -gs '\p{Alphabetic}' '\pM' ‭ ○しろまるͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI ‭ ○しろまるְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA ‭ ○しろまるֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL ‭ ○しろまるֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH ‭ ○しろまるֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS ... ‭ ○しろまるं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA ‭ ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA ‭ ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA ‭ ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I ‭ ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II ‭ ○しろまるु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U ‭ ○しろまるू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU ‭ ○しろまるृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R ‭ ○しろまるॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR ... While these are the NON-alphabetic marks, which are still Word characters though of course: % unichars -gs '\P{Alphabetic}' '\pM' ‭ ○しろまる̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT ‭ ○しろまる́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT ‭ ○しろまる̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT ‭ ○しろまる̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE ‭ ○しろまる̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON ‭ ○しろまる̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE ‭ ○しろまる̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE ‭ ○しろまる̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE ‭ ○しろまる̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS ‭ ○しろまる̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE ‭ ○しろまる̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE ‭ ○しろまる̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT ‭ ○しろまる̌ U+030C GC=Mn SC=Inherited COMBINING CARON ... And here are the Cased code points that are do not change when upper-, title-, or lowercased: % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]' ‭ a U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR ‭ o U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR ‭ ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA ‭ ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA ‭ ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE ‭ ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP ‭ ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK ‭ ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL ‭ ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE ‭ ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL ‭ ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL ‭ ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL ‭ ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL ‭ ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J ‭ ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH ‭ ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH ‭ ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL ‭ ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E ‭ ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK ‭ ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E ‭ ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK ‭ ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E ‭ ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE ‭ ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G ‭ ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G ‭ ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN ‭ ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H ‭ ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK ... You can get unichars from http://training.perl.com/scripts/unichars where you might also care to get uniprops and perhaps uninames to go with it. There are other Unicode tools there (the directory is 100% Unicode tools, not general scripts as its name suggests), but those are the important ones, I reckon.

Content

> Martin v. Löwis <martin@v.loewis.de> added the comment:
> "Split S into words. Change the first letter in a word to upper-case,
Except that I think you actually mean that the first "letter" is 
changed into titlecase not uppercase. 
One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves. For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:
 % (echo xyz; echo ab AB | unisupers) | uc
 XYZ
 ab AB
> and all subsequent letters to lower case. A word is a sequence that
> starts with a letter, followed by letter-related characters."
I don't like the way you have defined letters and letter-related
characters. The first already has a definition, which is not the
one you are using. Word characters also has a definition in Unicode,
and it is not the one you are using. I strongly advise against
redefining standard Unicode properties. Choose other, unused terms 
if you must. It is very confusing otherwise.
> Letters are all characters from the "Alphabetic" category, i.e.
> Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.
Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property. It is a mistake to equate
Letter=Alphabetic, and very confusing too.
I agree that this probably what you want, though. I just don't think you
should use "letter-related characters" when there is an existing formal
definition that works, or that you should redefine Letter.
> "letter-related" characters are letters + marks (Mn, Mc, Me).
That isn't quite right. 
 * Letters are Lu+Ll+Lt+Lm+Lo.
 * Alphabetic is Letters + Other_Alphabetic.
 * Other_Alphabetic is certain marks (like the iota subscript) and the
 letter numbers (Nl), as well as a few symbols.
 * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
I think you are looking for here are Word characters without 
Nd + Pc, so just Alphabetic + Mn+Mc+Me. 
Is that right?
--tom
PS: You can do union/intersection stuff with properties to see what
 the resulting sets look like using the unichars command-line tool.
 This is everything that is both alphabetic and also a mark:
 % unichars -gs '\p{Alphabetic}' '\pM'
 ‭ ○しろまるͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
 ‭ ○しろまるְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA
 ‭ ○しろまるֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL
 ‭ ○しろまるֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH
 ‭ ○しろまるֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS
 ...
 ‭ ○しろまるं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA
 ‭ ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA
 ‭ ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA
 ‭ ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I
 ‭ ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II
 ‭ ○しろまるु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U
 ‭ ○しろまるू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU
 ‭ ○しろまるृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R
 ‭ ○しろまるॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR
 ...
 While these are the NON-alphabetic marks, which are still Word
 characters though of course:
 % unichars -gs '\P{Alphabetic}' '\pM'
 ‭ ○しろまる̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT
 ‭ ○しろまる́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT
 ‭ ○しろまる̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT
 ‭ ○しろまる̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE
 ‭ ○しろまる̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON
 ‭ ○しろまる̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE
 ‭ ○しろまる̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE
 ‭ ○しろまる̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE
 ‭ ○しろまる̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS
 ‭ ○しろまる̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE
 ‭ ○しろまる̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE
 ‭ ○しろまる̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT
 ‭ ○しろまる̌ U+030C GC=Mn SC=Inherited COMBINING CARON
 ...
 And here are the Cased code points that are do not change when 
 upper-, title-, or lowercased:
 % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
 ‭ a U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR
 ‭ o U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR
 ‭ ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA
 ‭ ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA
 ‭ ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE
 ‭ ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP
 ‭ ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK
 ‭ ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL
 ‭ ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
 ‭ ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL
 ‭ ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL
 ‭ ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL
 ‭ ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL
 ‭ ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J
 ‭ ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH
 ‭ ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH
 ‭ ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL
 ‭ ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E
 ‭ ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK
 ‭ ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E
 ‭ ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
 ‭ ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E
 ‭ ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE
 ‭ ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G
 ‭ ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G
 ‭ ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN
 ‭ ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H
 ‭ ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK
 ...
 You can get unichars from http://training.perl.com/scripts/unichars
 where you might also care to get uniprops and perhaps uninames to go
 with it. There are other Unicode tools there (the directory is
 100% Unicode tools, not general scripts as its name suggests), but
 those are the important ones, I reckon.

History
Date	User	Action	Args
2011年09月30日 12:37:58	tchrist	set	recipients: + tchrist, gvanrossum, loewis, terry.reedy, vstinner, ezio.melotti, Arfrever
2011年09月30日 12:37:57	tchrist	link	issue12737 messages
2011年09月30日 12:37:56	tchrist	create

homepage