homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation
Type: enhancement Stage:
Components: Interpreter Core, Unicode Versions: Python 3.6, Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Jean-Michel.Fauth, Jim.Jewett, belopolsky, benjamin.peterson, ezio.melotti, mrabarnett, pitrou, python-dev, tchrist, Андрей Баксаляр
Priority: normal Keywords: patch

Created on 2011年08月11日 21:39 by tchrist, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
mux.python tchrist, 2011年08月11日 21:39 demo program showing all casemaps and casefolds for sample tricky dataset
casing-tests.py tchrist, 2011年08月26日 23:55 test suite for casemapping functions, case checking functions, and casefolding of patterns, both simple and full
casing-results.txt ezio.melotti, 2011年08月28日 05:54 results on 3.2/3.3 narrow/wide
full-casemapping.patch benjamin.peterson, 2012年01月08日 03:54 review
full-casemapping.patch benjamin.peterson, 2012年01月10日 03:49 review
full-casemapping.patch benjamin.peterson, 2012年01月11日 03:37 review
full-casemapping.patch benjamin.peterson, 2012年01月11日 20:20 review
pythonbug.png Андрей Баксаляр, 2016年03月10日 20:21
Messages (27)
msg141928 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月11日 21:39
Python's casemapping functions only use what Unicode calls simple casemaps. These are only appropriate for functions that operate on single characters alone, not for those that operate on strings. The reason for this is that you get much better results with full casemapping. Java, Ruby, and Perl all do full casemapping for their equivalent functions that do string mapping, and Python should, too.
I include a program that has a much of mappings and foldings both simple and full. Yes, it was machine-generated.
msg143036 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011年08月26日 21:11
I presume this applies to builtin str methods like .lower(), right? I think it is a good thing to do for Python 3.3.
We'd need to define what should happen in edge cases, e.g. when (against all odds) a string happens to contain a lone surrogate or some other code point or sequence of code points that the Unicode standard considers illegal. I think it should not fail but just leave those code points alone.
Does this require us to import more data files from the Unicode standard? By itself that doesn't scare me.
Would this also affect .islower() and friends?
msg143051 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月26日 23:36
Guido van Rossum <report@bugs.python.org> wrote
 on 2011年8月26日 21:11:24 -0000: 
> Guido van Rossum <guido@python.org> added the comment:
> I presume this applies to builtin str methods like .lower(), right? I
> think it is a good thing to do for Python 3.3.
Yes, the full casemaps are for upper, title, and lowercase. There is 
also a full casefold and turkic case fold (which is full), but you
don't have a casefold function so I guess that doesn't matter.
> We'd need to define what should happen in edge cases, e.g. when
> (against all odds) a string happens to contain a lone surrogate or
> some other code point or sequence of code points that the Unicode
> standard considers illegal. I think it should not fail but just leave
> those code points alone.
Well, it's a funny thing. There are properties given for all
Unicode code points, even noncharacter code points. This
includes the casing properties, oddly enough.
From UnicodeData.txt, which has a few surrogate entries; notice
no casing is given:
 D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
 DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
 DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
 DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
 DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
 DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
And in SpecialCasing.txt, which does not have surrogates but does have
a default clause:
 # This file is a supplement to the UnicodeData file.
 # It contains additional information about the casing of Unicode characters.
 # (For compatibility, the UnicodeData.txt file only contains case mappings for
 # characters where they are 1-1, and independent of context and language.
 # For more information, see the discussion of Case Mappings in the Unicode Standard.
 #
 # All code points not listed in this file that do not have a simple case mappings
 # in UnicodeData.txt map to themselves.
And in CaseFolding.txt, which also does not have surrogates but again does 
have a default clause:
 # The data supports both implementations that require simple case foldings
 # (where string lengths don't change), and implementations that allow full case folding
 # (where string lengths may grow). Note that where they can be supported, the
 # full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
 #
 # All code points not listed in this file map to themselves.
Taken all together, it follows that the surrogates have case{map,fold}s
back to themselves, since they have no case{map,fold}s listed.
It's ok to have arbitrary code points in memory, including surrogates and
the 66 noncharacters. It just isn't legal to have them in a UTF stream
for "open interchange", whatever that means. 
> Does this require us to import more data files from the Unicode
> standard? By itself that doesn't scare me.
One way or the other, yes, notably the SpecialCasing file for
casemapping and the CaseFolding file for casefolding (which you
should do anyway to fix re.I). But you can and should process the
new files into some tighter format optimized for your own lookups.
Oddly, Java doesn't provide for String methods that do full casing on
titlecase, even those they do do so on lowercase and uppercase. On
titlecase they only expose the simple casemaps via the Character class,
which are the ones from UnicodeData. They recognize that this is flaw, 
but it was too late to fix it for JAva 7.
> Would this also affect .islower() and friends?
Well, it shouldn't, but .islower() and friends are already mistaken.
They seem to be checking for GC=Ll and such, but they need to be
checking the Unicode binary property Lowercase and such. Watch:
 test 37 for string VIII
 wanted <viii> to be lowercase of <VIII> but python disagrees
 wanted <VIII> to be titlecase of <VIII> but python disagrees
 wanted <VIII> to be uppercase of <VIII> but python disagrees
 test 37 failed 3 subtests
 test 39 for string K
 wanted <k> to be lowercase of <K> but python disagrees
 wanted <K> to be titlecase of <K> but python disagrees
 wanted <K> to be uppercase of <K> but python disagrees
 test 39 failed 3 subtests
That's because the Roman numerals are GC=Nl but still have
case and change case. Similarly for the circled letters which
are GC=So but have case and change case. Plus there's U+0345,
the iota subscript, which is GC=Mn but has case and changes case.
I don't remember whether I've sent in my full test suite or not. 
If I haven't yet, I should attach it to the bug report.
--tom
msg143052 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月26日 23:55
Here’s my casing test suite; I thought I sent it in but the mux file here isn’t the full thing.
 It does several things, including letting you run it with regex vs re. It also checks for the islower, etc functions. It has both simple and full (and turkic) maps and folds in it, but is configured to only check the simple versions for now. The islower and isupper etc functions seem to be checking the wrong Unicode property.
Yes, it has my quaint Unixisms in it, because it needs to run with UTF-8 output, or you can't read what's going on.
msg143072 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月27日 14:48
Guido van Rossum <report@bugs.python.org> wrote
 on 2011年8月26日 21:11:24 -0000: 
> Would this also affect .islower() and friends?
SHORT VERSION: (7 lines)
 I don't believe so, but the relationship between lower() and islower()
 is not as clear to me as I would have thought, and more importantly,
 the code and the documentation for Python's islower() etc currently seem
 to disagree. For future releases, I recommend fixing the code, but if
 compatibility is an issue, then perhaps for previous releases still in
 maintenance mode fixing only the documentation would possibly be good
 enough--your call.
=======================================================================
MEDIUM VERSION: (87 lines)
I was initially confused with Python's islower() family because of the way
they are defined to operate on full strings. They don't check that
everything is lowercase even though they say they do.
 < http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range
 str.lower()
 Return a copy of the string with all the cased characters [4]
 converted to lowercase.
 str.islower()
 Return true if all cased characters [4] in the string are lowercase 
 and there is at least one cased character, false otherwise.
 [4] (1, 2, 3, 4) Cased characters are those with general category
 property being one of "Lu" (Letter, uppercase), "Ll" (Letter,
 lowercase), or "Lt" (Letter, titlecase).
This is strange in several ways. Of lesser importance is that
strings can be considered lowercase even if they don't match
 ^\p{lowercase}+$
Another is that the result of calling str.lower() may not be .islower().
I'm not sure what these are particularly for, since I myself would just use
a regex to get finer-grained control. (I suppose that's because re doesn't
give access to the Unicode properties needed that this approach never
gained any traction in the Python community.)
However, the worst of this is that the documentation defines both cased
characters and lowercase characters *differently* from how Unicode does
defines those very same terms. This was quite confusing.
Unicode distinguishes Cased code points from Cased_*Letter* code points.
Python is using the Cased_Letter property but calling it Cased. Cased in 
a proper superset of Cased_Letter. From the DerivedCoreProperties file in
the Unicode Character Database:
 # Derived Property: Cased (Cased)
 # As defined by Unicode Standard Definition D120
 # C has the Lowercase or Uppercase property or has a General_Category value of Titlecase_Letter.
In the same way, the Lowercase and Uppercase properties are not the same as
the Lowercase_*Letter* and Uppercase_*Letter* properties. Rather, the former
are respectively proper supersets of the latter. 
 # Derived Property: Lowercase
 # Generated from: Ll + Other_Lowercase
 [...]
 # Derived Property: Uppercase
 # Generated from: Lu + Other_Uppercase
In all these, you almost always want the superset versions not the
restricted subset versions you are using. If it were in the regex engine,
the user could select either.
Java used to miss all these, too. But in 1.7, they updated their character
methods to use the properties that they'd all along said they were using:
 < http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)
 public static boolean isLowerCase(char ch)
 Determines if the specified character is a lowercase character. 
 A character is lowercase if its general category type, provided by
 Character.getType(ch), is LOWERCASE_LETTER, or it has contributory
-> property Other_Lowercase as defined by the Unicode Standard.
 Note: This method cannot handle supplementary characters. To
 support all Unicode characters, including supplementary
 characters, use the isLowerCase(int) method.
(And yes, that's where Java uses "character" to mean "code unit" 
 not "code point", alas. No wonder people get confused)
I'm pretty sure that Python needs to either update its documentation to
match its code, update its code to match its documentation, or both. Java
chose to update the code to match the documentation, and this is the course
I would recommend if at all possible. If you say you are checking for
cased code points, then you should use the Unicode definition of cased code
points not your own, and if you say you are checking for lowercase code
points, then you should use the Unicode definition not your own. Both of
these require access to contributory properties from the UCD and not 
just general categories alone.
--tom
=======================================================================
LONG VERSION: (222 lines)
Essential tools I use for inspecting Unicode code points and their 
properties include
 http://training.perl.com/scripts/unichars
 http://training.perl.com/scripts/uniprops
 http://training.perl.com/scripts/uninames
And over the course of the day, these get used a fair bit, too:
 http://training.perl.com/scripts/uniquote
 http://training.perl.com/scripts/ucsort
 http://training.perl.com/scripts/unifmt
Here for example are (some of) the *non*-Letter code point that
are nonetheless considered lowercase or uppercase because
they have the Other_{Lower,Upper}case properties:
 % unichars -gs '\PL' '[\p{upper}\p{lower}]'
 しろまるͅ U+00345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
 I U+02160 GC=Nl SC=Latin ROMAN NUMERAL ONE
 II U+02161 GC=Nl SC=Latin ROMAN NUMERAL TWO
 III U+02162 GC=Nl SC=Latin ROMAN NUMERAL THREE
 [...]
 i U+02170 GC=Nl SC=Latin SMALL ROMAN NUMERAL ONE
 ii U+02171 GC=Nl SC=Latin SMALL ROMAN NUMERAL TWO
 iii U+02172 GC=Nl SC=Latin SMALL ROMAN NUMERAL THREE
 [...]
 A U+024B6 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER A
 B U+024B7 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER B
 C U+024B8 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER C
 [...]
 a U+024D0 GC=So SC=Common CIRCLED LATIN SMALL LETTER A
 b U+024D1 GC=So SC=Common CIRCLED LATIN SMALL LETTER B
 c U+024D2 GC=So SC=Common CIRCLED LATIN SMALL LETTER C
 [...]
And here are (some of) the letters that are cased but which are
not Lu, Lt, or Ll (they're all Lm, in fact):
 % unichars -gs '\p{Lm}' '\p{cased}' | ucsort
 Æ U+1D2D GC=Lm SC=Latin MODIFIER LETTER CAPITAL AE
 A U+1D2C GC=Lm SC=Latin MODIFIER LETTER CAPITAL A
 a U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A
 a U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A
 ɑ U+1D45 GC=Lm SC=Latin MODIFIER LETTER SMALL ALPHA
 B U+1D2E GC=Lm SC=Latin MODIFIER LETTER CAPITAL B
 b U+1D47 GC=Lm SC=Latin MODIFIER LETTER SMALL B
 [...]
 w U+02B7 GC=Lm SC=Latin MODIFIER LETTER SMALL W
 W U+1D42 GC=Lm SC=Latin MODIFIER LETTER CAPITAL W
 x U+02E3 GC=Lm SC=Latin MODIFIER LETTER SMALL X
 x U+2093 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER X
 y U+02B8 GC=Lm SC=Latin MODIFIER LETTER SMALL Y
 z U+1DBB GC=Lm SC=Latin MODIFIER LETTER SMALL Z
 β U+1D5D GC=Lm SC=Greek MODIFIER LETTER SMALL BETA
 γ U+1D5E GC=Lm SC=Greek MODIFIER LETTER SMALL GREEK GAMMA
 δ U+1D5F GC=Lm SC=Greek MODIFIER LETTER SMALL DELTA
 θ U+1DBF GC=Lm SC=Greek MODIFIER LETTER SMALL THETA
 ͅ U+037A GC=Lm SC=Greek GREEK YPOGEGRAMMENI
 φ U+1D60 GC=Lm SC=Greek MODIFIER LETTER SMALL GREEK PHI
 χ U+1D61 GC=Lm SC=Greek MODIFIER LETTER SMALL CHI
 н U+1D78 GC=Lm SC=Cyrillic MODIFIER LETTER CYRILLIC EN
Perversely, here are some of the modifier letters which are *not* cased:
 % unichars -gs '\p{Lm}' '\P{CASED}' | ucsort
 h U+2095 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER H
 ʻ U+02BB GC=Lm SC=Common MODIFIER LETTER TURNED COMMA
 ʽ U+02BD GC=Lm SC=Common MODIFIER LETTER REVERSED COMMA
 i U+2071 GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER I
 k U+2096 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER K
 l U+2097 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER L
 m U+2098 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER M
 n U+207F GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER N
 n U+2099 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER N
 p U+209A GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER P
 s U+209B GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER S
 t U+209C GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER T
 ʹ U+02B9 GC=Lm SC=Common MODIFIER LETTER PRIME
 ʺ U+02BA GC=Lm SC=Common MODIFIER LETTER DOUBLE PRIME
 ˆ U+02C6 GC=Lm SC=Common MODIFIER LETTER CIRCUMFLEX ACCENT
 ˇ U+02C7 GC=Lm SC=Common CARON
 ˈ U+02C8 GC=Lm SC=Common MODIFIER LETTER VERTICAL LINE
 ˉ U+02C9 GC=Lm SC=Common MODIFIER LETTER MACRON
 ˊ U+02CA GC=Lm SC=Common MODIFIER LETTER ACUTE ACCENT
 ˋ U+02CB GC=Lm SC=Common MODIFIER LETTER GRAVE ACCENT
 ˌ U+02CC GC=Lm SC=Common MODIFIER LETTER LOW VERTICAL LINE
(Interesting how the commas sort as breath marks next to H.)
I cannot for the life of me figure out why Unicode deems these lowercase:
 a U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A
 a U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A
 ɑ U+1D45 GC=Lm SC=Latin MODIFIER LETTER SMALL ALPHA
yet these *not* to be cased:
 i U+2071 GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER I
 m U+2098 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER M
 n U+207F GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER N
All I know is that the tables tell me.
Here's a fair assortment of cased and noncased, case-changing and
non-casing code points. The variation in binary properties is pretty wide.
 $ uniprops x 00aa 1d4e 2071 2172 df 262 1d401 1d42d 2117 24c5
 U+0078 ‹x› \N{LATIN SMALL LETTER X}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic ASCII Assigned Basic_Latin Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase PerlWord POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print POSIX_Word Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+00AA ‹a› \N{FEMININE ORDINAL INDICATOR}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+1D4E <ᵎ> \N{MODIFIER LETTER SMALL TURNED I}
 \w \pL \p{L_} \p{Lm}
 All Any Alnum Alpha Alphabetic Assigned InPhoneticExtensions Case_Ignorable CI Cased Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Lower Lowercase Phonetic_Extensions Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+2071 <i> \N{SUPERSCRIPT LATIN SMALL LETTER I}
 \w \pL \p{L_} \p{Lm}
 All Any Alnum Alpha Alphabetic Assigned InSuperscriptsAndSubscripts Case_Ignorable CI Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Print SD Soft_Dotted Superscripts_And_Subscripts Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
 U+2172 <iii> \N{SMALL ROMAN NUMERAL THREE}
 \w \pN \p{Nl}
 All Any Alnum Alpha Alphabetic Assigned InNumberForms Cased Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Nl N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Latin Latn Letter_Number Lower Lowercase Number Number_Forms Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+00DF <ß> \N{LATIN SMALL LETTER SHARP S}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+0262 <ɢ> \N{LATIN LETTER SMALL CAPITAL G}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+1D401 <B> \N{MATHEMATICAL BOLD CAPITAL B}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
 All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Math Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
 U+1D42D <t> \N{MATHEMATICAL BOLD SMALL T}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Math Mathematical_Alphanumeric_Symbols Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
 U+2117 ‹℗› \N{SOUND RECORDING COPYRIGHT}
 \pS \p{So}
 All Any Assigned InLetterlikeSymbols Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase Letterlike_Symbols Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print
 U+24C5 ‹P› \N{CIRCLED LATIN CAPITAL LETTER P}
 \w \pS \p{So}
 All Any Alnum Alpha Alphabetic Assigned InEnclosedAlphanumerics Cased Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Common Zyyy Enclosed_Alphanumerics So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol Upper Uppercase Word X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Unicode also has a Case_Ignorable (CI) character property, which I haven't 
thought much about but which might be useful. 
 http://www.unicode.org/reports/tr44/#Case_Ignorable
 Characters which are ignored for casing purposes. For more information,
 see D121 in Section 3.13, Default Case Algorithms in [Unicode].
 Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet
I'm not sure if you should think about these when doing your isupper()
test; maybe you should. That way you wouldn't fail just because you had
a code point that was technically lowercase, like if someone used
"LEONARD McCOY". That funny c wouldn't count as a spoiler then, so that
"Leonard McCoy".upper().isupper() could be true, as the c wouldn't
change but wouldn't count, either. I haven't thought about this enough
though. I'm not used to full string-based isupper() functions, so my
instincts may be wrong here.
The only code point that is both CWCM and also CI is the notorious
 しろまるͅ U+00345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
Subscripts, superscripts, modifier letters, small capitals, and mathematical
letters *tend* to be cased code points that do not change when casemapped
or casefolded, although there are exceptions.
 % uninames small capital '\b\R\b'
 ʀ 0280 LATIN LETTER SMALL CAPITAL R
 * voiced uvular trill
 * Germanic, Old Norse
 * uppercase is 01A6
 ʁ 0281 LATIN LETTER SMALL CAPITAL INVERTED R
 * voiced uvular fricative or approximant
 x (modifier letter small capital inverted r - 02B6)
 ʁ 02B6 MODIFIER LETTER SMALL CAPITAL INVERTED R
 * preceding four used for r-coloring or r-offglides
 x (latin letter small capital inverted r - 0281)
 # <super> 0281
 ᴙ 1D19 LATIN LETTER SMALL CAPITAL REVERSED R
 ᴚ 1D1A LATIN LETTER SMALL CAPITAL TURNED R
 ᷢ 1DE2 COMBINING LATIN LETTER SMALL CAPITAL R
 % uniprops 280 1a6
 U+0280 <ʀ> \N{LATIN LETTER SMALL CAPITAL R}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
 All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower
 X_POSIX_Print X_POSIX_Word
 U+01A6 <Ʀ> \N{LATIN LETTER YR}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
 All Any Alnum Alpha Alphabetic Assigned InLatinExtendedB Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_B Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
That's right: the uppercase of LATIN LETTER SMALL CAPITAL R is LATIN LETTER
YR, and I don't know why. No other small capital -- which are all considered
lowercase -- changes when casemapped. Only this one alone.
Note that things like code points like U+00DF LATIN SMALL LETTER SHARP S
have these binary properties true because the normal/default sense of these
terms in Unicode is the full/string sense not the simple/character sense:
 Changes_When_Casefolded (CWCF) 
 Changes_When_Casemapped (CWCM)
 Changes_When_Titlecased (CWT) 
 Changes_When_Uppercased (CWU)
Those are true because the full uppercase map of "ß" is "SS" 
and the full casefold of "ß" is "ss".
--tom
msg143083 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011年08月27日 16:15
Thanks you very much. We should fix the behavior in 3.3 for sure. I'm
thinking that we may be able to backport the behavior fix to 2.7 and
3.2 as well, since it just makes the behavior generally "better" (and
for most folks it won't matter anyway).
I'm not sure where the somewhat odd rules for .islower() come from, I
think in part from the desire to have "".islower() be False but "a
b".islower() to be True. Intuitively, this means that .islower() means
both "there is at least one lower case character" and "there are no
upper case characters", but not "all characters are lowercase". I
forget what we do w.r.t. titlecase, but the intuitive meaning should
not change. Although personally I don't have much of an intuition for
what titlecase means (and why it's important), perhaps because I'm not
familiar with any language where there is a third case for some
letters.
msg143084 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月27日 19:17
Guido van Rossum <report@bugs.python.org> wrote
 on 2011年8月27日 16:15:33 -0000: 
> Although personally I don't have much of an intuition for what
> titlecase means (and why it's important), perhaps because I'm not
> familiar with any language where there is a third case for some
> letters.
Neither am I. Even in "old-style" English with ae and oe, one wrote
ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
*Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.
 (BTW, in French you really shouldn't split up the œ into oe, 
 nor in Old English, Old Norse, or Icelandic the æ in ae;
 although in contemporary English, it's usually ok to do so.)
I believe that almost but not quite all the sticky situations with
Unicode casing involve compatibility characters for clean round-trips
with legacy encodings. Exceptions include the German sharp s (both of 
them now) and the two Greek lowercase sigmas. Thank goodness we don't
use the long s in English anymore. What is it with s's, anyway? :)
Most of the titlecase letters are in Greek, with a few in Armenian.
I know no Armenian (their letters all look the same to me :), and the
folks I talked to about the Greek are skeptical. The German sharp s is
a red herring, because you can never have it as the first letter
(although it needn't be the last, as in Rußland). That's no more
possible than having the old legacy ff ligature appear at the beginning
of an English world.
In any event, there are only 129 total code points that are
"problematic" in terms of their case, where by problematic 
I mean one or more of:
 --- titlecase differs from uppercase
 --- foldcase differs from lowercase
 --- any of fold/lower/title/uppercase yields more than one code point
Of all these, it's the (now two!) sharp s's and the Turkic i that are the most annoying.
It's really quite a lot of trouble to go through for so few code points of so little
(perceived) use. But I suppose you never know what new ones they'll uncover, either.
Here are those 129 case-problematicals arranged in UCA order. Some of these
normilizations forms that decompose into graphemes with four code points (not shown).
There are a few other oddities, like the Kelvin sign and other "singletons", but these
are most of the trouble. They're all in the BMP; I guess we learned our lesson. :)
--tom
 1: U+0345 しろまるͅ COMBINING GREEK YPOGEGRAMMENI
 fc=ι U+3B9 lc=しろまるͅ U+345 tc=Ι U+399 uc=Ι U+399 
 2: U+1E9A aʾ LATIN SMALL LETTER A WITH RIGHT HALF RING
 fc=aʾ U+61.2BE lc=aʾ U+1E9A tc=Aʾ U+41.2BE uc=Aʾ U+41.2BE 
 3: U+01F3 dz LATIN SMALL LETTER DZ
 fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 
 4: U+01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z
 fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 
 5: U+01F1 DZ LATIN CAPITAL LETTER DZ
 fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 
 6: U+01C6 dž LATIN SMALL LETTER DZ WITH CARON
 fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 
 7: U+01C5 Dž LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
 fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 
 8: U+01C4 DŽ LATIN CAPITAL LETTER DZ WITH CARON
 fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 
 9: U+FB00 ff LATIN SMALL LIGATURE FF
 fc=ff U+66.66 lc=ff U+FB00 tc=Ff U+46.66 uc=FF U+46.46 
 10: U+FB03 ffi LATIN SMALL LIGATURE FFI
 fc=ffi U+66.66.69 lc=ffi U+FB03 tc=Ffi U+46.66.69 uc=FFI U+46.46.49 
 11: U+FB04 ffl LATIN SMALL LIGATURE FFL
 fc=ffl U+66.66.6C lc=ffl U+FB04 tc=Ffl U+46.66.6C uc=FFL U+46.46.4C 
 12: U+FB01 fi LATIN SMALL LIGATURE FI
 fc=fi U+66.69 lc=fi U+FB01 tc=Fi U+46.69 uc=FI U+46.49 
 13: U+FB02 fl LATIN SMALL LIGATURE FL
 fc=fl U+66.6C lc=fl U+FB02 tc=Fl U+46.6C uc=FL U+46.4C 
 14: U+1E96 ẖ LATIN SMALL LETTER H WITH LINE BELOW
 fc=ẖ U+68.331 lc=ẖ U+1E96 tc=H̱ U+48.331 uc=H̱ U+48.331 
 15: U+0130 İ LATIN CAPITAL LETTER I WITH DOT ABOVE
 fc=i̇ U+69.307 lc=i̇ U+69.307 tc=İ U+130 uc=İ U+130 
 16: U+01F0 ǰ LATIN SMALL LETTER J WITH CARON
 fc=ǰ U+6A.30C lc=ǰ U+1F0 tc=J̌ U+4A.30C uc=J̌ U+4A.30C 
 17: U+01C9 lj LATIN SMALL LETTER LJ
 fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 
 18: U+01C8 Lj LATIN CAPITAL LETTER L WITH SMALL LETTER J
 fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 
 19: U+01C7 LJ LATIN CAPITAL LETTER LJ
 fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 
 20: U+01CC nj LATIN SMALL LETTER NJ
 fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 
 21: U+01CB Nj LATIN CAPITAL LETTER N WITH SMALL LETTER J
 fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 
 22: U+01CA NJ LATIN CAPITAL LETTER NJ
 fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 
 23: U+017F s LATIN SMALL LETTER LONG S
 fc=s U+73 lc=s U+17F tc=S U+53 uc=S U+53 
 24: U+1E9B ṡ LATIN SMALL LETTER LONG S WITH DOT ABOVE
 fc=ṡ U+1E61 lc=ṡ U+1E9B tc=Ṡ U+1E60 uc=Ṡ U+1E60 
 25: U+00DF ß LATIN SMALL LETTER SHARP S
 fc=ss U+73.73 lc=ß U+DF tc=Ss U+53.73 uc=SS U+53.53 
 26: U+1E9E ẞ LATIN CAPITAL LETTER SHARP S
 fc=ss U+73.73 lc=ß U+DF tc=ẞ U+1E9E uc=ẞ U+1E9E 
 27: U+FB06 st LATIN SMALL LIGATURE ST
 fc=st U+73.74 lc=st U+FB06 tc=St U+53.74 uc=ST U+53.54 
 28: U+FB05 st LATIN SMALL LIGATURE LONG S T
 fc=st U+73.74 lc=st U+FB05 tc=St U+53.74 uc=ST U+53.54 
 29: U+1E97 ẗ LATIN SMALL LETTER T WITH DIAERESIS
 fc=ẗ U+74.308 lc=ẗ U+1E97 tc=T̈ U+54.308 uc=T̈ U+54.308 
 30: U+1E98 ẘ LATIN SMALL LETTER W WITH RING ABOVE
 fc=ẘ U+77.30A lc=ẘ U+1E98 tc=W̊ U+57.30A uc=W̊ U+57.30A 
 31: U+1E99 ẙ LATIN SMALL LETTER Y WITH RING ABOVE
 fc=ẙ U+79.30A lc=ẙ U+1E99 tc=Y̊ U+59.30A uc=Y̊ U+59.30A 
 32: U+0149 ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 fc=ʼn U+2BC.6E lc=ʼn U+149 tc=ʼN U+2BC.4E uc=ʼN U+2BC.4E 
 33: U+1F84 ᾄ GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
 fc=ἄι U+1F04.3B9 lc=ᾄ U+1F84 tc=ᾌ U+1F8C uc=ἌΙ U+1F0C.399 
 34: U+1F8C ᾌ GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 fc=ἄι U+1F04.3B9 lc=ᾄ U+1F84 tc=ᾌ U+1F8C uc=ἌΙ U+1F0C.399 
 35: U+1F82 ᾂ GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
 fc=ἂι U+1F02.3B9 lc=ᾂ U+1F82 tc=ᾊ U+1F8A uc=ἊΙ U+1F0A.399 
 36: U+1F8A ᾊ GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 fc=ἂι U+1F02.3B9 lc=ᾂ U+1F82 tc=ᾊ U+1F8A uc=ἊΙ U+1F0A.399 
 37: U+1F86 ᾆ GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ἆι U+1F06.3B9 lc=ᾆ U+1F86 tc=ᾎ U+1F8E uc=ἎΙ U+1F0E.399 
 38: U+1F8E ᾎ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ἆι U+1F06.3B9 lc=ᾆ U+1F86 tc=ᾎ U+1F8E uc=ἎΙ U+1F0E.399 
 39: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
 fc=ἀι U+1F00.3B9 lc=ᾀ U+1F80 tc=ᾈ U+1F88 uc=ἈΙ U+1F08.399 
 40: U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
 fc=ἀι U+1F00.3B9 lc=ᾀ U+1F80 tc=ᾈ U+1F88 uc=ἈΙ U+1F08.399 
 41: U+1F85 ᾅ GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
 fc=ἅι U+1F05.3B9 lc=ᾅ U+1F85 tc=ᾍ U+1F8D uc=ἍΙ U+1F0D.399 
 42: U+1F8D ᾍ GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 fc=ἅι U+1F05.3B9 lc=ᾅ U+1F85 tc=ᾍ U+1F8D uc=ἍΙ U+1F0D.399 
 43: U+1F83 ᾃ GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
 fc=ἃι U+1F03.3B9 lc=ᾃ U+1F83 tc=ᾋ U+1F8B uc=ἋΙ U+1F0B.399 
 44: U+1F8B ᾋ GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 fc=ἃι U+1F03.3B9 lc=ᾃ U+1F83 tc=ᾋ U+1F8B uc=ἋΙ U+1F0B.399 
 45: U+1F87 ᾇ GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ἇι U+1F07.3B9 lc=ᾇ U+1F87 tc=ᾏ U+1F8F uc=ἏΙ U+1F0F.399 
 46: U+1F8F ᾏ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ἇι U+1F07.3B9 lc=ᾇ U+1F87 tc=ᾏ U+1F8F uc=ἏΙ U+1F0F.399 
 47: U+1F81 ᾁ GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
 fc=ἁι U+1F01.3B9 lc=ᾁ U+1F81 tc=ᾉ U+1F89 uc=ἉΙ U+1F09.399 
 48: U+1F89 ᾉ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
 fc=ἁι U+1F01.3B9 lc=ᾁ U+1F81 tc=ᾉ U+1F89 uc=ἉΙ U+1F09.399 
 49: U+1FB4 ᾴ GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
 fc=άι U+3AC.3B9 lc=ᾴ U+1FB4 tc=Άͅ U+386.345 uc=ΆΙ U+386.399 
 50: U+1FB2 ᾲ GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
 fc=ὰι U+1F70.3B9 lc=ᾲ U+1FB2 tc=Ὰͅ U+1FBA.345 uc=ᾺΙ U+1FBA.399 
 51: U+1FB6 ᾶ GREEK SMALL LETTER ALPHA WITH PERISPOMENI
 fc=ᾶ U+3B1.342 lc=ᾶ U+1FB6 tc=Α͂ U+391.342 uc=Α͂ U+391.342 
 52: U+1FB7 ᾷ GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
 fc=ᾶι U+3B1.342.3B9 lc=ᾷ U+1FB7 tc=ᾼ͂ U+391.342.345 uc=Α͂Ι U+391.342.399 
 53: U+1FB3 ᾳ GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
 fc=αι U+3B1.3B9 lc=ᾳ U+1FB3 tc=ᾼ U+1FBC uc=ΑΙ U+391.399 
 54: U+1FBC ᾼ GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
 fc=αι U+3B1.3B9 lc=ᾳ U+1FB3 tc=ᾼ U+1FBC uc=ΑΙ U+391.399 
 55: U+03D0 β GREEK BETA SYMBOL
 fc=β U+3B2 lc=β U+3D0 tc=Β U+392 uc=Β U+392 
 56: U+03F5 ε GREEK LUNATE EPSILON SYMBOL
 fc=ε U+3B5 lc=ε U+3F5 tc=Ε U+395 uc=Ε U+395 
 57: U+1F94 ᾔ GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
 fc=ἤι U+1F24.3B9 lc=ᾔ U+1F94 tc=ᾜ U+1F9C uc=ἬΙ U+1F2C.399 
 58: U+1F9C ᾜ GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 fc=ἤι U+1F24.3B9 lc=ᾔ U+1F94 tc=ᾜ U+1F9C uc=ἬΙ U+1F2C.399 
 59: U+1F92 ᾒ GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
 fc=ἢι U+1F22.3B9 lc=ᾒ U+1F92 tc=ᾚ U+1F9A uc=ἪΙ U+1F2A.399 
 60: U+1F9A ᾚ GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 fc=ἢι U+1F22.3B9 lc=ᾒ U+1F92 tc=ᾚ U+1F9A uc=ἪΙ U+1F2A.399 
 61: U+1F96 ᾖ GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ἦι U+1F26.3B9 lc=ᾖ U+1F96 tc=ᾞ U+1F9E uc=ἮΙ U+1F2E.399 
 62: U+1F9E ᾞ GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ἦι U+1F26.3B9 lc=ᾖ U+1F96 tc=ᾞ U+1F9E uc=ἮΙ U+1F2E.399 
 63: U+1F90 ᾐ GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
 fc=ἠι U+1F20.3B9 lc=ᾐ U+1F90 tc=ᾘ U+1F98 uc=ἨΙ U+1F28.399 
 64: U+1F98 ᾘ GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
 fc=ἠι U+1F20.3B9 lc=ᾐ U+1F90 tc=ᾘ U+1F98 uc=ἨΙ U+1F28.399 
 65: U+1F95 ᾕ GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
 fc=ἥι U+1F25.3B9 lc=ᾕ U+1F95 tc=ᾝ U+1F9D uc=ἭΙ U+1F2D.399 
 66: U+1F9D ᾝ GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 fc=ἥι U+1F25.3B9 lc=ᾕ U+1F95 tc=ᾝ U+1F9D uc=ἭΙ U+1F2D.399 
 67: U+1F93 ᾓ GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
 fc=ἣι U+1F23.3B9 lc=ᾓ U+1F93 tc=ᾛ U+1F9B uc=ἫΙ U+1F2B.399 
 68: U+1F9B ᾛ GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 fc=ἣι U+1F23.3B9 lc=ᾓ U+1F93 tc=ᾛ U+1F9B uc=ἫΙ U+1F2B.399 
 69: U+1F97 ᾗ GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ἧι U+1F27.3B9 lc=ᾗ U+1F97 tc=ᾟ U+1F9F uc=ἯΙ U+1F2F.399 
 70: U+1F9F ᾟ GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ἧι U+1F27.3B9 lc=ᾗ U+1F97 tc=ᾟ U+1F9F uc=ἯΙ U+1F2F.399 
 71: U+1F91 ᾑ GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
 fc=ἡι U+1F21.3B9 lc=ᾑ U+1F91 tc=ᾙ U+1F99 uc=ἩΙ U+1F29.399 
 72: U+1F99 ᾙ GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
 fc=ἡι U+1F21.3B9 lc=ᾑ U+1F91 tc=ᾙ U+1F99 uc=ἩΙ U+1F29.399 
 73: U+1FC4 ῄ GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
 fc=ήι U+3AE.3B9 lc=ῄ U+1FC4 tc=Ήͅ U+389.345 uc=ΉΙ U+389.399 
 74: U+1FC2 ῂ GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
 fc=ὴι U+1F74.3B9 lc=ῂ U+1FC2 tc=Ὴͅ U+1FCA.345 uc=ῊΙ U+1FCA.399 
 75: U+1FC6 ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI
 fc=ῆ U+3B7.342 lc=ῆ U+1FC6 tc=Η͂ U+397.342 uc=Η͂ U+397.342 
 76: U+1FC7 ῇ GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
 fc=ῆι U+3B7.342.3B9 lc=ῇ U+1FC7 tc=ῌ͂ U+397.342.345 uc=Η͂Ι U+397.342.399 
 77: U+1FC3 ῃ GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
 fc=ηι U+3B7.3B9 lc=ῃ U+1FC3 tc=ῌ U+1FCC uc=ΗΙ U+397.399 
 78: U+1FCC ῌ GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
 fc=ηι U+3B7.3B9 lc=ῃ U+1FC3 tc=ῌ U+1FCC uc=ΗΙ U+397.399 
 79: U+03D1 θ GREEK THETA SYMBOL
 fc=θ U+3B8 lc=θ U+3D1 tc=Θ U+398 uc=Θ U+398 
 80: U+1FBE ι GREEK PROSGEGRAMMENI
 fc=ι U+3B9 lc=ι U+1FBE tc=Ι U+399 uc=Ι U+399 
 81: U+1FD6 ῖ GREEK SMALL LETTER IOTA WITH PERISPOMENI
 fc=ῖ U+3B9.342 lc=ῖ U+1FD6 tc=Ι͂ U+399.342 uc=Ι͂ U+399.342 
 82: U+0390 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
 fc=ΐ U+3B9.308.301 lc=ΐ U+390 tc=Ϊ́ U+399.308.301 uc=Ϊ́ U+399.308.301 
 83: U+1FD3 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
 fc=ΐ U+3B9.308.301 lc=ΐ U+1FD3 tc=Ϊ́ U+399.308.301 uc=Ϊ́ U+399.308.301 
 84: U+1FD2 ῒ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
 fc=ῒ U+3B9.308.300 lc=ῒ U+1FD2 tc=Ϊ̀ U+399.308.300 uc=Ϊ̀ U+399.308.300 
 85: U+1FD7 ῗ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
 fc=ῗ U+3B9.308.342 lc=ῗ U+1FD7 tc=Ϊ͂ U+399.308.342 uc=Ϊ͂ U+399.308.342 
 86: U+03F0 κ GREEK KAPPA SYMBOL
 fc=κ U+3BA lc=κ U+3F0 tc=Κ U+39A uc=Κ U+39A 
 87: U+00B5 μ MICRO SIGN
 fc=μ U+3BC lc=μ U+B5 tc=Μ U+39C uc=Μ U+39C 
 88: U+03D6 π GREEK PI SYMBOL
 fc=π U+3C0 lc=π U+3D6 tc=Π U+3A0 uc=Π U+3A0 
 89: U+03F1 ρ GREEK RHO SYMBOL
 fc=ρ U+3C1 lc=ρ U+3F1 tc=Ρ U+3A1 uc=Ρ U+3A1 
 90: U+1FE4 ῤ GREEK SMALL LETTER RHO WITH PSILI
 fc=ῤ U+3C1.313 lc=ῤ U+1FE4 tc=Ρ̓ U+3A1.313 uc=Ρ̓ U+3A1.313 
 91: U+03C2 ς GREEK SMALL LETTER FINAL SIGMA
 fc=σ U+3C3 lc=ς U+3C2 tc=Σ U+3A3 uc=Σ U+3A3 
 92: U+1F50 ὐ GREEK SMALL LETTER UPSILON WITH PSILI
 fc=ὐ U+3C5.313 lc=ὐ U+1F50 tc=Υ̓ U+3A5.313 uc=Υ̓ U+3A5.313 
 93: U+1F54 ὔ GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
 fc=ὔ U+3C5.313.301 lc=ὔ U+1F54 tc=Υ̓́ U+3A5.313.301 uc=Υ̓́ U+3A5.313.301 
 94: U+1F52 ὒ GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
 fc=ὒ U+3C5.313.300 lc=ὒ U+1F52 tc=Υ̓̀ U+3A5.313.300 uc=Υ̓̀ U+3A5.313.300 
 95: U+1F56 ὖ GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
 fc=ὖ U+3C5.313.342 lc=ὖ U+1F56 tc=Υ̓͂ U+3A5.313.342 uc=Υ̓͂ U+3A5.313.342 
 96: U+1FE6 ῦ GREEK SMALL LETTER UPSILON WITH PERISPOMENI
 fc=ῦ U+3C5.342 lc=ῦ U+1FE6 tc=Υ͂ U+3A5.342 uc=Υ͂ U+3A5.342 
 97: U+03B0 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
 fc=ΰ U+3C5.308.301 lc=ΰ U+3B0 tc=Ϋ́ U+3A5.308.301 uc=Ϋ́ U+3A5.308.301 
 98: U+1FE3 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
 fc=ΰ U+3C5.308.301 lc=ΰ U+1FE3 tc=Ϋ́ U+3A5.308.301 uc=Ϋ́ U+3A5.308.301 
 99: U+1FE2 ῢ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
 fc=ῢ U+3C5.308.300 lc=ῢ U+1FE2 tc=Ϋ̀ U+3A5.308.300 uc=Ϋ̀ U+3A5.308.300 
100: U+1FE7 ῧ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
 fc=ῧ U+3C5.308.342 lc=ῧ U+1FE7 tc=Ϋ͂ U+3A5.308.342 uc=Ϋ͂ U+3A5.308.342 
101: U+03D5 φ GREEK PHI SYMBOL
 fc=φ U+3C6 lc=φ U+3D5 tc=Φ U+3A6 uc=Φ U+3A6 
102: U+1FA4 ᾤ GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
 fc=ὤι U+1F64.3B9 lc=ᾤ U+1FA4 tc=ᾬ U+1FAC uc=ὬΙ U+1F6C.399 
103: U+1FAC ᾬ GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 fc=ὤι U+1F64.3B9 lc=ᾤ U+1FA4 tc=ᾬ U+1FAC uc=ὬΙ U+1F6C.399 
104: U+1FA2 ᾢ GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
 fc=ὢι U+1F62.3B9 lc=ᾢ U+1FA2 tc=ᾪ U+1FAA uc=ὪΙ U+1F6A.399 
105: U+1FAA ᾪ GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 fc=ὢι U+1F62.3B9 lc=ᾢ U+1FA2 tc=ᾪ U+1FAA uc=ὪΙ U+1F6A.399 
106: U+1FA6 ᾦ GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ὦι U+1F66.3B9 lc=ᾦ U+1FA6 tc=ᾮ U+1FAE uc=ὮΙ U+1F6E.399 
107: U+1FAE ᾮ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ὦι U+1F66.3B9 lc=ᾦ U+1FA6 tc=ᾮ U+1FAE uc=ὮΙ U+1F6E.399 
108: U+1FA0 ᾠ GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
 fc=ὠι U+1F60.3B9 lc=ᾠ U+1FA0 tc=ᾨ U+1FA8 uc=ὨΙ U+1F68.399 
109: U+1FA8 ᾨ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
 fc=ὠι U+1F60.3B9 lc=ᾠ U+1FA0 tc=ᾨ U+1FA8 uc=ὨΙ U+1F68.399 
110: U+1FA5 ᾥ GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
 fc=ὥι U+1F65.3B9 lc=ᾥ U+1FA5 tc=ᾭ U+1FAD uc=ὭΙ U+1F6D.399 
111: U+1FAD ᾭ GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 fc=ὥι U+1F65.3B9 lc=ᾥ U+1FA5 tc=ᾭ U+1FAD uc=ὭΙ U+1F6D.399 
112: U+1FA3 ᾣ GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
 fc=ὣι U+1F63.3B9 lc=ᾣ U+1FA3 tc=ᾫ U+1FAB uc=ὫΙ U+1F6B.399 
113: U+1FAB ᾫ GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 fc=ὣι U+1F63.3B9 lc=ᾣ U+1FA3 tc=ᾫ U+1FAB uc=ὫΙ U+1F6B.399 
114: U+1FA7 ᾧ GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
 fc=ὧι U+1F67.3B9 lc=ᾧ U+1FA7 tc=ᾯ U+1FAF uc=ὯΙ U+1F6F.399 
115: U+1FAF ᾯ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 fc=ὧι U+1F67.3B9 lc=ᾧ U+1FA7 tc=ᾯ U+1FAF uc=ὯΙ U+1F6F.399 
116: U+1FA1 ᾡ GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
 fc=ὡι U+1F61.3B9 lc=ᾡ U+1FA1 tc=ᾩ U+1FA9 uc=ὩΙ U+1F69.399 
117: U+1FA9 ᾩ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
 fc=ὡι U+1F61.3B9 lc=ᾡ U+1FA1 tc=ᾩ U+1FA9 uc=ὩΙ U+1F69.399 
118: U+1FF4 ῴ GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
 fc=ώι U+3CE.3B9 lc=ῴ U+1FF4 tc=Ώͅ U+38F.345 uc=ΏΙ U+38F.399 
119: U+1FF2 ῲ GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
 fc=ὼι U+1F7C.3B9 lc=ῲ U+1FF2 tc=Ὼͅ U+1FFA.345 uc=ῺΙ U+1FFA.399 
120: U+1FF6 ῶ GREEK SMALL LETTER OMEGA WITH PERISPOMENI
 fc=ῶ U+3C9.342 lc=ῶ U+1FF6 tc=Ω͂ U+3A9.342 uc=Ω͂ U+3A9.342 
121: U+1FF7 ῷ GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
 fc=ῶι U+3C9.342.3B9 lc=ῷ U+1FF7 tc=ῼ͂ U+3A9.342.345 uc=Ω͂Ι U+3A9.342.399 
122: U+1FF3 ῳ GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
 fc=ωι U+3C9.3B9 lc=ῳ U+1FF3 tc=ῼ U+1FFC uc=ΩΙ U+3A9.399 
123: U+1FFC ῼ GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
 fc=ωι U+3C9.3B9 lc=ῳ U+1FF3 tc=ῼ U+1FFC uc=ΩΙ U+3A9.399 
124: U+0587 եւ ARMENIAN SMALL LIGATURE ECH YIWN
 fc=եւ U+565.582 lc=եւ U+587 tc=Եւ U+535.582 uc=ԵՒ U+535.552 
125: U+FB14 մե ARMENIAN SMALL LIGATURE MEN ECH
 fc=մե U+574.565 lc=մե U+FB14 tc=Մե U+544.565 uc=ՄԵ U+544.535 
126: U+FB15 մի ARMENIAN SMALL LIGATURE MEN INI
 fc=մի U+574.56B lc=մի U+FB15 tc=Մի U+544.56B uc=ՄԻ U+544.53B 
127: U+FB17 մխ ARMENIAN SMALL LIGATURE MEN XEH
 fc=մխ U+574.56D lc=մխ U+FB17 tc=Մխ U+544.56D uc=ՄԽ U+544.53D 
128: U+FB13 մն ARMENIAN SMALL LIGATURE MEN NOW
 fc=մն U+574.576 lc=մն U+FB13 tc=Մն U+544.576 uc=ՄՆ U+544.546 
129: U+FB16 վն ARMENIAN SMALL LIGATURE VEW NOW
 fc=վն U+57E.576 lc=վն U+FB16 tc=Վն U+54E.576 uc=ՎՆ U+54E.546
msg143085 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2011年08月27日 19:29
There are some oddities in Unicode case-folding.
Under full case-folding, both "\N{LATIN CAPITAL LETTER SHARP S}" and "\N{LATIN SMALL LETTER SHARP S}" fold to "ss", which means that those codepoints match each other.
However, under simple case-folding, they fold to themselves, which means that those codepoints _don't_ match each other.
msg143086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年08月27日 20:04
> Neither am I. Even in "old-style" English with ae and oe, one wrote
> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
> *Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.
Trying to disprove you a bit:
http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg
but classical typographies seem to write either the uppercase Œ or the lowercase œ.
That said, I wonder why Unicode even includes ligatures like ff. Sounds like mission creep to me (and horrible annoyances for people like us).
msg143089 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011年08月28日 05:54
FTR, with the latest Python 3.2/3.3 (narrow) I get:
 Total failures: 58 / 500 ( 12%)
 Total successes: 442 / 500 ( 88%)
and with the latest Python 3.2/3.3 (wide) I get:
 Total failures: 52 / 500 ( 10%)
 Total successes: 448 / 500 ( 90%)
msg143110 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011年08月28日 17:27
Thanks Tom for such a clear explanation! I hope someone will implement
this. (Matthew, does this affect regex? I am guessing it does, for
case-insensitive matching?)
msg143119 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2011年08月28日 18:56
The regex module currently uses simple case-folding, although I'm working towards full case-folding, as listed in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.
msg143124 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月28日 21:01
Antoine Pitrou <report@bugs.python.org> wrote on 2011年8月27日 20:04:56 -0000: 
>> Neither am I. Even in "old-style" English with ae and oe, one wrote
>> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
>> *Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.
> Trying to disprove you a bit:
> http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
> http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
> http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg
> but classical typographies seem to write either the uppercase Œ or the
> lowercase œ.
That's what I meant: one only ever sees œufs or ŒUFS, never OEUFS.
French doesn't fit into ISO 8859-1. That's one of the changes to
ISO-8859-15 compared with ISO-8859-1 (and Unicode):
 iso-8859-1 A4 ⇔ U+00A4 < ¤ > \N{CURRENCY SIGN}
 iso-8859-15 A4 ⇒ U+20AC < € > \N{EURO SIGN}
 iso-8859-1 A6 ⇔ U+00A6 < ¦ > \N{BROKEN BAR}
 iso-8859-15 A6 ⇒ U+0160 < Š > \N{LATIN CAPITAL LETTER S WITH CARON}
 iso-8859-1 A8 ⇔ U+00A8 < ̈ > \N{DIAERESIS}
 iso-8859-15 A8 ⇒ U+0161 < š > \N{LATIN SMALL LETTER S WITH CARON}
 iso-8859-1 B4 ⇔ U+00B4 < ́ > \N{ACUTE ACCENT}
 iso-8859-15 B4 ⇒ U+017D < Ž > \N{LATIN CAPITAL LETTER Z WITH CARON}
 iso-8859-1 B8 ⇔ U+00B8 < ̧ > \N{CEDILLA}
 iso-8859-15 B8 ⇒ U+017E < ž > \N{LATIN SMALL LETTER Z WITH CARON}
 iso-8859-1 BC ⇔ U+00BC < 1⁄4 > \N{VULGAR FRACTION ONE QUARTER}
 iso-8859-15 BC ⇒ U+0152 < Œ > \N{LATIN CAPITAL LIGATURE OE}
 iso-8859-1 BD ⇔ U+00BD < 1⁄2 > \N{VULGAR FRACTION ONE HALF}
 iso-8859-15 BD ⇒ U+0153 < œ > \N{LATIN SMALL LIGATURE OE}
 iso-8859-1 BE ⇔ U+00BE < 3⁄4 > \N{VULGAR FRACTION THREE QUARTERS}
 iso-8859-15 BE ⇒ U+0178 < Ÿ > \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}
> That said, I wonder why Unicode even includes ligatures like ff. Sounds
> like mission creep to me (and horrible annoyances for people like us).
I'm pretty sure that typographic ligatures are there for roundtripping
with legacy encodings. I believe that œ/Œ is the only code point
with ligature in its name that you're "supposed" to still use, and
that all others should be figured out by modern fonting software.
--tom
msg143145 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2011年08月29日 13:13
Œ, œ or even & are historically ligatures or "ligatured forms".
In the French typography, they are "single plain letters" and
they belong the group of the 42 letters used in the French
typography.
Typographically speaking, using "oe" instead of "œ" is considered
as a mistake, while not using the ligatured forms for the groups
of letters like ff, ffi, ffl, fj, et, st is acceptable.
Microsoft with cp1252, Apple with mac-roman, Adobe and all
foundries and now "Unicode" are working correctly.
It should be noted, when "TeX" moved from the ascii to iso-8859-1
(more precisely "CorkEncoding") as default encoding, "they" saw
the problem and introduced the \oe or \OE commands.
From my understanding and my point of view on the subject, ISO has
somehow recognized his mistake by introducing iso-8859-15.
Infortunatelly, it was too late.
To the subject: Œdipe: correct, Oedipe, OEdipe: incorrect.
Without beeing an expert on that field, all the informations
one can find on Wikipedia (French) regarding questions about
typography are generally correct.
msg143146 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年08月29日 13:21
> Œ, œ or even & are historically ligatures or "ligatured forms".
> In the French typography, they are "single plain letters" and
> they belong the group of the 42 letters used in the French
> typography.
> Typographically speaking, using "oe" instead of "œ" is considered
> as a mistake,
It's not only "typographically speaking", it's really a spelling error,
even in hand-written text :-)
msg143148 - (view) Author: Tom Christiansen (tchrist) Date: 2011年08月29日 14:16
Antoine Pitrou <report@bugs.python.org> wrote
 on 2011年8月29日 13:21:06 -0000: 
> It's not only "typographically speaking", it's really a spelling error,
> even in hand-written text :-)
Sure, and so too is omitting an accent mark or diaeresis. But—alas!—you’ll
never convince most monoglot anglophones of that, the ones who keep wanting to
strip them from résumé, façade, châteaux, crème brûlée, fête, tête-à-tête, 
à la française, or naïveté, not to mention José, jalapeño, the erstwhile
American Secretary of State Federico Peña, or nearby Cañon City, Colorado, 
where I have family. I think œnonlogy has survived solely on its rarity, 
and the Encyclopædia Britannica is that way because the ligat(ur)ed letter
is in their actual trademark.
Cell phone users sending text messages have long suffered the grievous
injuries to their language(s) that naked ASCII imparts, but this is
nothing like the crossdressing nightmare called Greeklish, also variously
known as Grenglish, Latinoellinika/Λατινοελληνικά, or ASCII Greek.
 http://en.wikipedia.org/wiki/Greeklish
 [...] The reason for this is the fact that text written in Greeklish
 is considerably less aesthetically pleasing, and also much harder to
 read, compared to text written in the Greek alphabet. A non-Greek
 speaker/reader can guess this by this example: "δις ιζ χαρντ του
 ριντ" would be the way to write "this is hard to read" in English
 but utilizing the Greek alphabet.
I especially enjoy George Baloglou’s "Byzantine" Grenglish, wherein:
 Ὀδυσσεύς => Oducceus instead of Odysseus
 Ἀχιλλεύς => Axilleus instead of Achilleus
 Σίσυφος => Sicuphos instead of Sisyphus
 Περικλῆς => 5epiklhs instead of Pericles
 Χθονός => X8onos instead of Chthonos
 Οι Ατρείδες => Oi Atpeides instead of the Atreïdes
Terrible though the depredations upon the French language that may
have been committed by ASCII, surely these go even further. :)
--tom
 Η Ιλιάδα H Iliada
Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλῆος Mhnin aeide, 8ea, 5hlhiadeo Axilhos
οὐλομένην, ἣ μυρί’ Ἀχαιοῖς ἄλγε’ ἔθηκε, oulomenhn, 'h mupi’ Axaiois alge’ e8hke,
πολλὰς δ’ ἰφθίμους ψυχὰς Ἄϊδι προῒαψεν nollas d’ iph8imous yuxas Aidi npoiayen
ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν 'hpoon, autous de elopia teuxe kuneccin
οἰωνοῖσί τε πᾶσι· Διὸς δ’ ἐτελείετο βουλή· oionoici te naci· Dios d’ eteleieto boulh·
ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε eks o'u dh ta npota diacththn epicante
Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς. Atpeidhs te anaks andpon kai dios Axilleus.
msg150844 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年01月08日 03:54
Here is a patch. I only dealt with case mappings and not titlecase. Doing titlecase properly requires word segmentation, which I think should be another patch/issue. This patch fixes swapcase(), capitalize(), upper(), and lower(). It does not include the changes to Objects/unicodetype_db.h because those are huge. Regenerate the database if you want to test it. Please review.
msg150998 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年01月10日 03:49
New patch. I implemented it the way Antoine desired. It seems rather inefficient to be copying around so much data...
msg151016 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年01月10日 14:03
__ap__'s implementation method is about 2x faster than mine.
msg151088 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012年01月11日 20:20
New patch with title casing mappings added.
msg151098 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年01月11日 23:17
New changeset f7e05d205a52 by Benjamin Peterson in branch 'default':
use full unicode mappings for upper/lower/title case (#12736)
http://hg.python.org/cpython/rev/f7e05d205a52 
msg151141 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2012年01月12日 17:17
The currently applied patch ( http://hg.python.org/cpython/rev/f7e05d205a52 ) left some dead code in unicodeobject.c
function fixup ( http://hg.python.org/cpython/file/f7e05d205a52/Objects/unicodeobject.c#l9386 ) has a shortcut for when the fixer doesn't make any actual changes. The removed fixers (like fixupper ) returned 0 rather than maxchar to indicate that. The only remaining fixer, fix_decimal_and_space_to_ascii (line 8839), does not. (I think fix_decimal_and_space_to_ascii *should* add a touched flag, but until it does, the shortcut dedup code is dead.)
Also, around line 10502, there is an #if 0 section with code that relied on one of the removed fixers; is it time to remove that section?
msg151311 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2012年01月16日 00:24
Why was the delta-processing removed from the casing functions?
As best I can tell, the whole point of going through multiple levels of indirection (courtesy splitbins) is to maximize compression and minimize the amount of cache that unicode might occupy.
By using deltas, only one record is needed for each combination of (upper - lower, upper - title), which is generally only one or two combinations per script. 
Without deltas, nearly every cased letter needs its own record, and the index tables also get bigger. (It seems to be about 2.6 times as large, but cache effects may be worse, since letters from the same script will no longer be in the same record or the same index chain.)
If it is a concern about not enough room for flags, then the decimal/digit chars could be combined. They are always the same, unless the number isn't decimal (in which case the flag is enough).
msg151314 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年01月16日 02:19
New changeset 03ea95e3b497 by Benjamin Peterson in branch 'default':
delta encoding of upper/lower/title makes a glorious return (#12736)
http://hg.python.org/cpython/rev/03ea95e3b497 
msg261517 - (view) Author: Андрей Баксаляр (Андрей Баксаляр) Date: 2016年03月10日 17:37
A same problem with the unicode case mapping is still present in the Python 3.4.3. You can repeat the bug with this code, for instance:
'ΰ'.upper().lower() == 'ΰ'
The case swapping is strangelly leads to character replacement:
b'\xce\xb0' → b'\xcf\x85\xcc\x88\xcc\x81'
msg261522 - (view) Author: Андрей Баксаляр (Андрей Баксаляр) Date: 2016年03月10日 20:21
Interestingly, the bug is still reproducible in version 3.5.1, but fixed in 2.7.6.
msg261547 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2016年03月11日 07:39
The full case mappings do not preserve normalization form.
>>> for c in 'ΰ'.upper().lower(): print(unicodedata.name(c))
... 
GREEK SMALL LETTER UPSILON
COMBINING DIAERESIS
COMBINING ACUTE ACCENT
>>> unicodedata.normalize('NFC', 'ΰ'.upper().lower()) == 'ΰ'
True
History
Date User Action Args
2022年04月11日 14:57:20adminsetgithub: 56945
2016年03月11日 07:39:49benjamin.petersonsetmessages: + msg261547
2016年03月10日 20:44:03gvanrossumsetnosy: - gvanrossum
2016年03月10日 20:42:37SilentGhostsetversions: + Python 3.4, Python 3.5, Python 3.6, - Python 2.7
2016年03月10日 20:21:51Андрей Баксалярsetfiles: + pythonbug.png

messages: + msg261522
versions: + Python 2.7, - Python 3.4
2016年03月10日 17:37:31Андрей Баксалярsetnosy: + Андрей Баксаляр

messages: + msg261517
versions: + Python 3.4, - Python 3.3
2013年06月23日 23:56:10belopolskylinkissue4610 superseder
2012年01月16日 02:19:31python-devsetmessages: + msg151314
2012年01月16日 00:24:46Jim.Jewettsetmessages: + msg151311
2012年01月12日 17:17:24Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg151141
2012年01月11日 23:23:51benjamin.petersonsetstatus: open -> closed
resolution: fixed
2012年01月11日 23:17:46python-devsetnosy: + python-dev
messages: + msg151098
2012年01月11日 20:20:09benjamin.petersonsetfiles: + full-casemapping.patch

messages: + msg151088
2012年01月11日 03:38:21benjamin.petersonsetfiles: + full-casemapping.patch
2012年01月10日 14:03:39benjamin.petersonsetmessages: + msg151016
2012年01月10日 03:49:31benjamin.petersonsetfiles: + full-casemapping.patch

messages: + msg150998
2012年01月08日 03:54:29benjamin.petersonsetfiles: + full-casemapping.patch

nosy: + benjamin.peterson
messages: + msg150844

keywords: + patch
2011年08月29日 14:16:04tchristsetmessages: + msg143148
2011年08月29日 13:21:06pitrousetmessages: + msg143146
2011年08月29日 13:13:57Jean-Michel.Fauthsetnosy: + Jean-Michel.Fauth
messages: + msg143145
2011年08月28日 21:01:49tchristsetmessages: + msg143124
2011年08月28日 18:56:35mrabarnettsetmessages: + msg143119
2011年08月28日 17:27:28gvanrossumsetmessages: + msg143110
2011年08月28日 05:54:35ezio.melottisetfiles: + casing-results.txt

messages: + msg143089
2011年08月27日 20:04:56pitrousetnosy: + pitrou
messages: + msg143086
2011年08月27日 19:29:28mrabarnettsetmessages: + msg143085
2011年08月27日 19:17:30tchristsetmessages: + msg143084
2011年08月27日 16:15:33gvanrossumsetmessages: + msg143083
2011年08月27日 14:48:38tchristsetmessages: + msg143072
2011年08月26日 23:55:58tchristsetfiles: + casing-tests.py

messages: + msg143052
2011年08月26日 23:36:17tchristsetmessages: + msg143051
2011年08月26日 21:11:23gvanrossumsetnosy: + gvanrossum
messages: + msg143036
2011年08月13日 00:58:12mrabarnettsetnosy: + mrabarnett
2011年08月12日 18:05:57Arfreversetnosy: + Arfrever
2011年08月12日 17:30:15eric.araujosetcomponents: + Interpreter Core, Unicode, - Library (Lib)
versions: + Python 3.3, - Python 3.2
2011年08月12日 00:17:23ezio.melottisetnosy: + belopolsky, ezio.melotti
2011年08月11日 21:39:44tchristcreate

AltStyle によって変換されたページ (->オリジナル) /