Message142720
| Author |
ezio.melotti |
| Recipients |
Arfrever, Rhamphoryncus, amaury.forgeotdarc, belopolsky, ezio.melotti, lemburg, tchrist, vstinner |
| Date |
2011年08月22日.12:13:58 |
| SpamBayes Score |
3.758105e-14 |
| Marked as misclassified |
No |
| Message-id |
<1314015239.36.0.592430578881.issue9200@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
It turned out that this can't be fixed in 2.7 unless we backport the patch in #5127 (it's in 3.2/3.3 but not in 2.7).
IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, the higher bits gets truncated and the macros return wrong results.
So, for example
>>> u'\ud800\udc42'.isupper()
True
because \ud800 + \udc42 = \U000100429 → \U000100429 gets truncated to \u0429 → \u0429 is the CYRILLIC CAPITAL LETTER SHCHA → .isupper() returns True.
The current behavior is instead broken in another way, because it checks that u'\ud800'.isupper() and u'\udc42'.isupper() separately.
Would it make sense to backport #5127 or should I just give up and leave it broken? |
|