Message81061
| Author |
vstinner |
| Recipients |
amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner |
| Date |
2009年02月03日.14:18:25 |
| SpamBayes Score |
5.1957577e-10 |
| Marked as misclassified |
No |
| Message-id |
<200902031518.19781.victor.stinner@haypocalc.com> |
| In-reply-to |
<4988441A.1010607@egenix.com> |
| Content |
lemburg> This is not possible for unichr() in Python 2.x, since applications
lemburg> always expect len(unichr(x)) == 1
Oh, ok.
lemburg> Changing ord() would be possible in Python 2.x is easier, since
lemburg> this would only extend the range of returned values for UCS2
lemburg> builds.
ord() of Python3 (narrow build) rejects surrogate characters:
'\U00010000'
>>> len(chr(0x10000))
2
>>> ord(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected string of length 1, but int found
---
It looks that narrow builds with surrogates have some more problems...
Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other.
Correct result (Python 2.5, wide build):
$ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> unichr(0x10000)
u'\U00010000'
>>> unichr(0x10000).isalpha()
True
Error in Python3 (narrow build):
marge$ ./python
Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35)
>>> chr(0x10000).isalpha()
False
>>> list(chr(0x10000))
['\ud800', '\udc00']
>>> chr(0xd800).isalpha()
False
>>> chr(0xdc00).isalpha()
False
Unicode ranges, all in the category "Other, Surrogate":
- U+D800..U+DB7F: Non Private Use High Surrogate
- U+DB80..U+DBFF: Private Use High Surrogate
- U+DC00..U+DFFF: Low Surrogate" range |
|