Message 81061 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009年02月03日.14:18:25
SpamBayes Score	5.1957577e-10
Marked as misclassified	No
Message-id	<200902031518.19781.victor.stinner@haypocalc.com>
In-reply-to	<4988441A.1010607@egenix.com>

Content
lemburg> This is not possible for unichr() in Python 2.x, since applications lemburg> always expect len(unichr(x)) == 1 Oh, ok. lemburg> Changing ord() would be possible in Python 2.x is easier, since lemburg> this would only extend the range of returned values for UCS2 lemburg> builds. ord() of Python3 (narrow build) rejects surrogate characters: '\U00010000' >>> len(chr(0x10000)) 2 >>> ord(0x10000) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ord() expected string of length 1, but int found --- It looks that narrow builds with surrogates have some more problems... Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other. Correct result (Python 2.5, wide build): $ python Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) >>> unichr(0x10000) u'\U00010000' >>> unichr(0x10000).isalpha() True Error in Python3 (narrow build): marge$ ./python Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35) >>> chr(0x10000).isalpha() False >>> list(chr(0x10000)) ['\ud800', '\udc00'] >>> chr(0xd800).isalpha() False >>> chr(0xdc00).isalpha() False Unicode ranges, all in the category "Other, Surrogate": - U+D800..U+DB7F: Non Private Use High Surrogate - U+DB80..U+DBFF: Private Use High Surrogate - U+DC00..U+DFFF: Low Surrogate" range

Content

lemburg> This is not possible for unichr() in Python 2.x, since applications
lemburg> always expect len(unichr(x)) == 1
Oh, ok.
lemburg> Changing ord() would be possible in Python 2.x is easier, since
lemburg> this would only extend the range of returned values for UCS2
lemburg> builds.
ord() of Python3 (narrow build) rejects surrogate characters:
'\U00010000'
>>> len(chr(0x10000))
2
>>> ord(0x10000)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: ord() expected string of length 1, but int found
---
It looks that narrow builds with surrogates have some more problems...
Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other.
Correct result (Python 2.5, wide build):
 $ python
 Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
 >>> unichr(0x10000)
 u'\U00010000'
 >>> unichr(0x10000).isalpha()
 True
Error in Python3 (narrow build):
 marge$ ./python
 Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35)
 >>> chr(0x10000).isalpha()
 False
 >>> list(chr(0x10000))
 ['\ud800', '\udc00']
 >>> chr(0xd800).isalpha()
 False
 >>> chr(0xdc00).isalpha()
 False
Unicode ranges, all in the category "Other, Surrogate":
 - U+D800..U+DB7F: Non Private Use High Surrogate
 - U+DB80..U+DBFF: Private Use High Surrogate
 - U+DC00..U+DFFF: Low Surrogate" range

History
Date	User	Action	Args
2009年02月03日 14:18:28	vstinner	set	recipients: + vstinner, lemburg, amaury.forgeotdarc, ezio.melotti, bupjae
2009年02月03日 14:18:26	vstinner	link	issue5127 messages
2009年02月03日 14:18:25	vstinner	create

homepage