This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年10月03日 03:46 by ArcRiley, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| t.py | ArcRiley, 2009年10月03日 03:46 | |||
| u.py | ArcRiley, 2009年10月03日 23:04 | |||
| Messages (10) | |||
|---|---|---|---|
| msg93475 - (view) | Author: Arc Riley (ArcRiley) * | Date: 2009年10月03日 03:46 | |
The following is a minimal example which does not work under Python
3.1.1 but functions as expected on Pyhton 2.6 and 3.0.
Python 3.1.1 believes the single UTF-8 glyph is two entirely different
(and illegal) unicode characters:
Traceback (most recent call last):
File "t.py", line 2, in <module>
print('𐑛')
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in
position 0: surrogates not allowed
Test system is Ubuntu 9.10-beta 32-bit
|
|||
| msg93476 - (view) | Author: Arc Riley (ArcRiley) * | Date: 2009年10月03日 04:09 | |
While t.py only bugs on 3.1, the following happens with 3.0 as well: >>> line = '𐑑𐑧𐑕𐑑𐑦𐑙' >>> first = '𐑑' >>> first '𐑑' >>> line[0] '\ud801' >>> line[0] == first False And with 2.6: >>> line = u'𐑑𐑧𐑕𐑑𐑦𐑙' >>> first = u'𐑑' >>> first u'\ud801\udc51' |
|||
| msg93482 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2009年10月03日 09:12 | |
I can't reproduce that; it prints fine for me. Notice that it is perfectly fine for Python to represent this as two code points in UCS-2 mode (so that len(s)==2); this is called UTF-16. |
|||
| msg93486 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2009年10月03日 10:21 | |
I can't reproduce it either on Ubuntu 9.04 32-bit. I tried both from the
terminal and from the file, using Py3.2a0.
As Martin said, the fact that in narrow builds of Python the codepoints
outside the BMP are represented with two surrogate pairs is a known
"issue". This is how UTF-16 works, even if it has some problematic
side-effects.
In your example 'line[0]' is not equal to 'first' because line[0] is the
codepoint of the first surrogate and 'first' is a scalar value that
represents the SHAVIAN LETTER TOT (U+010451).
Regarding the traceback you pasted in the first post, have you used
print('𐑑') or print(line[0])?
This is what I get using line[0]:
>>> line = '𐑑𐑧𐑕𐑑𐑦𐑙'
>>> first = '𐑑'
>>> print(line[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in
position 0: surrogates not allowed
In this case you are getting an error because lone surrogates are
invalid and they can't be encoded. If you use line[:2] instead it works
because it takes both the surrogates:
>>> print(line[0:2])
𐑑
>>> first == line[0:2]
True
If you really got that error with print('𐑛'), then #3297 could be related.
Can you also try this and see what it prints?
>>> import sys
>>> sys.maxunicode
|
|||
| msg93489 - (view) | Author: Arc Riley (ArcRiley) * | Date: 2009年10月03日 13:15 | |
Python 3.1.1 (r311:74480, Sep 13 2009, 22:19:17) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 1114111 >>> u = '𐑑' >>> print(u) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed |
|||
| msg93510 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年10月03日 22:14 | |
The page: http://www.fileformat.info/info/unicode/char/d801/index.htm has a big warning saying that "U+D801 is not a valid unicode character." The problem is similar to issue6697, and lead to the same question: should python validate utf-8 input, and refuse invalid unicode characters? |
|||
| msg93513 - (view) | Author: Arc Riley (ArcRiley) * | Date: 2009年10月03日 23:04 | |
Amaury, you are absolutely correct, \ud801 is not a valid unicode glyph,
however I am not giving Python \ud801, I am giving Python '𐑑' (==
'\U00010451').
I am attaching a different short example that demonstrates that Python
is mishandling UTF-8 on both the interactive terminal and in scripts, u.py
The output should be the same, but on Python 3.1.1 compiled for wide
unicode it reports two different values. As someone on #python-dev
found '𐑑'.encode('utf-16').decode('utf-16') outputs the correct value.
|
|||
| msg93514 - (view) | Author: Adam Olsen (Rhamphoryncus) | Date: 2009年10月03日 23:37 | |
I believe this is a duplicate of issue #3297. When given a high unicode scalar value directly in the source (rather than in escaped form) python will split it into surrogates, even on a UTF-32 build where those surrogates are nonsensical and ill-formed. Patches for Issue #3672 probably made this more visible. |
|||
| msg93515 - (view) | Author: Arc Riley (ArcRiley) * | Date: 2009年10月03日 23:52 | |
This behavior is identical whether u.py or u.pyc is run on my systems, where that previous ticket concerns differing behavior. Though it is obviously related. |
|||
| msg94596 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年10月28日 01:07 | |
This is a duplicate of #3297, and Adam's patch there fixes it. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:53 | admin | set | github: 51294 |
| 2009年10月28日 01:07:55 | benjamin.peterson | set | status: open -> closed nosy: + benjamin.peterson messages: + msg94596 superseder: Python interpreter uses Unicode surrogate pairs only before the pyc is created resolution: duplicate |
| 2009年10月03日 23:52:55 | ArcRiley | set | messages:
+ msg93515 versions: - Python 2.6, Python 3.0 |
| 2009年10月03日 23:37:58 | Rhamphoryncus | set | nosy:
+ Rhamphoryncus messages: + msg93514 |
| 2009年10月03日 23:04:02 | ArcRiley | set | files:
+ u.py messages: + msg93513 |
| 2009年10月03日 22:14:04 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg93510 |
| 2009年10月03日 13:15:16 | ArcRiley | set | messages: + msg93489 |
| 2009年10月03日 10:21:58 | ezio.melotti | set | priority: normal nosy: + ezio.melotti messages: + msg93486 |
| 2009年10月03日 09:12:03 | loewis | set | nosy:
+ loewis messages: + msg93482 |
| 2009年10月03日 04:09:16 | ArcRiley | set | messages:
+ msg93476 versions: + Python 2.6, Python 3.0 |
| 2009年10月03日 03:46:46 | ArcRiley | create | |