This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年06月30日 20:02 by Mike.Lewis, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Messages (4) | |||
|---|---|---|---|
| msg109010 - (view) | Author: Mike Lewis (Mike.Lewis) | Date: 2010年06月30日 20:02 | |
When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')
its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence.
It maps to the value U+DF2D which is a "surrogate pair" it seems.
http://tools.ietf.org/html/rfc3629#section-4
explains:
However, pairs of
UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
parlance), being actually UCS-4 characters transformed through
UTF-16, need special treatment: the UTF-16 transformation must be
undone, yielding a UCS-4 character that is then transformed as
above.
which would suggest that it is invalid.
However, I think wikipedia's explanation is a bit clearer:
UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.
Thanks,
Mike
|
|||
| msg109011 - (view) | Author: Mike Lewis (Mike.Lewis) | Date: 2010年06月30日 20:07 | |
Sorry, meant to add this part to the quote from the rfc: This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8 |
|||
| msg109012 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2010年06月30日 20:11 | |
This is already fixed in Python 3. However I think that for backward compatibility reasons it can't be fixed in Python 2, where it is possible to encode and decode every codepoint to/from UTF-8. See also http://bugs.python.org/issue8271#msg102209 I think this can be closed as wontfix. |
|||
| msg109017 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年06月30日 20:38 | |
Ezio Melotti wrote: > > I think this can be closed as wontfix. Agreed. I've already closed the ticket. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:03 | admin | set | github: 53379 |
| 2014年03月12日 21:14:09 | jmehnle | set | nosy:
+ jmehnle |
| 2010年06月30日 22:05:21 | ezio.melotti | set | stage: resolved |
| 2010年06月30日 20:38:28 | lemburg | set | messages: + msg109017 |
| 2010年06月30日 20:25:37 | lemburg | set | status: pending -> closed resolution: wont fix |
| 2010年06月30日 20:11:50 | ezio.melotti | set | status: open -> pending versions: + Python 2.7 nosy: + lemburg, vstinner, ezio.melotti messages: + msg109012 type: behavior |
| 2010年06月30日 20:07:17 | Mike.Lewis | set | messages: + msg109011 |
| 2010年06月30日 20:02:53 | Mike.Lewis | create | |