Message 109010 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Mike.Lewis
Recipients	Mike.Lewis
Date	2010年06月30日.20:02:51
SpamBayes Score	0.015523787
Marked as misclassified	No
Message-id	<1277928174.53.0.503418777966.issue9133@psf.upfronthosting.co.za>

Content
When I do codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8') its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence. It maps to the value U+DF2D which is a "surrogate pair" it seems. http://tools.ietf.org/html/rfc3629#section-4 explains: However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance), being actually UCS-4 characters transformed through UTF-16, need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above. which would suggest that it is invalid. However, I think wikipedia's explanation is a bit clearer: UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. Thanks, Mike

Content

When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')
its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence.
It maps to the value U+DF2D which is a "surrogate pair" it seems.
http://tools.ietf.org/html/rfc3629#section-4
explains:
 However, pairs of
 UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
 parlance), being actually UCS-4 characters transformed through
 UTF-16, need special treatment: the UTF-16 transformation must be
 undone, yielding a UCS-4 character that is then transformed as
 above.
which would suggest that it is invalid.
However, I think wikipedia's explanation is a bit clearer:
UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.
Thanks,
Mike

History
Date	User	Action	Args
2010年06月30日 20:02:54	Mike.Lewis	set	recipients: + Mike.Lewis
2010年06月30日 20:02:54	Mike.Lewis	set	messageid: <1277928174.53.0.503418777966.issue9133@psf.upfronthosting.co.za>
2010年06月30日 20:02:53	Mike.Lewis	link	issue9133 messages
2010年06月30日 20:02:51	Mike.Lewis	create

homepage