Message 169886 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	wiml
Recipients	ezio.melotti, wiml
Date	2012年09月05日.18:54:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1346871274.13.0.653364801762.issue15866@psf.upfronthosting.co.za>

Content
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character. Here's a transcript (Python 2.7.3, x86_64): >>> b = '\xf0\x9f\x92\x9d' >>> u = b.decode('utf8') >>> u u'\U0001f49d' >>> u.encode('ascii', errors='xmlcharrefreplace') '&#55357;&#56477;' >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') '&#55357;&#56477;' >>> list(u) [u'\ud83d', u'\udc9d'] >>> u.encode('utf8', errors='xmlcharrefreplace') '\xf0\x9f\x92\x9d' The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'. As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint. (FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.) [1] http://www.w3.org/International/questions/qa-escapes#bytheway

Content

Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character.
Here's a transcript (Python 2.7.3, x86_64):
 >>> b = '\xf0\x9f\x92\x9d'
 >>> u = b.decode('utf8')
 >>> u
 u'\U0001f49d'
 >>> u.encode('ascii', errors='xmlcharrefreplace')
 '&#55357;&#56477;'
 >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
 '&#55357;&#56477;'
 >>> list(u)
 [u'\ud83d', u'\udc9d']
 >>> u.encode('utf8', errors='xmlcharrefreplace')
 '\xf0\x9f\x92\x9d'
The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '&#128157;'.
As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint.
(FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.)
[1] http://www.w3.org/International/questions/qa-escapes#bytheway

History
Date	User	Action	Args
2012年09月05日 18:54:34	wiml	set	recipients: + wiml, ezio.melotti
2012年09月05日 18:54:34	wiml	set	messageid: <1346871274.13.0.653364801762.issue15866@psf.upfronthosting.co.za>
2012年09月05日 18:54:33	wiml	link	issue15866 messages
2012年09月05日 18:54:32	wiml	create

homepage