Message 189685 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	terry.reedy
Recipients	amaury.forgeotdarc, dongying, eli.bendersky, flox, terry.reedy, vstinner
Date	2013年05月20日.19:53:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1369079583.21.0.416688402904.issue13612@psf.upfronthosting.co.za>

Content
3.3 shifted the wide-build problem to all builds ;-). I now get File "C:\Python\mypy\tem.py", line 4, in <module> xmlet.fromstring(s) File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML parser.feed(text) File "<string>", line None xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30 I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded. I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.) With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names. I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.

Content

3.3 shifted the wide-build problem to all builds ;-). I now get
 File "C:\Python\mypy\tem.py", line 4, in <module>
 xmlet.fromstring(s)
 File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML
 parser.feed(text)
 File "<string>", line None
xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30
I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to
LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded.
I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.)
With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names.
I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.

History
Date	User	Action	Args
2013年05月20日 19:53:03	terry.reedy	set	recipients: + terry.reedy, amaury.forgeotdarc, vstinner, eli.bendersky, flox, dongying
2013年05月20日 19:53:03	terry.reedy	set	messageid: <1369079583.21.0.416688402904.issue13612@psf.upfronthosting.co.za>
2013年05月20日 19:53:03	terry.reedy	link	issue13612 messages
2013年05月20日 19:53:02	terry.reedy	create

homepage