Message 155576 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	scoder
Recipients	Arfrever, effbot, eli.bendersky, flox, philthompson10, scoder
Date	2012年03月13日.07:24:14
SpamBayes Score	2.0504154e-11
Marked as misclassified	No
Message-id	<1331623455.82.0.303366058574.issue14246@psf.upfronthosting.co.za>

Content
FWIW, lxml also has support for parsing Unicode strings. It doesn't encode the input, however, but parses straight from the underlying buffer (after detecting the buffer layout etc. at module init time - and yes, I still haven't fixed this up for PEP393). There is one problem that I see with encoding it to UTF-8 first, even disregarding the obvious inefficiency in terms of both memory and processing time. At least for string parsing, lxml has an additional check in place that rejects Unicode string input containing an encoding declaration, because that would be very unlikely to match the buffer encoding on a given platform. In your case, it would be good to have something similar, because when you get a Unicode string with, say, an ISO8859-1 encoding declaration, encoding it to UTF-8 and then passing that to pyexpat will silently generate incorrect content - unless you can safely enforce a specific encoding regardless of the declaration, don't know how expat handles this (you are clearly not handling it in your patch, which, I take it, only adapts the behaviour to what pyET currently does). The problem here is that it's not so easy to do this for file-like objects, because they may return text that contains "<?xml version='1.0'", then a billion whitespace characters, and then "encoding='latin1'?>". The XML parser could handle this, but doing it in a preprocessing step would be some work. In any case, silently returning broken data is not a good idea. Maybe it would work to check the encoding that the parser uses against the one we expect? If the parser switches encodings at some point even though we are sure it must be utf-8 (in whatever spelling), we can still raise an error at that point. I'll consider letting lxml do this check as well, it sounds more efficient than what it currently does.

Content

FWIW, lxml also has support for parsing Unicode strings. It doesn't encode the input, however, but parses straight from the underlying buffer (after detecting the buffer layout etc. at module init time - and yes, I still haven't fixed this up for PEP393).
There is one problem that I see with encoding it to UTF-8 first, even disregarding the obvious inefficiency in terms of both memory and processing time. At least for string parsing, lxml has an additional check in place that rejects Unicode string input containing an encoding declaration, because that would be very unlikely to match the buffer encoding on a given platform.
In your case, it would be good to have something similar, because when you get a Unicode string with, say, an ISO8859-1 encoding declaration, encoding it to UTF-8 and then passing that to pyexpat will silently generate incorrect content - unless you can safely enforce a specific encoding regardless of the declaration, don't know how expat handles this (you are clearly not handling it in your patch, which, I take it, only adapts the behaviour to what pyET currently does).
The problem here is that it's not so easy to do this for file-like objects, because they may return text that contains "<?xml version='1.0'", then a billion whitespace characters, and then "encoding='latin1'?>". The XML parser could handle this, but doing it in a preprocessing step would be some work.
In any case, silently returning broken data is not a good idea. Maybe it would work to check the encoding that the parser uses against the one we expect? If the parser switches encodings at some point even though we are sure it *must* be utf-8 (in whatever spelling), we can still raise an error at that point. I'll consider letting lxml do this check as well, it sounds more efficient than what it currently does.

History
Date	User	Action	Args
2012年03月13日 07:24:15	scoder	set	recipients: + scoder, effbot, philthompson10, Arfrever, eli.bendersky, flox
2012年03月13日 07:24:15	scoder	set	messageid: <1331623455.82.0.303366058574.issue14246@psf.upfronthosting.co.za>
2012年03月13日 07:24:15	scoder	link	issue14246 messages
2012年03月13日 07:24:14	scoder	create

homepage