Issue 14246: Accelerated ETree XMLParser cannot handle io.StringIO

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58454

classification

Title:	Accelerated ETree XMLParser cannot handle io.StringIO
Type:	behavior	Stage:	resolved
Components:	XML	Versions:	Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	eli.bendersky	Nosy List:	Arfrever, effbot, eli.bendersky, flox, philthompson10, python-dev, scoder
Priority:	normal	Keywords:	3.2regression, patch

Created on 2012年03月10日 11:04 by philthompson10, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue14246.1.patch	eli.bendersky, 2012年03月13日 04:32	review

Messages (9)
msg155301 - (view)	Author: Phil Thompson (philthompson10)	Date: 2012年03月10日 11:04
The old unaccelerated ETree XMLParser accepts input from a io.StringIO, but the accelerated version does not. Any code that relies on this is broken by Python v3.3.
msg155336 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年03月10日 18:22
Hi Phil, Could you please post the problematic code snippet that runs with 3.2 but not 3.3? Thanks in advance
msg155339 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年03月10日 18:28
Note that this code works fine: --------------------------------------- tree = ET.ElementTree() stream = io.BytesIO() stream.write(b'''<?xml version="1.0"?> <site> </site> ''') stream.seek(0) tree.parse(stream) print(tree.getroot()) ---------------------------------------
msg155355 - (view)	Author: Phil Thompson (philthompson10)	Date: 2012年03月10日 21:53
This variation of your test doesn't... --------------------------------------- import io from xml.etree.ElementTree import parse stream = io.StringIO() stream.write('''<?xml version="1.0"?> <site> </site> ''') stream.seek(0) parsed = parse(stream) print(parsed) --------------------------------------- Phil
msg155377 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年03月11日 04:13
Both the Python ET and _elementtree use expat, but reach its parsing function in a different path when given a file-like object (StringIO, BytesIO). Python ET reads the stream itself and passes the data to pyexpat's Parse method, which uses PyArg_ParseTuple to decode it. The latter turns a string into bytes when required, so the parsing of str streams is handled transparently. For _elementtree, on the other hand, ET directly calls the internal XMLParser._parse, which uses its own (C) loop to read from the stream. When it sees that it hasn't read bytes (but a string) it stops and falls back on parsing an empty document. The fix will have to be in the latter loop, probably just converting the read string to bytes before moving on.
msg155567 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年03月13日 04:32
Attaching a patch that fixes the problem and adds a test. xmlparser_parse from _elementtree will now be able to handle Unicode, encoding it to bytes with UTF-8, parallel to the way it's done in pyexpat. I would appreciate a review for the patch.
msg155576 - (view)	Author: Stefan Behnel (scoder) * (Python committer)	Date: 2012年03月13日 07:24
FWIW, lxml also has support for parsing Unicode strings. It doesn't encode the input, however, but parses straight from the underlying buffer (after detecting the buffer layout etc. at module init time - and yes, I still haven't fixed this up for PEP393). There is one problem that I see with encoding it to UTF-8 first, even disregarding the obvious inefficiency in terms of both memory and processing time. At least for string parsing, lxml has an additional check in place that rejects Unicode string input containing an encoding declaration, because that would be very unlikely to match the buffer encoding on a given platform. In your case, it would be good to have something similar, because when you get a Unicode string with, say, an ISO8859-1 encoding declaration, encoding it to UTF-8 and then passing that to pyexpat will silently generate incorrect content - unless you can safely enforce a specific encoding regardless of the declaration, don't know how expat handles this (you are clearly not handling it in your patch, which, I take it, only adapts the behaviour to what pyET currently does). The problem here is that it's not so easy to do this for file-like objects, because they may return text that contains "<?xml version='1.0'", then a billion whitespace characters, and then "encoding='latin1'?>". The XML parser could handle this, but doing it in a preprocessing step would be some work. In any case, silently returning broken data is not a good idea. Maybe it would work to check the encoding that the parser uses against the one we expect? If the parser switches encodings at some point even though we are sure it must be utf-8 (in whatever spelling), we can still raise an error at that point. I'll consider letting lxml do this check as well, it sounds more efficient than what it currently does.
msg155579 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年03月13日 08:24
Stefan, Thanks a lot for taking the time to review the patch. As you correctly say, the current pathch's goal is just to align with existing behavior in the Python implementation of ET. I understand the problem you are describing, but at least it's not a regression vs. previous behavior, while the original problem this issue complains about is a regression. I propose to commit this to fix the regression and open a separate issue with the insight you provided. One easy solution could be to just require the encoding to be UTF-8 when passing unicode to the module, and to document it explicitly. Another solution would be to actually fix it in the module itself. If there is a decision to fix it, the fix should then cover both the C and Python implementations, in all possible places (all functions reading XML from strings will also suffer from the same problem, since they get passed to xmlparse_Parse in pyexpat, which just uses PyArg_ParseTuple with the "s#" format - encoding unicode in utf-8 without looking at the XML encoding itself).
msg155990 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年03月16日 03:55
New changeset 7bdf5c96fdc0 by Eli Bendersky in branch 'default': Closes Issue #14246: _elementtree parser will now handle io.StringIO http://hg.python.org/cpython/rev/7bdf5c96fdc0

History
Date	User	Action	Args
2022年04月11日 14:57:27	admin	set	github: 58454
2012年03月16日 03:56:06	eli.bendersky	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2012年03月16日 03:55:15	python-dev	set	nosy: + python-dev messages: + msg155990
2012年03月13日 08:24:19	eli.bendersky	set	messages: + msg155579
2012年03月13日 07:24:15	scoder	set	messages: + msg155576
2012年03月13日 04:32:35	eli.bendersky	set	files: + issue14246.1.patch nosy: + effbot, scoder, flox messages: + msg155567 keywords: + patch stage: needs patch -> patch review
2012年03月11日 04:13:28	eli.bendersky	set	messages: + msg155377
2012年03月11日 03:59:14	eli.bendersky	set	keywords: + 3.2regression assignee: eli.bendersky stage: needs patch
2012年03月10日 22:03:33	Arfrever	set	nosy: + Arfrever
2012年03月10日 21:53:27	philthompson10	set	messages: + msg155355
2012年03月10日 18:28:00	eli.bendersky	set	messages: + msg155339
2012年03月10日 18:22:36	eli.bendersky	set	messages: + msg155336
2012年03月10日 13:58:46	loewis	set	nosy: + eli.bendersky
2012年03月10日 11:04:01	philthompson10	create

homepage