homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: json and ElementTree parsers misbehave on streams containing more than a single object
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Frederick.Ross, eli.bendersky, eric.araujo, ezio.melotti, pitrou, r.david.murray, rhettinger
Priority: normal Keywords:

Created on 2012年05月18日 17:29 by Frederick.Ross, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (9)
msg161068 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012年05月18日 17:29
When parsing something like '<a>x</a><a>y</a>' with xml.etree.ElementTree, or '{}{}' with json, these parser throw exceptions instead of reading a single element of the kind they understand off the stream (or throwing an exception if there is no element they understand) and leaving the stream in a sane state.
So I should be able to write
import xml.etree.ElementTree as et
import StringIO
s = StringIO.StringIO("<a>x</a><a>y</a>")
elem1 = et.parse(s)
elem2 = et.parse(s)
and have elem1 correspond to "<a>x</a>" and elem2 correspond to "<a>y</a>".
At the very least, if the parsers refuse to parse partial streams, they should at least not destroy the streams.
msg161599 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012年05月25日 18:52
I am not sure the parsers should be lenient. One could argue that it’s the stream that is broken if it contains non-compliant XML or JSON. Can you tell more about the use case?
msg161605 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年05月25日 19:09
ElementTree supports incremental parsing with the iterparse() method, not sure it fills your use case:
http://docs.python.org/dev/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
As for the json module, it doesn't have such a facility.
msg161607 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012年05月25日 19:26
Antoine, It's not iterative parsing, it's a sequence of XML docs or json objects.
Eric, the server I'm retrieving from, for real time searches, steadily produces a stream of (each properly formed) XML or json documents containing new search results. However, at the moment I have to edit the stream on the fly to wrap an outer tag around it and remove any DTD in inner elements, or I can't use the XML parser. Such a workaround isn't possible with the json parser, since it has no iterative parsing mode.
msg161609 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012年05月25日 19:32
I think it is perfectly reasonable for a parser to leave the file pointer in some undefined further location into the file when it detects "extra stuff" and produces an error message. One can certainly argue that producing that error message is a feature ("detect badly formed documents"). 
I also think that your use case is a perfectly reasonable one, but I think a mode that supports your use case would be an enhancement.
msg161616 - (view) Author: Frederick Ross (Frederick.Ross) Date: 2012年05月25日 20:06
In the case of files, sure, it's fine. The error gives me the offset, and I can go pull it out and buffer it, and it's fine. Plus XML is strict about having only one document per file.
For streams, none of this is applicable. I can't seek in a streaming network connection. If the parser leaves it in an unusable state, then I lose everything that may follow. It makes Python unusable in certain, not very rare, cases of network programming.
I'll just add that Haskell's Parsec does this right, and should be used as an example.
msg161617 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012年05月25日 20:12
Well, if the stream isn't seekable then I don't see how it can be left in any state other than the same one it leaves a file (read ahead as much as it read to generate the error). So unfortunately by our backward compatibility rules I still think this will be a new feature.
msg161762 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012年05月28日 09:41
I don't think this is an enhancement to ET, because ET was not designed to be a streaming parser, which is what is required here. ET was designed to read a whole valid XML document. There is 'iterparse', as Antoine mentioned, but it is designed to "track changes to the tree while it is being built" - mostly to save memory.
You have streaming XML parsers in Python - for example xml.sax. You can also relatively easily use xml.sax to find the end of your document and then parse the buffer with ET.
I don't see how a comparison with Parsec (a parser generator/DSL library) makes sense. There are tons of such libraries for Python - just pick one.
msg162060 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012年06月01日 08:40
I propose to close this issue. If the problem in json is real and someone thinks it has to be fixed, a separate issue specifically for json should be opened.
History
Date User Action Args
2022年04月11日 14:57:30adminsetgithub: 59057
2012年06月08日 12:31:38eli.benderskysetstatus: open -> closed
resolution: wont fix
stage: resolved
2012年06月01日 08:40:11eli.benderskysetmessages: + msg162060
2012年05月28日 09:41:30eli.benderskysetmessages: + msg161762
2012年05月25日 20:12:38r.david.murraysetmessages: + msg161617
2012年05月25日 20:06:38Frederick.Rosssetmessages: + msg161616
2012年05月25日 19:32:34r.david.murraysetversions: + Python 3.3, - Python 2.7
nosy: + r.david.murray

messages: + msg161609

type: enhancement
2012年05月25日 19:26:38Frederick.Rosssetmessages: + msg161607
2012年05月25日 19:09:34pitrousetmessages: + msg161605
2012年05月25日 18:52:01eric.araujosetnosy: + pitrou, ezio.melotti, rhettinger, eric.araujo, eli.bendersky

messages: + msg161599
versions: - Python 2.6
2012年05月18日 17:29:21Frederick.Rosscreate

AltStyle によって変換されたページ (->オリジナル) /