homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.sax.xmlreader does not support the InputSource protocol
Type: behavior Stage: resolved
Components: Library (Lib), XML Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: 17089 Superseder:
Assigned To: fdrake Nosy List: fdrake, serhiy.storchaka, ygale
Priority: low Keywords:

Created on 2008年02月24日 13:52 by ygale, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Messages (9)
msg62900 - (view) Author: Yitz Gale (ygale) Date: 2008年02月24日 13:52
In the documentation for xml.sax.xmlreader.InputSource objects
(section 8.12.4 of the Library Reference) we find that
users of InputSource objects should use the following
sequence to get their input data:
1. If the InputSource has a character stream, use that.
2. Otherwise, if the InputSource has a byte stream, use that.
3. Otherwise, open a URI connection to the system ID.
The parse() method of IncrementalParser skips step 1.
In addition, we need to add a method
getSourceEncoding() to the XMLReader interface;
if non-null, it will indicate to the parser that
the input is a byte stream in the given encoding.
The documentation should indicate what the parser
should do if the XML itself announces that its
encoding is something else. I propose that the parser should
be required to raise an error in that case.
See also #1483.
msg62904 - (view) Author: Yitz Gale (ygale) Date: 2008年02月24日 14:09
See also: #1483 and #2175.
msg62907 - (view) Author: Yitz Gale (ygale) Date: 2008年02月24日 14:18
Hmm. When getSourceEncoding() is None, there needs to be some
way for the parser to distinguish between the cases where it
is getting pre-decoded Unicode through a character stream,
or where it is getting a byte stream with an unspecified
encoding. In the latter case, it will have to look in the
XML for an encoding declaration, or use UTF-8 by default).
Note that expat only can handle the latter case.
msg62909 - (view) Author: Yitz Gale (ygale) Date: 2008年02月24日 14:53
So I think there are two possibilities:
1. Use a special value for getSourceEnconding(),
like "unicode", to indicate that this is a
unicode character stream and not a byte stream.
2. Provide yet another method in the XMLReader
interface: sourceIsCharacterStream(), returning
a bool.
There is a more drastic option:
3. Since expat doesn't support this stuff
anyway, and perhaps not too many people
have written parsers that do support it,
dumb down the InputSource interface.
Specifically, deprecate setCharacterStream(),
getCharacterStream(), setEncoding() and
getEncoding(), none of which are used by
expat. Parsers should read the XML from
the byte stream and use that to determine
the encoding.
That may upset some implementors of XML
libraries though. They would each have to go
to some trouble to provide their own
proprietary and possibly incompatible
mechanisms for this, if they need it.
Perhaps a compromise fourth path would
be to have subclasses of InputSource for
the two cases of character stream and
byte stream.
msg62940 - (view) Author: Yitz Gale (ygale) Date: 2008年02月24日 21:16
Subclass of XMLReader would be needed, not InputStream.
msg64644 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008年03月28日 18:42
It's certainly arguable that the current behavior is a bug, though I
suspect it shouldn't be considered major since I've not seen any prior
complaints about this.
It should be easy to fix the bug you describe by taking the character
stream and encoding it before feeding it to the XML parser; Expat can
certainly be forced to take a known encoding, ignoring what's in the XML
declaration.
On the other hand, it's not at all clear that changing this is
worthwhile. This API borrows quite literally from the Java SAX APIs;
perhaps this separation of the character stream from the byte stream
makes sense for some of the Java XML parsers, but I don't know that
there are any Python parsers that benefit from that separation.
msg239312 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015年03月26日 07:29
Issue2175 has a patch that covers all three issues: issue1483, issue2174 and issue2175. I hesitate what parts of the patch are worth to be applied to maintained releases.
msg239939 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015年04月02日 18:12
Fixed in issue2175 (in 3.5 only).
msg240171 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2015年04月06日 19:18
Given that this has languished this long, patching historical releases seems pointless.
History
Date User Action Args
2022年04月11日 14:56:31adminsetgithub: 46427
2015年04月06日 19:27:13Arfreversetcomponents: + XML
2015年04月06日 19:26:18Arfreversetstage: resolved
resolution: fixed
components: + Library (Lib), - Documentation, XML
versions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2
2015年04月06日 19:18:26fdrakesetstatus: open -> closed

messages: + msg240171
2015年04月02日 18:12:48serhiy.storchakasetmessages: + msg239939
2015年03月26日 07:29:04serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg239312
2013年01月31日 10:02:57serhiy.storchakasetdependencies: + Expat parser parses strings only when XML encoding is UTF-8
2010年06月09日 21:59:34terry.reedysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2008年03月28日 18:42:40fdrakesetpriority: normal -> low
messages: + msg64644
components: - Library (Lib), Unicode
2008年03月20日 02:52:31jafosetpriority: normal
assignee: fdrake
nosy: + fdrake
2008年02月24日 21:16:40ygalesetmessages: + msg62940
2008年02月24日 14:53:29ygalesetmessages: + msg62909
2008年02月24日 14:18:28ygalesetmessages: + msg62907
2008年02月24日 14:09:57ygalesetmessages: + msg62904
components: + Unicode
2008年02月24日 13:52:31ygalecreate

AltStyle によって変換されたページ (->オリジナル) /