This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年02月24日 13:52 by ygale, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Messages (9) | |||
|---|---|---|---|
| msg62900 - (view) | Author: Yitz Gale (ygale) | Date: 2008年02月24日 13:52 | |
In the documentation for xml.sax.xmlreader.InputSource objects (section 8.12.4 of the Library Reference) we find that users of InputSource objects should use the following sequence to get their input data: 1. If the InputSource has a character stream, use that. 2. Otherwise, if the InputSource has a byte stream, use that. 3. Otherwise, open a URI connection to the system ID. The parse() method of IncrementalParser skips step 1. In addition, we need to add a method getSourceEncoding() to the XMLReader interface; if non-null, it will indicate to the parser that the input is a byte stream in the given encoding. The documentation should indicate what the parser should do if the XML itself announces that its encoding is something else. I propose that the parser should be required to raise an error in that case. See also #1483. |
|||
| msg62904 - (view) | Author: Yitz Gale (ygale) | Date: 2008年02月24日 14:09 | |
See also: #1483 and #2175. |
|||
| msg62907 - (view) | Author: Yitz Gale (ygale) | Date: 2008年02月24日 14:18 | |
Hmm. When getSourceEncoding() is None, there needs to be some way for the parser to distinguish between the cases where it is getting pre-decoded Unicode through a character stream, or where it is getting a byte stream with an unspecified encoding. In the latter case, it will have to look in the XML for an encoding declaration, or use UTF-8 by default). Note that expat only can handle the latter case. |
|||
| msg62909 - (view) | Author: Yitz Gale (ygale) | Date: 2008年02月24日 14:53 | |
So I think there are two possibilities: 1. Use a special value for getSourceEnconding(), like "unicode", to indicate that this is a unicode character stream and not a byte stream. 2. Provide yet another method in the XMLReader interface: sourceIsCharacterStream(), returning a bool. There is a more drastic option: 3. Since expat doesn't support this stuff anyway, and perhaps not too many people have written parsers that do support it, dumb down the InputSource interface. Specifically, deprecate setCharacterStream(), getCharacterStream(), setEncoding() and getEncoding(), none of which are used by expat. Parsers should read the XML from the byte stream and use that to determine the encoding. That may upset some implementors of XML libraries though. They would each have to go to some trouble to provide their own proprietary and possibly incompatible mechanisms for this, if they need it. Perhaps a compromise fourth path would be to have subclasses of InputSource for the two cases of character stream and byte stream. |
|||
| msg62940 - (view) | Author: Yitz Gale (ygale) | Date: 2008年02月24日 21:16 | |
Subclass of XMLReader would be needed, not InputStream. |
|||
| msg64644 - (view) | Author: Fred Drake (fdrake) (Python committer) | Date: 2008年03月28日 18:42 | |
It's certainly arguable that the current behavior is a bug, though I suspect it shouldn't be considered major since I've not seen any prior complaints about this. It should be easy to fix the bug you describe by taking the character stream and encoding it before feeding it to the XML parser; Expat can certainly be forced to take a known encoding, ignoring what's in the XML declaration. On the other hand, it's not at all clear that changing this is worthwhile. This API borrows quite literally from the Java SAX APIs; perhaps this separation of the character stream from the byte stream makes sense for some of the Java XML parsers, but I don't know that there are any Python parsers that benefit from that separation. |
|||
| msg239312 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年03月26日 07:29 | |
Issue2175 has a patch that covers all three issues: issue1483, issue2174 and issue2175. I hesitate what parts of the patch are worth to be applied to maintained releases. |
|||
| msg239939 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年04月02日 18:12 | |
Fixed in issue2175 (in 3.5 only). |
|||
| msg240171 - (view) | Author: Fred Drake (fdrake) (Python committer) | Date: 2015年04月06日 19:18 | |
Given that this has languished this long, patching historical releases seems pointless. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:31 | admin | set | github: 46427 |
| 2015年04月06日 19:27:13 | Arfrever | set | components: + XML |
| 2015年04月06日 19:26:18 | Arfrever | set | stage: resolved resolution: fixed components: + Library (Lib), - Documentation, XML versions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2 |
| 2015年04月06日 19:18:26 | fdrake | set | status: open -> closed messages: + msg240171 |
| 2015年04月02日 18:12:48 | serhiy.storchaka | set | messages: + msg239939 |
| 2015年03月26日 07:29:04 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg239312 |
| 2013年01月31日 10:02:57 | serhiy.storchaka | set | dependencies: + Expat parser parses strings only when XML encoding is UTF-8 |
| 2010年06月09日 21:59:34 | terry.reedy | set | versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0 |
| 2008年03月28日 18:42:40 | fdrake | set | priority: normal -> low messages: + msg64644 components: - Library (Lib), Unicode |
| 2008年03月20日 02:52:31 | jafo | set | priority: normal assignee: fdrake nosy: + fdrake |
| 2008年02月24日 21:16:40 | ygale | set | messages: + msg62940 |
| 2008年02月24日 14:53:29 | ygale | set | messages: + msg62909 |
| 2008年02月24日 14:18:28 | ygale | set | messages: + msg62907 |
| 2008年02月24日 14:09:57 | ygale | set | messages:
+ msg62904 components: + Unicode |
| 2008年02月24日 13:52:31 | ygale | create | |