This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年08月18日 15:06 by edreamleo, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Messages (12) | |||
|---|---|---|---|
| msg71339 - (view) | Author: Edward K Ream (edreamleo) * | Date: 2008年08月18日 15:06 | |
While porting Leo to Python 3.0, I found that passing any byte stream to xml.sax.parser.parse will hang the parser. My quick fix was to change: while buffer != "": to: while buffer != "" and buffer != b"": at line 123 of xmlreader.py Here is the entire function: def parse(self, source): from . import saxutils source = saxutils.prepare_input_source(source) self.prepareParser(source) file = source.getByteStream() buffer = file.read(self._bufsize) ### while buffer != "": while buffer != "" and buffer != b"": ### EKR self.feed(buffer) buffer = file.read(self._bufsize) self.close() For reference, here is the code in Leo that was hanging:: parser = xml.sax.make_parser() parser.setFeature(xml.sax.handler.feature_external_ges,1) handler = saxContentHandler(c,inputFileName,silent,inClipboard) parser.setContentHandler(handler) parser.parse(theFile) Looking at the test_expat_file function in test_sax.py, it appears that the essential difference between the code that hangs and the successful unit test is that that Leo opens the file in 'rb' mode. (code not shown) It's doubtful that 'rb' mode is correct--from the unit test I deduce that the default 'r' mode would be better. Anyway, it would be nice if parser.parse didn't hang on dubious streams. HTH. Edward |
|||
| msg71340 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年08月18日 15:09 | |
It should probably be changed to just while buffer != b"" since it requests a byte stream. |
|||
| msg71341 - (view) | Author: Edward K Ream (edreamleo) * | Date: 2008年08月18日 15:39 | |
On Mon, Aug 18, 2008 at 10:09 AM, Benjamin Peterson <report@bugs.python.org>wrote: > > Benjamin Peterson <musiccomposition@gmail.com> added the comment: > > It should probably be changed to just while buffer != b"" since it > requests a byte stream. That was my guess as well. I added the extra test so as not to remove a test that might, under some circumstance be important. Just to be clear, I am at present totally confused about io streams :-) Especially as used by the sax parsers. In particular, opening a file in 'r' mode, that is, passing a *non*-byte stream to parser.parse, works, while opening a file in 'rb' mode, that is, passing a *byte* stream to parser.parse, hangs. Anyway, opening the file passed to parser.parse with 'r' mode looks like the (only) way to go when using Python 3.0. In Python 2.5, opening files passed to parser.parse in 'rb' mode works. I don't recall whether I had any reason for 'rb' mode: it may have been an historical accident, or just a lucky accident :-) Edward -------------------------------------------------------------------- Edward K. Ream email: edreamleo@gmail.com Leo: http://webpages.charter.net/edreamleo/front.html -------------------------------------------------------------------- |
|||
| msg71345 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月18日 16:00 | |
> Just to be clear, I am at present totally confused about io streams :-)
Python 3.0 distincts more clearly between unicode strings (called "str"
in 3.0) and bytes strings (called "bytes" in 3.0). The most important
point being that there is no more any implicit conversion between the
two: you must explicitly use .encode() or .decode().
Files opened in binary ("rb") mode returns byte strings, but files
opened in text ("r") mode return unicode strings, which means you can't
give a text file to 3.0 library expecting a binary file, or vice-versa.
What is more worrying is that XML, until decoded, should be considered a
byte stream, so sax.parser should accept binary files rather than text
files. I took a look at test_sax and indeed it considers XML as text
rather than bytes :-(
Bumping this as critical because it needs a decision very soon (ideally
before beta3).
|
|||
| msg71361 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月18日 18:51 | |
From the discussion on the python-3000, it looks like it would be nice if sax.parser handled both bytes and unicode streams. Edward, does your simple fix make sax.parser work entirely well with byte streams? |
|||
| msg71373 - (view) | Author: Edward K Ream (edreamleo) * | Date: 2008年08月18日 20:16 | |
On Mon, Aug 18, 2008 at 1:51 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > From the discussion on the python-3000, it looks like it would be nice > if sax.parser handled both bytes and unicode streams. > > Edward, does your simple fix make sax.parser work entirely well with > byte streams? No. The sax.parser seems to have other problems. Here is what I *think* I know ;-) 1. A smallish .leo file (an xml file) containing a single non-ascii (utf-8) encoded character appears to have been read correctly with Python 3.0. 2. A larger .leo file fails as follows (it's possible that the duplicate error messages are a Leo problem): Traceback (most recent call last): Traceback (most recent call last): File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in parse_leo_file parser.parse(theFile) # expat does not support parseString File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in parse_leo_file parser.parse(theFile) # expat does not support parseString File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse buffer = file.read(self._bufsize) File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse buffer = file.read(self._bufsize) File "C:\Python30\lib\io.py", line 1670, in read eof = not self._read_chunk() File "C:\Python30\lib\io.py", line 1670, in read eof = not self._read_chunk() File "C:\Python30\lib\io.py", line 1499, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "C:\Python30\lib\io.py", line 1499, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "C:\Python30\lib\io.py", line 1236, in decode output = self.decoder.decode(input, final=final) File "C:\Python30\lib\io.py", line 1236, in decode output = self.decoder.decode(input, final=final) File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74: character maps to <undefined> UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74: character maps to <undefined> The same calls to sax read the file correctly on Python 2.5. It would be nice to have a message pinpoint the line and character offset of the problem. My vote would be for the code to work on both kinds of input streams. This would save the users considerable confusion if sax does the (tricky) conversions automatically. Imo, now would be the most convenient time to attempt this--there is a certain freedom in having everything be partially broken :-) Edward -------------------------------------------------------------------- Edward K. Ream email: edreamleo@gmail.com Leo: http://webpages.charter.net/edreamleo/front.html -------------------------------------------------------------------- |
|||
| msg71375 - (view) | Author: Edward K Ream (edreamleo) * | Date: 2008年08月18日 20:21 | |
On Mon, Aug 18, 2008 at 11:00 AM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > Just to be clear, I am at present totally confused about io streams :-) > > Python 3.0 distincts more clearly between unicode strings (called "str" > in 3.0) and bytes strings (called "bytes" in 3.0). The most important > point being that there is no more any implicit conversion between the > two: you must explicitly use .encode() or .decode(). > > Files opened in binary ("rb") mode returns byte strings, but files > opened in text ("r") mode return unicode strings, which means you can't > give a text file to 3.0 library expecting a binary file, or vice-versa. > > What is more worrying is that XML, until decoded, should be considered a > byte stream, so sax.parser should accept binary files rather than text > files. I took a look at test_sax and indeed it considers XML as text > rather than bytes :-( Thanks for these remarks. They confirm what I suspected, but was unsure of, namely that it seems strange to be passing something other than a byte stream to parser.parse. > > Bumping this as critical because it needs a decision very soon (ideally > before beta3). Thanks for taking this seriously. Edward P.S. I love the new unicode plans. They are going to cause some pain at first for everyone (Python team and developers), but in the long run they are going to be a big plus for Python. EKR -------------------------------------------------------------------- Edward K. Ream email: edreamleo@gmail.com Leo: http://webpages.charter.net/edreamleo/front.html -------------------------------------------------------------------- |
|||
| msg71381 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月18日 21:15 | |
> The same calls to sax read the file correctly on Python 2.5. What are those calls exactly? Why is "cp1252" used as an encoding? Is it what is specified in the XML file? Or do you somehow feed stdin to the SAX parser? (if the latter, you aren't testing bytes handling since stdin/stdout/stderr are text streams in py3k) |
|||
| msg71382 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年08月18日 21:24 | |
I guess that the file is simply opened in text mode ("r"). This uses the
"preferred encoding", which is cp1252 on (western) Windows machines.
|
|||
| msg71390 - (view) | Author: Edward K Ream (edreamleo) * | Date: 2008年08月18日 22:04 | |
On Mon, Aug 18, 2008 at 4:15 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > The same calls to sax read the file correctly on Python 2.5. > > What are those calls exactly? parser = xml.sax.make_parser() parser.setFeature(xml.sax.handler.feature_external_ges,1) handler = saxContentHandler(c,inputFileName,silent,inClipboard) parser.setContentHandler(handler) parser.parse(theFile) As discussed in http://bugs.python.org/issue3590 theFile is a file opened with 'rb' attributes Edward -------------------------------------------------------------------- Edward K. Ream email: edreamleo@gmail.com Leo: http://webpages.charter.net/edreamleo/front.html -------------------------------------------------------------------- |
|||
| msg71391 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2008年08月18日 22:07 | |
Ok, then xml.sax looks rather broken. (by the way, can you avoid sending HTML emails? each time you send one, the bug tracker attaches a file names "unnamed". I've removed all 4 of them now.) |
|||
| msg72422 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年09月03日 21:42 | |
This is a duplicate of #2501. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:37 | admin | set | github: 47840 |
| 2008年09月03日 21:42:40 | benjamin.peterson | set | status: open -> closed resolution: duplicate messages: + msg72422 |
| 2008年08月21日 14:58:15 | benjamin.peterson | set | priority: critical -> release blocker |
| 2008年08月18日 22:07:04 | pitrou | set | messages: + msg71391 |
| 2008年08月18日 22:05:40 | pitrou | set | files: - unnamed |
| 2008年08月18日 22:05:37 | pitrou | set | files: - unnamed |
| 2008年08月18日 22:05:34 | pitrou | set | files: - unnamed |
| 2008年08月18日 22:05:31 | pitrou | set | files: - unnamed |
| 2008年08月18日 22:04:00 | edreamleo | set | files:
+ unnamed messages: + msg71390 |
| 2008年08月18日 21:24:56 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg71382 |
| 2008年08月18日 21:15:28 | pitrou | set | messages: + msg71381 |
| 2008年08月18日 20:21:43 | edreamleo | set | files:
+ unnamed messages: + msg71375 |
| 2008年08月18日 20:16:09 | edreamleo | set | files:
+ unnamed messages: + msg71373 |
| 2008年08月18日 18:51:08 | pitrou | set | messages: + msg71361 |
| 2008年08月18日 16:00:57 | pitrou | set | priority: critical nosy: + pitrou messages: + msg71345 title: sax.parser hangs on byte streams -> sax.parser considers XML as text rather than bytes |
| 2008年08月18日 15:39:38 | edreamleo | set | files:
+ unnamed messages: + msg71341 |
| 2008年08月18日 15:09:18 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages: + msg71340 |
| 2008年08月18日 15:06:14 | edreamleo | create | |