This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年12月16日 09:04 by dongying, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| test.py | dongying, 2011年12月16日 09:04 | source code | ||
| expat_unknown_encoding_handler.patch | serhiy.storchaka, 2013年05月21日 03:31 | review | ||
| expat_unknown_encoding_handler_2.patch | serhiy.storchaka, 2013年05月22日 13:52 | review | ||
| expat_et_share_unknown_handler.1.patch | eli.bendersky, 2013年05月23日 12:48 | review | ||
| expat_buffer_overflow-2.7.patch | serhiy.storchaka, 2013年08月04日 10:24 | review | ||
| Messages (25) | |||
|---|---|---|---|
| msg149602 - (view) | Author: Dongying Zhang (dongying) | Date: 2011年12月16日 09:04 | |
I've been trying to parse xml string using python, codes following: #-*- coding: utf-8 -*- import xml.etree.ElementTree as xmlet s = '<?xml version="1.0" encoding="GBK"?><info></info>' xmlet.fromstring(s) Then: $ python2.6 test.py or: $ pypy test.py Traceback message came out like this: Traceback (most recent call last): File "test.py", line 4, in <module> xmlet.fromstring(s) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML parser.feed(text) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed self._parser.Parse(data, 0) xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30 I've seen this in following cases: Python2.5.6 on Mac OS X Lion Python2.6.5 on Ubuntu 10.04 pypy1.7.0 on Mac OS X Lion But not in these cases: Python2.6.7 on Mac OS X Lion Python2.7.1 on Mac OS X Lion |
|||
| msg149612 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2011年12月16日 11:26 | |
Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box). In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long. So buffers overflow, etc. The function has a comment "supports only 8bit encodings"; indeed. Versions 3.2 and 3.3 happen to pass the test, probably by pure luck. Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes. |
|||
| msg149652 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2011年12月17日 01:25 | |
3.2, Win7 (a narrow build) it indeed works and returns <Element 'info' at 0x3389128> |
|||
| msg189660 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月20日 13:17 | |
In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately] |
|||
| msg189671 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2013年05月20日 15:35 | |
encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long. Exact behavior is not defined... |
|||
| msg189685 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2013年05月20日 19:53 | |
3.3 shifted the wide-build problem to all builds ;-). I now get File "C:\Python\mypy\tem.py", line 4, in <module> xmlet.fromstring(s) File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML parser.feed(text) File "<string>", line None xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30 I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded. I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.) With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names. I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug. |
|||
| msg189694 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月20日 21:18 | |
PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding. |
|||
| msg189720 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月21日 03:31 | |
I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury. |
|||
| msg189785 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2013年05月21日 20:59 | |
The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support. - please add a unit test - remove usage of PyUnicodeObject and the "Stupid to access directly" comment, they are outdated. Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment. |
|||
| msg189796 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月22日 08:01 | |
For unit tests we first should fix issue16986. |
|||
| msg189809 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月22日 12:27 | |
> For unit tests we first should fix issue16986. I did another round of code review on issue 16986 now. |
|||
| msg189810 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月22日 12:29 | |
Looked at Serhiy's patch here too: LGTM with a unit test :) |
|||
| msg189818 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月22日 13:52 | |
Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests. |
|||
| msg189845 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月23日 04:02 | |
Serhiy, would it make sense to share the code somewhere instead of duplicating it? |
|||
| msg189847 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月23日 04:48 | |
May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments: /* The code is duplicated as xx() in xxx.c. Keep both functions synchronized. */ |
|||
| msg189865 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月23日 12:48 | |
How about this patch (not tested it too much - just as a proof of concept). We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString. ET can then simply call the handler from pyexpat. |
|||
| msg189898 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月24日 06:19 | |
LGTM. But why not just inline expat_unknown_encoding_handler()? For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for issue16986)? |
|||
| msg189956 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2013年05月25日 12:27 | |
New changeset f7b47fb30169 by Eli Bendersky in branch '3.3': Issue #13612: handle unknown encodings without a buffer overflow. http://hg.python.org/cpython/rev/f7b47fb30169 New changeset 47e719b11c46 by Eli Bendersky in branch 'default': Issue #13612: handle unknown encodings without a buffer overflow. http://hg.python.org/cpython/rev/47e719b11c46 |
|||
| msg189957 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月25日 12:32 | |
A few notes: 1. If by C API version you mean PyExpat_CAPI_MAGIC, I'm not sure what difference that makes. It has never been updated and it's also being used only in _elementtree. Since the latter is statically compiled against pyexpat, I don't see a reason to touch it. 2. I wouldn't backport it to 2.7; again, not critical enough. 3. Should we open a separate issue for adding support to GBK and similar encodings "sometime in the future"? This issue kind-of got derailed to fix the bug. |
|||
| msg189960 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年05月25日 13:44 | |
Tests were duplicated. There is no Misc/NEWS entry. I think a buffer overflow is critical enough for backporting. |
|||
| msg189961 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年05月25日 14:18 | |
Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers. |
|||
| msg194061 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年08月01日 12:48 | |
Serhiy, do you want to backport the buffer overflow fix to 2.7? |
|||
| msg194340 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月04日 10:24 | |
Here is a patch for 2.7. |
|||
| msg194364 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2013年08月04日 13:10 | |
New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7': Issue #13612: Fix a buffer overflow in case of a multi-byte encoding. http://hg.python.org/cpython/rev/b3efc140d8a6 |
|||
| msg194365 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年08月04日 13:11 | |
Thanks, Serhiy. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:24 | admin | set | github: 57821 |
| 2013年08月04日 13:11:32 | eli.bendersky | set | status: open -> closed resolution: fixed messages: + msg194365 stage: patch review -> resolved |
| 2013年08月04日 13:10:41 | python-dev | set | messages: + msg194364 |
| 2013年08月04日 10:24:34 | serhiy.storchaka | set | files:
+ expat_buffer_overflow-2.7.patch messages: + msg194340 |
| 2013年08月01日 12:48:56 | eli.bendersky | set | messages: + msg194061 |
| 2013年05月25日 14:18:49 | eli.bendersky | set | messages: + msg189961 |
| 2013年05月25日 13:44:25 | serhiy.storchaka | set | messages: + msg189960 |
| 2013年05月25日 12:32:56 | eli.bendersky | set | messages: + msg189957 |
| 2013年05月25日 12:27:32 | python-dev | set | nosy:
+ python-dev messages: + msg189956 |
| 2013年05月24日 06:19:08 | serhiy.storchaka | set | messages: + msg189898 |
| 2013年05月23日 12:48:20 | eli.bendersky | set | files:
+ expat_et_share_unknown_handler.1.patch messages: + msg189865 |
| 2013年05月23日 04:48:59 | serhiy.storchaka | set | messages: + msg189847 |
| 2013年05月23日 04:02:53 | eli.bendersky | set | messages: + msg189845 |
| 2013年05月22日 13:57:33 | serhiy.storchaka | set | dependencies: - ElementTree incorrectly parses strings with declared encoding not UTF-8 |
| 2013年05月22日 13:52:31 | serhiy.storchaka | set | files:
+ expat_unknown_encoding_handler_2.patch messages: + msg189818 |
| 2013年05月22日 12:29:22 | eli.bendersky | set | messages: + msg189810 |
| 2013年05月22日 12:27:27 | eli.bendersky | set | messages: + msg189809 |
| 2013年05月22日 08:01:25 | serhiy.storchaka | set | messages: + msg189796 |
| 2013年05月22日 07:59:11 | serhiy.storchaka | set | dependencies: + ElementTree incorrectly parses strings with declared encoding not UTF-8 |
| 2013年05月21日 20:59:41 | amaury.forgeotdarc | set | messages: + msg189785 |
| 2013年05月21日 03:31:10 | serhiy.storchaka | set | files:
+ expat_unknown_encoding_handler.patch components: + Extension Modules, Unicode, XML, - Library (Lib) versions: + Python 3.4 keywords: + patch nosy: + ezio.melotti messages: + msg189720 stage: patch review |
| 2013年05月20日 21:18:51 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg189694 |
| 2013年05月20日 19:53:03 | terry.reedy | set | messages:
+ msg189685 versions: - Python 3.2 |
| 2013年05月20日 15:35:55 | amaury.forgeotdarc | set | messages: + msg189671 |
| 2013年05月20日 13:17:22 | eli.bendersky | set | messages: + msg189660 |
| 2012年07月21日 15:27:22 | flox | set | nosy:
+ eli.bendersky |
| 2011年12月19日 15:15:16 | pitrou | set | nosy:
+ vstinner, flox versions: + Python 3.2, Python 3.3 |
| 2011年12月17日 01:25:36 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg149652 versions: + Python 2.7, - Python 2.6, 3rd party |
| 2011年12月16日 11:26:44 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg149612 |
| 2011年12月16日 09:05:08 | dongying | set | type: behavior |
| 2011年12月16日 09:04:34 | dongying | set | versions: + 3rd party |
| 2011年12月16日 09:04:10 | dongying | create | |