This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年06月07日 21:30 by Neil Muller, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue6233_py3k.diff | Neil Muller, 2009年06月07日 21:47 | Simple patch | ||
| issue6233_py3k_with_test.diff | Neil Muller, 2009年06月18日 13:12 | Combined patch with test case | ||
| issue6233-escape_entities.diff | jcsalterego, 2009年06月23日 05:37 | Excp handling in _encode + prev submitted test | ||
| issue6233-encode_cdata.diff | jcsalterego, 2009年06月24日 23:56 | _encode & effbot's _escape_cdata, w/ test | ||
| Messages (16) | |||
|---|---|---|---|
| msg89058 - (view) | Author: Neil Muller (Neil Muller) | Date: 2009年06月07日 21:30 | |
In py3k, ElementTree no longer correctly converts characters to entities
when they can't be represented in the requested output encoding.
Python 2:
>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
"<?xml version='1.0' encoding='ascii'?>\n<body>tãt</body>"
Python 3:
>>> import xml.etree.ElementTree as ET
>>> e = ET.XML("<?xml version='1.0'
encoding='iso-8859-1'?><body>t\xe3t</body>")
>>> ET.tostring(e, 'ascii')
.....
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)
It looks like _encode_entity isn't ever called inside ElementTree
anymore - it probably should be called as part of _encode for characters
that can't be represented.
|
|||
| msg89059 - (view) | Author: Neil Muller (Neil Muller) | Date: 2009年06月07日 21:47 | |
Simple possible patch uploaded This doesn't give the expected answer for the test above, but does work when starting from an XML file in utf-8 encoding. I still need to determine why this happens. |
|||
| msg89276 - (view) | Author: Neil Muller (Neil Muller) | Date: 2009年06月12日 12:50 | |
> This doesn't give the expected answer for the test above Which is obviously due to not comparing apples with apples, as I should be using a byte-string in the py3k example. >>> import xml.etree.ElementTree as ET >>> e = ET.XML(b"<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") >>> ET.tostring(e, 'ascii') Fails without the patch, behaves as expected with the patch. |
|||
| msg89504 - (view) | Author: Neil Muller (Neil Muller) | Date: 2009年06月18日 13:12 | |
Updated patch - adds a test for this. |
|||
| msg89505 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年06月18日 13:25 | |
This regression is probably annoying enough to make it a blocker. |
|||
| msg89583 - (view) | Author: Fredrik Lundh (effbot) * (Python committer) | Date: 2009年06月21日 21:33 | |
Umm. Isn't _encode used to encode tags and attribute names? The charref syntax is only valid in CDATA sections and attribute values, which are encoded by the corresponding _escape functions. I suspect this patch will make things blow up on a non-ASCII tag/attribute name. |
|||
| msg89585 - (view) | Author: Fredrik Lundh (effbot) * (Python committer) | Date: 2009年06月21日 21:42 | |
Did you look at the 1.3 alpha code base when you came up with this idea?
Unfortunately, 1.3's _encode is used for a different purpose...
I don't have time to test it tonight, but I suspect that 1.3's
escape_data/escape_attrib functions might work better under 3.X; they do
the text.replace dance first, and then an explicit text.encode(encoding,
"xmlcharrefreplace") at the end. E.g.
def _escape_cdata(text, encoding):
# escape character data
try:
# it's worth avoiding do-nothing calls for strings that are
# shorter than 500 character, or so. assume that's, by far,
# the most common case in most applications.
if "&" in text:
text = text.replace("&", "&")
if "<" in text:
text = text.replace("<", "<")
if ">" in text:
text = text.replace(">", ">")
return text.encode(encoding, "xmlcharrefreplace")
except (TypeError, AttributeError):
_raise_serialization_error(text)
|
|||
| msg89623 - (view) | Author: Jerry Chen (jcsalterego) | Date: 2009年06月23日 05:37 | |
The attached patch includes Neil's original additions to test_xml_etree.py. I also noticed that _encode_entity wasn't being called in ElementTree in py3k, with the important bit being the nested function escape_entities(), in conjunction with _escape and _escape_map. In 2.x, _encode_entity() is used after _encode() throws Unicode exceptions [1], so I figured it would make sense to take the core functionality of _escape_entities() and integrate it into _encode in the same fashion -- when an exception is thrown. Basically, I: - changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]" - extracted _encode_entity.escape_entities() and made it _escape_entities of module scope - removed _encode_entity() - added UnicodeEncodeError exception in _encode() I'm not sure what the expected outcome is supposed to be when the text is not type bytes but str. With this patch, the output has b"tãt" rather than b"tãt". Hope this is a step in the right direction. [1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941 |
|||
| msg89684 - (view) | Author: Fredrik Lundh (effbot) * (Python committer) | Date: 2009年06月24日 21:51 | |
That's backwards, unless I'm missing something here: charrefs represent Unicode characters, not UTF-8 byte values. The character "LATIN SMALL LETTER A WITH TILDE" with the character value 227 should be represented as "ã" if serialized to an encoding that doesn't support non-ASCII characters. And there's no need to use RE:s to filter things under 3.X; those parts of ET 1.2 are there for pre-2.0 compatibility. Did you try running the tests with the escape function I posted? |
|||
| msg89690 - (view) | Author: Jerry Chen (jcsalterego) | Date: 2009年06月24日 23:56 | |
Thanks for the explanation -- looks like I was way off base on that one.
I took a look at the code you provided but it doesn't work as a drop-in
replacement for _escape_cdata, since that function returns a string
rather than bytes.
However taking your code, calling it _encode_cdata and then refactoring
all calls _encode(_escape_cdata(x), encoding) to _encode_cdata(x,
encoding) seems to do the trick and passes the tests.
Specific example:
- file.write(_encode(_escape_cdata(node.text), encoding))
+ file.write(_encode_cdata(node.text, encoding))
One minor modification is to return the string as is if encoding=None,
just like _encode:
def _encode_cdata(text, encoding):
# escape character data
try:
text = text.replace("&", "&")
text = text.replace("<", "<")
text = text.replace(">", ">")
if encoding:
return text.encode(encoding, "xmlcharrefreplace")
else:
return text
except (TypeError, AttributeError):
_raise_serialization_error(text)
|
|||
| msg89715 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年06月25日 21:26 | |
effbot, do you have an opinion about the latest patch? It'd be nice to not have to delay the release for this. |
|||
| msg89718 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2009年06月26日 01:12 | |
I disagree with this report being classified as release-critical - it is *not* a regression over 3.0 (i.e. 3.0 already behaved in the same way). That it is a regression relative to 2.x should not make it release-critical - we can still fix such regressions in 3.2. In addition, there is an easy work-around for applications that run into the problem - just use utf-8 as the output encoding always: py> e = ET.XML(b"<?xml version='1.0' encoding='iso-8859-1'?><body>t\xe3t</body>") py> ET.tostring(e,encoding='utf-8') b'<body>t\xc3\xa3t</body>' |
|||
| msg89722 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2009年06月26日 03:30 | |
+1 for Py3.1.1 |
|||
| msg89728 - (view) | Author: Jerry Chen (jcsalterego) | Date: 2009年06月26日 14:21 | |
Either way, it would be nice to get feedback so we can iterate on the patch or close out this issue already :-) |
|||
| msg95045 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2009年11月08日 20:43 | |
The patch looks ok to me. |
|||
| msg99125 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年02月09日 16:56 | |
Committed in r78123 (py3k) and r78124 (3.1). I've also removed _escape_cdata() since it wasn't used anymore. Thanks Jerry for the patch. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:49 | admin | set | github: 50482 |
| 2010年02月09日 16:56:25 | pitrou | set | status: open -> closed resolution: fixed messages: + msg99125 stage: resolved |
| 2009年11月08日 20:43:14 | pitrou | set | messages: + msg95045 |
| 2009年06月26日 14:21:19 | jcsalterego | set | messages: + msg89728 |
| 2009年06月26日 11:13:39 | pitrou | set | priority: release blocker -> critical versions: + Python 3.2, - Python 3.0 |
| 2009年06月26日 03:30:55 | rhettinger | set | nosy:
+ rhettinger messages: + msg89722 |
| 2009年06月26日 01:12:28 | loewis | set | nosy:
+ loewis messages: + msg89718 |
| 2009年06月25日 21:26:50 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages: + msg89715 |
| 2009年06月24日 23:56:04 | jcsalterego | set | files:
+ issue6233-encode_cdata.diff messages: + msg89690 |
| 2009年06月24日 21:51:25 | effbot | set | messages: + msg89684 |
| 2009年06月23日 05:37:25 | jcsalterego | set | files:
+ issue6233-escape_entities.diff nosy: + jcsalterego messages: + msg89623 |
| 2009年06月21日 21:42:03 | effbot | set | messages: + msg89585 |
| 2009年06月21日 21:33:01 | effbot | set | messages: + msg89583 |
| 2009年06月18日 13:25:14 | pitrou | set | priority: release blocker nosy: + pitrou messages: + msg89505 |
| 2009年06月18日 13:12:25 | Neil Muller | set | files:
+ issue6233_py3k_with_test.diff messages: + msg89504 |
| 2009年06月12日 12:50:24 | Neil Muller | set | messages: + msg89276 |
| 2009年06月07日 21:47:42 | Neil Muller | set | files:
+ issue6233_py3k.diff keywords: + patch messages: + msg89059 |
| 2009年06月07日 21:30:58 | Neil Muller | create | |