Issue 1767933: Badly formed XML using etree and utf-16

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45279

classification

Title:	Badly formed XML using etree and utf-16
Type:	behavior	Stage:	resolved
Components:	XML	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	6472	Superseder:
Assigned To:	effbot	Nosy List:	BreamoreBoy, Richard.Urwin, amaury.forgeotdarc, bugok, effbot, eli.bendersky, flox, nnorwitz, python-dev, rurwin, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2007年08月05日 15:01 by bugok, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
patch.txt	rurwin, 2008年11月14日 17:32	patch to xml/etree/ElementTree.py
bug-test.py	Richard.Urwin, 2010年07月26日 13:24	demonstrator
etree_write_utf16.patch	serhiy.storchaka, 2012年04月27日 20:46	review
etree_write_utf16_2.patch	serhiy.storchaka, 2012年05月20日 22:30	review
etree_write_utf16_3.patch	serhiy.storchaka, 2012年07月07日 14:32	review
etree_write_utf16_4.patch	serhiy.storchaka, 2012年07月08日 11:52	review
etree_write_utf16_5.patch	serhiy.storchaka, 2012年07月13日 20:25	review
etree_write_utf16_without_tests-3.2.patch	serhiy.storchaka, 2012年07月18日 06:49	review

Messages (34)
msg32587 - (view)	Author: BugoK (bugok)	Date: 2007年08月05日 15:01
Hello, The bug occurs when writing an XML file using the UTF-16 encoding. The problem is that the etree encodes every string to utf-16 by itself - meaning, inserting the 0xfffe BOM before every string (tag, text, attribute name, etc.), causing a badly formed utf=16 strings. A possible solution, which was offered by a co-worker of mine, was to use a utf-16 writer (from codecs.getwriter('utf-16') to write the file. Best, BugoK.
msg32588 - (view)	Author: Neal Norwitz (nnorwitz) * (Python committer)	Date: 2007年08月07日 05:54
Fredrik, could you take a look at this?
msg32589 - (view)	Author: Fredrik Lundh (effbot) * (Python committer)	Date: 2007年08月07日 06:20
ET's standard serializer currently only supports ASCII-compatible encodings. See e.g. http://effbot.python-hosting.com/ticket/47 The best workaround for ET 1.2 (Python 2.5) is probably to serialize as "utf-8" and transcode: out = unicode(ET.tostring(elem), "utf-8").encode(...)
msg75864 - (view)	Author: Richard Urwin (rurwin)	Date: 2008年11月14日 15:33
This is a bug in two halves. 1. Not all characters in the file are UTF-16. The initial xml header isn't, and the individual < > etc characters are not. This is just a matter of extending the methodology to encode all characters and not just the textual bits. There is no work-around except a five-minute hack of the ElementTree.write() method. 2. Every write has a BOM, so corrupting the file in a manner analogous to bug 555360. This is a result of using string.encode() and is a well-known feature. It can be worked around by using UTF-16LE or UTF-16BE which do not prepend a BOM, but then the file doesn't have any BOM. A complete solution would be to rewrite ElementTree.write() to use a different encoding methodology such as StreamWriter. I have made the above hack and work-around for my own use, and I can report that it produces perfect UTF-16.
msg75866 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2008年11月14日 15:48
Would you provide a patch?
msg75875 - (view)	Author: Richard Urwin (rurwin)	Date: 2008年11月14日 17:32
Here is a patch of my quick hack, more for interest than any suggestion it gets used. Although it does produce good output so long as you avoid the BOM. The full solution is beyond my (very weak) Python skills. The character encoding is tied in with XML character substitution (& etc. and hexadecimal representation of multibyte characters). I could disentangle it, but I probably wouldn't produce optimal Python, or indeed anything that wouldn't inspire mirth and/or incredulity. NB. The workaround suggested by Fredrik Lundh doesn't solve our particular problems, since the downsize to UTF-8 causes the multi-byte characters to be represented in hex. Our software doesn't read those. (I know that's our problem.)
msg99394 - (view)	Author: Florent Xicluna (flox) * (Python committer)	Date: 2010年02月16日 11:43
Could you provide a test case, so we can check if the upgrade proposed on #6472 solves this issue?
msg111533 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010年07月25日 10:23
@Richard: Could you provide a test case for this, or do you consider it beyond your Python capabilities allowing for your comments on msg75875?
msg111608 - (view)	Author: Richard Urwin (Richard.Urwin)	Date: 2010年07月26日 13:24
I can't produce an automated test, for want of time, but here is a demonstrator. Grab the example XHTML from http://docs.python.org/library/xml.etree.elementtree.html#elementtree-objects or use some tiny ASCII-encoded xml file. Save it as "file.xml" in the same folder as bug-test.py attached here. Execute bug-test.xml file.xml is read and then written in UTF-16. The output file is then read and dumped to stdout as a byte-stream. 1. To be correct UTF-16, the output should start with 255 254, which should never occur in the rest of the file. 2. The rest of the output (including the first line) should alternate zeros with ASCII character codes. 3. The file output.xml should be loadable in a UTF16-capable text editor (eg jEdit), be recognised as UTF-16 and be identical in terms of content to file.xml
msg111611 - (view)	Author: Richard Urwin (Richard.Urwin)	Date: 2010年07月26日 13:27
> Execute bug-test.xml I meant bug-test.py, of course
msg111631 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010年07月26日 15:09
@Florent: is this something you could pick up, I think it's out of my league.
msg111635 - (view)	Author: Richard Urwin (Richard.Urwin)	Date: 2010年07月26日 15:31
As an example, here is the first two lines of output when I use Python 2.6.3: 60 63 120 109 108 32 118 101 114 115 105 111 110 61 39 49 46 48 39 32 101 110 99 111 100 105 110 103 61 39 85 84 70 45 49 54 39 63 62 10 60 255 254 104 0 116 0 109 0 108 0 62 255 254 10 Note: No 255 254 at the start of the file, but several within it. No zeros interspersing the first line and the odd one missing thereafter.
msg117864 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2010年10月02日 09:56
Python 3.1 improves the situation, the file looks more like utf-16, except that the BOM ("\xff\xfe") is repeated all the time, probably on every internal call to file.write(). Here is a test script that should work on both 2.7 and 3.1. from io import BytesIO from xml.etree.ElementTree import ElementTree content = "<?xml version='1.0' encoding='UTF-16'?><html></html>" input = BytesIO(content.encode('utf-16')) tree = ElementTree() tree.parse(input) # Write content output = BytesIO() tree.write(output, encoding="utf-16") assert output.getvalue().decode('utf-16') == content
msg159491 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年04月27日 20:46
Here is a patch which solves the problem of writing ElementTree with utf-16 or utf-32 encoding.
msg161046 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年05月18日 11:12
Anyone can review the patch?
msg161222 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)	Date: 2012年05月20日 18:27
The patch needs some tests. Also, it seems that ElementTree.write() will only accept files inheriting from io.IOBase, where a only a .write() method was expected before. Is it the case?
msg161236 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年05月20日 22:30
Here is updated patch, with tests and support of objects with only 'write' method.
msg163739 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年06月24日 07:20
It would be nice to fix this bug before forking of the 3.3.0b1 release clone.
msg163756 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年06月24日 09:43
I will try to find time to review it before the fork, but since time is tight I don't promise. That said, this patch falls more into the bugfix category than a new feature, so I think it will be OK after beta as well.
msg164713 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月06日 03:09
Serhiy, note that _SimpleElementPath is now gone in 3.3, since ElementPath.py is always there in stdlib. Could you update the patch to reflect this? Another thing. I'm trying really hard to phase out the doctest tests of etree, replacing them with unittest-based tests as much as possible. The doctests are causing all kinds of trouble with parametrized testing for both the Python and the C implementations. Please don't add new doctests. If you add tests, add them to existing TestCase classes, or create new ones.
msg164858 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月07日 14:32
> Serhiy, note that _SimpleElementPath is now gone in 3.3, since ElementPath.py is always there in stdlib. Could you update the patch to reflect this? Don't worry, _SimpleElementPath is not used in changes. > Another thing. I'm trying really hard to phase out the doctest tests of etree, replacing them with unittest-based tests as much as possible. The doctests are causing all kinds of trouble with parametrized testing for both the Python and the C implementations. Please don't add new doctests. If you add tests, add them to existing TestCase classes, or create new ones. Done. I replaced the encoding doctest by unittest-based tests and merge it with StringIOTest and user IO tests in one IOTest class. Added test for StringIO writing. Also I've improved support of unbuffered file objects (as for issue1470548).
msg164918 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月07日 18:20
Thanks for your work on this, Serhiy. I made some comments in the code-review tool, mainly about the complexity of the resulting code. Great work on switching the tests to unittest, much appreciated.
msg165003 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月08日 11:23
Here is a patch with using context management (as Eli advised). This makes error handling much safer and probably makes the code a little easier. Several new tests are added.
msg165350 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月13日 03:41
Thanks, this looks much better. I've reviewed the _4 patch with some minor comments.
msg165366 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月13日 07:22
Serhiy, can you also take a look at #9458 - it may be related?
msg165415 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月13日 20:25
Patch updated with some comments.
msg165492 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年07月15日 03:02
New changeset 6120cf695574 by Eli Bendersky in branch 'default': Close #1767933: Badly formed XML using etree and utf-16. Patch by Serhiy Storchaka, with some minor fixes by me http://hg.python.org/cpython/rev/6120cf695574
msg165508 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月15日 07:02
Thank you, Eli. However changes to tostring() and tostringlist() break the invariant b"".join(tostringlist(element, 'utf-16')) == tostring(element, 'utf-16'). You should add followed methods to DataStream: def seekable(self): return True def tell(self): return len(data) Note, that monkey-patched version is faster. stream = io.BufferedIOBase() stream.writable = lambda: True stream.write = data.append stream.seekable = lambda: True stream.tell = data.__len__ Benchmark results: tostring() with BytesIO: $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root/>')" "ET.tostring(e, 'utf-16')" 1000 loops, best of 3: 268 usec per loop $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root>'+'<child/>'100+'</root>' )" "ET.tostring(e, 'utf-16')" 100 loops, best of 3: 4.63 msec per loop tostring() with monkey-patching: $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root/>')" "ET.tostring(e, 'utf-16')" 1000 loops, best of 3: 263 usec per loop $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root>'+'<child/>'100+'</root>' )" "ET.tostring(e, 'utf-16')" 100 loops, best of 3: 3.84 msec per loop tostringlist() with DataStream class: $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root/>')" "ET.tostringlist(e, 'utf-16')" 1000 loops, best of 3: 624 usec per loop $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root>'+'<child/>'100+'</root>' )" "ET.tostringlist(e, 'utf-16')" 100 loops, best of 3: 4.09 msec per loop tostringlist() with monkey-patching: $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root/>')" "ET.tostringlist(e, 'utf-16')"1000 loops, best of 3: 259 usec per loop $ ./python -m timeit -s "import xml.etree.ElementTree as ET; e=ET.XML('<root>'+'<child/>'100+'</root>' )" "ET.tostringlist(e, 'utf-16')" 100 loops, best of 3: 3.81 msec per loop
msg165673 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月17日 02:46
Fixed the invariant violation in changeset 64ff90e07d71 I'll review the performance difference separately
msg165675 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月17日 03:34
I posted a message to python-dev about the performance issue
msg165714 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年07月17日 12:10
New changeset 51978f89e5ed by Eli Bendersky in branch 'default': Optimize tostringlist by taking the stream class outside the function. It's now 2x faster on short calls. Related to #1767933 http://hg.python.org/cpython/rev/51978f89e5ed
msg165723 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月17日 14:33
How about porting this to 3.2? The main difficulty I see with the tests, which significantly differ in 3.2 and 3.3.
msg165740 - (view)	Author: Eli Bendersky (eli.bendersky) * (Python committer)	Date: 2012年07月18日 04:59
Frankly, I don't think the problem is serious enough to warrant a backport to 3.2, given that 3.3 gonna be out in a few weeks. The issue was open for 5 years without anyone seriously complaining :)
msg165746 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年07月18日 06:49
Python 3.2 currently shipped in last Ubuntu LTS and will be in production at least next 5 years. I think it will be main Python version for many years. Here is a compound patch (changesets 6120cf695574, 64ff90e07d71, 51978f89e5ed and 63ba0c32b81a) for Python 3.2 without tests. It is almost same as for 3.3, except using manual finalizing instead ExitStack.

History
Date	User	Action	Args
2022年04月11日 14:56:25	admin	set	github: 45279
2012年07月18日 06:49:34	serhiy.storchaka	set	files: + etree_write_utf16_without_tests-3.2.patch messages: + msg165746
2012年07月18日 04:59:32	eli.bendersky	set	messages: + msg165740
2012年07月17日 14:33:57	serhiy.storchaka	set	messages: + msg165723
2012年07月17日 12:10:19	python-dev	set	messages: + msg165714
2012年07月17日 03:34:48	eli.bendersky	set	messages: + msg165675
2012年07月17日 02:46:03	eli.bendersky	set	messages: + msg165673
2012年07月15日 07:02:39	serhiy.storchaka	set	messages: + msg165508
2012年07月15日 03:02:42	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg165492 resolution: fixed stage: needs patch -> resolved
2012年07月13日 20:25:33	serhiy.storchaka	set	files: + etree_write_utf16_5.patch messages: + msg165415
2012年07月13日 07:22:17	eli.bendersky	set	messages: + msg165366
2012年07月13日 03:41:14	eli.bendersky	set	messages: + msg165350
2012年07月08日 11:52:41	serhiy.storchaka	set	files: + etree_write_utf16_4.patch
2012年07月08日 11:23:38	serhiy.storchaka	set	messages: + msg165003
2012年07月07日 18:20:49	eli.bendersky	set	messages: + msg164918
2012年07月07日 14:32:24	serhiy.storchaka	set	files: + etree_write_utf16_3.patch messages: + msg164858
2012年07月06日 03:09:28	eli.bendersky	set	messages: + msg164713
2012年06月24日 09:43:45	eli.bendersky	set	messages: + msg163756
2012年06月24日 07:20:20	serhiy.storchaka	set	messages: + msg163739
2012年06月17日 03:18:42	eli.bendersky	set	nosy: + eli.bendersky
2012年05月20日 22:30:13	serhiy.storchaka	set	files: + etree_write_utf16_2.patch messages: + msg161236
2012年05月20日 18:27:11	amaury.forgeotdarc	set	messages: + msg161222
2012年05月18日 11:12:29	serhiy.storchaka	set	messages: + msg161046
2012年04月27日 20:46:31	serhiy.storchaka	set	files: + etree_write_utf16.patch versions: + Python 3.3 nosy: + serhiy.storchaka messages: + msg159491 keywords: + patch
2010年10月02日 09:56:28	amaury.forgeotdarc	set	messages: + msg117864 stage: test needed -> needs patch
2010年07月26日 15:31:50	Richard.Urwin	set	messages: + msg111635
2010年07月26日 15:09:21	BreamoreBoy	set	messages: + msg111631
2010年07月26日 13:27:27	Richard.Urwin	set	messages: + msg111611
2010年07月26日 13:24:27	Richard.Urwin	set	files: + bug-test.py nosy: + Richard.Urwin messages: + msg111608
2010年07月25日 10:23:24	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg111533
2010年02月16日 11:43:34	flox	set	dependencies: + Update ElementTree with upstream changes type: behavior versions: + Python 2.7, Python 3.2, - Python 2.6 nosy: + flox messages: + msg99394 stage: test needed
2008年11月14日 17:32:16	rurwin	set	files: + patch.txt messages: + msg75875
2008年11月14日 15:48:20	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg75866
2008年11月14日 15:33:53	rurwin	set	nosy: + rurwin messages: + msg75864 versions: + Python 2.6
2007年08月05日 15:01:57	bugok	create

homepage