Issue 691291: codecs.open(filename, 'U', 'UTF-16') corrupts text

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38031

classification

Title:	codecs.open(filename, 'U', 'UTF-16') corrupts text
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.7, Python 2.6

process

Dependencies:	Superseder:
Status:	closed	Resolution:	accepted
Assigned To:	flox	Nosy List:	aclover, christian.heimes, flox, jackjansen, jorend, lemburg
Priority:	normal	Keywords:	patch

Created on 2003年02月22日 19:21 by jorend, last changed 2022年04月10日 16:07 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
UTest.py	jorend, 2003年02月22日 20:01	Unit test demonstrating bug with codecs.open(filename, 'rU', 'UTF-16')
issue691291_py3k.diff	flox, 2009年12月01日 08:01	Patch against branches/py3k r76622 (test only)
issue691291_v2.diff	flox, 2009年12月30日 09:46	Patch, apply to trunk

Messages (10)
msg53767 - (view)	Author: Jason Orendorff (jorend)	Date: 2003年02月22日 19:21
Tested in Python 2.3a1. If I write u'Hello\r\nworld\r\n' to a file, then read it back in 'U' mode, I should get u'Hello\nworld\n'. However, if I do this using codecs.open() and the UTF-16 encoding, I get u'Hello\n\nworld\n\n'. codecs.open() is not 'U'-mode-aware. The underlying file is opened in universal newline mode, so the byte '\x0d' is erroneously translated to '\x0a' before the UTF-16 codec has a chance to decode it. The attached unit test should show specifically what it is that I wish would work.
msg53768 - (view)	Author: Jason Orendorff (jorend)	Date: 2003年02月22日 21:17
Logged In: YES user_id=18139 Tested in Python 2.3a2 as well (the bug is still there). Note that this isn't limited to UTF-16. It will affect any encoding that uses the byte '\x0d' to mean anything other than u'\r'. The most common American/European encodings are safe (ASCII, Latin-1 and friends, and UTF-8).
msg53769 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2003年02月26日 13:44
Logged In: YES user_id=38388 I'm turning this into a feature request. codecs.open() does not support 'U' as file mode. Assigning to Jack since he introduced the 'U' mode option. Jack, what can we do about this ?
msg53770 - (view)	Author: Jack Jansen (jackjansen) * (Python committer)	Date: 2003年03月03日 12:10
Logged In: YES user_id=45365 The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode. My feeling is that it should be okay to open the underlying file in text mode, thereby enabling the U flag to be passed. Opening the file in text mode would break, however, if one of the following conditions is met: - there are encodings where 0x0a or 0x0d are valid characters, not end of line. - there are libc implementations where opening a file in text mode has more implications than converting \r or \r\n to \n, i.e. if they change other bytes as well. Re-assigning to MAL, as he put the binary mode in in the first place. If this was just defensive programming we might try taking it out, if there was a real error case with text mode then codecs.open should probably at least signal an error if universal newline mode is requested.
msg53771 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2003年03月04日 10:12
Logged In: YES user_id=38388 The proper thing to do would be to read the file content as Unicode and then use the .splitlines() method on the resulting data. The latter knows about the various ways you can do line ending in Unicode, including the Mac, DOS and Unix variations. I don't have time for this, so unassigning it again.
msg59293 - (view)	Author: Christian Heimes (christian.heimes) * (Python committer)	Date: 2008年01月05日 18:00
Checks this for 2.6
msg81182 - (view)	Author: And Clover (aclover) *	Date: 2009年02月05日 01:42
> The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode. Actually the problem is it doesn't defeat it! The function is documented to force binary, but it actually only does "mode = mode + 'b'", which can leave you with a mode of 'rUb'. This mode should be invalid but in practice the 'U' wins out, and causes the expected problems for UTF-16 and some East Asian codecs. Until such time as text/universal mode is supported at the overlying decoded stream level, I suggest that 'U' should be .replace()d out of the mode as well as 'b' being added, as the documentation would imply.
msg95849 - (view)	Author: Florent Xicluna (flox) * (Python committer)	Date: 2009年12月01日 08:00
Proposed patch following suggestion of And Clover. Compliant with documentation: «Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.»
msg97023 - (view)	Author: Florent Xicluna (flox) * (Python committer)	Date: 2009年12月30日 09:46
slight update.
msg100146 - (view)	Author: Florent Xicluna (flox) * (Python committer)	Date: 2010年02月26日 10:43
Fixed on trunk with r78461. The test will be ported to py3k.

History
Date	User	Action	Args
2022年04月10日 16:07:02	admin	set	github: 38031
2010年02月27日 11:41:33	flox	set	status: pending -> closed
2010年02月26日 10:43:09	flox	set	status: open -> pending messages: + msg100146 assignee: flox resolution: accepted stage: patch review -> resolved
2009年12月30日 09:48:24	flox	set	files: - issue691291.diff
2009年12月30日 09:46:21	flox	set	files: + issue691291_v2.diff versions: + Python 2.7 messages: + msg97023 type: enhancement -> behavior stage: patch review
2009年12月02日 08:14:22	flox	set	files: - issue691291.diff
2009年12月02日 08:14:04	flox	set	files: + issue691291.diff
2009年12月01日 08:01:49	flox	set	files: + issue691291_py3k.diff
2009年12月01日 08:00:30	flox	set	files: + issue691291.diff nosy: + flox messages: + msg95849 keywords: + patch
2009年02月05日 01:42:20	aclover	set	nosy: + aclover messages: + msg81182
2008年01月05日 18:00:24	christian.heimes	set	nosy: + christian.heimes messages: + msg59293 components: + Library (Lib), - None versions: + Python 2.6
2003年02月22日 19:21:01	jorend	create

homepage