homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: codecs.open(filename, 'U', 'UTF-16') corrupts text
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: flox Nosy List: aclover, christian.heimes, flox, jackjansen, jorend, lemburg
Priority: normal Keywords: patch

Created on 2003年02月22日 19:21 by jorend, last changed 2022年04月10日 16:07 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
UTest.py jorend, 2003年02月22日 20:01 Unit test demonstrating bug with codecs.open(filename, 'rU', 'UTF-16')
issue691291_py3k.diff flox, 2009年12月01日 08:01 Patch against branches/py3k r76622 (test only)
issue691291_v2.diff flox, 2009年12月30日 09:46 Patch, apply to trunk
Messages (10)
msg53767 - (view) Author: Jason Orendorff (jorend) Date: 2003年02月22日 19:21
Tested in Python 2.3a1.
If I write u'Hello\r\nworld\r\n' to a file, then read
it back in 'U' mode, I should get u'Hello\nworld\n'.
However, if I do this using codecs.open() and the
UTF-16 encoding, I get u'Hello\n\nworld\n\n'.
codecs.open() is not 'U'-mode-aware. The underlying
file is opened in universal newline mode, so the byte
'\x0d' is erroneously translated to '\x0a' before the
UTF-16 codec has a chance to decode it.
The attached unit test should show specifically what it
is that I wish would work.
msg53768 - (view) Author: Jason Orendorff (jorend) Date: 2003年02月22日 21:17
Logged In: YES 
user_id=18139
Tested in Python 2.3a2 as well (the bug is still there).
Note that this isn't limited to UTF-16. It will affect any
encoding that uses the byte '\x0d' to mean anything other
than u'\r'. The most common American/European encodings are
safe (ASCII, Latin-1 and friends, and UTF-8).
msg53769 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003年02月26日 13:44
Logged In: YES 
user_id=38388
I'm turning this into a feature request. codecs.open()
does not support 'U' as file mode.
Assigning to Jack since he introduced the 'U' mode option.
Jack, what can we do about this ?
msg53770 - (view) Author: Jack Jansen (jackjansen) * (Python committer) Date: 2003年03月03日 12:10
Logged In: YES 
user_id=45365
The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode.
My feeling is that it should be okay to open the underlying file in text mode, thereby enabling the U flag to be passed. Opening the file in text mode would break, however, if one of the following conditions is met:
- there are encodings where 0x0a or 0x0d are valid characters, not end of line.
- there are libc implementations where opening a file in text mode has
more implications than converting \r or \r\n to \n, i.e. if they change
other bytes as well.
Re-assigning to MAL, as he put the binary mode in in the first place. If this was just defensive programming we might try taking it out, if there was a real error case with text mode then codecs.open should probably at least signal an error if universal newline mode is requested.
msg53771 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003年03月04日 10:12
Logged In: YES 
user_id=38388
The proper thing to do would be to read the file content
as Unicode and then use the .splitlines() method on the
resulting data. The latter knows about the various ways
you can do line ending in Unicode, including the Mac, DOS
and Unix variations.
I don't have time for this, so unassigning it again.
msg59293 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008年01月05日 18:00
Checks this for 2.6
msg81182 - (view) Author: And Clover (aclover) * Date: 2009年02月05日 01:42
> The problem is that codecs.open() forces binary mode on the underlying
file object, and this defeats the U mode.
Actually the problem is it doesn't defeat it!
The function is documented to force binary, but it actually only does
"mode = mode + 'b'", which can leave you with a mode of 'rUb'. This mode
should be invalid but in practice the 'U' wins out, and causes the
expected problems for UTF-16 and some East Asian codecs.
Until such time as text/universal mode is supported at the overlying
decoded stream level, I suggest that 'U' should be .replace()d out of
the mode as well as 'b' being added, as the documentation would imply.
msg95849 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009年12月01日 08:00
Proposed patch following suggestion of And Clover.
Compliant with documentation:
«Files are always opened in binary mode, even if no binary mode was
specified. This is done to avoid data loss due to encodings using 8-bit
values. This means that no automatic conversion of '\n' is done on
reading and writing.»
msg97023 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009年12月30日 09:46
slight update.
msg100146 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年02月26日 10:43
Fixed on trunk with r78461. The test will be ported to py3k.
History
Date User Action Args
2022年04月10日 16:07:02adminsetgithub: 38031
2010年02月27日 11:41:33floxsetstatus: pending -> closed
2010年02月26日 10:43:09floxsetstatus: open -> pending
messages: + msg100146

assignee: flox
resolution: accepted
stage: patch review -> resolved
2009年12月30日 09:48:24floxsetfiles: - issue691291.diff
2009年12月30日 09:46:21floxsetfiles: + issue691291_v2.diff
versions: + Python 2.7
messages: + msg97023

type: enhancement -> behavior
stage: patch review
2009年12月02日 08:14:22floxsetfiles: - issue691291.diff
2009年12月02日 08:14:04floxsetfiles: + issue691291.diff
2009年12月01日 08:01:49floxsetfiles: + issue691291_py3k.diff
2009年12月01日 08:00:30floxsetfiles: + issue691291.diff

nosy: + flox
messages: + msg95849

keywords: + patch
2009年02月05日 01:42:20acloversetnosy: + aclover
messages: + msg81182
2008年01月05日 18:00:24christian.heimessetnosy: + christian.heimes
messages: + msg59293
components: + Library (Lib), - None
versions: + Python 2.6
2003年02月22日 19:21:01jorendcreate

AltStyle によって変換されたページ (->オリジナル) /