Issue 222395: readline() of codecs.StreamReader doesn't work for"utf-16le"

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/33476

classification

Type:	Stage:
Title:	readline() of codecs.StreamReader doesn't work for"utf-16le"
Components:	Unicode	Versions:

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	lemburg	Nosy List:	gvanrossum, jhylton, lemburg
Priority:	high	Keywords:

Created on 2000年11月14日 13:37 by anonymous, last changed 2022年04月10日 16:03 by admin. This issue is now closed.

Messages (7)
msg2404 - (view)	Author: Nobody/Anonymous (nobody)	Date: 2000年11月14日 13:37
I tried that in BOTH Python 1.6 and Python 2.0 (operating system: Windows NT) I wrote : import codecs fileName1 = "d:\\sveta\\unicode\\try.txt" (UTF16LE_encode, UTF16LE_decode, UTF16LE_streamreader, UTF16LE_streamwriter) = codecs.lookup('UTF-16LE') output = UTF16LE_streamwriter( open(fileName1, 'wb') ) output.write(unicode('abc\n')) output.write(unicode('def\n')) output.close() input = UTF16LE_streamreader( open(fileName1, 'rb') ) rl = input.readline() print rl input.close() After I run it I got: Traceback (most recent call last): File "d:\\sveta\\unicode\\unicodecheck.py", line 13, in ? rl = input.readline() File "D:\Program Files\Python16\lib\codecs.py", line 250, in readline return self.decode(line)[0] UnicodeError: UTF-16 decoding error: truncated data
msg2405 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2000年11月14日 14:02
One for Marc-Andre. (Unfortunately he's announced he'll be too busy to look at bugs this year, so if someone else has a smart idea, feel free to butt in!) This was originally classified as a Windows bug, but it's platform independent (I can reproduce it on Linux as well).
msg2406 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2000年11月14日 14:09
A little bit of debugging suggests that the StreamReader.readline() method is naive: it calls the underlying stream's readline() method. Since in the example code the underlying stream is a regular 8-bit file, this will return an odd number of byte in the example. Because of the little-endian encoding; the file contains these hex bytes: 61 00 62 00 63 00 0a 00 ... (0a being '\n'). I'm not familiar enough with this class to tell whether this is simply inappropriate use of StreamReader, or that this should be fixed. Maybe Marc-Andre can answer t least that question?
msg2407 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2000年11月14日 14:43
Some background: .readline() is implemented in the way it is because all other techniques would require adding real buffering to the codec (AFAIK. at least) and this is currently out of scope. Besides, there is another problem: line breaking in Unicode is much more difficult to get right than for plain ASCII, since there are a lot more line break characters to watch out for. .readline() is currently relying on the underlying stream to do the line breaking. Since this doesn't know anything about encodings it will break lines at single bytes. As a result, the input data for the codec is broken. To correct the problem, one would have to write a true UTF-16 codec which implements buffering. This should be doable in Python, e.g. see how StringIO does it. The codec would then have to read the input data in chunks of say 1024 bytes (must be even), then pass the data through the codec and use the .splitlines() method on the Unicode output. Data which is not on the current line would have to be buffered until the next call to .read() or .readline(). Unfortunately, this technique will also break .tell(), .truncate() and friends... it's a mess. An easy work-around is reading in the file as a whole and then using .splitlines() to get at the lines.
msg2408 - (view)	Author: Jeremy Hylton (jhylton) (Python triager)	Date: 2002年03月01日 22:37
Logged In: YES user_id=31392 What should be done to fix this? It sounds like things are plain broken. If readline() doesn't work, it should raise an exception at the very least.
msg2409 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2002年03月05日 16:45
Logged In: YES user_id=38388 Uhm... it does raise an exception ;-) It is hard to fix this bug, since Unicode line breaking is much more elaborate than standard C lib type line breaking. The only way I see to handle this properly is by introducing line buffering. However, this can slow down the codec considerably. Perhaps we should simply have the .readline() method raise a NotImplementedError ?!
msg2410 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2002年04月05日 12:15
Logged In: YES user_id=38388 I've checked in a patch which raises a NotImplementedError for .readline() on UTF-16, -LE, -BE. This is not ideal, but more accurate than what was in place before.

History
Date	User	Action	Args
2022年04月10日 16:03:29	admin	set	github: 33476
2000年11月14日 13:37:26	anonymous	create

homepage