homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: readline() of codecs.StreamReader doesn't work for"utf-16le"
Type: Stage:
Components: Unicode Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lemburg Nosy List: gvanrossum, jhylton, lemburg
Priority: high Keywords:

Created on 2000年11月14日 13:37 by anonymous, last changed 2022年04月10日 16:03 by admin. This issue is now closed.

Messages (7)
msg2404 - (view) Author: Nobody/Anonymous (nobody) Date: 2000年11月14日 13:37
I tried that in
BOTH Python 1.6 and Python 2.0
(operating system: Windows NT)
I wrote :
import codecs
fileName1 = "d:\\sveta\\unicode\\try.txt"
(UTF16LE_encode, UTF16LE_decode,
 UTF16LE_streamreader, UTF16LE_streamwriter) = codecs.lookup('UTF-16LE')
output = UTF16LE_streamwriter( open(fileName1, 'wb') )
output.write(unicode('abc\n'))
output.write(unicode('def\n'))
output.close()
input = UTF16LE_streamreader( open(fileName1, 'rb') )
rl = input.readline()
print rl
input.close()
After I run it I got:
Traceback (most recent call last):
 File "d:\\sveta\\unicode\\unicodecheck.py", line 13, in ?
 rl = input.readline()
 File "D:\Program Files\Python16\lib\codecs.py", line 250, in readline
 return self.decode(line)[0]
UnicodeError: UTF-16 decoding error: truncated data
msg2405 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000年11月14日 14:02
One for Marc-Andre. (Unfortunately he's announced he'll be too busy to look at bugs this year, so if someone else has a smart idea, feel free to butt in!)
This was originally classified as a Windows bug, but it's platform independent (I can reproduce it on Linux as well).
msg2406 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000年11月14日 14:09
A little bit of debugging suggests that the StreamReader.readline() method is naive: it calls the underlying stream's readline() method. Since in the example code the underlying stream is a regular 8-bit file, this will return an odd number of byte in the example. Because of the little-endian encoding; the file contains these hex bytes: 61 00 62 00 63 00 0a 00 ... (0a being '\n').
I'm not familiar enough with this class to tell whether this is simply inappropriate use of StreamReader, or that this should be fixed. Maybe Marc-Andre can answer t least that question?
msg2407 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2000年11月14日 14:43
Some background:
.readline() is implemented in the way it is because all other
techniques would require adding real buffering to the codec (AFAIK.
at least) and this is currently out of scope.
Besides, there is another problem: line breaking in Unicode is much
more difficult to get right than for plain ASCII, since there are a lot
more line break characters to watch out for.
.readline() is currently relying on the underlying stream to do the
line breaking. Since this doesn't know anything about encodings
it will break lines at single bytes. As a result, the input data for the
codec is broken.
To correct the problem, one would have to write a true UTF-16 codec
which implements buffering. This should be doable in Python, e.g. see
how StringIO does it. The codec would then have to read the
input data in chunks of say 1024 bytes (must be even), then
pass the data through the codec and use the .splitlines() method on
the Unicode output. Data which is not on the current line would
have to be buffered until the next call to .read() or .readline().
Unfortunately, this technique will also break .tell(), .truncate() and friends...
it's a mess.
An easy work-around is reading in the file as a whole and then
using .splitlines() to get at the lines.
msg2408 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2002年03月01日 22:37
Logged In: YES 
user_id=31392
What should be done to fix this? It sounds like things are 
plain broken. If readline() doesn't work, it should raise 
an exception at the very least.
msg2409 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002年03月05日 16:45
Logged In: YES 
user_id=38388
Uhm... it does raise an exception ;-)
It is hard to fix this bug, since Unicode line breaking
is much more elaborate than standard C lib type
line breaking. The only way I see to handle this
properly is by introducing line buffering. However,
this can slow down the codec considerably.
Perhaps we should simply have the .readline()
method raise a NotImplementedError ?!
msg2410 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002年04月05日 12:15
Logged In: YES 
user_id=38388
I've checked in a patch which raises a NotImplementedError for 
.readline() on UTF-16, -LE, -BE.
This is not ideal, but more accurate than what was in place
before.
History
Date User Action Args
2022年04月10日 16:03:29adminsetgithub: 33476
2000年11月14日 13:37:26anonymouscreate

AltStyle によって変換されたページ (->オリジナル) /