[Python-Dev] Decoding incomplete unicode

Walter Dörwald walter at livinglogic.de
Wed Jul 28 11:38:16 CEST 2004


Hye-Shik Chang wrote:
> On 2004年7月27日 22:39:45 +0200, Walter Dörwald
> <walter at livinglogic.de> wrote:
>>>Pythons unicode machinery currently has problems when decoding
>>incomplete input.
>>>>When codecs.StreamReader.read() encounters a decoding error it
>>reads more bytes from the input stream and retries decoding.
>>This is broken for two reasons:
>>1) The error might be due to a malformed byte sequence in the input,
>> a problem that can't be fixed by reading more bytes.
>>2) There may be no more bytes available at this time. Once more
>> data is available decoding can't continue because bytes from
>> the input stream have already been read and thrown away.
>>(sio.DecodingInputFilter has the same problems)
>> StreamReaders and -Writers from CJK codecs are not suffering from
> this problems because they have internal buffer for keeping states
> and incomplete bytes of a sequence. In fact, CJK codecs has its
> own implementation for UTF-8 and UTF-16 on base of its multibytecodec
> system. It provides a "working" StreamReader/Writer already. :)

Seems you had the same problems with the builtin stream readers! ;)
BTW, how do you solve the problem that incomplete byte sequences
are retained in the middle of a stream, but should generate errors
at the end?
>>I've uploaded a patch that fixes these problems to SF:
>>http://www.python.org/sf/998993
>>>>The patch implements a few additional features:
>>- read() has an additional argument chars that can be used to
>> specify the number of characters that should be returned.
>>- readline() is supported on all readers derived from
>> codecs.StreamReader().
>> I have no comment for these, yet.
>>>- readline() and readlines() have an additional option
>> for dropping the u"\n".
>> +1
>> I wonder whether we need to add optional argument for writelines()
> to add newline characters for each lines, then.

This would probably be a nice convenient additional feature,
but of course you could always pass a GE to writelines():
stream.writelines(line+u"\n" for line in lines)
Bye,
 Walter Dörwald


More information about the Python-Dev mailing list

AltStyle によって変換されたページ (->オリジナル) /