[Python-Dev] Re: Regression in unicodestr.encode()?

2002年4月09日 21:13:37 -0400

[Guido]
> I knew all that, but I thought I'd read about a hack to encode NUL
> using c0 80, specifically to get around the limitation on encoded
> strings containing a NUL.

Ah, that violates the "shortest encoding" rule, so is invalid UTF-8. I'm
sure people have done it, though, and that many UTF-8 encoders accept it.
Python's doesn't:
>>> unicode('\xc0\x80', 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>>
Believe it or not, accepting non-shortest encodings is considered to be "a
security hole"(!). That's a sad story of its own <wink> ...