Why are some unicode error handlers "encode only"?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Mar 11 10:37:54 EDT 2012


At least two standard error handlers are documented as working for 
encoding only:
xmlcharrefreplace
backslashreplace
See http://docs.python.org/library/codecs.html#codec-base-classes
and http://docs.python.org/py3k/library/codecs.html
Why is this? I don't see why they shouldn't work for decoding as well. 
Consider this example using Python 3.2:
>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: 
illegal multibyte sequence
The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also 
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't 
or can't be supported?
# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=> r'aaa--騷--\xe9\x21--bbb'
and similarly for xmlcharrefreplace.
-- 
Steven


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /