Message 195769 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, vstinner
Date	2013年08月21日.10:38:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAMpsgwZSk2k41uRgph3y9fF3jc75Wbss+9wz3wqa2ADHRHoP0A@mail.gmail.com>
In-reply-to	<1377078267.22.0.222957122817.issue18713@psf.upfronthosting.co.za>

Content
Currently, Python 3 fails miserabily when it gets a non-ASCII character from stdin or when it tries to write a byte encoded as a Unicode surrogate to stdout. It works fine when OS data can be decoded from and encoded to the locale encoding. Example on Linux with UTF-8 data and UTF-8 locale encoding: $ mkdir test $ cd test $ touch héhé.txt $ ls héhé.txt $ python3 -c 'import os; print(", ".join(os.listdir()))' héhé.txt $ echo "héhé"\|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'\|cat héhé It fails miserabily when OS data cannot be decoded from or encoded to the locale encoding. Example on Linux with UTF-8 data and ASCII locale encoding: $ mkdir test $ cd test $ touch héhé.txt $ export LANG= # switch to ASCII locale encoding $ ls h??h??.txt $ python3 -c 'import os; print(", ".join(os.listdir()))' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128) $ echo "héhé"\|LANG= python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'\|cat Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) The ls output is not the expected "héhé" string, but it is an issue with the console output, not the ls program. ls does just write raw bytes to stdout: $ ls\|hexdump -C 00000000 68 c3 a9 68 c3 a9 2e 74 78 74 0a \|h..h...txt.\| 0000000b ("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9') I agree that we can do something to improve the situation on standard streams, but only on standard streams. It is already possible to workaround the issue by forcing the surrogateescape error handler on stdout: $ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os; print(", ".join(os.listdir()))' héhé.txt Something similar can be done in Python. For example, test.support.regrtest reopens sys.stdout to set the error handle to "backslashreplace". Extract of the replace_stdout() function: sys.stdout = open(stdout.fileno(), 'w', encoding=sys.stdout.encoding, errors="backslashreplace", closefd=False, newline='\n')

Content

Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.
It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé
It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG= # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
 File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)
$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
 File "<string>", line 1, in <module>
 File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
 return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)
The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:
$ ls|hexdump -C
00000000 68 c3 a9 68 c3 a9 2e 74 78 74 0a |h..h...txt.|
0000000b
("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')
I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:
$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt
Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:
 sys.stdout = open(stdout.fileno(), 'w',
 encoding=sys.stdout.encoding,
 errors="backslashreplace",
 closefd=False,
 newline='\n')

History
Date	User	Action	Args
2013年08月21日 10:38:53	vstinner	set	recipients: + vstinner, lemburg, ncoghlan, pitrou, abadger1999, benjamin.peterson, ezio.melotti, a.badger, r.david.murray
2013年08月21日 10:38:53	vstinner	link	issue18713 messages
2013年08月21日 10:38:52	vstinner	create

homepage