homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, vstinner
Date 2013年08月21日.10:38:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAMpsgwZSk2k41uRgph3y9fF3jc75Wbss+9wz3wqa2ADHRHoP0A@mail.gmail.com>
In-reply-to <1377078267.22.0.222957122817.issue18713@psf.upfronthosting.co.za>
Content
Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.
It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé
It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG= # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
 File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)
$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
 File "<string>", line 1, in <module>
 File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
 return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)
The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:
$ ls|hexdump -C
00000000 68 c3 a9 68 c3 a9 2e 74 78 74 0a |h..h...txt.|
0000000b
("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')
I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:
$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt
Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:
 sys.stdout = open(stdout.fileno(), 'w',
 encoding=sys.stdout.encoding,
 errors="backslashreplace",
 closefd=False,
 newline='\n')
History
Date User Action Args
2013年08月21日 10:38:53vstinnersetrecipients: + vstinner, lemburg, ncoghlan, pitrou, abadger1999, benjamin.peterson, ezio.melotti, a.badger, r.david.murray
2013年08月21日 10:38:53vstinnerlinkissue18713 messages
2013年08月21日 10:38:52vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /