This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013年08月12日 15:19 by ncoghlan, last changed 2022年04月11日 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| surrogateescape.patch | vstinner, 2013年08月22日 00:36 | review | ||
| Messages (34) | |||
|---|---|---|---|
| msg194968 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月12日 15:19 | |
One problem with Unicode in 3.x is that surrogateescape isn't normally enabled on stdin and stdout. This means the following code will fail with UnicodeEncodeError in the presence of invalid filesystem metadata: print(os.listdir()) We don't really want to enable surrogateescape on sys.stdin or sys.stdout unilaterally, as it increases the chance of data corruption errors when the filesystem encoding and the IO encodings don't match. Last night, Toshio and I thought of a possible solution: enable surrogateescape by default for sys.stdin and sys.stdout on non-Windows systems if (and only if) they're using the same codec as that returned by sys.getfilesystemencoding() (allowing for codec aliases rather than doing a simple string comparison) This means that for full UTF-8 systems (which includes most modern Linux installations), roundtripping will be enabled by default between the standard streams and OS facing APIs, while systems where the encodings don't match will still fail noisily. A more general alternative is also possible: default to errors='surrogatescape' for *any* text stream that uses the filesystem encoding. It's primarily the standard streams we're interested in fixing, though. |
|||
| msg194969 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2013年08月12日 15:38 | |
My gut reaction to this is that it feels dangerous. That doesn't mean my gut is right, I'm just reporting my reaction :) |
|||
| msg194970 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月12日 15:45 | |
Everything about surrogateescape is dangerous - we're trying to work around the presence of bad data by at least allowing it to be tunnelled through Python code without corrupting it further :) |
|||
| msg195732 - (view) | Author: Toshio Kuratomi (a.badger) * | Date: 2013年08月21日 00:16 | |
Nick and I had talked about this at a recent conference and came to it from different directions. On the one hand, Nick made the point that any encoding of surrogateescape'd text to bytes via a different encoding is corrupting the data as a whole. On the other hand, I made the point that raising an exception when doing something as basic as printing something that's text type was reintroducing the issues that python2 had wrt unicode, bytes, and encodings -- particularly with the exception being raised far from the source of the problem (when the data is introduced into the program). After some thought, Nick came up with this solution. The idea is that surrogateescape was originally accepted to allow roundtripping data from the OS and back when the OS considers it to be a "string" but python does not consider it to be "text". When that's the case, we know what the encoding was used to attempt to construct the text in python. If that same encoding is used to re-encode the data on the way back to the OS, then we're successfully roundtripping the data we were given in the first place. So this is just applying the original goal to another API. |
|||
| msg195733 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月21日 00:25 | |
Which reminds me: I'm curious what "ls" currently does for malformed filenames. The aim of this change would be to get 'python -c "import os; print(os.listdir())"' to do the best it can to work without losing data in such a situation. |
|||
| msg195734 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月21日 00:31 | |
On Linux, the locale encoding is usually UTF-8. If a filename cannot be decoded from UTF-8, invalid bytes are escaped to the surrogate range using the PEP 383. If I create a UTF-8 text file and I try to write the filename into this text file, the Python UTF-8 encoder raises an error. IMO Python must raise an error here because I want to generate a valid UTF-8 text file, not a text file only readable by Python if the locale encoding is UTF-8. So using surrogateescape error handler if the encoding is sys.getfilesystemencoding() is *not* a good idea. What is your use case where you need to display a filename? Is it displayed to the terminal, into a file or in a graphical window? Why not escaping surrogate just to format the filename, as Gnome does? See for example: https://developer.gnome.org/glib/2.34/glib-Character-Set-Conversion.html#g-filename-display-name |
|||
| msg195735 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月21日 00:37 | |
2013年8月21日 Nick Coghlan <report@bugs.python.org>: > Which reminds me: I'm curious what "ls" currently does for malformed > filenames. The aim of this change would be to get 'python -c "import os; > print(os.listdir())"' to do the best it can to work without losing data in > such a situation. The "ls" command works on bytes, not on characters. You can reimplement "ls" with: * Unicode: os.listdir(str), os.fsencode() and sys.stdout.buffer * bytes: os.listdir(bytes) and sys.stdout.buffer os.fsencode() does exactly the opposite of os.fsdecode(). There is a unit test to check that :-) I ensured that all OS functions can be used directly with bytes filenames in Python 3. That's why I added os.environb for example. |
|||
| msg195741 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月21日 02:35 | |
Think sysadmins running scripts on Linux, writing to the console or a pipe. I agree the generalisation is a bad idea, so only consider the original proposal that was specifically limited to the standard streams. Specifically, if a system is properly configured to use UTF-8 for all interfaces, I shouldn't have to live in fear of Python steps in a command pipeline falling over because it happens to encounter a filename encoded with latin-1 (etc). If the bytes oriented os tools like ls don't fall over on it, then neither should Python. This is about treating the standard streams as OS interfaces, as long as they're using the filesystem encoding. |
|||
| msg195746 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年08月21日 05:34 | |
> After some thought, Nick came up with this solution. The idea is that > surrogateescape was originally accepted to allow roundtripping data > from the OS and back when the OS considers it to be a "string" but > python does not consider it to be "text". When that's the case, we > know what the encoding was used to attempt to construct the text in > python. If that same encoding is used to re-encode the data on the > way back to the OS, then we're successfully roundtripping the data we > were given in the first place. So this is just applying the original > goal to another API. I think that outlook is a bit naïve. The text source is not always the same as the text destination, i.e. your surrogateescape-decoded data may come from a database or some JSON API, so there's no reason to think that the end of the stdout pipe will share the same convention. I'm myself quite partial to the "round-tripping" use case, but I'm not sure we can solve it as bluntly. If it's merely for printing out data, maybe we can an os.fsescape() function to allow for representation of broken filenames. |
|||
| msg195761 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2013年08月21日 09:44 | |
I think the essential use case is using a python program in a unix pipeline. I'm very sympathetic to that use case, despite my unease. |
|||
| msg195769 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月21日 10:38 | |
Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.
It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé
It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:
$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG= # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)
$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)
The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:
$ ls|hexdump -C
00000000 68 c3 a9 68 c3 a9 2e 74 78 74 0a |h..h...txt.|
0000000b
("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')
I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:
$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt
Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:
sys.stdout = open(stdout.fileno(), 'w',
encoding=sys.stdout.encoding,
errors="backslashreplace",
closefd=False,
newline='\n')
|
|||
| msg195844 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 00:36 | |
Attached patch changes the error handle of stdin, stdout and stderr to surrogateescape by default. It can still be changed explicitly using the PYTHONIOENCODING environment variable. |
|||
| msg195860 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月22日 05:47 | |
The surrogateescape error handler works only with UTF-8. As a side effect of this change an input from stdin will be incompatible in general with extensions which implicitly encode a string to bytes with UTF-8 (e.g. tkinter, XML parsers, sqlite3, datetime, locale, curses, etc.) |
|||
| msg195862 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 06:53 | |
"The surrogateescape error handler works with any codec." The surrogatepass only works with utf-8 if I remember correctly. The surrogateescape error handler works with any codec, especially ascii. "As a side effect of this change an input from stdin will be incompatible in general with extensions which implicitly encode a string to bytes with UTF-8 (e.g. tkinter, XML parsers, sqlite3, datetime, locale, curses, etc.)" Correct, but it's not something new: os.listdir(), sys.argv, os.environ and other functions using os.fsdecode(). Applications should already have to support surrogates. |
|||
| msg195866 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月22日 07:35 | |
> "The surrogateescape error handler works with any codec." Ah, sorry. You are correct. > Correct, but it's not something new: os.listdir(), sys.argv, os.environ and other functions using os.fsdecode(). Applications should already have to support surrogates. I'm only saying that this will increase a number of cases when an exception will raised in unexpected place. Perhaps it will be safe left the "strict" default error handler and make the errors attribute of text streams modifiable. |
|||
| msg195867 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 08:20 | |
> I'm only saying that this will increase a number of cases > when an exception will raised in unexpected place. The print() instruction is much more common than input(). IMO changing the error handle should fix more issues than adding regressions. Python functions decoding OS data from the filesystem encoding with surrogateescape: - sys.thread_info.version - sys.argv - os.environ, os.getenv() - os.fsdecode() - _ssl._SSLSocket.compression - os.ttyname(), os.ctermid(), os.getcwd(), os.listdir(), os.uname(), os.getlogin(), os.readlink(), os.confstr(), os.listxattr(), nis.cat() - grp.getgrpgid(), grp.getgrpnam(), grp.getgrpall() - spwd.spwd_getspnam(), spwd.spwd_getspall() - pwd.getpwuid(), pwd.getpwnam(), pwd.getpwall() - socket.socket.accept(), socket.socket.getsockname(), socket.socket.getpeername(), socket.socket.recvfrom(), socket.gethostname(), socket.if_nameindex(), socket.if_indextoname() |
|||
| msg195868 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月22日 08:37 | |
Shouldn't be safer use surrogateescape for output and strict for input. |
|||
| msg195870 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 09:12 | |
Serhiy Storchaka added the comment: > Shouldn't be safer use surrogateescape for output and strict for input. Nick wrote "Think sysadmins running scripts on Linux, writing to the console or a pipe." See my message msg195769: Python3 cannot be simply used as a pipe because it wants to be kind by decoding binary data to Unicode, whereas no everybody cares of Unicode :-) Hum, I realized that the subprocess should also be patched to be consistent: subprocess already uses surrogateescape for the command line arguments and environment variables, why not using the same error handler for stdin, stdout and stderr? Serhiy Storchaka also noticed (in the review of my patch) than errors is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use surrogateescape if only the encoding is changed. |
|||
| msg195871 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年08月22日 09:20 | |
> See my message msg195769: Python3 cannot be simply used as a pipe > because it wants to be kind by decoding binary data to Unicode, > whereas no everybody cares of Unicode :-) If somebody doesn't care about unicode, they can use sys.stdin.buffer. Problem solved :-) Note: enabling surrogateescape on stdin enables precisely the "exception being raised far from the source of the problem" people are afraid of. surrogateescape on stdin allows invalid unicode strings to slip into your application, only for a later encoding to utf-8 to fail (since lone surrogates are not allowed). For example if you are sending that user data over an utf-8 network protocol (perhaps JSON-encoded or XML-encoded)... |
|||
| msg195872 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年08月22日 09:21 | |
> Serhiy Storchaka also noticed (in the review of my patch) than errors > is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use > surrogateescape if only the encoding is changed. I don't understand what you say. Could you rephrase? |
|||
| msg195873 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 09:24 | |
>> Serhiy Storchaka also noticed (in the review of my patch) than errors >> is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use >> surrogateescape if only the encoding is changed. > I don't understand what you say. Could you rephrase? With my patch, sys.stdin.errors is "surrogateescape" by default, but it is "strict" when the PYTHONIOENCODING environment variable is set to "utf-8". |
|||
| msg195874 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2013年08月22日 09:27 | |
Is it a bug in your patch, or is it deliberate? |
|||
| msg195878 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 11:50 | |
> Is it a bug in your patch, or is it deliberate? It was not deliberate, and I think that it would be more consistent to use the same error handler (surrogateescape) when only the encoding is changed by the PYTHONIOENCODING environment variable. So surrogateescape should be used even with PYTHONIOENCODING=utf-8. |
|||
| msg195882 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月22日 12:47 | |
The surrogateescape error handler is dangerous with utf-16/32. It can produce globally invalid output. |
|||
| msg195886 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 13:18 | |
> The surrogateescape error handler is dangerous with utf-16/32. It can produce globally invalid output.
I don't understand, can you give an example? surrogateescape generate invalid encoded string with any encoding. Example with UTF-8:
>>> b"a\xffb".decode("utf-8", "surrogateescape")
'a\udcffb'
>>> 'a\udcffb'.encode("utf-8", "surrogateescape")
b'a\xffb'
>>> b'a\xffb'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 1: invalid start byte
So str.encode("utf-8", "surrogateescape") produces an invalid UTF-8 sequence.
|
|||
| msg195894 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月22日 14:16 | |
>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape').decode('utf-16le', 'surrogateescape')
'\udcff\udcdcqwerty'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape').decode('utf-16le', 'surrogateescape').encode('utf-16le', 'surrogateescape')
b'\xff\xdc\xdc\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape').decode('utf-16le', 'surrogateescape').encode('utf-16le', 'surrogateescape').decode('utf-16le', 'surrogateescape')
'\udcff\udcdc\udcdc\udcdcqwerty'
|
|||
| msg195897 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月22日 14:42 | |
Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.
To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following:
>>> open("Pƴ☂Høἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")
That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file:
$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 Pƴ☂Høἤ
Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string:
$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'Pƴ☂Høἤ']
Where it falls down is when you try to print the strings directly in Python 3:
$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed
While setting the IO encoding produces behaviour closer to that of the native tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
����
Pƴ☂Høἤ
On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default).
So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises?
|
|||
| msg195908 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2013年08月22日 15:40 | |
If you pipe the ls (eg: ls >temp) the bytes are preserved. Since setting the escape handler via PYTHONIOENCODING sets it for both stdin in and stdout, it sounds like that solves the sysadmin use case. The sysadmin can just put that environment variable setting in their default profile, and python will once again work like the other unix shell tools. (I double checked, and this does indeed work...doing the equivalent of ls >temp via python preserves the bytes with that PYTHONIOENCODING setting. I don't quite understand, however, why I get the � chars if I don't redirect the output.). I'd be inclined to consider the above as reason enough to close this issue. As usual with Python, explicit is better than implicit. |
|||
| msg195929 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2013年08月22日 22:11 | |
>>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape')
b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00'
Oh, this is a bug in the UTF-16 encoder: it should not encode surrogate characters => see issue #12892
I read that it's possible to set a standard stream like stdout in UTF-16 mode on Windows. I don't know if it's commonly used, nor it would impact Python. I never see a platform using UTF-16 or UTF-32 for standard streams.
|
|||
| msg195931 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月22日 23:04 | |
On 23 Aug 2013 01:40, "R. David Murray" <report@bugs.python.org> wrote: . (I double checked, and this does indeed work...doing the equivalent of ls >temp via python preserves the bytes with that PYTHONIOENCODING setting. I don't quite understand, however, why I get the � chars if I don't redirect the output.). I assume the terminal window is doing the substitution for the improperly encoded bytes. Regarding the issue, perhaps we should convert this to a docs bug? Attempt to make the "PYTHONIOENCODING=utf-8:surrogateescape" easier to discover? Heck, it may be worth creating a stable URL that we can include in surrogate related error messages... |
|||
| msg195932 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2013年08月22日 23:06 | |
I think it would be great to have a "Unicode/bytes" howto with information like this included. |
|||
| msg195938 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2013年08月23日 04:08 | |
Note: I created issue 18814 to cover some additional tools for working with surrogate escaped strings. For this issue, we currently have http://docs.python.org/3/howto/unicode.html, which aims to be a more comprehensive guide to understanding Unicode issues. I'm thinking we may want a "Debugging Unicode Errors" document, which defers to the existing howto guide for those that really want to understand Unicode, and instead focuses on quick fixes for resolving various problems that may present themselves. Application developers will likely want to read the longer guide, while the debugging document would be aimed at getting script writers past their immediate hurdle, without necessarily gaining a full understanding of Unicode. The would be for this page to become the top hit for "python surrogates not allowed", rather than the current top hit, which is a rejected bug report about it (http://bugs.python.org/issue13717). For example: ================================ What is the meaning of "UnicodeEncodeError: surrogates not allowed"? -------------------------------------------------------------------- Operating system metadata on POSIX based systems like Linux and Mac OS X may include improperly encoded text values. To cope with this, Python uses the "surrogateescape" error handler to store those arbitrary bytes inside a Unicode object. When converted back to bytes using the same encoding and error handler, the original byte sequence is reproduced exactly. This allows operations like opening a file based on a directory listing to work correctly, even when the metadata is not properly encoded according to the system settings. The "surrogates not allowed" error appears when a string from one of these operating system interfaces contains an embedded arbitrary byte sequence, but an attempt is made to encode it using the default "strict" error handler rather than the "surrogateescape" handler. This commonly occurs when printing improperly encoded operating system data to the console, or writing it to a file, database or other serialised interface. The ``PYTHONIOENCODING`` environment variable can be used to ensure operating system metadata can always be read via sys.stdin and written via sys.stdout. The following command will display the encoding Python will use by default to interact with the operating system:: $ python3 -c "import sys; print(sys.getfilesystemencoding())" utf-8 This can then be used to specify an appropriate setting for ``PYTHONIOENCODING``:: $ export PYTHONIOENCODING=utf-8:surrogateescape For other interfaces, there is no such general solution. If allowing the invalid byte sequence to propagate further is acceptable, then enabling the surrogateescape handler may be appropriate. Alternatively, it may be better to track these corrupted strings back to their point of origin, and either fix the underlying metadata, or else filter them out early on. ================================ If issue 18814 is implemented, then it could point to those tools. Similarly, issue 15216 could be referenced if that is implemented. |
|||
| msg195966 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年08月23日 12:52 | |
With new subject this issue looks as a duplicate of (or tightly related to) issue12832. |
|||
| msg242618 - (view) | Author: Nikolaus Rath (nikratio) * | Date: 2015年05月05日 20:44 | |
The first thing that would come to my mind when reading Nick's proposed document (without first reading this bug report) is "So why the heck is this not the default?". It would probably save a lot of people a lot of anger if there was also a brief explanation addressing this obvious first response :-). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:49 | admin | set | status: pending -> open github: 62913 |
| 2017年03月24日 16:29:59 | serhiy.storchaka | set | status: open -> pending superseder: The documentation for the print function should explain/point to how to control the sys.stdout encoding resolution: duplicate |
| 2015年05月05日 20:44:35 | nikratio | set | nosy:
+ nikratio messages: + msg242618 |
| 2013年08月23日 12:52:26 | serhiy.storchaka | set | messages: + msg195966 |
| 2013年08月23日 06:46:35 | serhiy.storchaka | set | dependencies: + Empty PYTHONIOENCODING is not the same as nonexistent |
| 2013年08月23日 04:08:18 | ncoghlan | set | title: Enable surrogateescape on stdin and stdout when appropriate -> Clearly document the use of PYTHONIOENCODING to set surrogateescape nosy: + docs@python messages: + msg195938 assignee: docs@python components: + Documentation |
| 2013年08月22日 23:06:36 | benjamin.peterson | set | messages: + msg195932 |
| 2013年08月22日 23:04:42 | ncoghlan | set | messages: + msg195931 |
| 2013年08月22日 22:11:49 | vstinner | set | messages: + msg195929 |
| 2013年08月22日 15:40:38 | r.david.murray | set | messages: + msg195908 |
| 2013年08月22日 14:42:05 | ncoghlan | set | messages: + msg195897 |
| 2013年08月22日 14:16:13 | serhiy.storchaka | set | messages: + msg195894 |
| 2013年08月22日 13:18:23 | vstinner | set | messages: + msg195886 |
| 2013年08月22日 12:47:09 | serhiy.storchaka | set | messages: + msg195882 |
| 2013年08月22日 11:50:52 | vstinner | set | messages: + msg195878 |
| 2013年08月22日 09:27:00 | pitrou | set | messages: + msg195874 |
| 2013年08月22日 09:24:54 | vstinner | set | messages: + msg195873 |
| 2013年08月22日 09:21:46 | pitrou | set | messages: + msg195872 |
| 2013年08月22日 09:20:16 | pitrou | set | messages: + msg195871 |
| 2013年08月22日 09:12:47 | vstinner | set | messages: + msg195870 |
| 2013年08月22日 08:37:11 | serhiy.storchaka | set | messages: + msg195868 |
| 2013年08月22日 08:20:24 | vstinner | set | messages: + msg195867 |
| 2013年08月22日 07:35:04 | serhiy.storchaka | set | messages: + msg195866 |
| 2013年08月22日 06:53:28 | vstinner | set | messages: + msg195862 |
| 2013年08月22日 05:47:47 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg195860 |
| 2013年08月22日 00:36:29 | vstinner | set | files:
+ surrogateescape.patch keywords: + patch messages: + msg195844 |
| 2013年08月21日 14:13:14 | Arfrever | set | nosy:
+ Arfrever |
| 2013年08月21日 10:38:53 | vstinner | set | messages: + msg195769 |
| 2013年08月21日 09:44:27 | r.david.murray | set | messages: + msg195761 |
| 2013年08月21日 05:34:50 | pitrou | set | messages: + msg195746 |
| 2013年08月21日 02:35:37 | ncoghlan | set | messages: + msg195741 |
| 2013年08月21日 00:37:10 | vstinner | set | messages: + msg195735 |
| 2013年08月21日 00:31:54 | vstinner | set | messages: + msg195734 |
| 2013年08月21日 00:25:20 | ncoghlan | set | messages: + msg195733 |
| 2013年08月21日 00:16:31 | a.badger | set | nosy:
+ a.badger messages: + msg195732 |
| 2013年08月12日 15:45:47 | ncoghlan | set | messages: + msg194970 |
| 2013年08月12日 15:38:56 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg194969 |
| 2013年08月12日 15:19:34 | ncoghlan | create | |