homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: PEP 540: Add a new UTF-8 mode
Type: enhancement Stage: resolved
Components: Interpreter Core, Library (Lib), Unicode Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, methane, vstinner
Priority: normal Keywords: patch

Created on 2017年01月11日 11:19 by vstinner, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test_all_locales.py vstinner, 2018年01月15日 09:38
Pull Requests
URL Status Linked Edit
PR 855 merged vstinner, 2017年03月27日 22:03
PR 4838 merged vstinner, 2017年12月13日 14:04
PR 4895 merged vstinner, 2017年12月15日 21:18
PR 4899 merged vstinner, 2017年12月16日 03:10
PR 4968 merged vstinner, 2017年12月21日 22:51
PR 5145 merged vstinner, 2018年01月10日 17:59
PR 5148 merged vstinner, 2018年01月10日 22:22
PR 5170 merged vstinner, 2018年01月13日 00:23
PR 4174 merged vstinner, 2018年01月15日 11:17
PR 5203 merged vstinner, 2018年01月16日 15:46
PR 5204 merged python-dev, 2018年01月16日 16:34
PR 5272 merged vstinner, 2018年01月22日 16:48
Messages (37)
msg285214 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 11:19
This issue tracks the implementation of the PEP 540.
Attached pep540_cli.py script can be used to play with it.
msg285215 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 11:27
pep540.patch: first draft
Changes:
* Add sys.flags.utf8mode
* Add -X utf8 command line option
* Add PYTHONUTF8 environment variable
* sys.stdin, sys.stdout and sys.stderr encoding and errors are modified in UTF-8 mode
* open() default encoding and errors is modified in the UTF-8 mode
* Add Lib/test/test_utf8mode.py
* Skip a few tests relying on the locale encoding if the UTF-8 mode is enabled
* Document changes
Allowed options:
* Disable UTF-8 mode: -X utf8=0 or PYTHONUTF8=0
* Enable UTF-8 mode: -X utf8=1 or PYTHONUTF8=1
* Enable UTf-8 Strict mode: -X utf8=strict or PYTHONUTF8=strict
* Other -X utf8 and PYTHONUTF8 values cause a fatal error
Prioririties (highest to lowest):
* open() encoding and errors arguments
* PYTHONIOENCODING
* UTF-8 mode
* os.device_encoding()
* locale encoding
TODO:
* re-encode sys.argv from the local encoding to UTF-8 in Py_Main() when the UTF-8 mode is enabled
* support strict mode in Py_DecodeLocale() and Py_EncodeLocale()
msg285216 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 11:32
Examples with pep540_cli.py.
Python 3.5:
$ python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C python3 pep540_cli.py 
sys.argv: ['pep540_cli.py']
stdin: ANSI_X3.4-1968/surrogateescape
stdout: ANSI_X3.4-1968/surrogateescape
stderr: ANSI_X3.4-1968/backslashreplace
open(): ANSI_X3.4-1968/strict
Patched Python 3.7:
$ ./python pep540_cli.py 
UTF-8 mode: 0
sys.argv: ['pep540_cli.py']
stdin: UTF-8/strict
stdout: UTF-8/strict
stderr: UTF-8/backslashreplace
open(): UTF-8/strict
$ LC_ALL=C ./python pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8 pep540_cli.py 
UTF-8 mode: 1
sys.argv: ['pep540_cli.py']
stdin: utf-8/surrogateescape
stdout: utf-8/surrogateescape
stderr: utf-8/backslashreplace
open(): utf-8/surrogateescape
$ ./python -X utf8=strict pep540_cli.py 
UTF-8 mode: 2
sys.argv: ['pep540_cli.py']
stdin: utf-8/strict
stdout: utf-8/strict
stderr: utf-8/backslashreplace
open(): utf-8/strict
msg285275 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 22:04
pep540-2.patch: Patch version 2, updated to the latest version of the PEP 540. It has no more FIXME/TODO and has more unit tests. The main change is that the strict mode doesn't use strict anymore for OS data, but keeps surrogateescape. See the PEP for the rationale (especially the "Use the strict error handler for operating system data" alternative).
msg285276 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 22:13
Oops, I introduced an obvious bug in my latest refactoring. It's now fixed in the patch version 3: pep540-3.patch.
msg285277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 23:00
Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:
$ LC_ALL=fr_FR ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\xff']
The result should not depend on the locale, it should be the same than:
$ LC_ALL=fr_FR.utf8 ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
$ LC_ALL=C ./python -X utf8 -c 'import sys; print(ascii(sys.argv))' $(echo -ne "\xff")
['-c', '\udcff']
msg285278 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月11日 23:01
I only tested the the PEP 540 implementation on Linux.
The PEP and its implementation should adjusted for Windows, especially Windows-only env vars like PYTHONLEGACYWINDOWSFSENCODING.
Changes are maybe also needed for Mac OS X and Android, which always use UTF-8. Currently, the locale encoding is still used on these platforms (ex: by open()). Is it possible to a locale encoding different than UTF-8 on Android for example?
msg285280 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017年01月11日 23:57
> Hum, pep540-3.patch doesn't work if the locale encoding is different than ASCII and UTF-8. argv must be reencoded:
I want to skip reencoding.
On UTF-8 mode, arbitrary bytes in cmdline (e.g. broken filename passed by xarg) should be able to roundtrip by UTF-8/surrogateescape.
I don't trust wcstombs/mbstowcs. It may not guarantee round tripping of arbitrary bytes.
Can -X utf8 option be processed before Py_Main()?
msg285296 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月12日 09:18
> Can -X utf8 option be processed before Py_Main()?
I'm trying to implement that, but it's hard to factorize the code. I will probably have to duplicate the code handling -E, -X utf8, PYTHONMALLOC and PYTHONUTF8 for wchar_t* (UCS4 or UTF-16) and char* (bytes).
msg285298 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月12日 10:32
Hum, test_utf8mode lacks an unit test on the -E command line option:
PYTHONUTF8 should be ignored if -E is used.
msg285325 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月12日 13:31
Patch version 4:
* Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
* Add an unit test on PYTHONUTF8 env var and -E cmdline option
* Add an unit test on the POSIX locale
* Fix initstdio() to handle correctly empty PYTHONIOENCODING: this bug affects Python 3.6 as well and is not directly related to the PEP 540
* Fix to handle correctly PYTHONUTF8 set to an empty string (ignore it)
* Skip an unit test in test_utf8mode which failed with the POSIX locale
Note: This patch still has the sys.argv encoding bug with locale encodings different than ASCII and UTF-8.
msg285332 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月12日 16:45
encodings.py: enhancement version of pep540_cli.py, add locale and filesystem encoding. Script to test the implementation of the PEP 540 (and PEP 538).
msg285357 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017年01月13日 00:54
How about locale.getpreferredencoding() returns 'utf-8' in utf8 mode?
msg285407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月13日 15:27
Oh, I just noticed that os.environ uses the hardcoded error handler "surrogateescape": it should be replaced with sys.getfilesystemencodeerrors() to support UTF-8 Strict mode.
msg285482 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017年01月14日 14:05
> it should be replaced with sys.getfilesystemencodeerrors() 
> to support UTF-8 Strict mode.
I did that in the patch for issue 28188. The focus of the patch is to add bytes support on Windows for os.putenv and os.environb, but I also tried to maximize consistency (at least parallel structure) between the POSIX and Windows implementations.
msg307694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月05日 22:12
I rebased my PR on master.
msg307695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月05日 22:12
I removed old patches in favor of the now up to date PR 855.
msg308182 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月13日 01:21
The PEP 538 has two open issues: bpo-30672 and bpo-32238.
I recently refactored the Py_Main() code so it should be simpler to implement the PEP 540: see bpo-32030.
msg308183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月13日 01:25
Oh, PYTHONCOERCECLOCALE env var is read very early in main() by _Py_CoerceLegacyLocale(), it ignores -E command line option.
 * Ignoring -E and -I is safe from a security perspective, as we only use
 * the setting to turn *off* the implicit locale coercion, and anyone with
 * access to the process environment already has the ability to set
 * `LC_ALL=C` to override the C level locale settings anyway.
msg308198 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月13日 11:29
New changeset 91106cd9ff2f321c0f60fbaa09fd46c80aa5c266 by Victor Stinner in branch 'master':
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
https://github.com/python/cpython/commit/91106cd9ff2f321c0f60fbaa09fd46c80aa5c266
msg308213 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月13日 16:31
New changeset d5dda98fa80405db82e2eb36ac48671b4c8c0983 by Victor Stinner in branch 'master':
pymain_set_sys_argv() now copies argv (#4838)
https://github.com/python/cpython/commit/d5dda98fa80405db82e2eb36ac48671b4c8c0983
msg308217 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月13日 16:46
test_readline failed. It seems to be related to my commit:
http://buildbot.python.org/all/#/builders/87/builds/360
======================================================================
FAIL: test_nonascii (test.test_readline.TestReadline)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/usr/home/buildbot/python/3.x.koobs-freebsd10/build/Lib/test/test_readline.py", line 219, in test_nonascii
 self.assertIn(b"text 't\\xeb'\r\n", output)
AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\303円\257円nserted]|t\x07\x08\x08\x08\x08\x08\x08\x08\x07\x07xrted]|t\x08\x08\x08\x08\x08\x08\x08\x07\r\nresult \'[\\xefnsexrted]|t\'\r\nhistory \'[\\xefnsexrted]|t\'\r\n")
msg308430 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月15日 22:06
New changeset d2b02310acbfe6c978a8ad3cd3ac8b3f12927442 by Victor Stinner in branch 'master':
bpo-29240: Don't define decode_locale() on macOS (#4895)
https://github.com/python/cpython/commit/d2b02310acbfe6c978a8ad3cd3ac8b3f12927442
msg308448 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月16日 03:54
New changeset 9454060e84a669dde63824d9e2fcaf295e34f687 by Victor Stinner in branch 'master':
bpo-29240, bpo-32030: Py_Main() re-reads config if encoding changes (#4899)
https://github.com/python/cpython/commit/9454060e84a669dde63824d9e2fcaf295e34f687
msg308915 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月21日 23:09
New changeset 424315fa865b43f67e36a40647107379adf031da by Victor Stinner in branch 'master':
bpo-29240: Skip test_readline.test_nonascii() (#4968)
https://github.com/python/cpython/commit/424315fa865b43f67e36a40647107379adf031da
msg308916 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月21日 23:11
IHMO test_readline should be fixed by ignoring the UTF-8 mode in Py_EncodeLocale/Py_DecodeLocale, but only when called from the Python readline module. We need maybe new functions, something like: Py_EncodeCurrentLocale/Py_DecodeCurrentLocale.
I will work on a patch when I will be back from holiday. In the meanwhile, I skipped the test to repair FreeBSD 3.x buildbots.
msg309782 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月10日 21:46
New changeset 2cba6b85797ba60d67389126f184aad5c9e02ff3 by Victor Stinner in branch 'master':
bpo-29240: readline now ignores the UTF-8 Mode (#5145)
https://github.com/python/cpython/commit/2cba6b85797ba60d67389126f184aad5c9e02ff3
msg309798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月11日 09:38
New changeset cb3ae5588bd7733e76dc09277bb7626652d9bb64 by Victor Stinner in branch 'master':
bpo-29240: Ignore UTF-8 Mode in time module (#5148)
https://github.com/python/cpython/commit/cb3ae5588bd7733e76dc09277bb7626652d9bb64
msg309958 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月15日 09:38
Attached test_all_locales.py is a test suite for locale functions: os.strerror(), locale.localeconv(), time.strftime(). I tested it on Linux Fedora 27, FreeBSD 11.0 and macOS 10.13.2.
The test should always pass on Python 2.7. On Python 3.6 and the master branch with PR 5170, 2 tests on numeric localeconv() fail because Python uses the wrong encoding: see bpo-31900. master with PR 5170 now has less encoding bugs than Python 3.6.
msg309959 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月15日 09:45
New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
https://github.com/python/cpython/commit/7ed7aead9503102d2ed316175f198104e0cd674c
msg310029 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月16日 00:08
> New changeset 7ed7aead9503102d2ed316175f198104e0cd674c by Victor Stinner in branch 'master':
> bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
Oh, this change broke test_nonascii() of test_readline() on FreeBSD.
Previsously, readline used ASCII/surrogateescape encoding for the POSIX locale. Now, mbstowcs() / wcstombs() is called directly, with the surrogateescape error handler.
msg310092 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月16日 16:34
New changeset c495e799ed376af91ae2ddf6c4bcc592490fe294 by Victor Stinner in branch 'master':
Skip test_readline.test_nonascii() on C locale (#5203)
https://github.com/python/cpython/commit/c495e799ed376af91ae2ddf6c4bcc592490fe294
msg310097 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月16日 17:27
New changeset c2740e8a263e76427a8102a89f4b491a3089b2a1 by Victor Stinner (Miss Islington (bot)) in branch '3.6':
Skip test_readline.test_nonascii() on C locale (GH-5203) (#5204)
https://github.com/python/cpython/commit/c2740e8a263e76427a8102a89f4b491a3089b2a1
msg310177 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月17日 14:28
test_readline pass again on all buildbots, especially on FreeBSD 3.6 and 3.x buildbots.
There are no more known issues, the implementation of the PEP 540 (UTF-8 Mode) is now complete!
msg310443 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月22日 18:07
New changeset 9089a265918754d95e105a7c4c409ac9352c87bb by Victor Stinner in branch 'master':
bpo-29240: PyUnicode_DecodeLocale() uses UTF-8 on Android (#5272)
https://github.com/python/cpython/commit/9089a265918754d95e105a7c4c409ac9352c87bb
msg310444 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年01月22日 18:09
I partially reverted the commit 7ed7aead9503102d2ed316175f198104e0cd674c: on Android, UTF-8 is now always used, again. Paul Peny (aka pmpp) confirmed me that my commit broke Python on Android, at least with API 19 (locales don't work properly before API 21).
msg412665 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月06日 20:51
> New changeset 91106cd9ff2f321c0f60fbaa09fd46c80aa5c266 by Victor Stinner in branch 'master':
> bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
> https://github.com/python/cpython/commit/91106cd9ff2f321c0f60fbaa09fd46c80aa5c266
Oh, this change broke the mbcs alias on Windows and the test_codecs and test_site tests (2 tests!) missed the bug :-( I fixed it in:
New changeset 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30 by Victor Stinner in branch 'main':
bpo-46659: Update the test on the mbcs codec alias (GH-31168)
https://github.com/python/cpython/commit/04dd60e50cd3da48fd19cdab4c0e4cc600d6af30 
History
Date User Action Args
2022年04月11日 14:58:41adminsetgithub: 73426
2022年02月08日 11:52:04yan12125setnosy: - yan12125
2022年02月06日 20:51:32vstinnersetmessages: + msg412665
2018年01月22日 18:09:52vstinnersetmessages: + msg310444
2018年01月22日 18:07:35vstinnersetmessages: + msg310443
2018年01月22日 16:48:36vstinnersetpull_requests: + pull_request5116
2018年01月17日 14:28:41vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg310177

stage: patch review -> resolved
2018年01月16日 17:27:36vstinnersetmessages: + msg310097
2018年01月16日 16:34:45python-devsetpull_requests: + pull_request5058
2018年01月16日 16:34:37vstinnersetmessages: + msg310092
2018年01月16日 15:46:05vstinnersetpull_requests: + pull_request5057
2018年01月16日 00:08:03vstinnersetmessages: + msg310029
2018年01月15日 11:17:23vstinnersetpull_requests: + pull_request5043
2018年01月15日 09:45:56vstinnersetmessages: + msg309959
2018年01月15日 09:38:41vstinnersetfiles: + test_all_locales.py

messages: + msg309958
2018年01月13日 00:23:31vstinnersetpull_requests: + pull_request5024
2018年01月11日 09:38:07vstinnersetmessages: + msg309798
2018年01月10日 22:22:15vstinnersetpull_requests: + pull_request5005
2018年01月10日 21:46:18vstinnersetmessages: + msg309782
2018年01月10日 17:59:22vstinnersetpull_requests: + pull_request5003
2017年12月21日 23:11:01vstinnersetmessages: + msg308916
2017年12月21日 23:09:28vstinnersetmessages: + msg308915
2017年12月21日 22:51:06vstinnersetpull_requests: + pull_request4860
2017年12月16日 03:54:25vstinnersetmessages: + msg308448
2017年12月16日 03:10:30vstinnersetpull_requests: + pull_request4793
2017年12月15日 22:06:23vstinnersetmessages: + msg308430
2017年12月15日 21:18:45vstinnersetpull_requests: + pull_request4787
2017年12月13日 16:46:03vstinnersetmessages: + msg308217
2017年12月13日 16:31:18vstinnersetmessages: + msg308213
2017年12月13日 14:04:28vstinnersetstage: patch review
pull_requests: + pull_request4727
2017年12月13日 11:29:11vstinnersetmessages: + msg308198
2017年12月13日 01:25:01vstinnersetmessages: + msg308183
2017年12月13日 01:21:47vstinnersetmessages: + msg308182
2017年12月05日 22:12:54vstinnersetmessages: + msg307695
2017年12月05日 22:12:31vstinnersetfiles: - encodings.py
2017年12月05日 22:12:14vstinnersetfiles: - pep540_cli.py
2017年12月05日 22:12:14vstinnersetfiles: - pep540.patch
2017年12月05日 22:12:13vstinnersetfiles: - pep540-2.patch
2017年12月05日 22:12:12vstinnersetfiles: - pep540-3.patch
2017年12月05日 22:12:11vstinnersetfiles: - pep540-4.patch
2017年12月05日 22:12:00vstinnersetmessages: + msg307694
2017年12月05日 22:11:45vstinnersettitle: [WIP] Implementation of the PEP 540: Add a new UTF-8 mode -> PEP 540: Add a new UTF-8 mode
2017年06月28日 01:00:39vstinnersettitle: Implementation of the PEP 540: Add a new UTF-8 mode -> [WIP] Implementation of the PEP 540: Add a new UTF-8 mode
2017年03月27日 22:03:35vstinnersetpull_requests: + pull_request757
2017年03月27日 22:03:20vstinnersetpull_requests: - pull_request15
2017年01月14日 14:05:46eryksunsetnosy: + eryksun
messages: + msg285482
2017年01月13日 15:27:23vstinnersetmessages: + msg285407
2017年01月13日 00:54:08methanesetmessages: + msg285357
2017年01月12日 16:45:20vstinnersetfiles: + encodings.py

messages: + msg285332
2017年01月12日 13:31:42vstinnersetfiles: + pep540-4.patch

messages: + msg285325
2017年01月12日 10:32:24vstinnersetmessages: + msg285298
2017年01月12日 10:19:41yan12125setnosy: + yan12125
2017年01月12日 09:18:36vstinnersetmessages: + msg285296
2017年01月11日 23:57:12methanesetmessages: + msg285280
2017年01月11日 23:01:39vstinnersetmessages: + msg285278
2017年01月11日 23:00:07vstinnersetmessages: + msg285277
2017年01月11日 22:13:06vstinnersetfiles: + pep540-3.patch

messages: + msg285276
2017年01月11日 22:04:22vstinnersetfiles: + pep540-2.patch

messages: + msg285275
2017年01月11日 16:25:18methanesetnosy: + methane
2017年01月11日 11:32:58vstinnersetmessages: + msg285216
2017年01月11日 11:27:22vstinnersetfiles: + pep540.patch
keywords: + patch
messages: + msg285215
2017年01月11日 11:19:52vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /