Issue 8611: Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52857

classification

Type:	Stage:
Title:	Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:	9425	Superseder:
Assigned To:	Nosy List:	Arfrever, asvetlov, brett.cannon, georg.brandl, pitrou, r.david.murray, vstinner
Priority:	release blocker	Keywords:	patch

Created on 2010年05月04日 13:30 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (26)
msg104932 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月04日 13:30
Python3 is unable to start (bootstrap failure) on a POSIX system if the locale encoding is different than utf8 and the Python path (standard library path where the encoding module is stored) contains a non-ASCII character. (Windows and Mac OS X are not affected by this issue because the file system encoding is hardcoded.) - Py_FileSystemDefaultEncoding == NULL - calculate_path(): sys.path is filled with directory names decoded with the locale encoding - find_module() encodes each path using PyUnicode_AsEncodedString(..., Py_FileSystemDefaultEncoding, NULL): use "utf-8" encoding because Py_FileSystemDefaultEncoding is NULL => error because the path is not encoded and decoded with the same encoding We cannot encodes a path with the locale encoding because we need find_module() to load the encoding codec, and loading the codec needs find_module()... (bootstrap error :-)) We should decodes the path using a fixed encoding (eg. ASCII or utf-8), use the same encoding to encodes paths in find_module(), and then reencode paths of all objects storing filenames: - sys.path list items - sys.modules dict keys - sys.modules values: each module have __file__ and/or __path__ attributes - all code objects (co_filename) - (maybe some other?) The error occurs in an early stage of Py_InitializeEx(), so the object list is limited and we control this list (eg. site is not loaded yet). Related issues: - #8610: "Python3/POSIX: errors if file system encoding is None" - #8242: "Improve support of PEP 383 (surrogates) in Python3: meta-issue"
msg104934 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2010年05月04日 13:36
We could have a separate list storing the original bytes form of sys.path; this list would be used by find_module() as long as Py_FileSystemDefaultEncoding isn't initialized.
msg104935 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2010年05月04日 13:39
Or find_module() could use wcstombs() as long as Py_FileSystemDefaultEncoding is NULL.
msg104941 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月04日 14:16
I have a patch implementation most of the point described in my first message. I have to rework on it before submit it. The patch depends on other issues, and I prefer to first fix all related issues.
msg105241 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月07日 22:35
Let's try with something: pyunicode_asencodefsdefault.patch adds PyUnicode_EncodeFSDefault() function to uniformize how a unicode is converted to bytes. Fallback to UTF-8 if Py_FileSystemEncoding is not set (I should be ASCII, not UTF-8) and use surrogateescape error handler.
msg105723 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月14日 16:57
I opened a separated issue for the new function PyUnicode_EncodeFSDefault(): #8715.
msg106097 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月19日 20:45
See also #4352.
msg106103 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月19日 21:07
If I understood correctly, this issue is a regression introduced by r67055 (to fix #4213). Read: http://bugs.python.org/issue4213#msg75387 See also r67057 (issue #3723).
msg106154 - (view)	Author: Andrew Svetlov (asvetlov) * (Python committer)	Date: 2010年05月20日 14:10
After looking in #4352 deep I figured out what true separation of filesystem default encoding and utf8 python namespace is really too complicated. For example import call stack chain converts module name from utf-8 to filesystem in import.c:find_module. After that converted name used by PyImport_ExecCodeModule* as utf-8 name while actually it has filesystem encoding. That problem cannot be solved by "five-line patch" and Martin von Loevis suggested me to stop potentially dangerous big import.c changes in python 3.1 beta. I like importlib way (with maybe C implementation as next step) in terms of "true way" reorganization of python import machinery, but unfortunatelly Cannon has no time for that. From my perspective only big refactoring can solve encoding issues (and we can use excellent io implementation to open utf-8 named files in Windows using native unicode functions). We need to split 'module names' from 'filesystem pathes' clean. Maybe pure python importing is not easy - not sure. But reorganizing of current 'import spaghetti' is required. importlib (and PEP 302) introduced a nice way to do that. I like to be volunteer for this task and I feel enough knowledge to implement and test cover at least windows and linux (MacOs is not big problem also). But I need a mentor (Petrou, Cannon - you are welcome) to make it done, done clear and stable, done in resonable time period.
msg106159 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月20日 15:22
As I wrote, I have an huge patch somewhere in my harddrive fixing this issue. But I don't want to publish it because it's really huge. I prefer to fix the problem step by step. I fixed most related issues: see the dependency list of #8242. I will publish the big patch shortly.
msg106337 - (view)	Author: Andrew Svetlov (asvetlov) * (Python committer)	Date: 2010年05月23日 17:14
I'm skeptical about surrogates particularly for that problem. From my perspective the solution is only to use native unicode support for windows file operation functions. Conversions utf-8 -> mbcs -> utf8 will loose encoding information thanks to tricky Microsoft mbcs encoding schema. If I'm wrong please correct me.
msg106474 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月25日 20:55
asvetlov> I'm skeptical about surrogates particularly for that asvetlov> problem. From my perspective the solution is only to use asvetlov> native unicode support for windows file operation functions. It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode. asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding asvetlov> information thanks to tricky Microsoft mbcs encoding schema. asvetlov> If I'm wrong please correct me. On Windows, Python3 does convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it. To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls). -- mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see #850997.
msg108569 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年06月24日 23:54
I think that #8988 is a duplicate of this issue.
msg109025 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年06月30日 23:20
See also #3080.
msg112031 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年07月30日 00:23
I posted a patch to fix this issue: see #9425.
msg112119 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2010年07月31日 07:55
This will have to wait until after alpha1, as well.
msg115637 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2010年09月05日 08:15
The Unicode import system won't be put in place before 3.2a2, deferring.
msg115944 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年09月09日 12:50
See also #9713 (Py_CompileString fails on non decode-able paths) and #9738 (Document the encoding of functions bytes arguments of the C API).
msg118324 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2010年10月10日 09:32
Deferring once again.
msg118908 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月17日 00:31
Status of this issue, 5 months later: most tests pass except test_gc test_gdb test_runpy test_sys test_wsgiref test_zipimport. Said differently, 95% of the task (or more?) is done. It's possible to run Python installed in a non-ascii directory with any locale (I tested ascii, iso-8859-1 and utf-8).
msg118967 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月17日 19:20
Updated list of failing test with py3k and a non-ascii path: * Linux, LANG=C: test_gc test_gdb test_runpy test_zipimport * Windows: test_email test_httpservers test_zipimport Possible reasons: * test_httpservers (CGIHTTPServerTestCase.setUp): test should be skipped if sys.executable is not pure ASCII (and it's not possible to create ASCII path using a symlink) * test_zipimport: zipimport uses utf-8 (in strict mode) for the prefix, instead of the filesystem encoding * test_gc (test_get_count): "The following two tests are fragile: ..." :-/ * test_gdb: libpython doesn't support surrogates if paths * test_email: issue with the end of line (\n vs \r\n?) * test_runpy: ?
msg118976 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月17日 20:03
r85655 fixed test_gdb failure. test_runpy failure looks to be linked to test_zipimport problems.
msg118979 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月17日 20:19
r85659 + r85662 + r85663 fixed test_httpservers.
msg118991 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2010年10月17日 23:45
Victor, can you paste or attach the error for email? My MSDN subscription has expired so I can't set up to test it myself (I've submitted the renewal, but who knows how long it will take to process :)
msg118997 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月18日 03:49
> Victor, can you paste or attach the error for email? It doesn't look to be related to the path name (same failure with "py3ké" or "py3k" directory name), so I opened #10134.
msg119098 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月19日 01:02
Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done.

History
Date	User	Action	Args
2022年04月11日 14:57:00	admin	set	github: 52857
2010年10月19日 01:02:50	vstinner	set	status: open -> closed resolution: fixed messages: + msg119098
2010年10月18日 03:49:57	vstinner	set	messages: + msg118997
2010年10月17日 23:45:46	r.david.murray	set	nosy: + r.david.murray messages: + msg118991
2010年10月17日 20:19:12	vstinner	set	messages: + msg118979
2010年10月17日 20:03:10	vstinner	set	messages: + msg118976
2010年10月17日 19:20:54	vstinner	set	messages: + msg118967
2010年10月17日 00:31:22	vstinner	set	messages: + msg118908
2010年10月12日 12:41:49	georg.brandl	set	priority: deferred blocker -> release blocker
2010年10月10日 09:32:57	georg.brandl	set	priority: release blocker -> deferred blocker messages: + msg118324
2010年09月09日 12:50:09	vstinner	set	messages: + msg115944
2010年09月06日 08:26:36	georg.brandl	set	priority: deferred blocker -> release blocker
2010年09月05日 08:15:44	georg.brandl	set	priority: release blocker -> deferred blocker messages: + msg115637
2010年07月31日 18:24:55	georg.brandl	set	priority: deferred blocker -> release blocker
2010年07月31日 07:55:45	georg.brandl	set	priority: release blocker -> deferred blocker nosy: + georg.brandl messages: + msg112119 dependencies: + Rewrite import machinery to work with unicode paths
2010年07月30日 00:23:56	vstinner	set	messages: + msg112031
2010年06月30日 23:20:35	vstinner	set	messages: + msg109025
2010年06月28日 12:16:32	ncoghlan	set	priority: normal -> release blocker
2010年06月24日 23:54:54	vstinner	set	messages: + msg108569
2010年05月25日 20:55:20	vstinner	set	messages: + msg106474
2010年05月23日 17:14:57	asvetlov	set	messages: + msg106337
2010年05月20日 15:22:33	vstinner	set	messages: + msg106159
2010年05月20日 14:11:02	asvetlov	set	nosy: + brett.cannon
2010年05月20日 14:10:28	asvetlov	set	nosy: + asvetlov messages: + msg106154
2010年05月19日 21:07:38	vstinner	set	messages: + msg106103
2010年05月19日 20:45:52	vstinner	set	messages: + msg106097
2010年05月15日 12:40:20	vstinner	link	issue8725 dependencies
2010年05月14日 16:57:24	vstinner	set	messages: + msg105723
2010年05月14日 16:56:54	vstinner	set	files: - pyunicode_encodefsdefault.patch
2010年05月07日 22:36:02	vstinner	set	files: + pyunicode_encodefsdefault.patch keywords: + patch messages: + msg105241
2010年05月04日 18:00:56	Arfrever	set	nosy: + Arfrever
2010年05月04日 14:16:14	vstinner	set	messages: + msg104941
2010年05月04日 13:39:41	pitrou	set	messages: + msg104935
2010年05月04日 13:36:43	pitrou	set	nosy: + pitrou messages: + msg104934
2010年05月04日 13:30:49	vstinner	create

homepage