Issue 9167: argv double encoding on OSX

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53413

classification

Title:	argv double encoding on OSX
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, macOS, Unicode	Versions:	Python 3.1, Python 3.2

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ronaldoussoren	Nosy List:	ezio.melotti, piro, r.david.murray, ronaldoussoren, vstinner
Priority:	normal	Keywords:	patch

Created on 2010年07月05日 16:07 by piro, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test-argv.patch	piro, 2010年07月06日 09:43

Messages (15)
msg109333 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010年07月05日 16:07
Looks like the wchar_t* array returned by Py_GetArgcArgv() on OSX suffers by double encoding. This can affect sys.argv, sys.executable and C code relying on the above function of course. On Linux: $ python3 Python 3.0rc1+ (py3k, Oct 28 2008, 09:22:29) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xe2\x98\x83' 0 On OSX (uname -a is Darwin comicbookguy.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504年7月4日~1/RELEASE_I386 i386) $ python3 Python 3.1.2 (r312:79147, Jul 5 2010, 11:57:14) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xc3\xa2\xc2\x98\xc2\x83' 0 Is this a known limitation of the platform? I don't know much about OSX, just found it testing for regressions in setproctitle <http://code.google.com/p/py-setproctitle/> Reported correctly working on Windows.
msg109367 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年07月06日 07:24
I cannot reproduce this with both 3.1.2 and 3.2a (py3k:80693), in both cases I get the same output as you do on Linux. This is on OSX 10.6 though, I haven't tested on 10.4 yet. What is the output of the locale command on your OSX system? Mine says: $ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= And what is the value of "__CF_USER_TEXT_ENCODING"? My is: $ echo ${__CF_USER_TEXT_ENCODING} 0x1F6:0:0
msg109368 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年07月06日 07:25
BTW. My 3.1 build is: release31-maint:80235M, which is slightly newer that the 3.1.2 release.
msg109377 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010年07月06日 09:43
Attached patch with test cases to check sys.argv and sys.executable. The tests fail against the daily snapshot, so adding python 3.2 to the affected versions. Variable __CF_USER_TEXT_ENCODING is undefined. Locale of the system is C: $ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=
msg109386 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010年07月06日 12:16
I've made some other test with LANG=C on other platforms. It seems resulting in a clean error on Linux: $ LANG=C ./here/bin/python3 Python 3.2a0 (py3k, Jul 6 2010, 12:40:29) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys, os >>> snowman = '\u2603' >>> os.system((sys.executable + " -c 'import sys; print(sys.argv[-1].encode(\"utf8\"))' " + snowman).encode(sys.getdefaultencoding())) Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 0: surrogates not allowed 256 Notice that I had to use an explicit encoding or os.system would have tried to encode using ascii and barf, probably because of bug #8775. I've also been reported about issue #4388: I've checked and test_run_code() fails as described. So I think this bug can be considered a #4388 duplicate.
msg111327 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年07月23日 14:17
Daniele: which version of OSX do you use? And if you use OSX 10.5 or 10.6: which is your system language according to system preferences (the topmost entry in the list of the "Language and Text" preference pane, whose icon looks a little like a UN flag. I can only reproduce this by explicitly setting LANG=C before running the test on OSX 10.6 (with English as the main language) This may be very hard to fix. What happens is that subprocess.Popen converts the argument array into the filesystem encoding (which on OSX is always UTF-8). The argv decoder then decodes the using the encoding specified in LANG, which on your system is different from UTF-8. This results in a string where each byte in the UTF-8 encoding of snowman is represented as a single character. Those characters are then encoded as UTF-8 by the test and that results in the error your seeing. That is, the output looks like the output of this code: >>> snowman = '\u2603' >>> snowman.encode('utf-8').decode('latin1').encode('utf-8')
msg111342 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年07月23日 15:01
Daniele: never mind, you already said you are on OSX 10.4. The current behavior is only a problem when the system default encoding as implied by LANG is different from the fileystem encoding. How to fix this is an entirely different question: most (all?) unix tools just work with byte-strings and pass those through unmodified, this means that with something like: subprocess.Popen(['ls', snowman]) The snowman character should be encoded using the filesystem encoding, as that is the bytestring that the C APIs that ls calls expect. Note that encoding using the preferred encoding would result in an exception, as the snowman character cannot be encoded in ASCII or even latin1. A possible workaround is to use the CFStringGetSystemEncoding from CoreFoundation to get the system encoding when LANG=C (and probably guarded by to be activate only on OSX releases before 10.5). Another workaround: upgrade from OSX 10.4 to at least OSX 10.5 ;-)
msg111402 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年07月24日 00:01
> This may be very hard to fix I wrote a patch to fix this problem: see #8775.
msg111470 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年07月24日 12:47
Using the CF API to fetch the system encoding won't work: Using PyObjC: >>> CFStringConvertEncodingToIANACharSetName(CFStringGetSystemEncoding()) u'macintosh' There doesn't seem to be another way to extract the prefered encoding from the system. I see two possible resolutions for this issue: * Close as won't fix This is technically a platform issue that has been fixed in OSX 10.5 * Add a workaround that explicitly sets os.environ['LANG'] to 'en_US.UTF-8' before converting argument and environment values to Unicode (only on OSX < 10.4, when LANG=C and of course resetting the previous value after conversion) I have a 10.4 system I could develop this on, but that's currently in a different country than me.
msg111565 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年07月25日 22:23
Issue #8622 proposes the creation of an environment variable PYTHONFSENCODING. It will be used to set sys.getfilesystemencoding(). Would it help this issue?
msg111602 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010年07月26日 11:38
Ronald, Thank you for the interest. For me trying to deal with such a tricky issue on a system whose Best Before date is already passed would be a waste of time. I was only interested in factor out the bugs in my extension module from the ones not under my responsibility and I had the bad luck to find a 10.4 to test on. I don't have a direct interest in this bug to be fixed. Thank you very much again for your time.
msg119254 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月21日 00:54
I just closed #4388 with r85765 (Python 3.2): always use UTF-8 to decode the command line arguments on Mac OS X, not the locale encoding. I suppose that it does fix this issue. Can someone check that?
msg119262 - (view)	Author: Ronald Oussoren (ronaldoussoren) * (Python committer)	Date: 2010年10月21日 05:51
Thank you. I'll check, but probably only sometime next week.
msg119358 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2010年10月22日 01:03
rdmurray@buddy:~/python/py3k>uname -a Darwin buddy.home.bitdance.com 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504年7月4日~1/RELEASE_I386 i386 rdmurray@buddy:~/python/release31-maint>LC_ALL="C" ./python.exe Python 3.1.2 (release31-maint:85783, Oct 21 2010, 20:31:06) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xc3\xa2\xc2\x98\xc2\x83' 0 rdmurray@buddy:~/python/py3k>LC_ALL="C" ./python.exe Python 3.2a3+ (py3k:85768, Oct 21 2010, 12:31:12) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xe2\x98\x83' 0
msg119370 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年10月22日 08:58
FYI, you should use ascii() instead of a.encode(\"utf8\") to dump arguments. It's easier to check '\u2603' than b'\xe2\x98\x83' for me :-) So the bug is fixed in Python 3.2, great! I was thinking that we need a test for that, but then I remembered that I already wrote such test :-) My test checks 3 unicode characters: \xe9, \u20ac, \U0010ffff; but also invalid byte sequences: text = ( b'\xff' # invalid byte b'\xc3\xa9' # valid utf-8 character b'\xc3\xff' # invalid byte sequence b'\xed\xa0\x80' # lone surrogate character (invalid) ) And it should be enough :-) See test_osx_utf8() of test_cmd_line to see the whole test.

History
Date	User	Action	Args
2022年04月11日 14:57:03	admin	set	github: 53413
2010年10月22日 08:58:23	vstinner	set	messages: + msg119370
2010年10月22日 01:03:08	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg119358 resolution: fixed stage: test needed -> resolved
2010年10月21日 05:51:14	ronaldoussoren	set	messages: + msg119262
2010年10月21日 00:54:28	vstinner	set	messages: + msg119254
2010年07月26日 11:38:12	piro	set	messages: + msg111602
2010年07月25日 22:23:49	vstinner	set	messages: + msg111565
2010年07月24日 12:47:01	ronaldoussoren	set	messages: + msg111470
2010年07月24日 00:01:38	vstinner	set	messages: + msg111402
2010年07月23日 15:01:24	ronaldoussoren	set	messages: + msg111342
2010年07月23日 14:17:22	ronaldoussoren	set	messages: + msg111327
2010年07月06日 12:16:51	piro	set	messages: + msg109386
2010年07月06日 09:43:12	piro	set	files: + test-argv.patch keywords: + patch messages: + msg109377 versions: + Python 3.2
2010年07月06日 07:25:42	ronaldoussoren	set	messages: + msg109368
2010年07月06日 07:24:50	ronaldoussoren	set	messages: + msg109367
2010年07月05日 16:47:40	ezio.melotti	set	nosy: + ezio.melotti, ronaldoussoren assignee: ronaldoussoren components: + macOS, Unicode stage: test needed
2010年07月05日 16:32:48	r.david.murray	set	nosy: + vstinner
2010年07月05日 16:07:52	piro	create

homepage