This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018年08月06日 16:06 by Michael.Felt, last changed 2022年04月11日 14:59 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| pEpkey.asc | Michael.Felt, 2018年08月06日 20:26 | |||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 8923 | merged | Michael.Felt, 2018年08月25日 13:38 | |
| PR 14233 | merged | Michael.Felt, 2019年06月19日 14:04 | |
| Messages (18) | |||
|---|---|---|---|
| msg323214 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月06日 16:06 | |
The test fails because
byte_str.decode('ascii', 'surragateescape')
is not what ascii(byte_str) - returns when called from the commandline.
Assumption: since " check('utf8', [arg_utf8])" succeeds I assume the parsing of the command-line is correct.
DETAILS
>>> arg = 'h\xe9\u20ac'.encode('utf-8')
>>> arg
b'h\xc3\xa9\xe2\x82\xac'
>>> arg.decode('ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
I am having a difficult time getting the syntax correct for all the "escapes", so I added a print statement in the check routine:
test_cmd_line (test.test_utf8_mode.UTF8ModeTests) ...
code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:UTF-8:['h\xe9\u20ac']
code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
test code with my debug statement (to generate above):
def test_cmd_line(self):
arg = 'h\xe9\u20ac'.encode('utf-8')
arg_utf8 = arg.decode('utf-8')
arg_ascii = arg.decode('ascii', 'surrogateescape')
code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
def check(utf8_opt, expected, **kw):
out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
print("\ncode:%s arg:%s\nout:%s" % (code, arg, out))
args = out.partition(':')[2].rstrip()
self.assertEqual(args, ascii(expected), out)
check('utf8', [arg_utf8])
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C')
So the first check succeeds:
check('utf8', [arg_utf8])
But the second does not:
FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
check('utf8=0', [c_arg], LC_ALL='C')
File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 218, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
|
|||
| msg323222 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月06日 20:10 | |
In short, I do not understand how this passes on Linux.
This is python3-3.4.6 on sles12:
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>
This is python3-3.7.0 on AIX:
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
If I am missing something essential here - please be blunt!
|
|||
| msg323223 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月06日 20:26 | |
On 8/6/2018 10:10 PM, Michael Felt wrote: > Michael Felt <michael@felt.demon.nl> added the comment: > > In short, I do not understand how this passes on Linux. > > This is python3-3.4.6 on sles12: > >>>> 'h\xe9\u20ac'.encode('utf-8') > b'h\xc3\xa9\xe2\x82\xac' >>>> ascii('h\xe9\u20ac'.encode('utf-8')) > "b'h\\xc3\\xa9\\xe2\\x82\\xac'" >>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape') > 'h\udcc3\udca9\udce2\udc82\udcac' > This is python3-3.7.0 on AIX: >>>> 'h\xe9\u20ac'.encode('utf-8') > b'h\xc3\xa9\xe2\x82\xac' >>>> ascii('h\xe9\u20ac'.encode('utf-8')) > "b'h\\xc3\\xa9\\xe2\\x82\\xac'" >>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape') > 'h\udcc3\udca9\udce2\udc82\udcac' > > If I am missing something essential here - please be blunt! Also seeing the same with Windows. C:\Users\MICHAELFelt>python Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> 'h\xe9\u20ac'.encode('utf-8') b'h\xc3\xa9\xe2\x82\xac' >>> ascii('h\xe9\u20ac'.encode('utf-8')) "b'h\\xc3\\xa9\\xe2\\x82\\xac'" >>> 'h\xe9\u20ac'.encode('utf-8').decode('ascii','surrogateescape') 'h\udcc3\udca9\udce2\udc82\udcac' >>> > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34347> > _______________________________________ > |
|||
| msg323250 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月07日 20:23 | |
Common "experts" - feedback needed! Original test test_utf8_mode failed -- Traceback (most recent call last): File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line check('utf8=0', [c_arg], LC_ALL='C') File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 217, in check self.assertEqual(args, ascii(expected), out) AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" - ['h\xc3\xa9\xe2\x82\xac'] + ['h\udcc3\udca9\udce2\udc82\udcac'] : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] Modification #1: if sys.platform == 'darwin' or support.is_android: c_arg = arg_utf8 elif sys.platform.startswith("aix"): c_arg = arg_ascii.encode('utf-8', 'surrogateescape') else: c_arg = arg_ascii check('utf8=0', [c_arg], LC_ALL='C') Result: AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" - ['h\xc3\xa9\xe2\x82\xac'] + [b'h\xc3\xa9\xe2\x82\xac'] ? + : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] Modifiction #2: if sys.platform == 'darwin' or support.is_android: c_arg = arg_utf8 elif sys.platform.startswith("aix"): c_arg = arg else: c_arg = arg_ascii check('utf8=0', [c_arg], LC_ALL='C') AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" - ['h\xc3\xa9\xe2\x82\xac'] + [b'h\xc3\xa9\xe2\x82\xac'] ? + : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] The "expected" continues to be a "bytes" object, while the CLI code returns a non-byte string. Or - the original has an ascii string object but uses \udc rather than \x \udc is common (i.e., I see it frequently in googled results on other things) - should something in ascii() be changed to output \udc rather than \x ? Thx! |
|||
| msg323319 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月09日 11:55 | |
Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading.
Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery.
The failing result is:
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
The test code is:
+207 @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
+208 def test_cmd_line(self):
+209 arg = 'h\xe9\u20ac'.encode('utf-8')
+210 arg_utf8 = arg.decode('utf-8')
+211 arg_ascii = arg.decode('ascii', 'surrogateescape')
+212 code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
+213
+214 def check(utf8_opt, expected, **kw):
+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
+216 args = out.partition(':')[2].rstrip()
+217 self.assertEqual(args, ascii(expected), out)
+218
+219 check('utf8', [arg_utf8])
+220 if sys.platform == 'darwin' or support.is_android:
+221 c_arg = arg_utf8
+222 else:
+223 c_arg = arg_ascii
+224 check('utf8=0', [c_arg], LC_ALL='C')
Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.
+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
rewrites (less indent) as:
+215 out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
or
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)
Finally, in Lib/test/support/script_helper.py we have
+127 print("\n", cmd_line) # debug info, ignore
+128 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
+129 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+130 env=env, cwd=cwd)
Which gives:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
Above - utf8=1 - is successful
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
Here: utf8=0 fails. The arg to the CLI is equal in both cases.
FAIL
## Goiing back to check() and what does it have:
## Add some debug. The first line is the 'raw' expected,
## the second line is ascii(decoded)
## the final is the value extracted from get_output
+214 def check(utf8_opt, expected, **kw):
+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
+216 args = out.partition(':')[2].rstrip()
+217 print("")
+218 print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out))
+219 self.assertEqual(args, ascii(expected), out)
For: utf8 mode true, it works:
['h▒\u20ac']: expected
['h\xe9\u20ac']:ascii(expected)
UTF-8:['h\xe9\u20ac']:out
+221 check('utf8', [arg_utf8])
But not for utf8=0
+226 check('utf8=0', [c_arg], LC_ALL='C')
# note, different values for LC_ALL='C' have been tried
['h\udcc3\udca9\udce2\udc82\udcac']: expected
['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected)
ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out
## re: expected and ascii(expected)
When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match.
When utf8=0 there is no difference is "arg1" passed to "code".
However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'.
In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'
Finally, when I run the command from the command line (after rewrites)
What passes:
./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']
encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]
./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality.
Again: test result includes:
ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
So, I feel the issue is not with test, but within what happens after:
+127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
+128 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+129 env=env, cwd=cwd)
Specifically: here.
+130 with proc:
+131 try:
+132 out, err = proc.communicate()
+133 finally:
+134 proc.kill()
+135 subprocess._cleanup()
+136 rc = proc.returncode
+137 err = strip_python_stderr(err)
+138 return _PythonRunResult(rc, out, err), cmd_line
PASS:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
0 b"UTF-8:['h\\xe9\\u20ac']\n" b''
FAIL:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''
Seems the 'b' quality disappears somehow with:
+216 args = out.partition(':')[2].rstrip()
So, maybe it is in test - in that line.
However, this goes well beyond my comprehension of python internal workings.
Hope this helps. Please comment.
|
|||
| msg323831 - (view) | Author: Łukasz Langa (lukasz.langa) * (Python committer) | Date: 2018年08月21日 14:55 | |
I have no idea what's going on here yet but just wanted to report that we are seeing this issue on one FreeBSD buildbot, too: https://buildbot.python.org/all/#/builders/124/builds/508/steps/4/logs/stdio I can also reproduce on CentOS 7. Could this be related to LC_ALL= or related environment variables? |
|||
| msg323941 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2018年08月23日 10:48 | |
I fixed bpo-34207. |
|||
| msg323942 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2018年08月23日 10:51 | |
Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale(). > Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed. |
|||
| msg323961 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月23日 17:14 | |
On 23/08/2018 12:51, STINNER Victor wrote: > STINNER Victor <vstinner@redhat.com> added the comment: > > Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale(). This is beyond my understanding atm. Early on I tried making the expected just be 'arg' and went from situation A to situation B - which looked much closer, BUT, the 'types' differed: Situaltion A (original) AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" - ['h\xc3\xa9\xe2\x82\xac'] + ['h\udcc3\udca9\udce2\udc82\udcac'] : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). Situation B AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" - ['h\xc3\xa9\xe2\x82\xac'] + [b'h\xc3\xa9\xe2\x82\xac'] ? + : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] After further digging - to understand why it was coming as "\x encoding rather than \udc" I looked at what was happening here: out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw) becomes out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw) becomes out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw) And finally, at the CLI becomes: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar gv[1:])))', b'h\xc3\xa9\xe2\x82\xac' UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', b'h\xc3\xa9\xe2\x82\xac' ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] Note: /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac' ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac' ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (> UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] Summary: a) concerned about how b'h....' becomes 'bh....' b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from self.get_output('-X', utf8_opt, '-c', code, arg, **kw) determines the output and the (failed) comparison. >> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) > Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34347> > _______________________________________ > |
|||
| msg323996 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月24日 10:28 | |
On 23/08/2018 19:14, Michael Felt wrote: > Michael Felt <aixtools@felt.demon.nl> added the comment: > > On 23/08/2018 12:51, STINNER Victor wrote: >> STINNER Victor <vstinner@redhat.com> added the comment: >> >> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale(). > This is beyond my understanding atm. > Early on I tried making the expected just be 'arg' and went from > situation A to situation B - which looked much closer, BUT, the 'types' > differed: > > Situaltion A (original) > AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" > - ['h\xc3\xa9\xe2\x82\xac'] > + ['h\udcc3\udca9\udce2\udc82\udcac'] > : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] > > I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). > > Situation B > AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" > - ['h\xc3\xa9\xe2\x82\xac'] > + [b'h\xc3\xa9\xe2\x82\xac'] > ? + > : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] > > After further digging - to understand why it was coming as "\x encoding rather than \udc" > > I looked at what was happening here: > > out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw) > becomes > out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw) > becomes > out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw) > > And finally, at the CLI becomes: > ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar > gv[1:])))', b'h\xc3\xa9\xe2\x82\xac' > UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', b'h\xc3\xa9\xe2\x82\xac' > ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] > > Note: > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac' > ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac' > ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (> > UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > Summary: > a) concerned about how b'h....' becomes 'bh....' > b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from > self.get_output('-X', utf8_opt, '-c', code, arg, **kw) > determines the output and the (failed) comparison. p.s. also tried: michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', 'h\xe9\u20ac'.encode\('utf-8'\) ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)'] michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', 'h\xe9\u20ac'.encode\('utf-8'\) UTF-8:['h\\xe9\\u20ac.encode(utf-8)'] Really unclear to me what this test is trying to verify. The CLI seems to just 'echo' what it is provided. >>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) >> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed. >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <https://bugs.python.org/issue34347> >> _______________________________________ >> > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34347> > _______________________________________ > |
|||
| msg324067 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月25日 13:40 | |
Solution much simpler than I thought:
not arg.decode('ascii', 'surrogateescape'), but arg.decode('iso-8859-1')
|
|||
| msg324097 - (view) | Author: Michael Osipov (michael-o) * | Date: 2018年08月25日 19:46 | |
This is a very thorough analysis. Kudos to that. |
|||
| msg324179 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2018年08月27日 13:40 | |
New changeset 7ef1697be54a74314d5214d9ba0580d4e620694c by Victor Stinner (Michael Felt) in branch 'master': bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) https://github.com/python/cpython/commit/7ef1697be54a74314d5214d9ba0580d4e620694c |
|||
| msg324181 - (view) | Author: Michael Osipov (michael-o) * | Date: 2018年08月27日 14:24 | |
Interesting is that the very same approach does not work for HP-UX even if I swap out the params for HP-UX: $ ./python -m test test_utf8_mode Run tests sequentially 0:00:00 [1/1] test_utf8_mode test test_utf8_mode failed -- Traceback (most recent call last): File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 226, in test_cmd_line check('utf8=0', [c_arg], LC_ALL='C') File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check self.assertEqual(args, ascii(expected), out) AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']" - ['h\xc3\xa9\xe2\x82\xac'] + ['h\xfb\u02cb\xe3\x82\u02dc'] : roman8:['h\xc3\xa9\xe2\x82\xac'] |
|||
| msg324409 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2018年08月31日 09:19 | |
The buildbots seem happy. This may be closed. |
|||
| msg324419 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2018年08月31日 14:17 | |
> The buildbots seem happy. This may be closed. Cool, thank you for checking, and thanks for your fix! I close the issue. |
|||
| msg337636 - (view) | Author: Michael Felt (Michael.Felt) * | Date: 2019年03月10日 18:48 | |
Could this be backported to version 3.7? |
|||
| msg346078 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2019年06月19日 20:07 | |
New changeset 15e7d2432294ec46f1ad84ce958fdeb9d4ca78b1 by Victor Stinner (Michael Felt) in branch '3.7': [3.7] bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) (GH-14233) https://github.com/python/cpython/commit/15e7d2432294ec46f1ad84ce958fdeb9d4ca78b1 |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:59:04 | admin | set | github: 78528 |
| 2019年06月19日 20:07:50 | vstinner | set | messages: + msg346078 |
| 2019年06月19日 14:04:17 | Michael.Felt | set | pull_requests: + pull_request14069 |
| 2019年03月10日 18:48:12 | Michael.Felt | set | messages: + msg337636 |
| 2018年08月31日 14:17:50 | vstinner | set | status: open -> closed resolution: fixed messages: + msg324419 stage: patch review -> resolved |
| 2018年08月31日 09:19:16 | Michael.Felt | set | messages: + msg324409 |
| 2018年08月27日 14:24:43 | michael-o | set | messages: + msg324181 |
| 2018年08月27日 13:40:21 | vstinner | set | messages: + msg324179 |
| 2018年08月25日 19:46:27 | michael-o | set | nosy:
+ michael-o messages: + msg324097 |
| 2018年08月25日 13:40:42 | Michael.Felt | set | messages: + msg324067 |
| 2018年08月25日 13:38:20 | Michael.Felt | set | keywords:
+ patch stage: patch review pull_requests: + pull_request8396 |
| 2018年08月24日 10:28:43 | Michael.Felt | set | messages: + msg323996 |
| 2018年08月23日 17:14:25 | Michael.Felt | set | messages: + msg323961 |
| 2018年08月23日 10:51:30 | vstinner | set | messages: + msg323942 |
| 2018年08月23日 10:48:50 | vstinner | set | nosy:
+ vstinner messages: + msg323941 |
| 2018年08月21日 14:56:54 | lukasz.langa | set | keywords: + 3.7regression |
| 2018年08月21日 14:56:17 | lukasz.langa | set | dependencies: + test_cmd_line test_utf8_mode test_warnings fail in all FreeBSD 3.x (3.8) buildbots |
| 2018年08月21日 14:55:43 | lukasz.langa | set | nosy:
+ lukasz.langa messages: + msg323831 |
| 2018年08月09日 11:55:07 | Michael.Felt | set | messages: + msg323319 |
| 2018年08月07日 20:23:35 | Michael.Felt | set | messages: + msg323250 |
| 2018年08月06日 20:26:57 | Michael.Felt | set | files:
+ pEpkey.asc messages: + msg323223 |
| 2018年08月06日 20:10:54 | Michael.Felt | set | messages: + msg323222 |
| 2018年08月06日 16:06:50 | Michael.Felt | create | |