Message 323319 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Michael.Felt
Recipients	Michael.Felt
Date	2018年08月09日.11:55:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1533815707.35.0.56676864532.issue34347@psf.upfronthosting.co.za>

Content
Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading. Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery. The failing result is: self.assertEqual(args, ascii(expected), out) AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" - ['h\xc3\xa9\xe2\x82\xac'] + ['h\udcc3\udca9\udce2\udc82\udcac'] : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] The test code is: +207 @unittest.skipIf(MS_WINDOWS, 'test specific to Unix') +208 def test_cmd_line(self): +209 arg = 'h\xe9\u20ac'.encode('utf-8') +210 arg_utf8 = arg.decode('utf-8') +211 arg_ascii = arg.decode('ascii', 'surrogateescape') +212 code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))' +213 +214 def check(utf8_opt, expected, kw): +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, kw) +216 args = out.partition(':')[2].rstrip() +217 self.assertEqual(args, ascii(expected), out) +218 +219 check('utf8', [arg_utf8]) +220 if sys.platform == 'darwin' or support.is_android: +221 c_arg = arg_utf8 +222 else: +223 c_arg = arg_ascii +224 check('utf8=0', [c_arg], LC_ALL='C') Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set. +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, kw) rewrites (less indent) as: +215 out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), kw) or out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', kw) Finally, in Lib/test/support/script_helper.py we have +127 print("\n", cmd_line) # debug info, ignore +128 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE, +129 stdout=subprocess.PIPE, stderr=subprocess.PIPE, +130 env=env, cwd=cwd) Which gives: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] Above - utf8=1 - is successful ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] Here: utf8=0 fails. The arg to the CLI is equal in both cases. FAIL ## Goiing back to check() and what does it have: ## Add some debug. The first line is the 'raw' expected, ## the second line is ascii(decoded) ## the final is the value extracted from get_output +214 def check(utf8_opt, expected, kw): +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw) +216 args = out.partition(':')[2].rstrip() +217 print("") +218 print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out)) +219 self.assertEqual(args, ascii(expected), out) For: utf8 mode true, it works: ['h▒\u20ac']: expected ['h\xe9\u20ac']:ascii(expected) UTF-8:['h\xe9\u20ac']:out +221 check('utf8', [arg_utf8]) But not for utf8=0 +226 check('utf8=0', [c_arg], LC_ALL='C') # note, different values for LC_ALL='C' have been tried ['h\udcc3\udca9\udce2\udc82\udcac']: expected ['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected) ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out ## re: expected and ascii(expected) When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match. When utf8=0 there is no difference is "arg1" passed to "code". However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'. In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac' Finally, when I run the command from the command line (after rewrites) What passes: ./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii( sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac' UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac'] encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1] ./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii( sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac' ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality. Again: test result includes: ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] So, I feel the issue is not with test, but within what happens after: +127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE, +128 stdout=subprocess.PIPE, stderr=subprocess.PIPE, +129 env=env, cwd=cwd) Specifically: here. +130 with proc: +131 try: +132 out, err = proc.communicate() +133 finally: +134 proc.kill() +135 subprocess._cleanup() +136 rc = proc.returncode +137 err = strip_python_stderr(err) +138 return _PythonRunResult(rc, out, err), cmd_line PASS: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] 0 b"UTF-8:['h\\xe9\\u20ac']\n" b'' FAIL: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] 0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b'' Seems the 'b' quality disappears somehow with: +216 args = out.partition(':')[2].rstrip() So, maybe it is in test - in that line. However, this goes well beyond my comprehension of python internal workings. Hope this helps. Please comment.

Content

Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading.
Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery.
The failing result is:
 self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
The test code is:
 +207 @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
 +208 def test_cmd_line(self):
 +209 arg = 'h\xe9\u20ac'.encode('utf-8')
 +210 arg_utf8 = arg.decode('utf-8')
 +211 arg_ascii = arg.decode('ascii', 'surrogateescape')
 +212 code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
 +213
 +214 def check(utf8_opt, expected, **kw):
 +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
 +216 args = out.partition(':')[2].rstrip()
 +217 self.assertEqual(args, ascii(expected), out)
 +218
 +219 check('utf8', [arg_utf8])
 +220 if sys.platform == 'darwin' or support.is_android:
 +221 c_arg = arg_utf8
 +222 else:
 +223 c_arg = arg_ascii
 +224 check('utf8=0', [c_arg], LC_ALL='C')
Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.
 +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
rewrites (less indent) as:
 +215 out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
or
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)
Finally, in Lib/test/support/script_helper.py we have
 +127 print("\n", cmd_line) # debug info, ignore
 +128 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
 +129 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
 +130 env=env, cwd=cwd)
Which gives:
 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
Above - utf8=1 - is successful
 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
Here: utf8=0 fails. The arg to the CLI is equal in both cases.
FAIL
## Goiing back to check() and what does it have:
## Add some debug. The first line is the 'raw' expected,
## the second line is ascii(decoded)
## the final is the value extracted from get_output
 +214 def check(utf8_opt, expected, **kw):
 +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
 +216 args = out.partition(':')[2].rstrip()
 +217 print("")
 +218 print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out))
 +219 self.assertEqual(args, ascii(expected), out)
For: utf8 mode true, it works:
['h▒\u20ac']: expected
['h\xe9\u20ac']:ascii(expected)
UTF-8:['h\xe9\u20ac']:out
 +221 check('utf8', [arg_utf8])
But not for utf8=0
 +226 check('utf8=0', [c_arg], LC_ALL='C')
 # note, different values for LC_ALL='C' have been tried
['h\udcc3\udca9\udce2\udc82\udcac']: expected
['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected)
ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out
## re: expected and ascii(expected)
When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match.
When utf8=0 there is no difference is "arg1" passed to "code".
However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'.
In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'
Finally, when I run the command from the command line (after rewrites)
What passes:
./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']
encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]
./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality.
Again: test result includes:
 ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
So, I feel the issue is not with test, but within what happens after:
 +127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
 +128 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
 +129 env=env, cwd=cwd)
Specifically: here.
 +130 with proc:
 +131 try:
 +132 out, err = proc.communicate()
 +133 finally:
 +134 proc.kill()
 +135 subprocess._cleanup()
 +136 rc = proc.returncode
 +137 err = strip_python_stderr(err)
 +138 return _PythonRunResult(rc, out, err), cmd_line
PASS:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"UTF-8:['h\\xe9\\u20ac']\n" b''
FAIL:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''
Seems the 'b' quality disappears somehow with:
 +216 args = out.partition(':')[2].rstrip()
So, maybe it is in test - in that line.
However, this goes well beyond my comprehension of python internal workings.
Hope this helps. Please comment.

History
Date	User	Action	Args
2018年08月09日 11:55:07	Michael.Felt	set	recipients: + Michael.Felt
2018年08月09日 11:55:07	Michael.Felt	set	messageid: <1533815707.35.0.56676864532.issue34347@psf.upfronthosting.co.za>
2018年08月09日 11:55:07	Michael.Felt	link	issue34347 messages
2018年08月09日 11:55:06	Michael.Felt	create

homepage