This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年11月08日 22:52 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| support_undecodable.patch | vstinner, 2012年11月08日 22:52 | review | ||
| Messages (22) | |||
|---|---|---|---|
| msg175200 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月08日 22:52 | |
Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler. So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters. The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for #16218 works with UTF-8 locale encoding. Please test the patch on UNIX, Windows and Mac OS X. We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check. Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph. |
|||
| msg175201 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月08日 22:53 | |
The patch contains two print to help debugging the patch itself, these print statements must be removed later.
+print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE)
+print("TESTFN_NONASCII = %a" % TESTFN_NONASCII)
|
|||
| msg175202 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月08日 23:04 | |
> We may also use support.TESTFN_UNDECODABLE > in test_cmd_line_script.test_non_ascii() on Windows Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0). http://bugs.python.org/issue4036#msg100376 So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test). |
|||
| msg175209 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月08日 23:50 | |
> Please test the patch on UNIX, Windows and Mac OS X. The full test suite pass on: * Linux with UTF-8 locale encoding * Linux with ASCII locale encoding * Windows with cp932 ANSI code page * Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, $LC_ALL, $LC_CTYPE are not set) |
|||
| msg175221 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月09日 11:00 | |
Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings. b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258 b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257 b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424 b'\xd5' : iso8859-8 cp856 cp857 b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255 |
|||
| msg175222 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月09日 11:09 | |
Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X).
b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass').
b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape.
|
|||
| msg175223 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月09日 11:14 | |
> The full test suite pass on: The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs? |
|||
| msg175271 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月10日 10:50 | |
> The matter is not only in the fact that tests passed. Right, but I don't want to introduce a regression :-) > They should fail if the original bug occurs again. Have you tried to restore the bugs? test_cmd_line_script.test_non_ascii() comes from the issue #16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX. test_genericpath.test_non_ascii() comes from the issue #3426, this fix comes from the issue #3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-) |
|||
| msg175272 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月10日 11:07 | |
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default': Issue #16444, #16218: Use TESTFN_UNDECODABLE on UNIX http://hg.python.org/cpython/rev/6b8a8bc6ba9c |
|||
| msg175275 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月10日 12:21 | |
TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes. |
|||
| msg175291 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年11月10日 18:24 | |
I suppose you noticed you broke a bunch of buildbots :) |
|||
| msg175296 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月10日 21:31 | |
New changeset 398f8770bf0d by Victor Stinner in branch 'default': Issue #16444: disable undecodable characters in test_non_ascii() test until http://hg.python.org/cpython/rev/398f8770bf0d |
|||
| msg175396 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月11日 21:51 | |
> TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec. |
|||
| msg175399 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月11日 22:08 | |
These encodings used not only on Windows. |
|||
| msg175402 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月11日 22:15 | |
> I suppose you noticed you broke a bunch of buildbots :) Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment: # _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is # C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris # and Mac OS X. Mac OS X is now using UTF-8 to decode the command line arguments. I just created the issue #16455 to fix FreeBSD and OpenIndiana. I propose to close this issue because I consider it as fixed (#16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script). |
|||
| msg175406 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年11月11日 23:12 | |
> These encodings used not only on Windows. You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding). |
|||
| msg175413 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年11月12日 00:24 | |
New changeset 6017f09ead53 by Victor Stinner in branch '3.3': Issue #16218, #16444: Backport improvment on tests for non-ASCII characters http://hg.python.org/cpython/rev/6017f09ead53 |
|||
| msg175423 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年11月12日 08:05 | |
> You can uses cpXXX encodings explictly to read or write a file, but these > encodings are not used for sys.getfilesystemencoding() (or > sys.stdout.encoding). At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian: $ grep CP /usr/share/i18n/SUPPORTED be_BY CP1251 bg_BG CP1251 ru_RU.CP1251 CP1251 yi_US CP1255 |
|||
| msg176893 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月04日 10:40 | |
Ping. |
|||
| msg176955 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年12月04日 20:42 | |
New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default': Issue #16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages http://hg.python.org/cpython/rev/ed0ff4b3d1c4 |
|||
| msg176958 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年12月04日 20:53 | |
Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-)) |
|||
| msg178868 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2013年01月03日 00:59 | |
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2': Issue #16218, #16414, #16444: Backport FS_NONASCII, TESTFN_UNDECODABLE, http://hg.python.org/cpython/rev/41658a4fb3cc New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3': (Merge 3.2) Issue #16218, #16414, #16444: Backport FS_NONASCII, http://hg.python.org/cpython/rev/4d40c1ce8566 |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:38 | admin | set | github: 60648 |
| 2013年01月03日 01:07:37 | vstinner | set | versions: + Python 3.2, Python 3.3 |
| 2013年01月03日 00:59:43 | python-dev | set | messages: + msg178868 |
| 2012年12月04日 20:53:30 | vstinner | set | status: open -> closed resolution: fixed messages: + msg176958 |
| 2012年12月04日 20:42:00 | python-dev | set | messages: + msg176955 |
| 2012年12月04日 10:40:07 | serhiy.storchaka | set | type: enhancement messages: + msg176893 stage: patch review |
| 2012年11月15日 15:53:18 | asvetlov | set | nosy:
+ asvetlov |
| 2012年11月12日 08:05:43 | serhiy.storchaka | set | messages: + msg175423 |
| 2012年11月12日 00:24:14 | python-dev | set | messages: + msg175413 |
| 2012年11月11日 23:12:16 | vstinner | set | messages: + msg175406 |
| 2012年11月11日 22:15:48 | vstinner | set | messages: + msg175402 |
| 2012年11月11日 22:08:49 | serhiy.storchaka | set | messages: + msg175399 |
| 2012年11月11日 21:51:52 | vstinner | set | messages: + msg175396 |
| 2012年11月10日 21:31:49 | python-dev | set | messages: + msg175296 |
| 2012年11月10日 18:24:50 | pitrou | set | nosy:
+ pitrou messages: + msg175291 |
| 2012年11月10日 12:21:27 | serhiy.storchaka | set | messages: + msg175275 |
| 2012年11月10日 11:07:35 | python-dev | set | nosy:
+ python-dev messages: + msg175272 |
| 2012年11月10日 10:50:07 | vstinner | set | messages: + msg175271 |
| 2012年11月09日 11:14:20 | serhiy.storchaka | set | messages: + msg175223 |
| 2012年11月09日 11:09:49 | serhiy.storchaka | set | messages: + msg175222 |
| 2012年11月09日 11:00:23 | serhiy.storchaka | set | messages: + msg175221 |
| 2012年11月08日 23:50:56 | vstinner | set | messages: + msg175209 |
| 2012年11月08日 23:04:16 | vstinner | set | messages: + msg175202 |
| 2012年11月08日 22:53:12 | vstinner | set | messages: + msg175201 |
| 2012年11月08日 22:52:14 | vstinner | create | |