This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年08月17日 00:39 by brett.cannon, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| fix_latin.diff | brett.cannon, 2008年09月05日 20:36 | |||
| test_pep3120.py | brett.cannon, 2008年09月05日 20:37 | |||
| iso.py | vstinner, 2008年10月04日 00:19 | Script encoding in ISO-8859-1 and has the #coding header | ||
| tokenizer_iso-8859-1-patch2.patch | vstinner, 2008年10月04日 01:42 | |||
| tokenizer_iso-8859-1-patch3.patch | vstinner, 2008年10月06日 21:49 | |||
| alt_latin_1.diff | brett.cannon, 2008年10月07日 00:26 | Tweak to Victor's patch | ||
| Messages (31) | |||
|---|---|---|---|
| msg71251 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月17日 00:39 | |
The following leads to a SyntaxError in 3.0: compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec') That is not the case in Python 2.6. |
|||
| msg71252 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月17日 00:39 | |
Looks like Parser/tokenizer.c:check_coding_spec() considered Latin-1 a raw encoding just like UTF-8. Patch is in the works. |
|||
| msg71257 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月17日 01:56 | |
Here is a potential fix. It broke test_imp because it assumed that Latin-1 source files would be encoded at Latin-1 instead of UTF-8 when returned by imp.new_module(). Doesn't seem like a critical change as the file is still properly decoded. |
|||
| msg71258 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月17日 01:57 | |
Attached is a test for test_pep3120 (since that is what most likely introduced the breakage). It's a separate patch since the source file is marked as binary and thus can't be diffed by ``svn diff``. |
|||
| msg71401 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月19日 02:34 | |
Can someone double-check this patch for me? I don't have much experience with the parser so I want to make sure I am not doing anything wrong. |
|||
| msg71402 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月19日 02:37 | |
There is a potential dependency on issue3594 as it would change how imp.find_module() acts and thus make test_imp no longer fail in the way it has. |
|||
| msg71403 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2008年08月19日 02:54 | |
That line dates back to the PEP 263 implementation. Martin? |
|||
| msg71846 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2008年08月24日 17:57 | |
Since this is marked "release blocker", I'll provide a shallow comment: I don't think it should be a release blocker. It's a bug in the compile function, and there are various work-arounds (such as saving the bytes to a temporary file and executing that one, or decoding the byte string to a Unicode string, and then compiling the Unicode string). It is sufficient to fix it in 3.0.1. I don't think the patch is right: as the test had to be changed, it means that somewhere, the detection of the encoding declaration now fails. This is clearly a new bug, but I don't have the time to analyse the cause further. In principle, there is nothing wrong with the tokenizer treating latin-1 as "raw" - that only means we don't go through a codec. |
|||
| msg71850 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月24日 18:12 | |
Actually, the tests don't have to change; if issue 3594 gets applied then that change cascades into this issue and negates the need to change the tests themselves. As for treating Latin-1 as a raw encoding, how can that be theoretically okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1? |
|||
| msg71852 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2008年08月24日 19:19 | |
> As for treating Latin-1 as a raw encoding, how can that be theoretically > okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1? The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all non-ASCII bytes on to the AST, which then needs to deal with them; it then could (but apparently doesn't) take into account whether the internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode for some remains of that. The other case (besides string literals) where bytes > 127 matter is tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but could be easily extended to support Latin-1 as well). The third case where non-ASCII bytes are allowed is comments; there they are entirely ignored (i.e. it is not even verified that the comment is well-formed UTF-8). Removal of the special case should simplify the code; I would agree that any speedup gained by not going through a codec is irrelevant. I'm still puzzled why test_imp if the special case is removed. |
|||
| msg71855 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年08月24日 19:53 | |
The test_imp stuff has to do with PyTokenizer_FindEncoding(). imp.find_module() only opens the file, passes the file descriptor to PyTokenizer_FindEncoding() and then returns a file object with the found encoding. Problem is that (as issue 3594 points out), PyTokenizer_FindEncoding() always fails. That means it assumes only the raw encodings are okay. With Latin-1 being one of them, it returns the file opened as Latin-1 as is correct. Removing that case here means PyTokenizer_FindEncoding() fails, and thus assumes only UTF-8 as a legitimate encoding and opens the files with the UTF-8 encoding. It took a while to find these two bugs obviously. =) |
|||
| msg72624 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年09月05日 20:36 | |
I have attached a new version of the patch with the changes to test_imp removed as issue 3594 fixed the need for the change. I have also directly uploaded test_pep3120.py since it is flagged as binary and thus cannot be diffed by svn. |
|||
| msg74287 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月04日 00:19 | |
Using py3k trunk + fix_latin.diff: - compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec') doesn't fail - test_pep3120.py is ok - but execute a ISO-8859-1 script fails: see attached iso.py Original Python3: $ python iso.py 'Bonjour ma ch\xe8re amie' Patched Python3: $ python iso.py 'Bonjour ma ch\xc3\xa8re amie' |
|||
| msg74288 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月04日 00:23 | |
See also Lib/test/test_shlex.py: trunk is ok, but with fix_latin.diff the test fails. |
|||
| msg74291 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月04日 00:40 | |
It looks like the problem of fix_latin.diff is the decoding_state: it's set to STATE_NORMAL whereas current behaviour is to stay in state STATE_RAW. I wrote another patch which is a mix of case 1 (utf-8: just set tok->encoding) and case 2 (another charset: set tok->enc, tok->encoding and tok>decoding_state): a new case 3 which set enc, encoding but stay a the state STATE_RAW. I don't understand my patch, so review it (twice or more :-D). Using my patch: - compile(...) works - test_shlex.py works - test_pep3120.py - iso.py works |
|||
| msg74294 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月04日 01:24 | |
But why does iso-8859-1 need to be treated as a special case? UTF-8 is special because it is the default encoding for source. But iso-8859-1 really shouldn't be special, IMO. Your patch does exactly what happens lower when set_readline() succeeds. So I want to know why set_readline() is not returning something for iso-8859-1 (which your patch seems to be suggesting). |
|||
| msg74295 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月04日 01:28 | |
Sorry, I mis-spoke: your patch, Victor, doesn't change the state to NORMAL. But my worry still stands; why does iso-8859-1 need to be special-cased? It suggests to me that some more fundamental needs to be dealt with instead of just patching around iso-8859-1. |
|||
| msg74296 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月04日 01:42 | |
@brett.cannon: I found it: ast.c used a hack for iso-8859-1! Since this hack introduces a bug (your compile(...) example), I prefer to remove it to simplify to code. The new patch just removes the hack in tokenizer.c and ast.c. It does also document encoding and enc attributes of tok_state in tokenizer.h. Using tokenizer_iso-8859-1-patch2.patch, all tests (test_pep3120.py, iso.py, test_shlex.py, etc.) are OK. |
|||
| msg74299 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月04日 02:23 | |
Thanks for finding that, Victor! I will do a patch review when I have a chance (it won't be until after the weekend). |
|||
| msg74300 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月04日 02:36 | |
After reading tokenizer.c 1000 times, I finally used grep: $ grep -l -i 'iso.8859.1' $(find -name "*.c") ./Python/ast.c <~~~ WTF? ./Objects/unicodeobject.c ./Parser/tokenizer.c ./Modules/cjkcodecs/_codecs_iso2022.c ./Modules/expat/xmltok.c |
|||
| msg74393 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月06日 21:48 | |
My patch version 2 included an "unrelated" fix for the issue2384. |
|||
| msg74415 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月07日 00:26 | |
So I tried the patch (I attached my version with different comments in the header file and removed some redundant change in whitespacing), and test_sys consistently fails for me: test_current_frames (__main__.SysModuleTest) AssertionError: '' != 'g456()' And this is after a ``make distclean`` and a new compile. This failure doesn't occur on a clean checkout made after a 'distclean'. |
|||
| msg74428 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月07日 08:21 | |
test_sys failure is fixed by the issue #2384. |
|||
| msg74620 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月10日 08:33 | |
Amaury applied my both patches for issues #2384 and #3975. So all tests now pass with python trunk + alt_latin_1.diff. |
|||
| msg74642 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月10日 20:24 | |
Yep, it passes for me now. Martin, have any objection to this patch? |
|||
| msg74793 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2008年10月15日 06:05 | |
The patch looks fine to me, please apply. I notice that the diff file reports changes to test_pep3120.py. No such changes should be necessary, so please exclude them from committing. |
|||
| msg74807 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月15日 17:15 | |
On Tue, Oct 14, 2008 at 11:05 PM, Martin v. Löwis <report@bugs.python.org> wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > The patch looks fine to me, please apply. > Great! > I notice that the diff file reports changes to test_pep3120.py. No such > changes should be necessary, so please exclude them from committing. > I added a test for compile() in there, which is why the patch is claiming that. There is an uploaded version of test_pep3120.py on the issue. |
|||
| msg74811 - (view) | Author: Martin v. Löwis (loewis) * (Python committer) | Date: 2008年10月15日 19:33 | |
> I added a test for compile() in there, which is why the patch is > claiming that. There is an uploaded version of test_pep3120.py on the > issue. Ah, ok. I missed that - that change is also fine. |
|||
| msg74892 - (view) | Author: Barry A. Warsaw (barry) * (Python committer) | Date: 2008年10月17日 01:55 | |
Brett, please apply and close the issue. |
|||
| msg74894 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月17日 03:34 | |
Sorry about that; been one of those days. Doing a svn up and making sure it still compiles fine. |
|||
| msg74895 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年10月17日 03:39 | |
Applied in r66951. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:37 | admin | set | github: 47824 |
| 2008年10月17日 03:39:11 | brett.cannon | set | status: open -> closed messages: + msg74895 |
| 2008年10月17日 03:34:07 | brett.cannon | set | messages: + msg74894 |
| 2008年10月17日 01:55:11 | barry | set | nosy:
+ barry messages: + msg74892 |
| 2008年10月15日 19:33:06 | loewis | set | messages: + msg74811 |
| 2008年10月15日 17:15:34 | brett.cannon | set | messages: + msg74807 |
| 2008年10月15日 06:05:29 | loewis | set | keywords:
- needs review assignee: loewis -> brett.cannon resolution: accepted messages: + msg74793 |
| 2008年10月10日 20:24:41 | brett.cannon | set | assignee: loewis messages: + msg74642 |
| 2008年10月10日 08:33:03 | vstinner | set | messages: + msg74620 |
| 2008年10月07日 08:21:16 | vstinner | set | dependencies:
+ [Py3k] line number is wrong after encoding declaration messages: + msg74428 |
| 2008年10月07日 00:26:56 | brett.cannon | set | files: + alt_latin_1.diff |
| 2008年10月07日 00:26:18 | brett.cannon | set | assignee: brett.cannon -> (no value) messages: + msg74415 |
| 2008年10月06日 21:49:17 | vstinner | set | files: + tokenizer_iso-8859-1-patch3.patch |
| 2008年10月06日 21:49:07 | vstinner | set | files: - python3_bytes_filename-3.patch |
| 2008年10月06日 21:48:35 | vstinner | set | files:
+ python3_bytes_filename-3.patch messages: + msg74393 |
| 2008年10月06日 21:32:38 | vstinner | set | files: - tokenizer_iso-8859-1.patch |
| 2008年10月04日 02:36:38 | vstinner | set | messages: + msg74300 |
| 2008年10月04日 02:23:42 | brett.cannon | set | assignee: brett.cannon messages: + msg74299 |
| 2008年10月04日 01:42:22 | vstinner | set | files:
+ tokenizer_iso-8859-1-patch2.patch messages: + msg74296 |
| 2008年10月04日 01:28:14 | brett.cannon | set | messages: + msg74295 |
| 2008年10月04日 01:24:43 | brett.cannon | set | messages: + msg74294 |
| 2008年10月04日 00:40:19 | vstinner | set | files:
+ tokenizer_iso-8859-1.patch messages: + msg74291 |
| 2008年10月04日 00:23:23 | vstinner | set | messages: + msg74288 |
| 2008年10月04日 00:19:39 | vstinner | set | files:
+ iso.py nosy: + vstinner messages: + msg74287 |
| 2008年10月02日 12:54:23 | barry | set | priority: deferred blocker -> release blocker |
| 2008年09月26日 22:18:34 | barry | set | priority: release blocker -> deferred blocker |
| 2008年09月18日 05:43:20 | barry | set | priority: deferred blocker -> release blocker |
| 2008年09月09日 13:12:01 | barry | set | priority: release blocker -> deferred blocker |
| 2008年09月05日 20:37:13 | brett.cannon | set | files: - pep3120_test.diff |
| 2008年09月05日 20:37:07 | brett.cannon | set | files: + test_pep3120.py |
| 2008年09月05日 20:36:41 | brett.cannon | set | files: - fix_latin.diff |
| 2008年09月05日 20:36:29 | brett.cannon | set | files:
+ fix_latin.diff messages: + msg72624 |
| 2008年08月24日 19:53:27 | brett.cannon | set | messages: + msg71855 |
| 2008年08月24日 19:19:03 | loewis | set | messages: + msg71852 |
| 2008年08月24日 18:12:06 | brett.cannon | set | messages: + msg71850 |
| 2008年08月24日 17:57:03 | loewis | set | messages: + msg71846 |
| 2008年08月21日 20:33:57 | brett.cannon | set | keywords: + needs review |
| 2008年08月21日 18:35:23 | brett.cannon | set | priority: critical -> release blocker |
| 2008年08月19日 02:54:03 | benjamin.peterson | set | nosy:
+ loewis, benjamin.peterson messages: + msg71403 |
| 2008年08月19日 02:37:05 | brett.cannon | set | dependencies:
+ PyTokenizer_FindEncoding() never succeeds messages: + msg71402 |
| 2008年08月19日 02:34:23 | brett.cannon | set | messages: + msg71401 |
| 2008年08月17日 01:58:08 | brett.cannon | set | type: behavior |
| 2008年08月17日 01:57:59 | brett.cannon | set | priority: critical files: + pep3120_test.diff messages: + msg71258 components: + Interpreter Core versions: + Python 3.0 |
| 2008年08月17日 01:56:36 | brett.cannon | set | files:
+ fix_latin.diff keywords: + patch messages: + msg71257 |
| 2008年08月17日 00:39:59 | brett.cannon | set | messages: + msg71252 |
| 2008年08月17日 00:39:39 | brett.cannon | create | |