This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年12月11日 06:19 by brett.cannon, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| tokenizer_ignore_cookie-4.patch | vstinner, 2009年01月29日 23:13 | |||
| tok_ignore_cookie-5.patch | benjamin.peterson, 2009年02月14日 18:23 | |||
| unnamed | brett.cannon, 2009年03月24日 23:40 | |||
| Messages (29) | |||
|---|---|---|---|
| msg77590 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2008年12月11日 06:19 | |
When compile() is called with a string it is a reasonable assumption
that it has already been decoded. But this is not in fact the case and
leads to errors when trying to use non-ASCII identifiers::
>>> source = "# coding=latin-1\n\u00c6 = '\u00c6'"
>>> compile(source, '<test>', 'exec')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<test>", line 2
à = 'Ã'
^
SyntaxError: invalid character in identifier
>>> compile(source.encode('latin-1'), '<test>', 'exec')
<code object <module> at 0x389cc8, file "<test>", line 2>
|
|||
| msg78534 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年12月30日 14:37 | |
Issue4742 is similar issue: >>> source = b"# coding=cp1252\n\x94 = '\x94'".decode('cp1252') >>> compile(source, '<test>', 'exec') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<test>", line 0 SyntaxError: unknown encoding: cp1252 The real error here is masked; just before the exception is set, there is a pending SyntaxError: 'charmap' codec can't decode byte 0x9d in position 18: character maps to <undefined> It seems that the source internal representation is correct utf-8, but this is decoded again with the "coding=" cookie. |
|||
| msg78917 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2009年01月03日 01:13 | |
Here is what I have found out so far. Python/bltinmodule.c:builtin_compile takes in a PyObject and gets the char * representation of that object and passes it to Python/pythonrun.c:Py_CompileStringFlags. Unfortunately no other information is passed along in the call, including what the encoding happens to be. This is unfortunate as builtin_compile makes sure that the char* data is encoded using the default encoding before calling Py_CompileStringFlags. I just tried setting a PyCF flag to denote that the char* data is encoded using the default encoding, but Parser/tokenizer.c is not compiled against unicodeobject.c and thus one cannot use PyUnicode_GetDefaultEncoding() to know what the data is stored as. I'm going to try to explicitly convert to UTF-8 and see if that works. |
|||
| msg78930 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2009年01月03日 04:02 | |
So explicitly converting to UTF-8 didn't work, or at least as simply as I had hoped. |
|||
| msg79102 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月05日 01:56 | |
The function decode_str() (Parser/tokenizer.c) is responsible to detect the encoding using the BOM or the cookie ("coding: xxx"). decode_str() reencodes also the text to utf-8 if the encoding is different than utf-8. I think that we can just skip this function if the input text is already unicode (utf-8). Attached patch implements this idea. The patch introduces a new compiler flag (PyCF_IGNORE_COOKIE) and a new parser flag (PyPARSE_IGNORE_COOKIE). The new compiler flag is set by source_as_string() when the input is a PyUnicode object. "Ignore cookie" is maybe not the best name for this flag. With my patch, the first Brett's example displays: $ ./python com2.py Traceback (most recent call last): File "com2.py", line 3, in <module> compile(source, '<test>', 'exec') File "<test>", line 2 " = '"' ^ SyntaxError: invalid character in identifier The error cursor is not at the right column (bug related to the issue 2382 or introduced by my patch?). The patch changes the public API: PyTokenizer_FromString() prototype changed to get a new argument. I don't like changing public API. The new argument should be a bit vector (flags) instead of a single bit (ignore_cookie). We can avoid changing the public API by creating a new function (eg. "PyTokenizer_FromUnicode" ;-)). There are some old PyPARSE_xxx constants in Include/parsetok.h that might be removed. PyPARSE_WITH_IS_KEYWORD value is 3 which is strange since flags is a bit vector (changed with | and tested by &). But PyPARSE_WITH_IS_KEYWORD is a dead constant (written in #if 0...#endif). |
|||
| msg79103 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月05日 01:56 | |
Oops, I attached the wrong file :-p |
|||
| msg79104 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月05日 02:01 | |
> With my patch, the first Brett's example displays: > ... > " = '"' > ^ > SyntaxError: invalid character in identifier > > The error cursor is not at the right column (bug related to the issue > 2382 or introduced by my patch?). I tried py3k_adjust_cursor_at_syntax_error_v2.patch (issue #2382) and the cursor is displayed at the right column: $ ./python com2.py Traceback (most recent call last): ... File "<test>", line 2 " = '"' ^ So it's not a new bug ;-) |
|||
| msg79992 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月17日 01:55 | |
New version of my patch: add a regression test. @brett.cannon: Could you review my patch? |
|||
| msg79997 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2009年01月17日 02:37 | |
I will see when I can get to it (other stuff is taking priority). Not going to assign to myself quite yet in case someone wants to beat me to this. I won't lose track since I created the bug and have a saved query to look at all bugs I make. |
|||
| msg80579 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月26日 17:00 | |
Ping! Can anyone review my patch? |
|||
| msg80613 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年01月27日 03:32 | |
I don't like the change of API to PyTokenizer_FromString. I would prefer another function like PyTokenizer_IgnoreCodingCookie() blows up when parsing has already started. The (char *) cast in PyTokenizer_FromString is unneeded. You need to indent the "else" clause after you test for ignore_cookie. I'd like to see a test that shows that byte strings still have their cookies examined. |
|||
| msg80635 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2009年01月27日 08:16 | |
Note that PyTokenizer_FromString is not an API function: it's not marked with PyAPI_FUNC, and not exported from the windows DLL. |
|||
| msg80791 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年01月29日 23:13 | |
> I don't like the change of API to PyTokenizer_FromString. > I would prefer another function like PyTokenizer_IgnoreCodingCookie() Ok, I created a new function PyTokenizer_FromUnicode(). I choosed "FromUnicode" because the string is encoded in unicode (as UTF-8, even if it's not the wchar_t* type). > The (char *) cast in PyTokenizer_FromString is unneeded. The cast on the decode_str() result? It was already present in the original code. I removed it in my new patch. > You need to indent the "else" clause after you test for ignore_cookie. Ooops, I always have problems to generate a diff because my editor removes trailing spaces and then I have to ignore space changes to create the diff. > I'd like to see a test that shows that byte strings still have their cookies examined. test_pep263 has already two tests using a "#coding:" header. |
|||
| msg80804 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年01月30日 02:00 | |
On Thu, Jan 29, 2009 at 5:13 PM, STINNER Victor <report@bugs.python.org> wrote: > Ok, I created a new function PyTokenizer_FromUnicode(). I > choosed "FromUnicode" because the string is encoded in unicode (as > UTF-8, even if it's not the wchar_t* type). How about PyTokenizer_FromUTF8() then? > >> The (char *) cast in PyTokenizer_FromString is unneeded. > > The cast on the decode_str() result? It was already present in the > original code. I removed it in my new patch. No, I was referring to this line: tok->encoding = (char *)PyMem_MALLOC |
|||
| msg82102 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年02月14日 18:23 | |
Here's another patch. The one problem is that it causes test_coding to fail because coding cookies are ignored and not checked to see if they are the same as the encoding of the file. It seems fine to me to just skip this test. |
|||
| msg82393 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年02月18日 00:22 | |
@benjamin.peterson: I don't see you changes... I read both patches carrefuly and I saw: - source_as_string(): remove "cf.cf_flags = supplied_flags | PyCF_SOURCE_IS_UTF8;" - rename PyTokenizer_FromUnicode() to PyTokenizer_FromUTF8() - update NEWS I don't understand the change in source_as_string(). Except of that, it looks correct. > The one problem is that it causes test_coding to fail because > coding cookies are ignored and not checked to see if they > are the same as the encoding of the file. The test have to fail, but the error is not the the compile() patch, but in the test. Input file is opened as unicode instead of bytes. A propose this patch to fix the test: Index: Lib/test/test_coding.py =================================================================== --- Lib/test/test_coding.py (révision 69723) +++ Lib/test/test_coding.py (copie de travail) @@ -17,7 +17,7 @@ path = os.path.dirname(__file__) filename = os.path.join(path, module_name + '.py') - fp = open(filename, encoding='utf-8') + fp = open(filename, 'rb') text = fp.read() fp.close() self.assertRaises(SyntaxError, compile, text, filename, 'exec') |
|||
| msg82615 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年02月23日 01:40 | |
On Tue, Feb 17, 2009 at 6:22 PM, STINNER Victor <report@bugs.python.org> wrote: > I don't understand the change in source_as_string(). Except of that, > it looks correct. Py_CFFLAGS_SOURCE_IS_UTF8 is already set in compile(). > >> The one problem is that it causes test_coding to fail because >> coding cookies are ignored and not checked to see if they >> are the same as the encoding of the file. > > The test have to fail, but the error is not the the compile() patch, > but in the test. Input file is opened as unicode instead of bytes. A > propose this patch to fix the test: That fix is correct, but I think it avoids what the test is meant to test. The test is supposed to check that compile() complains if the encoding of the unicode string is wrong compared to the coding cookie, but I think that feature is ok to not support. |
|||
| msg82793 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年02月26日 23:54 | |
> I think that feature is ok to not support. Yeah! Anyone to review and/or commit the last patch? |
|||
| msg82815 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年02月27日 03:22 | |
I'll deal with it eventually. |
|||
| msg83047 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年03月02日 23:32 | |
Fixed in r70112. |
|||
| msg83048 - (view) | Author: Benjamin Peterson (benjamin.peterson) * (Python committer) | Date: 2009年03月02日 23:32 | |
Should this be backported? |
|||
| msg83069 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月03日 08:47 | |
> Should this be backported? It's the r70113 (not the 70112). I see that pitrou backported the fix to 3.0.x. I think that it's enough, 2.x doesn't require the fix. |
|||
| msg84093 - (view) | Author: jmf (jmfauth) | Date: 2009年03月24日 16:24 | |
I'm glad to have discovered this topic. I bumped into something similar when I toyed with an interactive interpreter. from code import InteractiveInterpreter ii = InteractiveInterpreter() source = ... ii.runsource(source) What should be the encoding and/or the type (str, bytes) of the "source" string? Taking into account the encoding of the script which contains this code, I have the feeling there is always something going wrong, this can be a "non ascii" char in the source (encoded in utf-8!) or the interactive interpreter does not accept very well a byte string representing a utf-8 encoded string. IDLE is not suffering from this. Its interactive interpreter is somehow receiving "ucs-2 ready string" from tkinter. I'm a little bit confused here (win2k, winXP sp2, Python 3.0.1). |
|||
| msg84125 - (view) | Author: Brett Cannon (brett.cannon) * (Python committer) | Date: 2009年03月24日 23:40 | |
On Tue, Mar 24, 2009 at 09:24, Jean-Michel Fauth <report@bugs.python.org>wrote: > > Jean-Michel Fauth <wxjmfauth@gmail.com> added the comment: > > I'm glad to have discovered this topic. I bumped into something similar > when I toyed with an interactive interpreter. > > from code import InteractiveInterpreter > > ii = InteractiveInterpreter() > source = ... > ii.runsource(source) > > What should be the encoding and/or the type (str, bytes) of the "source" > string? Off the top of my head it should be UTF-8. Otherwise it can probably be bytes as long as it has universal newlines. |
|||
| msg84153 - (view) | Author: jmf (jmfauth) | Date: 2009年03月25日 09:54 | |
When I was preparing some test examples to be submitted here. I noticed the module codeop.py used by the InteractiveInterpreter, does not like byte strings very much. IDLE, Python 3.0.1, winxp sp2 >>> source = b'print(999)' >>> compile(source, '<in>', 'exec') <code object <module> at 0x00AA5CC8, file "<in>", line 1> >>> r = compile(source, '<in>', 'exec') >>> exec(r) 999 >>> from code import InteractiveInterpreter >>> ii = InteractiveInterpreter() >>> ii.runsource(source) Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> ii.runsource(source) File "C:\Python30\lib\code.py", line 63, in runsource code = self.compile(source, filename, symbol) File "C:\Python30\lib\codeop.py", line 168, in __call__ return _maybe_compile(self.compiler, source, filename, symbol) File "C:\Python30\lib\codeop.py", line 70, in _maybe_compile for line in source.split("\n"): TypeError: Type str doesn't support the buffer API >>> >>> source = 'print(999)' >>> ii.runsource(source) 999 False |
|||
| msg84154 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月25日 10:02 | |
@jmfauth: Can you open a different issue for the IDLE issue? |
|||
| msg84159 - (view) | Author: jmf (jmfauth) | Date: 2009年03月25日 14:56 | |
> Victor Yes, I could, but I think this is not an IDLE issue, I'm just using IDLE to illustrate the problem. I got the same when I'm working from within an editor or with my interactive interpreter I wrote for the fun. Code in the editor: # -*- coding: cp1252 -*- # Python 3.0.1, winxp sp2 from code import InteractiveInterpreter ii = InteractiveInterpreter() source = b'print(999)' ii.runsource(source) Output: >c:\python30\pythonw -u "uuu.py" Traceback (most recent call last): File "uuu.py", line 8, in <module> ii.runsource(source) File "c:\python30\lib\code.py", line 63, in runsource code = self.compile(source, filename, symbol) File "c:\python30\lib\codeop.py", line 168, in __call__ return _maybe_compile(self.compiler, source, filename, symbol) File "c:\python30\lib\codeop.py", line 70, in _maybe_compile for line in source.split("\n"): TypeError: Type str doesn't support the buffer API >Exit code: 1 My interactive interpreter >>> --- from code import InteractiveInterpreter >>> --- ii = InteractiveInterpreter() >>> --- source = b'print(999)' >>> --- ii.runsource(source) Traceback (most recent call last): File "<smid last command>", line 1, in <module> File "c:\Python30\lib\code.py", line 63, in runsource code = self.compile(source, filename, symbol) File "c:\Python30\lib\codeop.py", line 168, in __call__ return _maybe_compile(self.compiler, source, filename, symbol) File "c:\Python30\lib\codeop.py", line 70, in _maybe_compile for line in source.split("\n"): TypeError: Type str doesn't support the buffer API >>> --- ======================= I realised and missed the fact the str() function is now accepting an encoding argument (very good). With it, the above code works. Code in the editor: # -*- coding: cp1252 -*- # Python 3.0.1, winxp sp2 from code import InteractiveInterpreter ii = InteractiveInterpreter() source = b'print(999)' source = str(source, 'cp1252') #<<<<<<<<<< ii.runsource(source) Output: (ok) >c:\python30\pythonw -u "uuu.py" 999 >Exit code: 0 ======================= In a few words, my empirical understanding of the story. 1) Things are in good shape, but Python, itsself and its components (compile, interactiveinterpreter module, runsource(), exec(), ...) are lacking in consistency, the all accept miscellanous arguments type or they are not strict enough in their arguments acceptance. When they accept a str, which encoding is accepted? 2) Python 3 is now using unicode as str. Of course, this is welcome, but the caveat is that there are now two encodings in use. Code source defaults to utf-8, but the str in code defaults to ucs-2/4 (eg. in IDLE). 3) Maybe a solution is to have an optional encoding argument as we now have everywhere eg. str(), open(), which defaults to one encoding. compile(source, filename, encodings='utf-8', ...) (Problem: BOM, coding cookies?). I suspect the miscellaneous discussions one finds from people attempting to write a "correct" execfile() for Python 3 are coming from this. Regards. |
|||
| msg84160 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月25日 15:23 | |
> Yes, I could, but I think this is not an IDLE issue > (...) > File "uuu.py", line 8, in <module> > ii.runsource(source) > (...) > File "c:\python30\lib\codeop.py", line 70, in _maybe_compile > for line in source.split("\n"): > TypeError: Type str doesn't support the buffer API compile() works as expected. Your problem is related to InteractiveInterpreter().runsource(source) which is part of IDLE. runsource() is not compatible with bytes, only 'str' type is accepted. The error comes from bytes.split(str): _maybe_compile() should use source.split(b'\n') if source type is bytes. Or runsource() should reject bytes directly. > source = str(source, 'cp1252') #<<<<<<<<<< > ii.runsource(source) > > Output: (ok) Yes, runsource() (only) works with the str type. > I suspect the miscellaneous discussions one finds from people attempting > to write a "correct" execfile() for Python 3 are coming from this. Please see issues: - issue #5524: execfile() removed from Python3 - issue #4628: No universal newline support for compile() when using bytes |
|||
| msg84623 - (view) | Author: jmf (jmfauth) | Date: 2009年03月30日 19:26 | |
Quick feedback from a Windows user. I made a few more tests with the freshly installed Pyton 3.1a1. The compile() function is running very well. As a side effect, it now possible to write an "execfile()" without any problem. Tests with execfile() versions reading files as text or as binary, with/without coding cookies, BOM, cp1252, cp850, cp437, utf-8 cookie, utf-8 with bom, .... (Of course, taking in account and managing universal newline). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:42 | admin | set | github: 48876 |
| 2009年03月30日 19:26:09 | jmfauth | set | messages: + msg84623 |
| 2009年03月25日 15:23:16 | vstinner | set | messages: + msg84160 |
| 2009年03月25日 14:56:49 | jmfauth | set | messages: + msg84159 |
| 2009年03月25日 10:02:32 | vstinner | set | messages: + msg84154 |
| 2009年03月25日 09:54:41 | jmfauth | set | messages: + msg84153 |
| 2009年03月24日 23:40:13 | brett.cannon | set | files:
+ unnamed messages: + msg84125 |
| 2009年03月24日 16:24:28 | jmfauth | set | nosy:
+ jmfauth messages: + msg84093 |
| 2009年03月20日 01:30:35 | vstinner | link | issue4282 dependencies |
| 2009年03月03日 08:47:20 | vstinner | set | messages: + msg83069 |
| 2009年03月02日 23:32:46 | benjamin.peterson | set | messages: + msg83048 |
| 2009年03月02日 23:32:33 | benjamin.peterson | set | status: open -> closed resolution: fixed messages: + msg83047 |
| 2009年02月27日 03:22:27 | benjamin.peterson | set | assignee: benjamin.peterson messages: + msg82815 |
| 2009年02月26日 23:54:51 | vstinner | set | messages:
+ msg82793 versions: + Python 3.1 |
| 2009年02月23日 01:40:59 | benjamin.peterson | set | messages: + msg82615 |
| 2009年02月18日 00:22:10 | vstinner | set | messages: + msg82393 |
| 2009年02月14日 18:23:08 | benjamin.peterson | set | keywords:
+ needs review files: + tok_ignore_cookie-5.patch messages: + msg82102 |
| 2009年01月30日 02:00:52 | benjamin.peterson | set | messages: + msg80804 |
| 2009年01月29日 23:13:48 | vstinner | set | files: - tokenizer_ignore_cookie-3.patch |
| 2009年01月29日 23:13:37 | vstinner | set | files:
+ tokenizer_ignore_cookie-4.patch messages: + msg80791 |
| 2009年01月27日 08:16:52 | amaury.forgeotdarc | set | messages: + msg80635 |
| 2009年01月27日 03:32:51 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages: + msg80613 |
| 2009年01月26日 17:00:42 | vstinner | set | messages: + msg80579 |
| 2009年01月17日 02:37:16 | brett.cannon | set | messages: + msg79997 |
| 2009年01月17日 01:55:29 | vstinner | set | files: - tokenizer_ignore_cookie-2.patch |
| 2009年01月17日 01:55:24 | vstinner | set | files:
+ tokenizer_ignore_cookie-3.patch messages: + msg79992 |
| 2009年01月05日 02:01:47 | vstinner | set | messages: + msg79104 |
| 2009年01月05日 01:56:55 | vstinner | set | files: - tokenizer_ignore_cookie.patch |
| 2009年01月05日 01:56:50 | vstinner | set | files:
+ tokenizer_ignore_cookie-2.patch messages: + msg79103 |
| 2009年01月05日 01:56:18 | vstinner | set | files:
+ tokenizer_ignore_cookie.patch nosy: + vstinner messages: + msg79102 keywords: + patch |
| 2009年01月03日 04:02:14 | brett.cannon | set | messages: + msg78930 |
| 2009年01月03日 01:13:47 | brett.cannon | set | messages: + msg78917 |
| 2008年12月30日 21:28:04 | sjmachin | set | nosy: + sjmachin |
| 2008年12月30日 14:37:58 | amaury.forgeotdarc | link | issue4742 superseder |
| 2008年12月30日 14:37:18 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg78534 |
| 2008年12月11日 06:19:16 | brett.cannon | create | |