homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: profile doesn't support non-UTF8 source code
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: 4626 4628 Superseder:
Assigned To: Nosy List: brett.cannon, christian.heimes, shidot, vstinner
Priority: normal Keywords: needs review, patch

Created on 2008年11月08日 02:49 by shidot, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
profile_encoding.patch vstinner, 2008年11月10日 00:40 profile module: open input file (from the command line) in binary mode
profile_encoding-2.patch vstinner, 2009年03月20日 01:44
Messages (9)
msg75627 - (view) Author: Takafumi SHIDO (shidot) Date: 2008年11月08日 02:49
The profile module of Python3 deesn't understand the character set of
the script.
When a profile is executed (like $python -m profile -o prof.dat foo.py)
on a code (say foo.py) which defines its character set in the second
line (like #coding:utf-8),
the profile crashes with an error message like:
"SyntaxError: unknown encoding: utf-8"
msg75676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008年11月10日 00:40
exec() doesn't work if the argument is an unicode string. Here is a
workaround for the profile module (open the file in binary mode), but it
doesn't fix the exec() problem.
msg75677 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008年11月10日 01:03
Exemple of the problem: exec('#header\n# encoding:
ISO-8859-1\nprint("h\xe9 h\xe9")\n')
exec(unicode) calls source_as_string() which converts unicode to bytes
using _PyUnicode_AsDefaultEncodedString() (UTF-8 charset). Then
PyRun_StringFlags() is called with the UTF-8 byte string with
PyCF_SOURCE_IS_UTF8 flag. But in the parser, get_coding_spec() recognize
the "#coding:" header and convert bytes to unicode using the specified
charset (which may be different than UTF-8).
The problem is in the function PyAST_FromNode(): the flag in not used in
the tokenizer but only in the AST parser. I also see:
 if (flags && flags->cf_flags & PyCF_SOURCE_IS_UTF8) {
 c.c_encoding = "utf-8";
 if (TYPE(n) == encoding_decl) {
#if 0
 ast_error(n, "encoding declaration in Unicode string");
 goto error;
#endif
 n = CHILD(n, 0);
 }
 } else if (TYPE(n) == encoding_decl) {
 c.c_encoding = STR(n);
 n = CHILD(n, 0);
 } else {
	/* PEP 3120 */
 c.c_encoding = "utf-8";
 }
The ast_error() may be uncommented.
msg83842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年03月20日 01:25
This bug was a duplicate of #4626 which was fixed by r70113 ;-)
msg83843 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年03月20日 01:30
Oops, i misread this issue (wrong title!). #4626 is related, but this 
issue is about the profile module. The problem is that profile open 
the source code as text (with the default charset: UTF-8).
Attached patch fixes the problem.
Example:
--- x.py (ISO-8859-1 text file) ---
#coding: ISO-8859-1
print("hé hé")
-----------------------------------
Run: python -m profile x.py
Current result:
 (...)
 File ".../py3k/Lib/profile.py", line 614, in main
 script = fp.read()
 File ".../Lib/codecs.py", line 300, in decode
 (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes (...)
With my patch, it works as expected.
msg83844 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年03月20日 01:44
Oops, benjamin noticed that it doesn't work with Windows end of line 
(\r\n). New patch reads the file encoding instead of reading file 
content as bytes.
msg83846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年03月20日 01:56
This regression was introduced by the removal of execfile() in 
Python3. The proposed replacement of execfile() is wrong. I propose a 
generic fix in the issue #5524.
msg83933 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年03月21日 10:51
After some discussions, I think that my first patch 
(profile_encoding.patch) was correct but we also have to fix #4628.
msg101477 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年03月22日 02:00
Fixed by r79271 (py3k), r79272 (3.1).
History
Date User Action Args
2022年04月11日 14:56:41adminsetgithub: 48532
2010年03月22日 02:00:33vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg101477
2009年03月21日 10:51:36vstinnersetdependencies: + No universal newline support for compile() when using bytes
messages: + msg83933
2009年03月20日 01:56:57vstinnersetmessages: + msg83846
2009年03月20日 01:44:43vstinnersetfiles: + profile_encoding-2.patch
keywords: + patch
messages: + msg83844
2009年03月20日 01:38:47brett.cannonsetkeywords: - patch
stage: test needed -> patch review
2009年03月20日 01:30:45vstinnersetkeywords: + needs review
2009年03月20日 01:30:35vstinnersetstatus: closed -> open
title: exec(unicode): invalid charset when #coding:xxx spec is used -> profile doesn't support non-UTF8 source code
messages: + msg83843

dependencies: + compile() doesn't ignore the source encoding when a string is passed in
resolution: fixed -> (no value)
2009年03月20日 01:25:22vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg83842
2008年11月10日 09:48:00vstinnersettitle: (Python3) The profile module deesn't understand the character set definition -> exec(unicode): invalid charset when #coding:xxx spec is used
2008年11月10日 09:46:37vstinnersetnosy: + brett.cannon
2008年11月10日 01:03:28vstinnersetmessages: + msg75677
2008年11月10日 00:40:37vstinnersetfiles: + profile_encoding.patch
keywords: + patch
messages: + msg75676
nosy: + vstinner
2008年11月09日 17:39:21christian.heimessetpriority: normal
nosy: + christian.heimes
type: crash -> behavior
components: + Library (Lib)
stage: test needed
2008年11月08日 02:49:31shidotcreate

AltStyle によって変換されたページ (->オリジナル) /