Message 158797 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	eric.snow
Recipients	brett.cannon, eric.snow, loewis
Date	2012年04月20日.05:17:16
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1334899038.03.0.61083358932.issue14629@psf.upfronthosting.co.za>

Content
(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html) The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports. When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see PEP 3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP 263. When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation. The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example. --- For tokenize.detect_encoding(): import tokenize enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline) print(enc) # "utf-8" (no SyntaxError) For PyTokenizer_FindEncodingFilename(): I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename(). import _tokenizer enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py") print(enc) # raises SyntaxError --- Some relevant, related notes: The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError: tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline) In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation. import test.badsyntax_pep3120 importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...

Content

(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html)
The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports.
When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see PEP 3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP 263.
When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation.
The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example.
 ---
For tokenize.detect_encoding():
 import tokenize
 enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
 print(enc) # "utf-8" (no SyntaxError)
For PyTokenizer_FindEncodingFilename():
I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename().
 import _tokenizer
 enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
 print(enc) # raises SyntaxError
 ---
Some relevant, related notes:
The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError:
 tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline)
In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation.
 import test.badsyntax_pep3120
importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...

History
Date	User	Action	Args
2012年04月20日 05:17:18	eric.snow	set	recipients: + eric.snow, loewis, brett.cannon
2012年04月20日 05:17:18	eric.snow	set	messageid: <1334899038.03.0.61083358932.issue14629@psf.upfronthosting.co.za>
2012年04月20日 05:17:17	eric.snow	link	issue14629 messages
2012年04月20日 05:17:17	eric.snow	create

homepage