Issue 14629: discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58834

classification

Title:	discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	Arfrever, brett.cannon, eric.snow, loewis, python-dev
Priority:	normal	Keywords:

Created on 2012年04月20日 05:17 by eric.snow, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
_tokenizer.c	eric.snow, 2012年04月20日 05:17	extension module wrapping PyTokenizer_FindEncodingFilename
setup.py	eric.snow, 2012年04月20日 05:17	setup script for the extension module

Messages (9)
msg158797 - (view)	Author: Eric Snow (eric.snow) * (Python committer)	Date: 2012年04月20日 05:17
(see http://mail.python.org/pipermail/python-dev/2012-April/118889.html) The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports. When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see PEP 3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP 263. When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation. The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example. --- For tokenize.detect_encoding(): import tokenize enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline) print(enc) # "utf-8" (no SyntaxError) For PyTokenizer_FindEncodingFilename(): I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename(). import _tokenizer enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py") print(enc) # raises SyntaxError --- Some relevant, related notes: The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError: tokenize.tokenize(open("/home/esnow/projects/import_cleanup/Lib/test/badsyntax_pep3120.py").readline) In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation. import test.badsyntax_pep3120 importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...
msg158824 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年04月20日 12:37
New changeset b07488490001 by Martin v. Löwis in branch '3.2': Issue #14629: Raise SyntaxError in tokenizer.detect_encoding http://hg.python.org/cpython/rev/b07488490001 New changeset 98a6a57c5876 by Martin v. Löwis in branch 'default': merge 3.2: issue 14629 http://hg.python.org/cpython/rev/98a6a57c5876
msg158825 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年04月20日 12:39
Thanks for the report. This is now fixed in 3.2 and default. Notice that your usage tokenize is incorrect: you need to open the file in binary mode.
msg158831 - (view)	Author: Eric Snow (eric.snow) * (Python committer)	Date: 2012年04月20日 15:09
Thanks, Martin! That did the trick.
msg158839 - (view)	Author: Eric Snow (eric.snow) * (Python committer)	Date: 2012年04月20日 15:24
Apparently the message string contained by the SyntaxError is different between the two. I noticed due to the hard-coded check in test_find_module_encoding (in Lib/test/test_imp.py). I've brought up the specific issue of that hard-coded message check in issue14633. However, in case it otherwise matters that the message string be the same between the two, I've brought it up here.
msg158847 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年04月20日 16:19
IMO, the test is flawed testing for the specific error message. OTOH, the original message is better than the tokenize message in that it mentions the file name. However, tokenize does not have the file name available, so it can't possibly report it. I have no idea how to resolve this. Contributions are welcome.
msg158849 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年04月20日 17:00
New changeset a281a6622714 by Brett Cannon in branch 'default': Issue #14633: Simplify imp.find_modue() test after fixes from issue http://hg.python.org/cpython/rev/a281a6622714
msg158855 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年04月20日 17:24
New changeset 1b57de8a8383 by Brett Cannon in branch 'default': Issue #14629: Mention the filename in SyntaxError exceptions from http://hg.python.org/cpython/rev/1b57de8a8383
msg158858 - (view)	Author: Eric Snow (eric.snow) * (Python committer)	Date: 2012年04月20日 17:35
Looks good. Thanks for the help, Martin and Brett.

History
Date	User	Action	Args
2022年04月11日 14:57:29	admin	set	github: 58834
2012年04月20日 17:35:20	eric.snow	set	messages: + msg158858
2012年04月20日 17:24:34	python-dev	set	messages: + msg158855
2012年04月20日 17:00:10	python-dev	set	messages: + msg158849
2012年04月20日 16:19:30	loewis	set	messages: + msg158847
2012年04月20日 15:24:33	eric.snow	set	messages: + msg158839
2012年04月20日 15:09:11	eric.snow	set	messages: + msg158831 versions: + Python 3.2
2012年04月20日 12:39:37	loewis	set	status: open -> closed resolution: fixed messages: + msg158825
2012年04月20日 12:37:25	python-dev	set	nosy: + python-dev messages: + msg158824
2012年04月20日 06:18:25	Arfrever	set	nosy: + Arfrever
2012年04月20日 05:17:48	eric.snow	set	files: + setup.py
2012年04月20日 05:17:17	eric.snow	create

homepage