Issue 719888: tokenize module w/ coding cookie

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38293

classification

Type:	Stage:
Title:	tokenize module w/ coding cookie
Components:	Unicode	Versions:	Python 3.0

process

Dependencies:	Superseder:
Status:	closed	Resolution:
Assigned To:	trent	Nosy List:	barry, loewis, mark.dickinson, michael.foord, trent
Priority:	normal	Keywords:	patch

Created on 2003年04月11日 19:24 by barry, last changed 2022年04月10日 16:08 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_tokenize_patch.tar	trent, 2008年03月17日 06:45	test_tokenizer patches and supporting text files.
tokenize.zip	michael.foord, 2008年03月18日 17:34	Changes to tokenize.py with tests for Python 3.
tokenize_patch.diff	michael.foord, 2008年03月18日 21:15	Patch for tokenize, tests and standard library and tools usage of tokenize.

Messages (18)
msg15444 - (view)	Author: Barry A. Warsaw (barry) * (Python committer)	Date: 2003年04月11日 19:24
The tokenize module should honor the coding cookie in a file, probably so that it returns Unicode strings with decoded characters.
msg15445 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2006年04月12日 08:02
Logged In: YES user_id=21627 I don't think I will do anything about this anytime soon, so unassigning myself.
msg63612 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月17日 01:55
This issue is currently causing test_tokenize failures in Python 3.0. There are other ways to fix the test failures, but making tokenize honor the source file encoding seems like the right thing to do to me. Does this still seem like a good idea to everyone?
msg63618 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年03月17日 04:39
In 3k, the tokenize module should definitely return strings, and, in doing so, it should definitely consider the encoding declaration (and also the default encoding in absence of the encoding declaration). For 2.6, I wouldn't mind if it were changed incompatibly so that it returns Unicode strings, or else that it parses in Unicode, and then encodes back to the source encoding before returning anything.
msg63619 - (view)	Author: Trent Nelson (trent) * (Python committer)	Date: 2008年03月17日 06:41
I've attached a patch to test_tokenizer.py and a bunch of text files (that should be dropped into Lib/test) that highlight this issue a lot better than the current state of affairs. The existing implementation defines roundup() in the doctest, then proceeds to define it again in the code body. The last for loop in the doctest is failing every so often -- what it's failing on isn't at all clear as a) ten random files are selected out of 332 in Lib/test, and b) there's no way of figuring out which files are causing it to fail unless you hack another method into the test case to try and replicate what the doctest is doing, with some additional print statements (which is the approach I took, only to get bitten by the fact that roundup() was being resolved to the bogus definition that's in the code body, not the functional one in the doctest, which resulted in even more misleading behaviour). FWIW, the file that causes the exception is test_doctest2.py as it contains encoded characters. So, the approach this patch takes is to drop the 'pick ten random test files and untokenize/tokenize' approach and add a class that specifically tests for the tokenizer's compliance with PEP 0263. I'll move on to a patch to tokenizer.py now, but this patch is ok to commit now -- it'll clean up the misleading errors being reported by the plethora of red 3.0 buildbots at the moment at the very least.
msg63620 - (view)	Author: Trent Nelson (trent) * (Python committer)	Date: 2008年03月17日 06:45
Hmm, I take it multiple file uploads aren't supported. I don't want to use svn diff for the text files as it looks like it's butchering the bom encodings, so, tar it is! (Untar in root py3k/ directory.)
msg63949 - (view)	Author: Michael Foord (michael.foord) * (Python committer)	Date: 2008年03月18日 17:34
Made quite extensive changes to tokenize.py (with tests) for Py3k. This migrates it to a 'bytes' API so that it can correctly decode Python source files following PEP-0263.
msg63951 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月18日 17:47
Michael, is the disappearance of the generate_tokens function in the new version of tokenize.py intentional?
msg63953 - (view)	Author: Michael Foord (michael.foord) * (Python committer)	Date: 2008年03月18日 17:52
That was 'by discussion with wiser heads than I'. The existing module has an old backwards compatibility interface called 'tokenize'. That can be deprecated in 2.6. As 'tokenize' is really the ideal name for the main entry point for the module, 'generate_tokens' became tokenize for Py3.
msg63955 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月18日 18:01
Is it worth keeping generate_tokens as an alias for tokenize, just to avoid gratuitous 2-to-3 breakage? Maybe not---I guess they're different beasts, in that one wants a string-valued iterator and the other wants a bytes-valued iterator. So if I understand correctly, the readline argument to tokenize would have to return bytes instances. Would it be worth adding a check for this, to catch possible misuse? You could put the check in detect_encoding, so that just checks that the first one or two yields from readline have the correct type, and assumes that the rest is okay.
msg63957 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月18日 18:05
Sorry---ignore the last comment; if readline() doesn't supply bytes then the line.decode('ascii') will fail with an AttributeError. So there won't be silent failure. I'll try thinking first and posting later next time.
msg63959 - (view)	Author: Trent Nelson (trent) * (Python committer)	Date: 2008年03月18日 18:24
Tested patch on Win x86/x64 2k8, XP & FreeBSD 6.2, +1.
msg63980 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月18日 20:27
With the patch, ./python.exe Lib/test/regrtest.py test_tokenize fails for me with the following output: Macintosh-2:py3k dickinsm$ ./python.exe Lib/test/regrtest.py test_tokenize test_tokenize test test_tokenize produced unexpected output: ******************************************************************** * lines 2-5 of actual output doesn't appear in expected output after line 1: + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt ********************************************************************** 1 test failed: test_tokenize [65880 refs] I get something similar on Linux.
msg63982 - (view)	Author: Michael Foord (michael.foord) * (Python committer)	Date: 2008年03月18日 20:32
If you remove the following line from the tests (which generates spurious additional output on stdout) then the problem goes away: print('testing: %s' % path, end='\n')
msg63990 - (view)	Author: Michael Foord (michael.foord) * (Python committer)	Date: 2008年03月18日 21:15
Full patch (excluding the new dependent test text files) for Python 3. Includes fixes for standard library and tools usage of tokenize. If it breaks anything blame Trent... ;-)
msg63998 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2008年03月18日 21:50
All tests pass for me on OS X 10.5.2 and SuSE Linux 10.2 (32-bit)!
msg64006 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年03月18日 22:41
> Is it worth keeping generate_tokens as an alias for tokenize, just > to avoid gratuitous 2-to-3 breakage? Maybe not---I guess they're > different beasts, in that one wants a string-valued iterator and the > other wants a bytes-valued iterator. Exactly so - that was the primary rationale for renaming it. It shouldn't "silently" return something else, but there should be an explicit clue that you need to port actively.
msg65679 - (view)	Author: Trent Nelson (trent) * (Python committer)	Date: 2008年04月22日 19:09
This was fixed in trunk in r61573, and merged to py3k in r61982.

History
Date	User	Action	Args
2022年04月10日 16:08:06	admin	set	github: 38293
2008年04月22日 19:09:53	trent	set	status: open -> closed messages: + msg65679
2008年03月18日 22:41:18	loewis	set	messages: + msg64006
2008年03月18日 21:50:39	mark.dickinson	set	messages: + msg63998
2008年03月18日 21:16:05	fuzzyman	set	files: + tokenize_patch.diff messages: + msg63990 versions: - Python 2.6
2008年03月18日 20:32:24	fuzzyman	set	messages: + msg63982
2008年03月18日 20:27:36	mark.dickinson	set	messages: + msg63980
2008年03月18日 18:24:01	trent	set	keywords: + patch assignee: trent messages: + msg63959
2008年03月18日 18:05:49	mark.dickinson	set	messages: + msg63957
2008年03月18日 18:01:22	mark.dickinson	set	messages: + msg63955
2008年03月18日 17:52:59	fuzzyman	set	messages: + msg63953
2008年03月18日 17:47:28	mark.dickinson	set	messages: + msg63951
2008年03月18日 17:34:39	fuzzyman	set	files: + tokenize.zip nosy: + fuzzyman messages: + msg63949
2008年03月17日 06:45:30	trent	set	files: + test_tokenize_patch.tar messages: + msg63620
2008年03月17日 06:41:10	trent	set	nosy: + trent messages: + msg63619
2008年03月17日 04:39:05	loewis	set	messages: + msg63618
2008年03月17日 01:55:26	mark.dickinson	set	nosy: + mark.dickinson messages: + msg63612 versions: + Python 2.6, Python 3.0, - Python 2.3
2003年04月11日 19:24:37	barry	create

homepage