Issue 12486: tokenize module should have a unicode API

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56695

classification

Title:	tokenize module should have a unicode API
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	Devin Jeanpierre, barry, eric.araujo, eric.snow, mark.dickinson, martin.panter, mbussonn, meador.inge, michael.foord, petri.lehtinen, serhiy.storchaka, takluyver, terry.reedy, trent, vstinner, willingc
Priority:	normal	Keywords:	patch

Created on 2011年07月04日 03:58 by Devin Jeanpierre, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tokenize_str.diff	serhiy.storchaka, 2012年10月15日 20:48	review
tokenize_str_2.diff	serhiy.storchaka, 2015年10月05日 06:04	review

Pull Requests
URL	Status	Linked	Edit
PR 6957	merged	takluyver, 2018年05月17日 20:40

Messages (19)
msg139733 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年07月04日 03:58
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding). The naive approach might be something like: def my_readline(): return my_oldreadline().encode('utf-8') But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one): def my_readline_safe(was_read=[]): if not was_read: was_read.append(True)can return b'# coding: utf-8' return my_oldreadline().encode('utf-8') tokenstream = tokenize.tokenize(my_readline_safe) Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function: tokenstream = tokenize._tokenize(my_readline, 'utf-8') Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function: tokenstream = tokenize.utokenize(my_oldreadline)
msg140050 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2011年07月09日 05:34
Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below. A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of strings. This would include but not be limited to a file opened in text mode. A related problem is that 'tokenize' is a convenience function that does several things bundled together. 1. Read lines as bytes from a file-like source. 2. Detect encoding. 3. Decode lines to strings. 4. Actually tokenize the strings to tokens. I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit. (Compile() is another function that bundles a tokenizer.) Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.
msg140055 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年07月09日 09:03
The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE. Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8. Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that.
msg173001 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2012年10月15日 20:48
Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes. Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol. Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion. Of course, it will be better if tokenize() will work with iterator protocol.
msg178473 - (view)	Author: Meador Inge (meador.inge) * (Python committer)	Date: 2012年12月29日 05:23
See also issue9969.
msg252300 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年10月05日 03:27
I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good. However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me. Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?
msg252305 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2015年10月05日 06:01
Thank you for your review Martin. Here is rebased patch that addresses Matin's comments. I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue.
msg252309 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年10月05日 07:35
I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions.
msg313591 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年03月11日 09:22
> Why not just bless the existing generate_tokens() function as a public API We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API. In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API.
msg316982 - (view)	Author: Matthias Bussonnier (mbussonn) *	Date: 2018年05月17日 20:28
> Why not just bless the existing generate_tokens() function as a public API, Yes please, or just make the private `_tokenize` public under another name. The `tokenize.tokenize` method try to magically detect encoding which may be unnecessary.
msg317004 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2018年05月18日 07:01
The old generate_tokens() was renamed to tokenize() in issue719888 because the latter is a better name. Is "generate_tokens" considered a good name now?
msg317010 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年05月18日 07:48
I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released. If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API.
msg317011 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2018年05月18日 07:59
My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me.
msg317018 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年05月18日 08:21
I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests. This and issue 9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have some unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle.
msg317020 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2018年05月18日 08:53
Don’t forget about updating __all__.
msg317021 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年05月18日 08:56
Thanks - I had forgotten it, just fixed it now.
msg317912 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年05月28日 20:09
The tests on PR #6957 are passing now, if anyone has time to have a look. :-)
msg318775 - (view)	Author: Carol Willing (willingc) * (Python committer)	Date: 2018年06月05日 17:26
New changeset c56b17bd8c7a3fd03859822246633d2c9586f8bd by Carol Willing (Thomas Kluyver) in branch 'master': bpo-12486: Document tokenize.generate_tokens() as public API (#6957) https://github.com/python/cpython/commit/c56b17bd8c7a3fd03859822246633d2c9586f8bd
msg318778 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018年06月05日 17:51
Thanks Carol :-)

History
Date	User	Action	Args
2022年04月11日 14:57:19	admin	set	github: 56695
2018年06月05日 17:51:27	takluyver	set	messages: + msg318778
2018年06月05日 17:30:56	willingc	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2018年06月05日 17:26:41	willingc	set	nosy: + willingc messages: + msg318775
2018年05月28日 20:09:06	takluyver	set	messages: + msg317912
2018年05月18日 08:56:08	takluyver	set	messages: + msg317021
2018年05月18日 08:53:49	martin.panter	set	messages: + msg317020
2018年05月18日 08:21:22	takluyver	set	messages: + msg317018
2018年05月18日 07:59:47	serhiy.storchaka	set	messages: + msg317011
2018年05月18日 07:48:49	takluyver	set	messages: + msg317010
2018年05月18日 07:04:20	serhiy.storchaka	set	nosy: + barry, mark.dickinson, trent, michael.foord versions: + Python 3.8, - Python 3.6
2018年05月18日 07:01:30	serhiy.storchaka	set	messages: + msg317004
2018年05月17日 20:40:11	takluyver	set	pull_requests: + pull_request6616
2018年05月17日 20:28:39	mbussonn	set	nosy: + mbussonn messages: + msg316982
2018年03月11日 09:22:55	takluyver	set	nosy: + takluyver messages: + msg313591
2015年10月05日 07:35:53	martin.panter	set	messages: + msg252309
2015年10月05日 06:04:47	serhiy.storchaka	set	files: + tokenize_str_2.diff
2015年10月05日 06:01:59	serhiy.storchaka	set	messages: + msg252305 versions: + Python 3.6, - Python 3.4
2015年10月05日 03:27:56	martin.panter	set	nosy: + martin.panter messages: + msg252300 stage: patch review
2013年02月04日 17:06:59	r.david.murray	link	issue17125 superseder
2012年12月29日 05:23:22	meador.inge	set	nosy: + meador.inge messages: + msg178473
2012年10月15日 20:48:44	serhiy.storchaka	set	files: + tokenize_str.diff versions: + Python 3.4, - Python 3.3 nosy: + serhiy.storchaka messages: + msg173001 keywords: + patch
2012年10月14日 04:15:32	eric.snow	set	nosy: terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, eric.snow, petri.lehtinen
2011年07月09日 20:53:49	eric.snow	set	nosy: + eric.snow
2011年07月09日 09:03:46	vstinner	set	messages: + msg140055
2011年07月09日 05:34:16	terry.reedy	set	nosy: + terry.reedy messages: + msg140050
2011年07月08日 17:49:51	petri.lehtinen	set	nosy: + petri.lehtinen
2011年07月04日 16:23:57	eric.araujo	set	nosy: + vstinner, eric.araujo type: enhancement versions: + Python 3.3
2011年07月04日 03:58:16	Devin Jeanpierre	create

homepage