This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年07月04日 03:58 by Devin Jeanpierre, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| tokenize_str.diff | serhiy.storchaka, 2012年10月15日 20:48 | review | ||
| tokenize_str_2.diff | serhiy.storchaka, 2015年10月05日 06:04 | review | ||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 6957 | merged | takluyver, 2018年05月17日 20:40 | |
| Messages (19) | |||
|---|---|---|---|
| msg139733 - (view) | Author: Devin Jeanpierre (Devin Jeanpierre) * | Date: 2011年07月04日 03:58 | |
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).
The naive approach might be something like:
def my_readline():
return my_oldreadline().encode('utf-8')
But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):
def my_readline_safe(was_read=[]):
if not was_read:
was_read.append(True)can
return b'# coding: utf-8'
return my_oldreadline().encode('utf-8')
tokenstream = tokenize.tokenize(my_readline_safe)
Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:
tokenstream = tokenize._tokenize(my_readline, 'utf-8')
Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:
tokenstream = tokenize.utokenize(my_oldreadline)
|
|||
| msg140050 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2011年07月09日 05:34 | |
Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below. A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of strings. This would include but not be limited to a file opened in text mode. A related problem is that 'tokenize' is a convenience function that does several things bundled together. 1. Read lines as bytes from a file-like source. 2. Detect encoding. 3. Decode lines to strings. 4. Actually tokenize the strings to tokens. I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit. (Compile() is another function that bundles a tokenizer.) Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2. |
|||
| msg140055 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年07月09日 09:03 | |
The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE. Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8. Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that. |
|||
| msg173001 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月15日 20:48 | |
Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes. Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol. Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion. Of course, it will be better if tokenize() will work with iterator protocol. |
|||
| msg178473 - (view) | Author: Meador Inge (meador.inge) * (Python committer) | Date: 2012年12月29日 05:23 | |
See also issue9969. |
|||
| msg252300 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年10月05日 03:27 | |
I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good. However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me. Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time? |
|||
| msg252305 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年10月05日 06:01 | |
Thank you for your review Martin. Here is rebased patch that addresses Matin's comments. I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue. |
|||
| msg252309 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2015年10月05日 07:35 | |
I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions. |
|||
| msg313591 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年03月11日 09:22 | |
> Why not just bless the existing generate_tokens() function as a public API We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API. In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API. |
|||
| msg316982 - (view) | Author: Matthias Bussonnier (mbussonn) * | Date: 2018年05月17日 20:28 | |
> Why not just bless the existing generate_tokens() function as a public API, Yes please, or just make the private `_tokenize` public under another name. The `tokenize.tokenize` method try to magically detect encoding which may be unnecessary. |
|||
| msg317004 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2018年05月18日 07:01 | |
The old generate_tokens() was renamed to tokenize() in issue719888 because the latter is a better name. Is "generate_tokens" considered a good name now? |
|||
| msg317010 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年05月18日 07:48 | |
I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released. If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API. |
|||
| msg317011 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2018年05月18日 07:59 | |
My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me. |
|||
| msg317018 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年05月18日 08:21 | |
I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests. This and issue 9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have *some* unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle. |
|||
| msg317020 - (view) | Author: Martin Panter (martin.panter) * (Python committer) | Date: 2018年05月18日 08:53 | |
Don’t forget about updating __all__. |
|||
| msg317021 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年05月18日 08:56 | |
Thanks - I had forgotten it, just fixed it now. |
|||
| msg317912 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年05月28日 20:09 | |
The tests on PR #6957 are passing now, if anyone has time to have a look. :-) |
|||
| msg318775 - (view) | Author: Carol Willing (willingc) * (Python committer) | Date: 2018年06月05日 17:26 | |
New changeset c56b17bd8c7a3fd03859822246633d2c9586f8bd by Carol Willing (Thomas Kluyver) in branch 'master': bpo-12486: Document tokenize.generate_tokens() as public API (#6957) https://github.com/python/cpython/commit/c56b17bd8c7a3fd03859822246633d2c9586f8bd |
|||
| msg318778 - (view) | Author: Thomas Kluyver (takluyver) * | Date: 2018年06月05日 17:51 | |
Thanks Carol :-) |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:19 | admin | set | github: 56695 |
| 2018年06月05日 17:51:27 | takluyver | set | messages: + msg318778 |
| 2018年06月05日 17:30:56 | willingc | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2018年06月05日 17:26:41 | willingc | set | nosy:
+ willingc messages: + msg318775 |
| 2018年05月28日 20:09:06 | takluyver | set | messages: + msg317912 |
| 2018年05月18日 08:56:08 | takluyver | set | messages: + msg317021 |
| 2018年05月18日 08:53:49 | martin.panter | set | messages: + msg317020 |
| 2018年05月18日 08:21:22 | takluyver | set | messages: + msg317018 |
| 2018年05月18日 07:59:47 | serhiy.storchaka | set | messages: + msg317011 |
| 2018年05月18日 07:48:49 | takluyver | set | messages: + msg317010 |
| 2018年05月18日 07:04:20 | serhiy.storchaka | set | nosy:
+ barry, mark.dickinson, trent, michael.foord versions: + Python 3.8, - Python 3.6 |
| 2018年05月18日 07:01:30 | serhiy.storchaka | set | messages: + msg317004 |
| 2018年05月17日 20:40:11 | takluyver | set | pull_requests: + pull_request6616 |
| 2018年05月17日 20:28:39 | mbussonn | set | nosy:
+ mbussonn messages: + msg316982 |
| 2018年03月11日 09:22:55 | takluyver | set | nosy:
+ takluyver messages: + msg313591 |
| 2015年10月05日 07:35:53 | martin.panter | set | messages: + msg252309 |
| 2015年10月05日 06:04:47 | serhiy.storchaka | set | files: + tokenize_str_2.diff |
| 2015年10月05日 06:01:59 | serhiy.storchaka | set | messages:
+ msg252305 versions: + Python 3.6, - Python 3.4 |
| 2015年10月05日 03:27:56 | martin.panter | set | nosy:
+ martin.panter messages: + msg252300 stage: patch review |
| 2013年02月04日 17:06:59 | r.david.murray | link | issue17125 superseder |
| 2012年12月29日 05:23:22 | meador.inge | set | nosy:
+ meador.inge messages: + msg178473 |
| 2012年10月15日 20:48:44 | serhiy.storchaka | set | files:
+ tokenize_str.diff versions: + Python 3.4, - Python 3.3 nosy: + serhiy.storchaka messages: + msg173001 keywords: + patch |
| 2012年10月14日 04:15:32 | eric.snow | set | nosy: terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, eric.snow, petri.lehtinen |
| 2011年07月09日 20:53:49 | eric.snow | set | nosy:
+ eric.snow |
| 2011年07月09日 09:03:46 | vstinner | set | messages: + msg140055 |
| 2011年07月09日 05:34:16 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg140050 |
| 2011年07月08日 17:49:51 | petri.lehtinen | set | nosy:
+ petri.lehtinen |
| 2011年07月04日 16:23:57 | eric.araujo | set | nosy:
+ vstinner, eric.araujo type: enhancement versions: + Python 3.3 |
| 2011年07月04日 03:58:16 | Devin Jeanpierre | create | |