homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add new attribute to TokenInfo to report specific token IDs
Type: enhancement Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: akuchling, docs@python, eric.araujo, eric.snow, ezio.melotti, gpolo, meador.inge, ncoghlan, python-dev, terry.reedy
Priority: normal Keywords: patch

Created on 2008年02月17日 23:00 by gpolo, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tokenize_sample.py gpolo, 2008年02月17日 23:00
tokenize_r60884.diff gpolo, 2008年02月18日 13:22 review
tokenize-exact-type-v0.patch meador.inge, 2011年12月19日 04:50 review
tokenize-exact-type-v1.patch meador.inge, 2011年12月24日 23:28 review
tokenize-docs-2.7-3.2.patch meador.inge, 2011年12月24日 23:28
Messages (16)
msg62509 - (view) Author: Guilherme Polo (gpolo) * (Python committer) Date: 2008年02月17日 23:00
function generate_tokes at tokenize.py yields token OP (51) for colon,
while it should be token COLON (11). It probably affects other python
versions as well.
I'm attaching a minor sample that demonstrates this, running it returns
the following output:
1 'if' (1, 0) (1, 2) if a == 2:
1 'a' (1, 3) (1, 4) if a == 2:
51 '==' (1, 5) (1, 7) if a == 2:
2 '2' (1, 8) (1, 9) if a == 2:
51 ':' (1, 9) (1, 10) if a == 2:
1 'print' (2, 0) (2, 5) print 'hey'
3 "'hey'" (2, 6) (2, 11) print 'hey'
0 '' (3, 0) (3, 0)
I didn't check if there are problems with other tokens, I noticed this
with colon because I was trying to make some improvements on tabnanny.
msg62527 - (view) Author: Guilherme Polo (gpolo) * (Python committer) Date: 2008年02月18日 13:22
I'm attaching a patch that solves this and updates tests.
msg99894 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2010年02月23日 02:46
Unfortunately I think this will break many users of tokenize.py.
e.g. http://browsershots.googlecode.com/svn/trunk/devtools/pep8/pep8.py 
has code like:
if (token_type == tokenize.OP and text in '([' and ...):
If tokenize now returns LPAR, this code will no longer work correctly.
Tools/i18n/pygettext.py, pylint, WebWare, pyfuscate, all have similar code. So I think we can't change the API this radically. Adding a parameter to enable more precise handling of tokens, and defaulting it to off, is probably the only way to change this.
msg112756 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010年08月04日 04:43
I have not looked at this, but a new parameter would be a new feature. Its a moot point until there is an agreed on patch for a current version.
msg149487 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2011年12月15日 01:13
There are a *lot* of characters with semantic significance that are reported by the tokenize module as generic "OP" tokens:
token.LPAR
token.RPAR
token.LSQB
token.RSQB
token.COLON
token.COMMA
token.SEMI
token.PLUS
token.MINUS
token.STAR
token.SLASH
token.VBAR
token.AMPER
token.LESS
token.GREATER
token.EQUAL
token.DOT
token.PERCENT
token.BACKQUOTE
token.LBRACE
token.RBRACE
token.EQEQUAL
token.NOTEQUAL
token.LESSEQUAL
token.GREATEREQUAL
token.TILDE
token.CIRCUMFLEX
token.LEFTSHIFT
token.RIGHTSHIFT
token.DOUBLESTAR
token.PLUSEQUAL
token.MINEQUAL
token.STAREQUAL
token.SLASHEQUAL
token.PERCENTEQUAL
token.AMPEREQUAL
token.VBAREQUAL
token.CIRCUMFLEXEQUAL
token.LEFTSHIFTEQUAL
token.RIGHTSHIFTEQUAL
token.DOUBLESTAREQUAL¶
token.DOUBLESLASH
token.DOUBLESLASHEQUAL
token.AT
However, I can't fault tokenize for deciding to treat all of those tokens the same way - for many source code manipulation purposes, these just need to be transcribed literally, and the "OP" token serves that purpose just fine.
As the extensive test updates in the current patch suggest, AMK is also correct that changing this away from always returning "OP" tokens (even for characters with more specialised tokens available) would be a backwards incompatible change.
I think there are two parts to this problem, one documentation related (affecting 2.7, 3.2, 3.3) and another that would be an actual change in 3.3:
1. First, I think 3.3 should add an "exact_type" attribute to TokenInfo instances (without making it part of the tuple-based API). For most tokens, this would be the same as "type", but for OP tokens, it would provide the appropriate more specific token ID.
2. Second, the tokenize module documentation should state *explicitly* which tokens it collapses down into the generic "OP" token, and explain how to use the "string" attribute to recover the more detailed information.
msg149489 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011年12月15日 02:06
I believe that that list includes all symbols and symbol combinations that are syntactically significant in expressions. This is the generalized meaning of 'operator' that is being used. What do not appear are '#' which marks comments, '_' which is a name char, and '\' which escapes chars within strings. Other symbols within strings will also not be marked as OP tokens. The non-syntactic symbols '$' and '?' are also omitted.
msg149491 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2011年12月15日 03:16
Sure, but what does that have to do with anything? tokenize isn't a general purpose tokenizer, it's specifically for tokenizing Python source code.
The *problem* is that it doesn't currently fully tokenize everything, but doesn't explicitly say that in the module documentation.
Hence my proposed two-fold fix: document the current behaviour explicitly and also add a separate "exact_type" attribute for easy access to the detailed tokenization without doing your own string comparisons.
msg149507 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011年12月15日 05:03
If you are responding to me, I am baffled. I gave a concise way to document the current behavior with respect to .OP, which you said you wanted.
msg149510 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2011年12月15日 05:48
Ah, I didn't read it as suggested documentation at all - you moved seamlessly from personal commentary to a docs suggestion without separating the two, so it appeared to be a complete non sequitur to me.
As for the docs suggestion, I think it works as the explanation of which tokens are affected once the concept of the token stream simplification is introduced:
=====
To simplify token stream handling, all literal tokens (':', '{', etc) are returned using the generic 'OP' token type. This allows them to be simply handled using common code paths (e.g. for literal transcription directly from input to output). Specific tokens can be distinguished by checking the "string" attribute of OP tokens for a match with the expected character sequence.
The affected tokens are all symbols and symbol combinations that are syntactically significant in expressions (as listed in the token module). Anything which is not an independent token (i.e. '#' which marks comments, '_' which is just part of a name, '\' which is used for line continuations, the contents of string literals and any symbols which are not a defined part of Python's syntax) is completely unaffected by this difference in behaviour.
===========
If "exact_type" is introduced in 3.3, then the first paragraph can be adjusted accordingly.
msg149578 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011年12月15日 22:02
Both the proposed text and 3.3 addition look good to me.
msg149815 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011年12月19日 04:50
The proposed documentation text seems too complicated and language expert speaky to me. We should try to link to standard definitions when possible to reduce the text here. For example, I believe the "Operators" and "Delimiters" tokens in the "Lexical Analysis" section of the docs (http://docs.python.org/dev/reference/lexical_analysis.html#operators) are exactly what we are trying to describe when referencing "literal tokens" and "affected tokens".
I like Nick's idea to introduce a new attribute for the exact type, while keeping the tuple structure itself backwards compatible. Attached is a patch for 3.3 that updates the docs, adds exact_type, adds new unit tests, and adds a new CLI option for displaying token names using the exact type.
An example of the new CLI option is:
$ echo '1+2**4' | ./python -m tokenize
1,0-1,1: NUMBER '1' 
1,1-1,2: OP '+' 
1,2-1,3: NUMBER '2' 
1,3-1,5: OP '**' 
1,5-1,6: NUMBER '4' 
1,6-1,7: NEWLINE '\n' 
2,0-2,0: ENDMARKER '' 
$ echo '1+2**4' | ./python -m tokenize -e
1,0-1,1: NUMBER '1' 
1,1-1,2: PLUS '+' 
1,2-1,3: NUMBER '2' 
1,3-1,5: DOUBLESTAR '**' 
1,5-1,6: NUMBER '4' 
1,6-1,7: NEWLINE '\n' 
2,0-2,0: ENDMARKER ''
msg149820 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2011年12月19日 05:56
Meador's patch looks good to me. The docs change for 2.7 and 3.2 would be similar, just with text like "Specific tokens can be distinguished by checking the ``string`` attribute of OP tokens for a match with the expected character sequence." replacing the reference to the new "exact_type" attribute.
msg150013 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011年12月21日 16:49
The cmdoption directive should be used with a program directive. See library/trace for an example of how to use it and to see the anchors and index entries it generates.
msg150242 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011年12月24日 23:28
> The cmdoption directive should be used with a program directive.
Ah, nice. Thanks for the tip Éric.
Updated patch attached along with a patch for the 2.7/3.2 doc update attached.
msg151607 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年01月19日 06:46
New changeset 75baef657770 by Meador Inge in branch '2.7':
Issue #2134: Clarify token.OP handling rationale in tokenize documentation.
http://hg.python.org/cpython/rev/75baef657770
New changeset dfd74d752b0e by Meador Inge in branch '3.2':
Issue #2134: Clarify token.OP handling rationale in tokenize documentation.
http://hg.python.org/cpython/rev/dfd74d752b0e
New changeset f4976fa6e830 by Meador Inge in branch 'default':
Issue #2134: Add support for tokenize.TokenInfo.exact_type.
http://hg.python.org/cpython/rev/f4976fa6e830 
msg151608 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012年01月19日 06:49
Fixed. Thanks for the reviews everyone.
History
Date User Action Args
2022年04月11日 14:56:30adminsetgithub: 46387
2012年01月19日 06:49:13meador.ingesetstatus: open -> closed
resolution: fixed
messages: + msg151608

stage: patch review -> resolved
2012年01月19日 06:46:55python-devsetnosy: + python-dev
messages: + msg151607
2012年01月16日 05:42:46ezio.melottisetstage: needs patch -> patch review
2011年12月24日 23:28:57meador.ingesetfiles: + tokenize-docs-2.7-3.2.patch
2011年12月24日 23:28:39meador.ingesetfiles: + tokenize-exact-type-v1.patch

messages: + msg150242
2011年12月21日 16:49:30eric.araujosetnosy: + eric.araujo
messages: + msg150013
2011年12月19日 05:56:17ncoghlansetmessages: + msg149820
2011年12月19日 04:50:23meador.ingesetfiles: + tokenize-exact-type-v0.patch

messages: + msg149815
2011年12月15日 22:18:25ezio.melottisetnosy: + ezio.melotti
2011年12月15日 22:02:11terry.reedysetmessages: + msg149578
2011年12月15日 19:22:21eric.snowsetnosy: + eric.snow
2011年12月15日 18:42:53meador.ingesetnosy: + meador.inge
2011年12月15日 05:48:31ncoghlansetmessages: + msg149510
2011年12月15日 05:03:51terry.reedysetmessages: + msg149507
2011年12月15日 03:16:53ncoghlansetmessages: + msg149491
2011年12月15日 02:06:25terry.reedysetmessages: + msg149489
2011年12月15日 01:13:34ncoghlansetassignee: docs@python

components: + Documentation
title: function generate_tokens at tokenize.py yields wrong token for colon -> Add new attribute to TokenInfo to report specific token IDs
nosy: + docs@python, ncoghlan
versions: + Python 2.7, Python 3.3
messages: + msg149487
stage: needs patch
2010年08月04日 04:43:33terry.reedysetversions: + Python 3.2, - Python 2.6, Python 2.5, Python 2.4
nosy: + terry.reedy

messages: + msg112756

type: behavior -> enhancement
2010年02月23日 02:46:00akuchlingsetnosy: + akuchling
messages: + msg99894
2008年02月19日 09:08:15christian.heimessetpriority: normal
keywords: + patch
2008年02月18日 13:22:48gpolosetfiles: + tokenize_r60884.diff
messages: + msg62527
2008年02月17日 23:00:30gpolocreate

AltStyle によって変換されたページ (->オリジナル) /