homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author nedbat
Recipients nedbat
Date 2012年10月06日.21:09:19
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1349557761.41.0.537472367724.issue16152@psf.upfronthosting.co.za>
In-reply-to
Content
When tokenizing with tokenize.generate_tokens, if the code ends with whitespace (no newline), the tokenizer produces an ERRORTOKEN for each space. Additionally, the regex that fails to find tokens in those spaces is linear in the number of spaces, so the overall performance is O(n**2).
I found this while tokenizing code samples uploaded to a public web site. One sample for some reason ended with 40,000 spaces, which was taking two hours to tokenize.
Demonstration:
{{{
import token
import tokenize
try:
 from cStringIO import StringIO
except:
 from io import StringIO
code = "@"+" "*10000
code_reader = StringIO(code).readline
for num, (ttyp, ttok, _, _, _) in enumerate(tokenize.generate_tokens(code_reader)):
 print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok))
}}}
History
Date User Action Args
2012年10月06日 21:09:21nedbatsetrecipients: + nedbat
2012年10月06日 21:09:21nedbatsetmessageid: <1349557761.41.0.537472367724.issue16152@psf.upfronthosting.co.za>
2012年10月06日 21:09:21nedbatlinkissue16152 messages
2012年10月06日 21:09:19nedbatcreate

AltStyle によって変換されたページ (->オリジナル) /