Message 176458 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	lpd
Recipients	ezio.melotti, lpd, mrabarnett
Date	2012年11月27日.00:07:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1353974863.42.0.0697393533738.issue16563@psf.upfronthosting.co.za>

Content
I've read a number of reports of exponential-time regexp matching, but this regexp uses no unusual features, requires no backtracking, and only loops "forever" on certain input strings. I listed the Python version # as 2.6; I actually observed the behavior in 2.5.1 and 2.5.2, but I'm almost certain it's still there, because I saw the same behavior in a very recent build of Google's V8 interpreter, which I believe uses the same regexp engine. Here's the test case: import re re_utf8 = r'^([\x00-\x7f]+\|[\xc0-\xdf][\x80-\xbf]\|[\xe0-\xef][\x80-\xbf][\x80-\xbf]\|[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$' s = "\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x14\x00\x00\x00\x01\x00\x00,`\x00\x00\x004\x00\x01\x8d" print re.match(re_utf8, s) If you pass s[0:34] or s[34:35] as the argument of re.match, it returns the correct answer, but the code above loops apparently forever.

Content

I've read a number of reports of exponential-time regexp matching, but this regexp uses no unusual features, requires no backtracking, and only loops "forever" on certain input strings.
I listed the Python version # as 2.6; I actually observed the behavior in 2.5.1 and 2.5.2, but I'm almost certain it's still there, because I saw the same behavior in a very recent build of Google's V8 interpreter, which I believe uses the same regexp engine.
Here's the test case:
import re
re_utf8 = r'^([\x00-\x7f]+|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$'
s = "\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x14\x00\x00\x00\x01\x00\x00,`\x00\x00\x004\x00\x01\x8d"
print re.match(re_utf8, s)
If you pass s[0:34] or s[34:35] as the argument of re.match, it returns the correct answer, but the code above loops apparently forever.

History
Date	User	Action	Args
2012年11月27日 00:07:43	lpd	set	recipients: + lpd, ezio.melotti, mrabarnett
2012年11月27日 00:07:43	lpd	set	messageid: <1353974863.42.0.0697393533738.issue16563@psf.upfronthosting.co.za>
2012年11月27日 00:07:43	lpd	link	issue16563 messages
2012年11月27日 00:07:42	lpd	create

homepage