Issue 23532: add example of 'first match wins' to regex "|" documentation?

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67720

classification

Title:	add example of 'first match wins' to regex "\|" documentation?
Type:	enhancement	Stage:	resolved
Components:	Documentation, Regular Expressions	Versions:	Python 3.7, Python 3.6, Python 3.5, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	not a bug
Assigned To:	docs@python	Nosy List:	Mark.Shannon, Rick Otten, docs@python, ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2015年02月26日 22:55 by Rick Otten, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Messages (8)
msg236715 - (view)	Author: Rick Otten (Rick Otten)	Date: 2015年02月26日 23:00
The documentation states that "\|" parsing goes from left to right. This doesn't seem to be true when spaces are involved. (or \s). Example: In [40]: mystring Out[40]: 'rwo incorporated' In [41]: re.sub('incorporated\| inc\|llc\|corporation\|corp\| co', '', mystring) Out[41]: 'rwoorporated' In this case " inc" was processed before incorporated. If I take the space out: In [42]: re.sub('incorporated\|inc\|llc\|corporation\|corp\| co', '', mystring) Out[42]: 'rwo ' incorporated is processed first. If I put a space with each, then " incorporated" is processed first: In [43]: re.sub(' incorporated\| inc\|llc\|corporation\|corp\| co', '', mystring) Out[43]: 'rwo' And If use \s instead of a space, it is processed first: In [44]: re.sub('incorporated\|\sinc\|llc\|corporation\|corp\| co', '', mystring) Out[44]: 'rwoorporated'
msg236716 - (view)	Author: Mark Shannon (Mark.Shannon) * (Python committer)	Date: 2015年02月26日 23:13
This looks like the expected behaviour to me. re.sub matches the leftmost occurence and the regular expression is greedy so (x\|xy) will always match xy if it can.
msg236718 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2015年02月27日 00:07
@Mark is correct, it's not a bug. In the first example: It tries to match each alternative at position 0. Failure. It tries to match each alternative at position 1. Failure. It tries to match each alternative at position 2. Failure. It tries to match each alternative at position 3. Success. ' inc' matches. In the second example: It tries to match each alternative at position 0. Failure. It tries to match each alternative at position 1. Failure. It tries to match each alternative at position 2. Failure. It tries to match each alternative at position 3. Failure. It tries to match each alternative at position 4. Success. 'incorporated' matches. ('inc' is a later alternative; it's considered only if the earlier alternatives have failed to match at that position.)
msg236720 - (view)	Author: Rick Otten (Rick Otten)	Date: 2015年02月27日 00:36
Can the documentation be updated to make this more clear? I see now where the clause "As the target string is scanned, ..." is describing what you have listed here. I and a coworker both read the description several times and missed that. I thought it first tried "incorporated" against the whole string, then tried " inc" against the whole string, etc... When actually it was trying each, "incorporated" and " inc" and the others against the first position of the string. And then again for the second position. Since I want to force the order against the whole string before trying the next one for my particular use case, I'll do a series of re.subs instead of trying to do them all in one. It makes sense now and is easy to fix. Thanks for looking at it and explaining what is happening more clearly. It was really not obvious. I tried at least 100 variations and wasn't seeing the pattern.
msg236725 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2015年02月27日 02:18
The thing is, what you describe is fundamental to how regular expressions work. I'm not sure it makes sense to add a specific mention of it to the '\|' docs, since it applies to all regexes.
msg236821 - (view)	Author: Matthew Barnett (mrabarnett) * (Python triager)	Date: 2015年02月27日 19:18
Not quite all. POSIX regexes will always look for the longest match, so the order of the alternatives doesn't matter, i.e. x\|xy would give the same result as xy\|x.
msg295128 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2017年06月04日 15:57
From the documentation: """ As the target string is scanned, REs separated by ``'\|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match. In other words, the ``'\|'`` operator is never greedy. """ I think this completely describes the behavior.
msg295129 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2017年06月04日 16:19
I concur with Serhiy that the docs correctly and completely describe the behavior.

History
Date	User	Action	Args
2022年04月11日 14:58:13	admin	set	github: 67720
2017年10月11日 14:46:48	berker.peksag	set	status: open -> closed stage: resolved
2017年06月04日 16:19:21	rhettinger	set	status: pending -> open nosy: + rhettinger messages: + msg295129
2017年06月04日 15:57:51	serhiy.storchaka	set	status: open -> pending nosy: + serhiy.storchaka messages: + msg295128 resolution: not a bug
2016年10月16日 22:32:17	serhiy.storchaka	set	type: behavior -> enhancement components: + Regular Expressions versions: + Python 3.5, Python 3.6, Python 3.7
2015年02月27日 19:18:42	mrabarnett	set	messages: + msg236821
2015年02月27日 02:18:20	r.david.murray	set	title: regex "\|" behavior differs from documentation -> add example of 'first match wins' to regex "\|" documentation? nosy: + r.david.murray, docs@python messages: + msg236725 assignee: docs@python components: + Documentation, - Regular Expressions
2015年02月27日 00:36:49	Rick Otten	set	messages: + msg236720
2015年02月27日 00:07:32	mrabarnett	set	messages: + msg236718
2015年02月26日 23:13:00	Mark.Shannon	set	nosy: + Mark.Shannon messages: + msg236716
2015年02月26日 23:00:23	Rick Otten	set	messages: + msg236715
2015年02月26日 22:55:54	Rick Otten	create

homepage