homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: add example of 'first match wins' to regex "|" documentation?
Type: enhancement Stage: resolved
Components: Documentation, Regular Expressions Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Mark.Shannon, Rick Otten, docs@python, ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2015年02月26日 22:55 by Rick Otten, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Messages (8)
msg236715 - (view) Author: Rick Otten (Rick Otten) Date: 2015年02月26日 23:00
The documentation states that "|" parsing goes from left to right. This doesn't seem to be true when spaces are involved. (or \s).
Example:
In [40]: mystring
Out[40]: 'rwo incorporated'
In [41]: re.sub('incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[41]: 'rwoorporated'
In this case " inc" was processed before incorporated.
If I take the space out:
In [42]: re.sub('incorporated|inc|llc|corporation|corp| co', '', mystring)
Out[42]: 'rwo '
incorporated is processed first.
If I put a space with each, then " incorporated" is processed first:
In [43]: re.sub(' incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[43]: 'rwo'
And If use \s instead of a space, it is processed first:
In [44]: re.sub('incorporated|\sinc|llc|corporation|corp| co', '', mystring)
Out[44]: 'rwoorporated'
msg236716 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2015年02月26日 23:13
This looks like the expected behaviour to me.
re.sub matches the leftmost occurence and the regular expression is greedy so (x|xy) will always match xy if it can.
msg236718 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2015年02月27日 00:07
@Mark is correct, it's not a bug.
In the first example:
It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Success. ' inc' matches.
In the second example:
It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Failure.
It tries to match each alternative at position 4. Success. 'incorporated' matches. ('inc' is a later alternative; it's considered only if the earlier alternatives have failed to match at that position.)
msg236720 - (view) Author: Rick Otten (Rick Otten) Date: 2015年02月27日 00:36
Can the documentation be updated to make this more clear?
I see now where the clause "As the target string is scanned, ..." is describing what you have listed here.
I and a coworker both read the description several times and missed that. I thought it first tried "incorporated" against the whole string, then tried " inc" against the whole string, etc... When actually it was trying each, "incorporated" and " inc" and the others against the first position of the string. And then again for the second position.
Since I want to force the order against the whole string before trying the next one for my particular use case, I'll do a series of re.subs instead of trying to do them all in one. It makes sense now and is easy to fix.
Thanks for looking at it and explaining what is happening more clearly. It was really not obvious. I tried at least 100 variations and wasn't seeing the pattern.
msg236725 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015年02月27日 02:18
The thing is, what you describe is fundamental to how regular expressions work. I'm not sure it makes sense to add a specific mention of it to the '|' docs, since it applies to all regexes.
msg236821 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2015年02月27日 19:18
Not quite all. POSIX regexes will always look for the longest match, so the order of the alternatives doesn't matter, i.e. x|xy would give the same result as xy|x.
msg295128 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017年06月04日 15:57
From the documentation:
"""
As the target string is scanned, REs separated by ``'|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match. In other words, the ``'|'`` operator is never greedy.
"""
I think this completely describes the behavior.
msg295129 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017年06月04日 16:19
I concur with Serhiy that the docs correctly and completely describe the behavior.
History
Date User Action Args
2022年04月11日 14:58:13adminsetgithub: 67720
2017年10月11日 14:46:48berker.peksagsetstatus: open -> closed
stage: resolved
2017年06月04日 16:19:21rhettingersetstatus: pending -> open
nosy: + rhettinger
messages: + msg295129

2017年06月04日 15:57:51serhiy.storchakasetstatus: open -> pending

nosy: + serhiy.storchaka
messages: + msg295128

resolution: not a bug
2016年10月16日 22:32:17serhiy.storchakasettype: behavior -> enhancement
components: + Regular Expressions
versions: + Python 3.5, Python 3.6, Python 3.7
2015年02月27日 19:18:42mrabarnettsetmessages: + msg236821
2015年02月27日 02:18:20r.david.murraysettitle: regex "|" behavior differs from documentation -> add example of 'first match wins' to regex "|" documentation?
nosy: + r.david.murray, docs@python

messages: + msg236725

assignee: docs@python
components: + Documentation, - Regular Expressions
2015年02月27日 00:36:49Rick Ottensetmessages: + msg236720
2015年02月27日 00:07:32mrabarnettsetmessages: + msg236718
2015年02月26日 23:13:00Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg236716
2015年02月26日 23:00:23Rick Ottensetmessages: + msg236715
2015年02月26日 22:55:54Rick Ottencreate

AltStyle によって変換されたページ (->オリジナル) /