This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2015年02月26日 22:55 by Rick Otten, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Messages (8) | |||
|---|---|---|---|
| msg236715 - (view) | Author: Rick Otten (Rick Otten) | Date: 2015年02月26日 23:00 | |
The documentation states that "|" parsing goes from left to right. This doesn't seem to be true when spaces are involved. (or \s).
Example:
In [40]: mystring
Out[40]: 'rwo incorporated'
In [41]: re.sub('incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[41]: 'rwoorporated'
In this case " inc" was processed before incorporated.
If I take the space out:
In [42]: re.sub('incorporated|inc|llc|corporation|corp| co', '', mystring)
Out[42]: 'rwo '
incorporated is processed first.
If I put a space with each, then " incorporated" is processed first:
In [43]: re.sub(' incorporated| inc|llc|corporation|corp| co', '', mystring)
Out[43]: 'rwo'
And If use \s instead of a space, it is processed first:
In [44]: re.sub('incorporated|\sinc|llc|corporation|corp| co', '', mystring)
Out[44]: 'rwoorporated'
|
|||
| msg236716 - (view) | Author: Mark Shannon (Mark.Shannon) * (Python committer) | Date: 2015年02月26日 23:13 | |
This looks like the expected behaviour to me. re.sub matches the leftmost occurence and the regular expression is greedy so (x|xy) will always match xy if it can. |
|||
| msg236718 - (view) | Author: Matthew Barnett (mrabarnett) * (Python triager) | Date: 2015年02月27日 00:07 | |
@Mark is correct, it's not a bug.
In the first example:
It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Success. ' inc' matches.
In the second example:
It tries to match each alternative at position 0. Failure.
It tries to match each alternative at position 1. Failure.
It tries to match each alternative at position 2. Failure.
It tries to match each alternative at position 3. Failure.
It tries to match each alternative at position 4. Success. 'incorporated' matches. ('inc' is a later alternative; it's considered only if the earlier alternatives have failed to match at that position.)
|
|||
| msg236720 - (view) | Author: Rick Otten (Rick Otten) | Date: 2015年02月27日 00:36 | |
Can the documentation be updated to make this more clear? I see now where the clause "As the target string is scanned, ..." is describing what you have listed here. I and a coworker both read the description several times and missed that. I thought it first tried "incorporated" against the whole string, then tried " inc" against the whole string, etc... When actually it was trying each, "incorporated" and " inc" and the others against the first position of the string. And then again for the second position. Since I want to force the order against the whole string before trying the next one for my particular use case, I'll do a series of re.subs instead of trying to do them all in one. It makes sense now and is easy to fix. Thanks for looking at it and explaining what is happening more clearly. It was really not obvious. I tried at least 100 variations and wasn't seeing the pattern. |
|||
| msg236725 - (view) | Author: R. David Murray (r.david.murray) * (Python committer) | Date: 2015年02月27日 02:18 | |
The thing is, what you describe is fundamental to how regular expressions work. I'm not sure it makes sense to add a specific mention of it to the '|' docs, since it applies to all regexes. |
|||
| msg236821 - (view) | Author: Matthew Barnett (mrabarnett) * (Python triager) | Date: 2015年02月27日 19:18 | |
Not quite all. POSIX regexes will always look for the longest match, so the order of the alternatives doesn't matter, i.e. x|xy would give the same result as xy|x. |
|||
| msg295128 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年06月04日 15:57 | |
From the documentation: """ As the target string is scanned, REs separated by ``'|'`` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once ``A`` matches, ``B`` will not be tested further, even if it would produce a longer overall match. In other words, the ``'|'`` operator is never greedy. """ I think this completely describes the behavior. |
|||
| msg295129 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2017年06月04日 16:19 | |
I concur with Serhiy that the docs correctly and completely describe the behavior. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:13 | admin | set | github: 67720 |
| 2017年10月11日 14:46:48 | berker.peksag | set | status: open -> closed stage: resolved |
| 2017年06月04日 16:19:21 | rhettinger | set | status: pending -> open nosy: + rhettinger messages: + msg295129 |
| 2017年06月04日 15:57:51 | serhiy.storchaka | set | status: open -> pending nosy: + serhiy.storchaka messages: + msg295128 resolution: not a bug |
| 2016年10月16日 22:32:17 | serhiy.storchaka | set | type: behavior -> enhancement components: + Regular Expressions versions: + Python 3.5, Python 3.6, Python 3.7 |
| 2015年02月27日 19:18:42 | mrabarnett | set | messages: + msg236821 |
| 2015年02月27日 02:18:20 | r.david.murray | set | title: regex "|" behavior differs from documentation -> add example of 'first match wins' to regex "|" documentation? nosy: + r.david.murray, docs@python messages: + msg236725 assignee: docs@python components: + Documentation, - Regular Expressions |
| 2015年02月27日 00:36:49 | Rick Otten | set | messages: + msg236720 |
| 2015年02月27日 00:07:32 | mrabarnett | set | messages: + msg236718 |
| 2015年02月26日 23:13:00 | Mark.Shannon | set | nosy:
+ Mark.Shannon messages: + msg236716 |
| 2015年02月26日 23:00:23 | Rick Otten | set | messages: + msg236715 |
| 2015年02月26日 22:55:54 | Rick Otten | create | |