homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zero-length match confuses re.finditer()
Type: behavior Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: THRlWiTi, denversc, ezio.melotti, isoschiz, jfrechet, mrabarnett, niemeyer, rsc, serhiy.storchaka, timehorse
Priority: normal Keywords: patch

Created on 2007年01月29日 22:35 by jfrechet, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4471 merged serhiy.storchaka, 2017年11月19日 23:36
PR 4678 closed serhiy.storchaka, 2017年12月02日 17:32
Messages (15)
msg31129 - (view) Author: Jacques Frechet (jfrechet) Date: 2007年01月29日 22:35
Hi!
re.finditer() seems to incorrectly increment the current position immediately after matching a zero-length substring. For example:
>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')]
What happened to the 'a'? I expected this result:
[('', None), (None, 'abc')]
Perl agrees with me:
% perl -le 'print defined(1ドル)?"\"1ドル\"":"undef",",",defined(2ドル)?"\"2ドル\"":"undef" while "abc" =~ /(z*)|(\w+)/g' 
"",undef
undef,"abc"
"",undef
Similarly, if I remove the ^:
>>> [m.groups() for m in re.finditer(r'(z*)|(\w+)', 'abc')]
[('', None), ('', None), ('', None), ('', None)]
Now all of the letters have fallen through the cracks! I expected this result:
[('', None), (None, 'abc'), ('', None)]
Again, perl agrees:
% perl -le 'print defined(1ドル)?"\"1ドル\"":"undef",",",defined(2ドル)?"\"2ドル\"":"undef" while "abc" =~ /(z*)|(\w+)/g' 
"",undef
undef,"abc"
"",undef
If this bug has already been reported, I apologize -- I wasn't able to find it here. I haven't looked at the code for the re module, but this seems like the sort of bug that might have been accidentally introduced in order to try to prevent the same zero-length match from being returned forever.
Thanks,
Jacques
msg73719 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008年09月24日 15:37
This also affects re.findall().
msg73737 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008年09月24日 17:59
What should:
 [m.groups() for m in re.finditer(r'(^z*)|(^q*)|(\w+)', 'abc')]
return? Should the second group also yield a zero-width match before the
third group is tried? I think it probably should. Does Perl?
msg73741 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008年09月24日 18:14
Hmmm. This strikes me as a bug, beyond the realm of Issue 3262. The
two items may be related, but the dropping of the 'a' seems like
unexpected behaviour that I doubt any current code is expecting to
occur. Clearly, what is going on is that the Engine starts scanning at
the 'a', finds the Zero-Width match and, having found a match,
increments its pointer within the input string, thus skipping the 'a'
when it matches 'bc'.
If it is indeed a bug, I think this should be considered for inclusion
in Python 2.6 rather than being part of the new Engine Design in Issue
3626. I think the solution would simply be to not increment the ptr
(which points to the input string) when findall / finditer encounters a
Zero-Width match.
msg73742 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008年09月24日 18:21
Never mind inclusion in 2.6 as no-one has repeated this bug in re-world
examples yet so it's going to have to wait for the Regexp 2.7 engine in
issue 2636.
msg73746 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008年09月24日 18:33
Ah, I see the problem, if ptr is not incremented, then it will keep
matching the first expression, (^z*), so it would have to both 'skip'
the 'a' and NOT skip the 'a'. Hmm. You're right, Matthew, this is
pretty complicated. Now, for your expression, Matthew,
r'(z*)|(^q*)|(\w+)', Perl gives:
"",undef,undef
undef,undef,"abc"
"",undef,undef
Meaning it doesn't even bother matching the ^q* since the ^z* matches
first. This seems the logical behaviour and fits with the idea that a
Zero-Width match would both only match once and NOT consume any
characters. An internal flag would just have to be created to tell the
2 find functions whether the current value of ptr would allow for a "No
Zero-Width Match" option on second go-around.
msg73755 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008年09月24日 21:14
What about r'(^z*)|(q*)|(\w+)'? I could imagine that the first group
could match only at the start of the string, but if the second group
doesn't have that restriction then it could match the second time, and
only after that could the third match, if you see what I mean. (The
previous example had (^q*) so it couldn't match because the first group
has already matched at the start of the string and we've already
advanced beyond that, even though by no characters!)
msg73765 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008年09月24日 23:34
FYI, I posted msg73737 after finding that the fix for the original case
was really very simple, but then thought about whether it would behave
as expected when there were more zero-width matches, hence the later posts.
msg73789 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008年09月25日 13:08
Perl gives this result for your new expression:
"",undef,undef
undef,undef,"abc"
undef,"",undef
I think it has to do with not thinking of a string as a sequence of
characters, but as a sequence of characters separated by null-space. 
Null-space is can be captured, but ONLY if it is part of a zero-width
match, and once captured, it can no longer be captured by another
zero-width expression. This is in keeping which what I see as Perl's
behaviour, namely that the (q*) group never participates in the first
match because, initially the (^z*) captures it. OTOH, when it gets to
the null-space AFTER the 'abc' capture, the (^z*) cannot participate
because it has a "at-beginning" restriction. The evaluator then moves
on to the (q*), which has no such restriction and this time it matches,
consuming the final null-space.
msg73792 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008年09月25日 13:55
I have to report that the fix appears to be successful:
>>> print [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'abc')]
>>> print re.findall(r"(^z*)|(\w+)", "abc")
[('', ''), ('', 'abc')]
>>> print [m.groups() for m in re.finditer(r"(^z*)|(q*)|(\w+)", "abc")]
[('', None, None), (None, None, 'abc'), (None, '', None)]
>>> print re.findall(r"(^z*)|(q*)|(\w+)", "abc")
[('', '', ''), ('', '', 'abc'), ('', '', '')]
The patch is regex_2.6rc2+7.diff.
msg73809 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008年09月25日 19:22
Matthew, I'll try to merge all your diffs with the current repository
over the weekend. Having done the first, I know where code differs
between your implementation, mine and the base, so I can apply your
patch, and then a patch that restores my changes so the rest of the
merges should be easy! :)
msg132827 - (view) Author: Denver Coneybeare (denversc) * Date: 2011年04月03日 01:59
I just re-tested this issue in trunk at changeset 053bc5ca199b and the issue is still exactly reproducible as originally reported. That is, the match to the empty string skips a character of the match:
>>> import re
>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')]
msg187318 - (view) Author: Martin Morrison (isoschiz) * Date: 2013年04月19日 00:21
This is still an issue today:
>>> import re
>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')]
msg221979 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014年06月30日 20:15
How does "the Regexp 2.7 engine in issue 2636" from msg73742 deal with this situation?
msg307556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017年12月04日 12:29
New changeset 70d56fb52582d9d3f7c00860d6e90570c6259371 by Serhiy Storchaka in branch 'master':
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. (#4471)
https://github.com/python/cpython/commit/70d56fb52582d9d3f7c00860d6e90570c6259371
History
Date User Action Args
2022年04月11日 14:56:22adminsetgithub: 44519
2018年03月14日 17:17:12serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017年12月04日 12:29:09serhiy.storchakasetmessages: + msg307556
2017年12月02日 17:32:36serhiy.storchakasetpull_requests: + pull_request4587
2017年11月19日 23:36:58serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request4404
2017年11月18日 15:55:13serhiy.storchakasetnosy: + ezio.melotti
type: behavior
components: + Library (Lib)
2017年11月18日 15:55:00serhiy.storchakasetassignee: niemeyer -> serhiy.storchaka

nosy: + serhiy.storchaka
versions: + Python 3.6, Python 3.7
2016年11月05日 17:51:35BreamoreBoysetnosy: - BreamoreBoy
2016年11月05日 14:23:33THRlWiTisetnosy: + THRlWiTi
2014年06月30日 20:15:38BreamoreBoysetnosy: + BreamoreBoy
messages: + msg221979
2013年04月19日 00:21:29isoschizsetnosy: + isoschiz
messages: + msg187318
2011年04月03日 01:59:19denverscsetnosy: + denversc
messages: + msg132827
2008年09月25日 19:22:06timehorsesetmessages: + msg73809
2008年09月25日 13:55:38mrabarnettsetmessages: + msg73792
2008年09月25日 13:08:16timehorsesetmessages: + msg73789
2008年09月24日 23:34:19mrabarnettsetmessages: + msg73765
2008年09月24日 21:14:13mrabarnettsetmessages: + msg73755
2008年09月24日 18:33:08timehorsesetmessages: + msg73746
2008年09月24日 18:21:05timehorsesetmessages: + msg73742
versions: + Python 2.7, - Python 2.5
2008年09月24日 18:14:27timehorsesetmessages: + msg73741
2008年09月24日 17:59:56mrabarnettsetmessages: + msg73737
2008年09月24日 17:35:57timehorsesetnosy: + timehorse
2008年09月24日 15:37:46mrabarnettsetnosy: + mrabarnett
messages: + msg73719
2008年04月24日 21:08:30rscsetnosy: + rsc
2007年01月29日 22:35:21jfrechetcreate

AltStyle によって変換されたページ (->オリジナル) /