-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix Regex::searchAll behaviour wrt empty capturing groups #2393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, searchAll would stop search when it encountered an empty matching group in any position. This means that, for example, regular expression "(a)(b?)(c)" would match string "ac", but the resulting group list would be ["ac", "a"]. After this change, the resulting list for the aforementioned regular expression becomes ["ac", "a", "", "c"] like it should've been. Additionally, this also changes behaviour for multiple matches. For example, when "aaa00bbb" is matched by "[a-z]*", previously only "aaa" would be returned. Now the matching list is ["aaa", "", "", "bbb", ""]. The old behaviour was confusing and almost certainly a bug. The new behaviour is the same as in Python's re.findall. For reference, though, Go does it somewhat differently: empty matches at the end of non-empty matches are ignored, so in Go above example is ["aaa", "", "bbb"] instead. This is the root cause of issue owasp-modsecurity#2336 which has been already fixed by replacing searchAll call there with a new function.
@martinhsv you may want to have a look at this one as well.
You are quite right that the underlying problem with searchAll, that was one of the initial triggers for creating the new function now used by the rx operator, still exists in searchAll for that other functionality (other than the rx operator) that still use it.
However, I believe there is a correctness issue with the particular solution in this pull request.
Consider a scenario where the regex is "^(|abc|def)" and the input is "abc".
In this case there should be two successful full matches. The first match should be an empty string. The second match should be "abc".
With the code in this pull request, the first match of an empty string will occur successfully, but then the code will advance the offset, and on the second iteration of the loop, "abc" can no longer be produced as a match.
(Note: There is a new searchGlobal function is another recent pull request (to support a new rxGlobal operator). I had a notion to implement that and then, after some stabilization time, look to migrate the remaining users of searchAll to the new function.
You may wish to wait for that process to play out.)
Interesting, I never thought about that. It's also interesting to note that Python used to have "my" behaviour (re.findall(r"^(|abc|def)", "abc") == [""]
) until Python 3.7, where it was changed to allow non-empty matches right after empty match (re.findall(r"^(|abc|def)", "abc") == ["", "abc"]
). python/cpython@70d56fb
Previously, searchAll would stop search when it encountered an empty
matching group in any position. This means that, for example,
regular expression "(a)(b?)(c)" would match string "ac", but the
resulting group list would be ["ac", "a"].
After this change, the resulting list for the aforementioned regular
expression becomes ["ac", "a", "", "c"] like it should've been.
Additionally, this also changes behaviour for multiple matches. For
example, when "aaa00bbb" is matched by "[a-z]*", previously only "aaa"
would be returned. Now the matching list is ["aaa", "", "", "bbb", ""].
The old behaviour was confusing and almost certainly a bug. The new
behaviour is the same as in Python's re.findall.
For reference, though, Go does it somewhat differently: empty matches
at the end of non-empty matches are ignored, so in Go above example is
["aaa", "", "bbb"] instead.
This is the root cause of issue #2336 which has been already fixed by
replacing searchAll call there with a new function.
Cherry-picked from #2012