The problem I'm solving is a more complex version of pairing up opening and closing brackets.
Instead of matching only on ([{}])
, I additionally need to match arbitrary opening sequences of arbitrary length to closing sequences of arbitrary length, such as '('
which is mapped to ')'
.
While only the top-level matches should be returned, the matches should still account for the inner ones. This means that with a simple parenthesis mapping, ((f))
should match only on the first opening bracket and the final closing bracket.
Motivation:
For parsing a grammar in a custom parser, I need to discern between bracket literals, which are surrounded by apostrophes and brackets on the grammar level, which are normal brackets. The goal is to return a list of all top-level matches that are found. A match is represented as a 4-tuple of the range that the opening and closing sequences span.
For a motivating example, evaluating the expression id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'
with a mapping of {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
should yield the following list:
[(3, 6, 15, 18), (19, 20, 32, 33), (34, 37, 53, 56)]
My starting point was this code review on checking for balanced brackets in Python , which I adapted to suit the additional requirements:
- Match arbitrary opening and closing sequences of length >= 1
- Collect tuples of the ranges at which the sequences are found.
Armed with Python 3.9, this is my code:
def get_top_level_matching_pairs(expression: str, mapping: dict[str, str]) \
-> list[tuple[int, int, int, int]]:
"""
Returns all top-level matches of opening sequences to closing sequences.
Each match is represented as a 4-tuple of the range that the opening and closing sequences span.
>>>get_top_level_matching_pairs("(a) op (b) cl", {'(': ')', 'op': 'cl'})
[(0, 1, 2, 3), (4, 6, 11, 13)]
"""
def head_starts_with_one_from(match_targets: Union[KeysView[str], ValuesView[str]]) -> Optional[str]:
# Check whether the expression, from index i, starts with one of the provided keys or values.
# Return the first match found, none otherwise.
return next(filter(lambda m: expression.startswith(m, i), match_targets), None)
res = []
queue = [] # functions as a stack to keep track of the opened pairs.
start_index = None
start_match = None
i = 0
while i < len(expression):
if open := head_starts_with_one_from(mapping.keys()):
if start_index is None:
start_index = i
start_match = open
queue.append(mapping[open]) # Store the closing counterpart for easy comparisons
i += len(open)
continue
if close := head_starts_with_one_from(mapping.values()):
try:
if (stack_head := queue.pop()) == close:
if not queue: # This closing token closes a top-level opening sequence, so add the result
res.append((start_index, start_index + len(start_match), i, i + len(close)))
start_index = None
start_match = None
i += len(close)
continue
# raise mismatched opening and closing characters.
except IndexError:
# raise closing sequence without an opening. (uses stack_head variable)
i += 1
return res
My questions:
- Should I put the preconditions in the docstring for the function or should I verify them in the code? The preconditions are: strings in the mapping are not empty and no mapping string can be a subset of another.
- Is the
head_starts_with_one_from
nested function justified as a nested function? It allows for some neat walrus expressions on theif open
andif close
lines. - Is there a better way to iterate over the expression, given that you need to match a varying range (here: 1 or 3 characters) of the expression at once and sometimes skip over part of it?
Of course, additional comments are more than welcome.
2 Answers 2
In response to question 1, in a REPL, help(obj)
returns the docstring for the object. So the docstring should contain information for a developer using the object (class, function, module, method, ...).
In response to your 3rd question, your method of scanning is fairly simple to understand and implement. However, it is not very efficient. It potentially checks each character in the expression against each opening and closing sequence. It may be "good enough" for your use case. But if there are a lot of sequences, or the expressions are long, or there are many expressions to check, another approach may be more efficient.
The re
module from the standard library can be used to simplify the code. There is some overhead to compile the regular expression, but a regular expression essentially checks for all the sequences in parallel, and re
is a c-library so it runs at "c-speed". You should test to see which approach is faster for your use case.
First, build a regular expression to match any of the opening or closing sequences.
left = '|'.join(re.escape(k) for k in mapping.keys())
right = '|'.join(re.escape(k) for k in mapping.values())
pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
The first line builds a regular expression that matches any of the opening (or left) sequences. Similarly, the second line builds one for any of the closing (or right) sequences. re.escape()
is used to escape any characters in a sequence that would have a special meaning to the regex compiler. The f-string combines these pieces into a larger regular expression with named groups. Based on the motivating example, pattern
would be a compiled regular expression for:
"(?P<left>\\(|\\[|\\{|'\\('|'\\['|'\\{')|(?P<right>\\)|\\]|\\}|'\\)'|'\\]'|'\\}')"
This matches any of the opening or left side sequences in the group named 'left' or any of the closing or right side sequences in the group named 'right'.
The resulting compiled regular expression can be used to scan expression
for opening or closing sequences.
stack = []
result = []
for match in pat.finditer(expression):
if match.lastgroup == 'left':
stack.append((match['left'], match.span()))
else:
if not stack:
raise ValueError("Unmatched closing sequence")
left, left_span = stack.pop()
if match['right'] != mapping[left]:
raise ValueError("Mismatched sequences")
if not stack:
result.append(left_span + match.span())
if stack:
raise ValueError("Unmatched opening sequence.")
match
is an re.MatchObject
. match.lastgroup
is the name of the last (or only) capture group that matched, which can be used to determine whether it matched an opening or closing sequence. Conveniently, match.span()
returns a tuple of the indices of the matched sequence, so you don't need to keep track of it yourself.
Presumably, the regular expression would be compiled once and then used repeatedly to scan many expressions. This could be done by defining a factory function that takes a mapping and returns a scanner:
def build_top_level_scanner(mapping):
left = '|'.join(re.escape(k) for k in mapping.keys())
right = '|'.join(re.escape(k) for k in mapping.values())
pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
def scanner(expression):
stack = []
result = []
for match in pat.finditer(expression):
if match.lastgroup == 'left':
stack.append((match['left'], match.span()))
else:
if not stack:
raise ValueError("Unmatched closing sequence")
left, left_span = stack.pop()
if match['right'] != mapping[left]:
raise ValueError("Mismatched sequences")
if not stack:
result.append(left_span + match.span())
if stack:
raise ValueError("Unmatched opening sequence.")
return result
return scanner
Used like so:
expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scan = build_top_level_scanner(mapping)
scan(expression)
which would return:
[(3, 6, 15, 18), (19, 20, 32, 33), (34, 37, 53, 56)]
It could also be done by defining a class:
class TopLevelScanner:
def __init__(self, mapping):
self.mapping = mapping
left = '|'.join(re.escape(k) for k in mapping.keys())
right = '|'.join(re.escape(k) for k in mapping.values())
self.pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
def scan(self, expression):
stack = []
result = []
for match in self.pattern.finditer(expression):
if match.lastgroup == 'left':
stack.append((match['left'], match.span()))
else:
if not stack:
raise ValueError("Unmatched closing sequence")
left, left_span = stack.pop()
if match['right'] != mapping[left]:
raise ValueError("Mismatched sequences")
if not stack:
result.append(left_span + match.span())
if stack:
raise ValueError("Unmatched opening sequence.")
return result
Used like so:
expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scanner = TopLevelScanner(mapping)
scanner.scan(expression)
As an alternative to the re
module, the third party PyAhoCorasick library could be used in a similar manner to the re
module.
-
\$\begingroup\$ I am in awe of this code. I disregarded regexes because they can't solve the problem on their own, but combining them with the delimiters algorithm makes so much sense. The code is very elegant. I ran timeit on your and @Peilonrayz's non-regex solution and your code is 10-15 times faster than the non-regex solution, regardless of the number of runs. Just for fun I also ran it while constantly recompiling the regex pattern, which only slows it down by a factor of 2. \$\endgroup\$Iterniam– Iterniam2021年03月05日 14:49:21 +00:00Commented Mar 5, 2021 at 14:49
-
\$\begingroup\$ Two small nitpicks,
stack.pop()
causes anIndexError
if there are more closing than opening sequences that would ideally be rethrown as aValueError
with a custom error message. A similar error could be raised ifstack
is not empty after going through the for loop. \$\endgroup\$Iterniam– Iterniam2021年03月05日 14:54:50 +00:00Commented Mar 5, 2021 at 14:54 -
\$\begingroup\$ @Iterniam, I added checks for the two bugs you found. \$\endgroup\$RootTwo– RootTwo2021年03月05日 16:18:52 +00:00Commented Mar 5, 2021 at 16:18
In my opinion
head_starts_with_one_from
would be cleaner by using a generator expression:def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]: return next( ( m for m in match_targets if expression.startswith(m, i) ), None )
Or better yet, a plain old for loop.
def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]: for m in match_targets: if expression.startswith(m, i): return m return None
Because you aren't storing
start_index
inqueue
your code can't work with nested brackets. We've now got rid of the need forstart_index
. You didn't really needstart_match
see next bullet point.queue.append((i, mapping[open]))
You never use
stack_head
, so you don't need to use the walrus operator there.You can use
elif
andelse
to get rid of thecontinue
s.You can use guard statements to prevent the arrow anti-pattern.
if queue.pop() != close: # raise mismatched opening and closing characters.
def get_top_level_matching_pairs(
expression: str,
mapping: dict[str, str],
) -> list[tuple[int, int, int, int]]:
def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]:
for m in match_targets:
if expression.startswith(m, i):
return m
return None
res = []
queue = []
i = 0
while i < len(expression):
if open := head_starts_with_one_from(mapping.keys()):
queue.append((i, open))
i += len(open)
elif close := head_starts_with_one_from(mapping.values()):
if not queue:
# raise closing sequence without an opening.
index, first_match = queue.pop()
match = mapping[first_match]
if match != close:
# raise mismatched opening and closing characters.
if not queue: # This closing token closes a top-level opening sequence, so add the result
res.append((index, index + len(first_match), i, i + len(close)))
i += len(close)
else:
i += 1
return res
-
\$\begingroup\$ You seem to imply that a normal for loop is preferred over a generator expression, which itself is preferred over a filter expression. Why this ordering? \$\endgroup\$Iterniam– Iterniam2021年03月03日 08:37:31 +00:00Commented Mar 3, 2021 at 8:37
-
\$\begingroup\$
stack_head
was used in error reporting, I've now included in the original code. \$\endgroup\$Iterniam– Iterniam2021年03月03日 08:42:22 +00:00Commented Mar 3, 2021 at 8:42 -
\$\begingroup\$ The code
res.append((index, index + len(match), i, i + len(close)))
implies thatmatch
differs fromclose
, which it never is. This makes the second tuple argument fail if the length of the opening sequence is not equal to the closing sequence, which is whystart_match
was originally saved. Breaking test case:expression='op a c', mapping={'op': 'c'}
. Expected is[(0, 2, 5, 6)]
, actual is[(0, 1, 5, 6)]
\$\endgroup\$Iterniam– Iterniam2021年03月03日 08:51:21 +00:00Commented Mar 3, 2021 at 8:51 -
1\$\begingroup\$ @Iterniam People read familiar code faster than unfamiliar code. A simple for loop is super basic and is implement in lots of programming languages. So a simple is highly readable for most people. Additionally
filer
is semi-depricated as comprehensions and generator expressions do everything and more. So back to the familiar point, Python programmers are less familiar withfilter
. Whilst a fair argument is no-one will read the code, I'd suggest looking at the long-term affect of potentially getting a 'bad habit' and the potential grief. \$\endgroup\$2021年03月05日 02:02:17 +00:00Commented Mar 5, 2021 at 2:02
Explore related questions
See similar questions with these tags.
expression
is only 10 characters long, and (2) what is the intendedmapping
for the "motivated example" in your discussion (I see only what I interpret as the inputexpression
and expected output). \$\endgroup\$(7, 8, 9, 10)
for the parens in this part of the text:(b)
? Or do bracket pairs never nest inside of other pairs? If nesting is allowed, your code seems to be behaving incorrectly, because it does not find the 4-tuple just noted. Alternatively, if nesting isn't allowed, we don't need all of the complexity of the classic balanced-bracket algorithm; instead, won't a simple greedy approach work? \$\endgroup\$