Retrieving top-level opening and closing sequences from a Python string

Question 1

The problem I'm solving is a more complex version of pairing up opening and closing brackets.
Instead of matching only on ([{}]), I additionally need to match arbitrary opening sequences of arbitrary length to closing sequences of arbitrary length, such as '(' which is mapped to ')'.
While only the top-level matches should be returned, the matches should still account for the inner ones. This means that with a simple parenthesis mapping, ((f)) should match only on the first opening bracket and the final closing bracket.

Motivation:
For parsing a grammar in a custom parser, I need to discern between bracket literals, which are surrounded by apostrophes and brackets on the grammar level, which are normal brackets. The goal is to return a list of all top-level matches that are found. A match is represented as a 4-tuple of the range that the opening and closing sequences span.
For a motivating example, evaluating the expression id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}' with a mapping of {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"} should yield the following list:

[(3, 6, 15, 18), (19, 20, 32, 33), (34, 37, 53, 56)]

My starting point was this code review on checking for balanced brackets in Python , which I adapted to suit the additional requirements:

Match arbitrary opening and closing sequences of length >= 1
Collect tuples of the ranges at which the sequences are found.

Armed with Python 3.9, this is my code:

def get_top_level_matching_pairs(expression: str, mapping: dict[str, str]) \
 -> list[tuple[int, int, int, int]]:
 """
 Returns all top-level matches of opening sequences to closing sequences.
 Each match is represented as a 4-tuple of the range that the opening and closing sequences span.
 >>>get_top_level_matching_pairs("(a) op (b) cl", {'(': ')', 'op': 'cl'})
 [(0, 1, 2, 3), (4, 6, 11, 13)]
 """
 def head_starts_with_one_from(match_targets: Union[KeysView[str], ValuesView[str]]) -> Optional[str]:
 # Check whether the expression, from index i, starts with one of the provided keys or values.
 # Return the first match found, none otherwise.
 return next(filter(lambda m: expression.startswith(m, i), match_targets), None)
 res = []
 queue = [] # functions as a stack to keep track of the opened pairs.
 start_index = None
 start_match = None
 i = 0
 while i < len(expression):
 if open := head_starts_with_one_from(mapping.keys()):
 if start_index is None:
 start_index = i
 start_match = open
 queue.append(mapping[open]) # Store the closing counterpart for easy comparisons
 i += len(open)
 continue
 if close := head_starts_with_one_from(mapping.values()):
 try:
 if (stack_head := queue.pop()) == close:
 if not queue: # This closing token closes a top-level opening sequence, so add the result
 res.append((start_index, start_index + len(start_match), i, i + len(close)))
 start_index = None
 start_match = None
 i += len(close)
 continue
 # raise mismatched opening and closing characters.
 except IndexError:
 # raise closing sequence without an opening. (uses stack_head variable)
 i += 1
 return res

My questions:

Should I put the preconditions in the docstring for the function or should I verify them in the code? The preconditions are: strings in the mapping are not empty and no mapping string can be a subset of another.
Is the head_starts_with_one_from nested function justified as a nested function? It allows for some neat walrus expressions on the if open and if close lines.
Is there a better way to iterate over the expression, given that you need to match a varying range (here: 1 or 3 characters) of the expression at once and sometimes skip over part of it?

Of course, additional comments are more than welcome.

Question 2

Maybe I haven't read carefully enough, but (1) the expected output shown in the docstring seems incorrect, since the input expression is only 10 characters long, and (2) what is the intended mapping for the "motivated example" in your discussion (I see only what I interpret as the input expression and expected output).

Question 3

@FMc I've addressed both your concerns in an edit, thanks for pointing them out.

Question 4

I'm probably missing something but I'm still wondering about the example in the docstring. Shouldn't the expected return value include (7, 8, 9, 10) for the parens in this part of the text: (b)? Or do bracket pairs never nest inside of other pairs? If nesting is allowed, your code seems to be behaving incorrectly, because it does not find the 4-tuple just noted. Alternatively, if nesting isn't allowed, we don't need all of the complexity of the classic balanced-bracket algorithm; instead, won't a simple greedy approach work?

Question 5

@FMc I'm only interested in the top-level nodes, so nesting is indeed not allowed. A greedy algorithm could work if it could also detect when the brackets are unbalanced to raise errors, but I don't see how.

Question 6

You should edit the question text to clarify the nesting issue. If I understand the situation, your grammar allows nesting, but you only want to report to the end user the outermost balanced pairs.

Question 7

In response to question 1, in a REPL, help(obj) returns the docstring for the object. So the docstring should contain information for a developer using the object (class, function, module, method, ...).

In response to your 3rd question, your method of scanning is fairly simple to understand and implement. However, it is not very efficient. It potentially checks each character in the expression against each opening and closing sequence. It may be "good enough" for your use case. But if there are a lot of sequences, or the expressions are long, or there are many expressions to check, another approach may be more efficient.

The re module from the standard library can be used to simplify the code. There is some overhead to compile the regular expression, but a regular expression essentially checks for all the sequences in parallel, and re is a c-library so it runs at "c-speed". You should test to see which approach is faster for your use case.

First, build a regular expression to match any of the opening or closing sequences.

left = '|'.join(re.escape(k) for k in mapping.keys())
right = '|'.join(re.escape(k) for k in mapping.values())
pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")

The first line builds a regular expression that matches any of the opening (or left) sequences. Similarly, the second line builds one for any of the closing (or right) sequences. re.escape() is used to escape any characters in a sequence that would have a special meaning to the regex compiler. The f-string combines these pieces into a larger regular expression with named groups. Based on the motivating example, pattern would be a compiled regular expression for:

"(?P<left>\\(|\\[|\\{|'\\('|'\\['|'\\{')|(?P<right>\\)|\\]|\\}|'\\)'|'\\]'|'\\}')"

This matches any of the opening or left side sequences in the group named 'left' or any of the closing or right side sequences in the group named 'right'.

The resulting compiled regular expression can be used to scan expression for opening or closing sequences.

stack = []
result = []
for match in pat.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 left, left_span = stack.pop()
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 if not stack:
 result.append(left_span + match.span())
 
if stack:
 raise ValueError("Unmatched opening sequence.")

match is an re.MatchObject. match.lastgroup is the name of the last (or only) capture group that matched, which can be used to determine whether it matched an opening or closing sequence. Conveniently, match.span() returns a tuple of the indices of the matched sequence, so you don't need to keep track of it yourself.

Presumably, the regular expression would be compiled once and then used repeatedly to scan many expressions. This could be done by defining a factory function that takes a mapping and returns a scanner:

def build_top_level_scanner(mapping):
 
 left = '|'.join(re.escape(k) for k in mapping.keys())
 right = '|'.join(re.escape(k) for k in mapping.values())
 pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
 def scanner(expression):
 stack = []
 result = []
 
 for match in pat.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 
 left, left_span = stack.pop()
 
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 
 if not stack:
 result.append(left_span + match.span())
 
 if stack:
 raise ValueError("Unmatched opening sequence.")
 return result
 return scanner

Used like so:

expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scan = build_top_level_scanner(mapping)
scan(expression)

which would return:

[(3, 6, 15, 18), (19, 20, 32, 33), (34, 37, 53, 56)]

It could also be done by defining a class:

class TopLevelScanner:
 def __init__(self, mapping):
 self.mapping = mapping
 
 left = '|'.join(re.escape(k) for k in mapping.keys())
 right = '|'.join(re.escape(k) for k in mapping.values())
 
 self.pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
 def scan(self, expression):
 stack = []
 result = []
 
 for match in self.pattern.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 
 left, left_span = stack.pop()
 
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 
 if not stack:
 result.append(left_span + match.span())
 
 if stack:
 raise ValueError("Unmatched opening sequence.")
 
 return result

Used like so:

expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scanner = TopLevelScanner(mapping)
scanner.scan(expression)

As an alternative to the re module, the third party PyAhoCorasick library could be used in a similar manner to the re module.

Question 8

I am in awe of this code. I disregarded regexes because they can't solve the problem on their own, but combining them with the delimiters algorithm makes so much sense. The code is very elegant. I ran timeit on your and @Peilonrayz's non-regex solution and your code is 10-15 times faster than the non-regex solution, regardless of the number of runs. Just for fun I also ran it while constantly recompiling the regex pattern, which only slows it down by a factor of 2.

Question 9

Two small nitpicks, stack.pop() causes an IndexError if there are more closing than opening sequences that would ideally be rethrown as a ValueError with a custom error message. A similar error could be raised if stack is not empty after going through the for loop.

Question 10

@Iterniam, I added checks for the two bugs you found.

Question 11

In my opinion head_starts_with_one_from would be cleaner by using a generator expression:

def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]:
 return next(
 (
 m
 for m in match_targets
 if expression.startswith(m, i)
 ),
 None
 )

Or better yet, a plain old for loop.

def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]:
 for m in match_targets:
 if expression.startswith(m, i):
 return m
 return None

Because you aren't storing start_index in queue your code can't work with nested brackets. We've now got rid of the need for start_index. You didn't really need start_match see next bullet point.
```
queue.append((i, mapping[open]))
```
You never use stack_head, so you don't need to use the walrus operator there.
You can use elif and else to get rid of the continues.

You can use guard statements to prevent the arrow anti-pattern.

if queue.pop() != close:
 # raise mismatched opening and closing characters.

def get_top_level_matching_pairs(
 expression: str,
 mapping: dict[str, str],
) -> list[tuple[int, int, int, int]]:
 def head_starts_with_one_from(match_targets: Iterable[str]) -> Optional[str]:
 for m in match_targets:
 if expression.startswith(m, i):
 return m
 return None
 res = []
 queue = []
 i = 0
 while i < len(expression):
 if open := head_starts_with_one_from(mapping.keys()):
 queue.append((i, open))
 i += len(open)
 elif close := head_starts_with_one_from(mapping.values()):
 if not queue:
 # raise closing sequence without an opening.
 index, first_match = queue.pop()
 match = mapping[first_match]
 if match != close:
 # raise mismatched opening and closing characters.
 if not queue: # This closing token closes a top-level opening sequence, so add the result
 res.append((index, index + len(first_match), i, i + len(close)))
 i += len(close)
 else:
 i += 1
 return res

Question 12

You seem to imply that a normal for loop is preferred over a generator expression, which itself is preferred over a filter expression. Why this ordering?

Question 13

stack_head was used in error reporting, I've now included in the original code.

Question 14

The code res.append((index, index + len(match), i, i + len(close))) implies that match differs from close, which it never is. This makes the second tuple argument fail if the length of the opening sequence is not equal to the closing sequence, which is why start_match was originally saved. Breaking test case: expression='op a c', mapping={'op': 'c'}. Expected is [(0, 2, 5, 6)], actual is [(0, 1, 5, 6)]

Question 15

@Iterniam People read familiar code faster than unfamiliar code. A simple for loop is super basic and is implement in lots of programming languages. So a simple is highly readable for most people. Additionally filer is semi-depricated as comprehensions and generator expressions do everything and more. So back to the familiar point, Python programmers are less familiar with filter. Whilst a fair argument is no-one will read the code, I'd suggest looking at the long-term affect of potentially getting a 'bad habit' and the potential grief.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2021-03-05 08:47:01Z

In response to question 1, in a REPL, help(obj) returns the docstring for the object. So the docstring should contain information for a developer using the object (class, function, module, method, ...).

In response to your 3rd question, your method of scanning is fairly simple to understand and implement. However, it is not very efficient. It potentially checks each character in the expression against each opening and closing sequence. It may be "good enough" for your use case. But if there are a lot of sequences, or the expressions are long, or there are many expressions to check, another approach may be more efficient.

The re module from the standard library can be used to simplify the code. There is some overhead to compile the regular expression, but a regular expression essentially checks for all the sequences in parallel, and re is a c-library so it runs at "c-speed". You should test to see which approach is faster for your use case.

First, build a regular expression to match any of the opening or closing sequences.

left = '|'.join(re.escape(k) for k in mapping.keys())
right = '|'.join(re.escape(k) for k in mapping.values())
pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")

The first line builds a regular expression that matches any of the opening (or left) sequences. Similarly, the second line builds one for any of the closing (or right) sequences. re.escape() is used to escape any characters in a sequence that would have a special meaning to the regex compiler. The f-string combines these pieces into a larger regular expression with named groups. Based on the motivating example, pattern would be a compiled regular expression for:

"(?P<left>\\(|\\[|\\{|'\\('|'\\['|'\\{')|(?P<right>\\)|\\]|\\}|'\\)'|'\\]'|'\\}')"

This matches any of the opening or left side sequences in the group named 'left' or any of the closing or right side sequences in the group named 'right'.

The resulting compiled regular expression can be used to scan expression for opening or closing sequences.

stack = []
result = []
for match in pat.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 left, left_span = stack.pop()
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 if not stack:
 result.append(left_span + match.span())
 
if stack:
 raise ValueError("Unmatched opening sequence.")

match is an re.MatchObject. match.lastgroup is the name of the last (or only) capture group that matched, which can be used to determine whether it matched an opening or closing sequence. Conveniently, match.span() returns a tuple of the indices of the matched sequence, so you don't need to keep track of it yourself.

Presumably, the regular expression would be compiled once and then used repeatedly to scan many expressions. This could be done by defining a factory function that takes a mapping and returns a scanner:

def build_top_level_scanner(mapping):
 
 left = '|'.join(re.escape(k) for k in mapping.keys())
 right = '|'.join(re.escape(k) for k in mapping.values())
 pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
 def scanner(expression):
 stack = []
 result = []
 
 for match in pat.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 
 left, left_span = stack.pop()
 
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 
 if not stack:
 result.append(left_span + match.span())
 
 if stack:
 raise ValueError("Unmatched opening sequence.")
 return result
 return scanner

Used like so:

expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scan = build_top_level_scanner(mapping)
scan(expression)

which would return:

[(3, 6, 15, 18), (19, 20, 32, 33), (34, 37, 53, 56)]

It could also be done by defining a class:

class TopLevelScanner:
 def __init__(self, mapping):
 self.mapping = mapping
 
 left = '|'.join(re.escape(k) for k in mapping.keys())
 right = '|'.join(re.escape(k) for k in mapping.values())
 
 self.pattern = re.compile(f"(?P<left>{left})|(?P<right>{right})")
 def scan(self, expression):
 stack = []
 result = []
 
 for match in self.pattern.finditer(expression):
 if match.lastgroup == 'left':
 stack.append((match['left'], match.span()))
 
 else:
 if not stack:
 raise ValueError("Unmatched closing sequence")
 
 left, left_span = stack.pop()
 
 if match['right'] != mapping[left]:
 raise ValueError("Mismatched sequences")
 
 if not stack:
 result.append(left_span + match.span())
 
 if stack:
 raise ValueError("Unmatched opening sequence.")
 
 return result

Used like so:

expression = """id '(' [FArgs] ')' ['::' FunType] '{' VarDecl* Stmt+ '}'"""
mapping = {'(': ')', '[': ']', '{': '}', "'('": "')'", "'['": "']'", "'{'": "'}'"}
scanner = TopLevelScanner(mapping)
scanner.scan(expression)

As an alternative to the re module, the third party PyAhoCorasick library could be used in a similar manner to the re module.

I am in awe of this code. I disregarded regexes because they can't solve the problem on their own, but combining them with the delimiters algorithm makes so much sense. The code is very elegant. I ran timeit on your and @Peilonrayz's non-regex solution and your code is 10-15 times faster than the non-regex solution, regardless of the number of runs. Just for fun I also ran it while constantly recompiling the regex pattern, which only slows it down by a factor of 2.
Two small nitpicks, stack.pop() causes an IndexError if there are more closing than opening sequences that would ideally be rethrown as a ValueError with a custom error message. A similar error could be raised if stack is not empty after going through the for loop.

Stack Exchange Network

Retrieving top-level opening and closing sequences from a Python string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Retrieving top-level opening and closing sequences from a Python string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions