Python function to expand regex with ranges

Question 1

I attempted to implement a python version of this function in my previous question. Given a regular expression with range(s) of alphabets/numbers, the function will expand the brackets and return a list of strings that explicitly spell out each match. If the input regex has multiple expressions separated by |, the output will be a dictionary where the keys are input expressions.

Within regex_expander, I created a sub-function single_expander to handle each expression separated by |. Would it be a better practice to separate this function out? Besides this, I am also looking for any advice on style, efficiency, any room for improvements.

Code

import re
import itertools
import warnings
def regex_expander(rex, verbose=True):
 """
 Given a regex with ranges (e.g. "as[1-9]df"), returns a list of strings that expands the bracket and explicitly spells out each match.
 If input regex contains multiple ranges separated by "|", returns a dictionary of converted strings where the keys are input expressions.
 args:
 - rex: regular expression with a range to expand
 - verbose: if True, will print verbose output
 """
 alphabets = "abcdefghijklmnopqrstuvwxyz"
 ALPHA_NUMS = alphabets.upper() + alphabets + "0123456789"
 def single_expander(rex, verbose):
 # extract ranges
 range_patterns = re.findall(r"\[.*?\]", rex)
 # replace ranges
 if len(range_patterns) == 1:
 range_pattern = range_patterns[0]
 expanded_range = re.findall(range_pattern, ALPHA_NUMS)
 replaced = [rex.replace(range_pattern, x) for x in expanded_range]
 elif len(range_patterns) > 1:
 expanded_range = [re.findall(rng, ALPHA_NUMS) for rng in range_patterns]
 expanded_range_prod = list(itertools.product(*expanded_range))
 
 replaced = []
 for tup in expanded_range_prod:
 range_dict = {k: v for k, v in zip(range_patterns, tup)}
 
 rex_copy = rex
 for k, v in range_dict.items():
 rex_copy = rex_copy.replace(k, v)
 replaced.append(rex_copy)
 
 # for verbose output
 expanded_range = ["".join(tup) for tup in expanded_range_prod]
 else: 
 replaced = rex
 warnings.warn(f"The input expression {rex} does not contain any ranges.")
 return replaced
 if verbose: 
 print("original string:", rex)
 print("expanded range\treplaced string")
 for e, r in zip(expanded_range, replaced):
 print(e.rjust(len("expanded range")), r.rjust(len("replaced string")), sep="\t")
 
 return replaced
 
 ###
 rex_split = rex.split("|")
 
 if len(rex_split) == 1:
 return single_expander(rex, verbose)
 
 else:
 return {r: single_expander(r, verbose) for r in rex_split}

Examples

r = "02[W04]F[0-4][JK]Z"
regex_expander(r, verbose=False)
# output
['02WF0JZ', '02WF0KZ', '02WF1JZ', '02WF1KZ', '02WF2JZ', '02WF2KZ', '02WF3JZ', '02WF3KZ', '02WF4JZ', '02WF4KZ', '020F0JZ', '020F0KZ', '020F1JZ', '020F1KZ', '020F2JZ', '020F2KZ', '020F3JZ', '020F3KZ', '020F4JZ', '020F4KZ', '024F0JZ', '024F0KZ', '024F1JZ', '024F1KZ', '024F2JZ', '024F2KZ', '024F3JZ', '024F3KZ', '024F4JZ', '024F4KZ']

r = "W3812|405[0-3L-O]|02[W04]F[0-4][JK]Z"
regex_expander(r, verbose=False)
# output
UserWarning: The input expression W3812 does not contain any ranges.
 warnings.warn(f"The input expression {rex} does not contain any ranges.")
{'W3812': 'W3812', '405[0-3L-O]': ['405L', '405M', '405N', '405O', '4050', '4051', '4052', '4053'], '02[W04]F[0-4][JK]Z': ['02WF0JZ', '02WF0KZ', '02WF1JZ', '02WF1KZ', '02WF2JZ', '02WF2KZ', '02WF3JZ', '02WF3KZ', '02WF4JZ', '02WF4KZ', '020F0JZ', '020F0KZ', '020F1JZ', '020F1KZ', '020F2JZ', '020F2KZ', '020F3JZ', '020F3KZ', '020F4JZ', '020F4KZ', '024F0JZ', '024F0KZ', '024F1JZ', '024F1KZ', '024F2JZ', '024F2KZ', '024F3JZ', '024F3KZ', '024F4JZ', '024F4KZ']}

Question 2

nested def

created a sub-function single_expander .... Would it be a better practice to separate this function out?

Yes. Nested functions tend to cause more problems than they solve, especially for junior devs. They have their uses, but this isn't one of them. The overlapping namespaces run the global variables risk of excessive coupling.

When in doubt, define a _private helper up at module level:

def _single_expander(rex, verbose):
 ...

That way your automated test suite can conveniently access and exercise the helper.

In the OP code, the range_expander function defines rex and verbose (they are input parameters). And then single_expander shadows both of them. Shadowing a name usually isn't desirable, and can lead to hard-to-debug maintenance issues, especially when multiple people are collaborating on a shared codebase. Mutating e.g. rex in the parent, and also in the child, could cause confusion.

type annotation

English is ambiguous.

def regex_expander(rex, verbose=True):
 ...
 args:
 - rex: regular expression with a range to expand

On my initial reading, I took that to indicate a re.Pattern, such as you get back from re.compile(). It would be better to mention str in the signature:

def regex_expander(rex: str, verbose: bool = True) -> list[str]:

type stability

... returns a list of strings ...
... returns a dictionary ...

Clearly it is possible to return diverse types. But that tends to make the Public API you're designing a little harder for callers to consume.

Consider defining a function that just returns a list of strings, and another that just returns a dictionary. Caller might want a convenience function that accepts a regex string and reports what kind of ranges it includes. Based on that, caller can choose appropriate function to call.

It's not clear the comment is actually telling the truth. In the "does not contain any ranges" warning case, we return a single str instead of wrapping a single list element with return [replaced]

Automated linting with mypy can help you notice such details.

common strings

alphabets = "abcdefghijklmnopqrstuvwxyz"
ALPHA_NUMS = alphabets.upper() + alphabets + "0123456789"

The string module already offers those. Prefer:

from string import ascii_letters, digits
ALPHA_NUMS = ascii_letters + digits

test suite

The examples you provided as Review Context are very nice. It would be useful to package them up as a TestCase. Then you, or any future maintainer, could easily verify that "things were working" before starting some refactor edits, and later verify that they're still working after completing those edits.

PEP-8

any advice on style

Thank you for habitually using raw strings for regexes; that is helpful.

Please use isort to organize your import statements in the usual way.

Occasionally doing "$ black *.py && ruff check" wouldn't hurt. Oh, and avoid nested defs, or at least write a # comment explaining the design rationale if you do use them.

compile once

any advice on ... efficiency

Imagine some caller that repeatedly invokes regex_expander() in a loop.

 range_patterns = re.findall(r"\[.*?\]", rex)

It's no big deal, but we are repeatedly asking the regex module to compile that string into a Pattern datastructure. Consider caching it across calls, so we incur the compile cost just once:

import re
range_re = re.compile(r"\[.*?\]")
def regex_expander( ...
 ...
 range_patterns = range_re.findall(rex)

two problems

Your code is essentially tackling these as distinct problems:

[abc]
(red|green|blue)

But (1.) could be turned into the same form as (2.), and then you'd be faced with just a single problem to solve. Consider writing a pre-processing layer which turns [a-c] into (a|b|c).

J_H J_H 41.7k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-09-08 23:21:13Z

nested def

created a sub-function single_expander .... Would it be a better practice to separate this function out?

Yes. Nested functions tend to cause more problems than they solve, especially for junior devs. They have their uses, but this isn't one of them. The overlapping namespaces run the global variables risk of excessive coupling.

When in doubt, define a _private helper up at module level:

def _single_expander(rex, verbose):
 ...

That way your automated test suite can conveniently access and exercise the helper.

In the OP code, the range_expander function defines rex and verbose (they are input parameters). And then single_expander shadows both of them. Shadowing a name usually isn't desirable, and can lead to hard-to-debug maintenance issues, especially when multiple people are collaborating on a shared codebase. Mutating e.g. rex in the parent, and also in the child, could cause confusion.

type annotation

English is ambiguous.

def regex_expander(rex, verbose=True):
 ...
 args:
 - rex: regular expression with a range to expand

On my initial reading, I took that to indicate a re.Pattern, such as you get back from re.compile(). It would be better to mention str in the signature:

def regex_expander(rex: str, verbose: bool = True) -> list[str]:

type stability

... returns a list of strings ...
... returns a dictionary ...

Clearly it is possible to return diverse types. But that tends to make the Public API you're designing a little harder for callers to consume.

Consider defining a function that just returns a list of strings, and another that just returns a dictionary. Caller might want a convenience function that accepts a regex string and reports what kind of ranges it includes. Based on that, caller can choose appropriate function to call.

It's not clear the comment is actually telling the truth. In the "does not contain any ranges" warning case, we return a single str instead of wrapping a single list element with return [replaced]

Automated linting with mypy can help you notice such details.

common strings

alphabets = "abcdefghijklmnopqrstuvwxyz"
ALPHA_NUMS = alphabets.upper() + alphabets + "0123456789"

The string module already offers those. Prefer:

from string import ascii_letters, digits
ALPHA_NUMS = ascii_letters + digits

test suite

The examples you provided as Review Context are very nice. It would be useful to package them up as a TestCase. Then you, or any future maintainer, could easily verify that "things were working" before starting some refactor edits, and later verify that they're still working after completing those edits.

PEP-8

any advice on style

Thank you for habitually using raw strings for regexes; that is helpful.

Please use isort to organize your import statements in the usual way.

Occasionally doing "$ black *.py && ruff check" wouldn't hurt. Oh, and avoid nested defs, or at least write a # comment explaining the design rationale if you do use them.

compile once

any advice on ... efficiency

Imagine some caller that repeatedly invokes regex_expander() in a loop.

 range_patterns = re.findall(r"\[.*?\]", rex)

It's no big deal, but we are repeatedly asking the regex module to compile that string into a Pattern datastructure. Consider caching it across calls, so we incur the compile cost just once:

import re
range_re = re.compile(r"\[.*?\]")
def regex_expander( ...
 ...
 range_patterns = range_re.findall(rex)

two problems

Your code is essentially tackling these as distinct problems:

[abc]
(red|green|blue)

But (1.) could be turned into the same form as (2.), and then you'd be faced with just a single problem to solve. Consider writing a pre-processing layer which turns [a-c] into (a|b|c).

Stack Exchange Network

Python function to expand regex with ranges

1 Answer 1

nested def

type annotation

type stability

common strings

test suite

PEP-8

compile once

two problems

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Python function to expand regex with ranges

1 Answer 1

nested def

type annotation

type stability

common strings

test suite

PEP-8

compile once

two problems

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions