I have a working function with too many if/else
statements (pylint). The code works but it's not good. I try to improve it but I have some tunnelvision. Can someone help me to improve the code?
So depending on a list of stacks, an environment (prod, nonprod or all) and a list of account IDs there can be filtered on which stacks need to be updated.
This is an example of a stack dict:
{'AccountId': '123456789101', 'StackName': 'test-nonprod'}
function:
def filter_stacks(stacks, environment, account_ids):
"""todo"""
non_prod_regex = re.compile("(nonprod|non-prod)")
stacks_to_update = []
if non_prod_regex.match(environment) and account_ids:
for stack in stacks:
if (
non_prod_regex.search(stack["StackName"])
and stack["AccountId"] in account_ids
):
stacks_to_update.append(stack)
elif non_prod_regex.match(environment) and not account_ids:
for stack in stacks:
if non_prod_regex.search(stack["StackName"]):
stacks_to_update.append(stack)
elif re.compile("all").match(environment) and account_ids:
for stack in stacks:
if stack["AccountId"] in account_ids:
stacks_to_update.append(stack)
elif re.compile("all").match(environment) and not account_ids:
stacks_to_update = stacks
elif not non_prod_regex.match(environment) and account_ids:
for stack in stacks:
if (
not non_prod_regex.search(stack["StackName"])
and stack["AccountId"] in account_ids
):
stacks_to_update.append(stack)
elif not non_prod_regex.match(environment) and not account_ids:
for stack in stacks:
if not non_prod_regex.search(stack["StackName"]):
stacks_to_update.append(stack)
else:
print("No match found.")
return stacks_to_update
-
3\$\begingroup\$ Also, regexes are the same every time the function is called. Make them static (i.e. global or attached to class object or something), not compile them every time. \$\endgroup\$Pavlo Slavynskyy– Pavlo Slavynskyy2022年01月14日 12:07:58 +00:00Commented Jan 14, 2022 at 12:07
-
6\$\begingroup\$ Without some more context, a review will be difficult. I think that the whole "non-production" logic is a non-starter and is indicative of suboptimal design. \$\endgroup\$Richard Neumann– Richard Neumann2022年01月14日 12:20:32 +00:00Commented Jan 14, 2022 at 12:20
-
4\$\begingroup\$ Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$Toby Speight– Toby Speight2022年01月14日 13:37:07 +00:00Commented Jan 14, 2022 at 13:37
-
3\$\begingroup\$ @PavloSlavynskyy Whereas OP should follow your advice, strictly speaking compile them every time is not what actually happens since there is a regex compilation cache. \$\endgroup\$Reinderien– Reinderien2022年01月14日 15:11:17 +00:00Commented Jan 14, 2022 at 15:11
-
1\$\begingroup\$ @Reinderien thnx gone reading \$\endgroup\$Pavlo Slavynskyy– Pavlo Slavynskyy2022年01月14日 15:18:08 +00:00Commented Jan 14, 2022 at 15:18
4 Answers 4
Simplifications
re.compile("(nonprod|non-prod)")
is just a complicated way to write a badly performing regex forre.compile("non-?prod")
, which does the same, but better.extract
re.compile("all")
into a variable like you did withnon_prod_regex
, or better yet, do a string comparison:if environment == "all"
... Achieves the same effect, but is quicker (because no regex involved).
A different design
An alterative to having that for-loop inside of the if-statements would be to have the if-statements inside of your loop. That can be made much cleaner by converting it to a list comprehension:
is_nonprod = bool(non_prod_regex.match(environment))
has_accounts = bool(account_ids)
stacks_to_update = [stack for stack in stacks if (
(environment == "all" or (is_nonprod == bool(non_prod_regex.match(stack["StackName"]))))
and (not has_accounts or stack["AccountId"] in account_ids)
)]
This design makes an interesting simplification: You have two dimensions of things you want to check: The environment of the stack and the account of the stack. Both of these dimensions impact the result set, but are independent of one another.
This allows us to simplify things a bit by noticing the repeating pattern:
- if there are
account_ids
, the account needs to be checked, otherwise we don't care - The environment must either be "all" or the 'nonprod-status' of the stack must match the environment.
These two conditions must be fulfilled to include a stack in the result set.
The first one is easier, it's just an implication \$A \Rightarrow s.aid \in A\$, which is equivalent to \$\neg A \lor s.aid \in A\$ (not account_ids or ...
)
The second condition is equally simple to implement and combining the two conditions with and
gets all the results filtered the way we need them to.
-
\$\begingroup\$ re: the regex... i doubt it makes a difference honestly, since unless the engine in question is bad the regex probably gets compiled to an nfa/dfa. so the decision is to choose whichever one is more maintainable... and that really depends on the user. \$\endgroup\$somebody– somebody2022年01月15日 00:43:10 +00:00Commented Jan 15, 2022 at 0:43
-
\$\begingroup\$ well actually, even better:
if environment in ["nonprod", "non-prod"]: <a> elif environment in ["all"]: <b> else: c
\$\endgroup\$somebody– somebody2022年01月15日 00:45:42 +00:00Commented Jan 15, 2022 at 0:45 -
\$\begingroup\$ @somebody except for the fact that string interning is a thing, the compilation of a regex takes time and even if it is a DFA, following a pointered graph instead of comparing elements of an array is slower. Note that the if you are suggesting won't work well in a list comprehension and is IMO easier expressed using the
XOR
semantic \$\endgroup\$Vogel612– Vogel6122022年01月15日 07:15:13 +00:00Commented Jan 15, 2022 at 7:15 -
\$\begingroup\$ "following a pointered graph"??? what exactly is that referring to? also imo the list comprehension doesn't look great... plus i wasn't replying to that section at all. \$\endgroup\$somebody– somebody2022年01月15日 10:51:21 +00:00Commented Jan 15, 2022 at 10:51
A data-oriented function should return data or raise an exception, not print. If there are no matching stacks, just return an empty list or raise.
Establish a habit of defining regular expressions with raw strings. If you always define regex patterns with raw strings,
you never have to waste time thinking about whether your pattern
contains things like \n
, \t
, and so forth. This is one example
of a preemptive strategy to avoid silly bugs. Efficient developers have
many habits like this.
Instead of checking everything at once, apply multiple filtering passes.
Your current strategy sets up all of the conditional logic branches based on
the environment and the presence/absence of account_ids
and then writes the
complete loop for each branch. That's a lot of code and it imposes a high
mental cost on the reader who must try to discern small differences between
each branch. But that's not necessary. If you break the process down into
phases, everything simplifies: first filter by environment, and then filter by
account IDs. Notice also that this reorganization of the code
has the benefit of laying the groundwork should you decide
in the future to put environment-filtering and account-id-filtering
in separate functions (for example, if the filtering logic in
one or both becomes more complex).
Organize code in commented paragraphs. As illustrated below, a few simple comments -- especially if they focus on strategy or purpose -- can help future readers navigate the code. The most likely future reader is you.
import re
def filter_stacks(stacks, environment, account_ids):
# Determine whether the environment is nonprod.
nonprod_rgx = re.compile(r'non-?prod')
env_is_nonprod = bool(nonprod_rgx.match(environment))
# Filter based on environment.
if re.compile(r'all').match(environment):
to_update = stacks
else:
to_update = [
s
for s in stacks
if env_is_nonprod is bool(nonprod_rgx.search(s['StackName']))
]
# Optionally filter based on account_ids.
if account_ids:
return [s for s in to_update if s['AccountId'] in account_ids]
else:
return to_update
According to your code the first optimization that came to my mind is to group your if statements by the variable accounts_ids like the next example
if accounts_ids:
if non_prod_regex.match(environment):
do....
elif re.compile("all").match(environment):
do....
else:
do...
else:
if non_prod_regex.match(environment):
do....
elif re.compile("all").match(environment):
stacks_to_update = stacks
else:
do....
That will make the code a bit more readable, however when there is multiple if statements my recommendation is that you use a Visitor pattern to solve the problem. Hope it helps to clarify
As other answers already pointed out, you should not compile call re.compile
several times, and there is no need to try to filter according to all filters at the same time.
I would go even further and suggest to use iterators, which are great tools in Python. It would allow you to write a single function per filter, and apply a bunch of function to each stack in a single iteration.
I would also suggest to type your code, which makes any review process easier.
Here is a suggestion that will work only in python 3.8+ (due to functools.cached_property
usage):
import re
from functools import cached_property
from typing import (
Any,
Iterable,
Iterator,
List,
Pattern,
TypeVar,
Mapping,
)
class Patterns:
"""Wrapper regex patterns into cached properties"""
@cached_property
def NON_PROD(self) -> Pattern[str]:
return re.compile(r"non-?prod")
@cached_property
def ALL(self) -> Pattern[str]:
return re.compile(r"all")
# Create a global instance that can be used accross the module
PATTERNS = Patterns()
def account_ids_filter(stack: Mapping[str, Any], account_ids: List[str]) -> bool:
"""Return True is stack account ID is present in list of accoutn IDs"""
return stack["AccountId"] in account_ids
def environment_filter(stack: Mapping[str, Any], environment: str) -> bool:
"""Return True if stack matches environment else False"""
env_is_non_prod = True if PATTERNS.NON_PROD.match(environment) else False
return env_is_non_prod is bool(PATTERNS.NON_PROD.search(stack["StackName"]))
# Whatever type is going into filter_stacks will also come out from filter_stacks
StackT = TypeVar("StackT", bound=Mapping[str, Any])
def filter_stacks(
stacks: Iterable[StackT],
environment: str,
account_ids: List[str],
) -> Iterator[StackT]:
"""Returns an iterator of filtered stacks"""
# Return early in case "all" pattern matches
if PATTERNS.ALL.match(environment):
return iter(stacks)
# Iterate through stacks
for stack in stacks:
# Filter using environment
if not environment_filter(stack, environment):
continue
# Filter using account IDs
if not account_ids_filter(stack, account_ids):
continue
yield stack
def get_filtered_stacks(
stacks: Iterable[StackT], environment: str, account_ids: List[str]
) -> List[StackT]:
"""Returns a list of filtered stacks"""
return list(filter_stacks(stacks, environment=environment, account_ids=account_ids))
Because we're working with iterators, the input stacks
argument is iterated on once only.
Moreover, regexp patterns are stored as cached properties on a global variable so that they are generated lazily (on first call, and not at module import)
If you filters grow more complex, you can write more filter functions, unit test them easily, and call them in the filter_stacks
function
-
1\$\begingroup\$ Your account_ids_filter function could be simplified.
return True if x else False
is equivalent toreturn x
(assuming x is a boolean, which it is here). \$\endgroup\$Seb– Seb2022年01月15日 23:05:49 +00:00Commented Jan 15, 2022 at 23:05 -
\$\begingroup\$
Pattern.match
returns eitherNone
if there is no match, or an instance ofre.Match
. So it's definitely not a boolean. I could have donebool(...)
instead ofTrue if ... else False
, but I guess it's a matter of style ? \$\endgroup\$gcharbon– gcharbon2022年01月16日 00:23:07 +00:00Commented Jan 16, 2022 at 0:23 -
1\$\begingroup\$ I’m talking about
stack["AccountId"] in account_ids
which is a Boolean \$\endgroup\$Seb– Seb2022年01月16日 00:27:04 +00:00Commented Jan 16, 2022 at 0:27 -
\$\begingroup\$ You're right, my bad ! I fixed it \$\endgroup\$gcharbon– gcharbon2022年01月16日 00:54:03 +00:00Commented Jan 16, 2022 at 0:54