4
\$\begingroup\$

I have a pattern that I have been using for a few lookup functions that I feel could be done in a better or more pythonic way. I am looking for guidance on how the following function could be improved. The list of patterns that I have for each type of lookup can be long - I have shortened it here but it is usually 50 - 100 patterns in each.

The goal of this function is to take an arg status and return back a string indicating which list it matched.

I use this as such:

... process csv data
normalize_license_status("Operating Approved") --> active
normalize_license_status("Pending") --> inactive

and if this matched the "active" list or expression I expect to get back "active".

def _normalize_license_status(self,status,pattern):
 logger.debug(f'Searching {status} for pattern {pattern}')
 if re.search(rf"\b(?=\w){pattern}\b(?!\w)", status, re.IGNORECASE):
 return True
def normalize_license_status(self,status):
 # Status Patterns
 active = "Operating|Established|Active|About to Expire"
 inactive = "Returned|Void|Pending-Inspection|Pending"
 missing = "Not Available|not available|Not Provided|not provided"
 
 # Strip off leading and training spaces
 status = status.strip()
 # Set default status
 
 try: 
 ## Inactive
 if self._normalize_license_status(status,inactive):
 result = 'Inactive'
 logger.debug(f'Matched {status} to {result}')
 
 # Active
 elif self._normalize_license_status(status,active):
 result = 'Active'
 logger.debug(f'Matched {status} to {result}')
 
 ## Missing or Not Provided or Not Available
 elif self._normalize_license_status(status,missing):
 result = 'Not Available'
 logger.debug(f'Matched {status} to {result}')
 
 ## No Match
 else:
 # Should only get here when a license status is not matched
 raise ValueError(f'Invalid status {status}. Could not find a match.')
 
 except ValueError as value_err:
 logger.error(value_err)
 except Exception as e:
 logger.error(e)
 else:
 logger.debug(f'Normalized status {status} as {result}')
 return result

My thought is that there is a better way to do this. I have several functions like this that verify a string against a list of patterns and I am looking to understand how to make this better.

Toby Speight
87.2k14 gold badges104 silver badges322 bronze badges
asked Feb 11, 2022 at 16:21
\$\endgroup\$
1
  • 2
    \$\begingroup\$ Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Please check that I haven't misrepresented your code, and correct it if I have. \$\endgroup\$ Commented Feb 11, 2022 at 16:45

2 Answers 2

3
\$\begingroup\$

Discussion

If I'm understanding this correctly, you're trying to take map a set of known strings to one of three license states (active, inactive, or missing)?

If so, it feels like representing the data as enums seems the most natural.

The license states themselves feel like enums - in your implementation you're returning one of three strings.

The inputs feel like fixed values themselves (maybe I'm wrong, maybe they are highly dynamic), but if they are relatively fixed / there is a known set of the input values, they should be categorized as enums as well.

Proposed Implementation

from enum import Enum, auto
class LicenseState(Enum):
 ACTIVE = auto()
 INACTIVE = auto()
 MISSING = auto()
class LicenseDescription(Enum):
 OPERATING_APPROVED = auto()
 PENDING = auto()
license_states_by_license_description = {
 LicenseDescription.OPERATING_APPROVED: LicenseState.ACTIVE,
 LicenseDescription.PENDING: LicenseState.INACTIVE
}
def calculate_license_state(license_description):
 license_state = license_states_by_license_description.get(license_description, None)
 if license_state is None:
 raise ValueError(f"Unknown license state for license description: {license_description}")
 
 return license_state
print(calculate_license_state(LicenseDescription.OPERATING_APPROVED)) // LicenseStatus.ACTIVE
print(calculate_license_state(None)) // ValueError

repl.it

Proposed Implementation Discussion

I'll touch more on this in the review of your proposed implementation, but I think a shortcoming of the your proposed implementation is it depends on string parsing for analyzing which state a license description is in, which feels a little brittle to me.

The function inputs seem like they should be enumerable, and in my opinion, this categorization logic would be much easier to maintain if they were enumerated. As it stands, you're effectively enumerating some subset of the license descriptions by using the pipe-delimited approach.

Worse-case you could at least split the pipe-delimited strings into a set of strings and map them to the appropriate license status like

identification_patterns_by_license_status = {
 LicenseStatus.ACTIVE: ('Operating', 'Established'),
 // etc
}

And then iterate over the different LicenseStatuses in some order, iterating over each element of the set and seeing if there is a match against the input string.

Code Review

Naming

Input Argument Name

The input argument to normalize_license_status is named status but to me, the return string seems to be more "status"-like. "Active", "Inactive", "Not Available" seem to describe the state of a license than "Operating Approved". To me, the input seems more like a license description, or a license status description?

Helper Method

_normalize_license_status is supposed to return a boolean. Generally speaking, I like to name methods that return a boolean is_X or has_X. So in this case maybe something like is_license_description_contained_in_license_patterns. Also, I think it's generally a good practice that if you return True in one branch, you should explicitly return False in the other branch and not return None - or even simpler

return re.search(rf"\b(?=\w){pattern}\b(?!\w)", status, re.IGNORECASE)

General Thoughts

Each time the method is called, you are reallocating the same strings to the active, inactive, and missing variables - these seem like they should be class variables.

The string parsing approach seems very brittle. Imagine the case where "Foo" is both in the set of active patterns and inactive patterns. Then you can never change the order of evaluating the inactive patterns before evaluating the active patterns or else the behavior of the method might change.

The logic has a lot of repetition. Inside the try you are effectively testing each string pattern and then assigning a result if a test succeeds. Imagine something like this:

result = next(
 case_and_result[1]
 for case_and_result in
 [(active, "Active"), (inactive, "Inactive), (missing, "Not Available)]
 if _normalize_license_status(status, case_and_result[0]),
 None
)
if result is None:
 raise ValueError()

Logging & Exception Handling

I'm seeing a lot of logger method invocations. There's nothing wrong with using a logger for debugging, but to have logger methods in (theoretically) production code seems like a code smell - specifically, it smells like there aren't sufficient unit tests written to test the places that have logger method invocations.

I don't have much context on the overall class, but I don't think this method should swallow errors. The caller should be in charge of how it wants to handle any errors that are thrown by this method.

While your current use-case might work by logging errors and swallowing them, if some other person joined your team, it would not be obvious (or expected) that any errors raised by this method would be swallowed.

answered Feb 12, 2022 at 0:15
\$\endgroup\$
3
  • 1
    \$\begingroup\$ The implication that logging is inappropriate for production code is at odds with industry practice. \$\endgroup\$ Commented Feb 12, 2022 at 15:09
  • \$\begingroup\$ @FMc you're right that production code can have logging. I have mostly seen usages for error handling and telemetry. A lot of the log statements seem to be around validating logical behavior - i.e. if a certain output was produced - which seems more inline with a unit test of some kind rather than a log statement, imho. \$\endgroup\$ Commented Feb 12, 2022 at 15:51
  • \$\begingroup\$ @jaeBradley - ty for the very detailed response. This is my first time posting on this exchange and i was not certain what to expect. I am reviewing your recommendations above and am realizing that i have never used Enums in the past. I am testing various scenarios with your code sample. I am curious about the logging you mentioned. This process is part of a much larger internal ETL tool and I thought i understood logging properly and have added logging so that i can find specific issues when the pop-up. I do not have any unit testing setup at this time and am using logging to verify \$\endgroup\$ Commented Feb 14, 2022 at 18:06
1
\$\begingroup\$

I presume that there is a one-to-one mapping of input strings to normalized status. That is, the same input string can't map to more than one normalized status. If that is the case, just use a dict to map input strings to the normalized status strings?

So that you don't need to type 'Active'/'Inactive'/'Missing' 100 times, start with a dict mapping normalized status to the corresponding input strings. Update this dict whenever a user comes up with a new way to describe the license status in the CSV file. Make it easy to read and update.

_status_map = {
 'Active':"""
 about to expire
 active
 established
 operating
 """,
 'Inactive':"""
 pending
 pending-inspection
 returned
 void
 """,
 'Missing':"""
 not available
 not provided
 """,
}

But, we need a map from input strings to a normalized status string:

license_status_map = {value:key for key,value in _status_map.items()
 for value in values.strip().splitlines()}

The normalizing routing should put the input string into some standard form. Here it is just lowercased, but consider replacing punctuation with spaces and collapsing a adjacent whitespace into one space. Then simply look up the input string in the license_status_map.

def normalize_license_status(self, status):
 status = status.strip().tolower()
 result = license_status_map.get(status, None)
 if result is None:
 logger.error(f'Unmatched {status}')
 return None # default_status?
 logger.debug(f'Matched {status} to {result}')
 return status

Note: 'Active'/'Inactive'/'Missing' can be replaced by enums if that fits your use case.

answered Feb 15, 2022 at 20:26
\$\endgroup\$
1
  • \$\begingroup\$ I think it's a many-to-one mapping, but the same observations apply. \$\endgroup\$ Commented Feb 15, 2022 at 20:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.