Python code that checks the valid phrases in input string

Question 1

Determine if all the phrases are found within the input string.

If they're all found (using distance as a measure of leeway) return True. Else False.

Example:

input = 'can i go to the bathroom in the morning',

phrases = ['can go', 'bathroom morning']

if distance is 1 then this won't result in a match because

'bathroom', 'morning' has 2 words between it

if distance is 2 then 'bathroom in the morning' is counted as a valid phrase

Code improvement questions

How would you make the code more readable
How do they handle punctuation - do they distinguish between types of punctuation and why How do they handle stemming?
Do they handle overlapping instances and partial phrase matches (need to check them all)
Big O time/ best/worst case scenarios

def get_compound_keyword_match(input: str, phrases: list, distance: int) -> bool:
 
 if not distance:
 # We have no leeway for a match.
 if all(phrase in input for phrase in phrases):
 return True
 keywords = input.split()
 for phrase in phrases:
 phrase_matched = False
 ck_words = phrase.split()
 first_word_matches = [
 i for i, x in enumerate(keywords) if x == ck_words[0]
 ]
 if not first_word_matches:
 return False
 for first_word_match in first_word_matches:
 old_match_index = first_word_match
 matched = False
 for i in range(0, len(ck_words)):
 try:
 match_index = keywords.index(ck_words[i])
 if match_index - old_match_index > (distance + 1):
 matched = False
 old_match_index = match_index
 except ValueError:
 matched = False
 if matched:
 phrase_matched = True
 break
 if not phrase_matched:
 return False
 return True

Some unittests

class TestCompoundKeywords(unittest.TestCase):
 """ Runnable compound keyword unittests.
 """
 def test_valid_match(self):
 """finds a valid compound keyword"""
 self.assertTrue(
 get_compound_keyword_match(
 "can i take tylenol", ["take tylenol"], 0,
 )
 )
 def test_invalid_match(self):
 """finds multiple keywords in entry"""
 self.assertTrue(
 get_compound_keyword_match(
 "can i take a tylenol take tylenol", ["take", "tylenol"], 1,
 )
 )
 def test_ck_valid_match(self):
 """finds a valid compound keyword match with distance > 1"""
 self.assertTrue(
 get_compound_keyword_match(
 "can i take a big tylenol", ["take tylenol"], 2,
 )
 )
 def test_multiple_ck_valid_match(self):
 """Contains two compound keyword matches with distance <= 2"""
 self.assertTrue(
 get_compound_keyword_match(
 "can i take a big tylenol i have severe allergies", ["take tylenol", "have allergies"], 2,
 )
 )
 def test_multiple_ck_invalid_distance_match(self):
 """Assert that if not all compound keywords are matchable within distance we fail"""
 self.assertFalse(
 get_compound_keyword_match(
 "can i take a big tylenol i have severe allergies", ["take tylenol", "have allergies"], 1,
 )
 )
 def test_invalid_match_duplicate_words(self):
 """Assert that an unmatchable phrase fails"""
 self.assertFalse(
 get_compound_keyword_match(
 "if i am nauseous and if i am coughing can i take tylenol", ["i i nauseous i"], 5,
 )
 )
if __name__ == "__main__":
 unittest.main()

Question 2

Is this homework?

Question 3

@Reinderien Yes kind of

Question 4

Ooohhh, look! It comes with unit tests, excellent!

Also, brownie points for the optional type annotations.

This is Just Wrong.

 if all(phrase in input for phrase in phrases):
 return True

Given a distance of zero and even a single unmatched word, we should be returning False. You were looking for

 return all(phrase in input for phrase in phrases)

As it stands the False case falls through to rest of function, which does not appear to be Author's Intent.

I found this a bit opaque / hard-to-maintain:

 ck_words = phrase.split()
 first_word_matches = [
 i for i, x in enumerate(keywords) if x == ck_words[0]
 ]

I'm sure ck_words could be a perfectly nice local identifier. But please offer a # comment that gives me a hint how I should read "ck". Maybe "check"? But that didn't seem to make sense.

It's not obvious to me what's special about the 1st word, e.g. "ummm can i go ..." arguably has "can" as the 1st important word, but .split() makes it the 2nd word.

I would feel better about complex logic in the for first_word_match ... loop if it could be broken out as a helper.

That way it could have its own contract and its own unit tests.

Consider adding a """docstring""" to get_compound_keyword_match.

Consider deleting one or more docstrings from the unit tests. You have lovely method names, so they are largely self-explanatory. If you feel there's still more to say, then yes by all means say it in a docstring.

The tests are all fine as they are. But don't be shy about adding more than one assert to a test. Sometimes it's more convenient that way. We assume all tests are Green. If it turns out it's not obvious what went wrong with a Red test, then you might get serious about running tiny atomic single-assert tests so you know exactly where things went south.

The .main() call at the end is perfectly nice and convenient; feel free to keep it. But please understand that it is pretty usual for folks to run tests with a command like $ pytest or

$ python -m unittest test_compound_keywords.py

Sometimes a Makefile target ($ make test) will mention such a command.

I'm just saying: don't feel like you have to throw that main() call in there.

Also, consider running this with $ pytest --cov, to help you assess whether the tests have exercised both possibilities for each if.

Overall?

Identifiers are well chosen. This code can reasonably be maintained by others.

The for loop is perhaps more complex than it absolutely has to be, for example there's coupling among a bunch of local variables. Breaking out the occasoinal helper could improve readability.

LGTM, ship it.

Question 5

This is Just Wrong ...

 if all(phrase in input for phrase in phrases):
 return True

... but not only for the reason J_H has given.

 self.assertFalse(
 get_compound_keyword_match(
 "is it ok if I mistake tylenol for aspirin",
 ["take tylenol"], 0,
 )
 )

There is no word boundary checking when a zero distance is given, which can lead to false positives.

Question 6

Yay! I feel this is an excellent time to advocate running tests against hypothesis. Believe no one! Not me. Not my (esteemed!) collaborator AJNeufeld. Rather, believe the machine! And gaze in astonishment at what it tells you about your shaky beliefs and weak "proofs". Fuzzing is an ugly business, but your code will emerge the better for it. (I admit it has humbled me more than once. For example, what does the unicode regex r"\d+" really mean? Yeah, not what you thought, eh?)

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-01-10 04:37:39Z