Determine if all the phrases are found within the input string.
If they're all found (using distance as a measure of leeway) return True. Else False.
Example:
- input = 'can i go to the bathroom in the morning',
- phrases = ['can go', 'bathroom morning']
- if distance is 1 then this won't result in a match because
- 'bathroom', 'morning' has 2 words between it
- if distance is 2 then 'bathroom in the morning' is counted as a valid phrase
Code improvement questions
- How would you make the code more readable
- How do they handle punctuation - do they distinguish between types of punctuation and why How do they handle stemming?
- Do they handle overlapping instances and partial phrase matches (need to check them all)
- Big O time/ best/worst case scenarios
def get_compound_keyword_match(input: str, phrases: list, distance: int) -> bool:
if not distance:
# We have no leeway for a match.
if all(phrase in input for phrase in phrases):
return True
keywords = input.split()
for phrase in phrases:
phrase_matched = False
ck_words = phrase.split()
first_word_matches = [
i for i, x in enumerate(keywords) if x == ck_words[0]
]
if not first_word_matches:
return False
for first_word_match in first_word_matches:
old_match_index = first_word_match
matched = False
for i in range(0, len(ck_words)):
try:
match_index = keywords.index(ck_words[i])
if match_index - old_match_index > (distance + 1):
matched = False
old_match_index = match_index
except ValueError:
matched = False
if matched:
phrase_matched = True
break
if not phrase_matched:
return False
return True
Some unittests
class TestCompoundKeywords(unittest.TestCase):
""" Runnable compound keyword unittests.
"""
def test_valid_match(self):
"""finds a valid compound keyword"""
self.assertTrue(
get_compound_keyword_match(
"can i take tylenol", ["take tylenol"], 0,
)
)
def test_invalid_match(self):
"""finds multiple keywords in entry"""
self.assertTrue(
get_compound_keyword_match(
"can i take a tylenol take tylenol", ["take", "tylenol"], 1,
)
)
def test_ck_valid_match(self):
"""finds a valid compound keyword match with distance > 1"""
self.assertTrue(
get_compound_keyword_match(
"can i take a big tylenol", ["take tylenol"], 2,
)
)
def test_multiple_ck_valid_match(self):
"""Contains two compound keyword matches with distance <= 2"""
self.assertTrue(
get_compound_keyword_match(
"can i take a big tylenol i have severe allergies", ["take tylenol", "have allergies"], 2,
)
)
def test_multiple_ck_invalid_distance_match(self):
"""Assert that if not all compound keywords are matchable within distance we fail"""
self.assertFalse(
get_compound_keyword_match(
"can i take a big tylenol i have severe allergies", ["take tylenol", "have allergies"], 1,
)
)
def test_invalid_match_duplicate_words(self):
"""Assert that an unmatchable phrase fails"""
self.assertFalse(
get_compound_keyword_match(
"if i am nauseous and if i am coughing can i take tylenol", ["i i nauseous i"], 5,
)
)
if __name__ == "__main__":
unittest.main()
-
1\$\begingroup\$ Is this homework? \$\endgroup\$Reinderien– Reinderien2023年01月09日 13:07:18 +00:00Commented Jan 9, 2023 at 13:07
-
\$\begingroup\$ @Reinderien Yes kind of \$\endgroup\$Rohit Sthapit– Rohit Sthapit2023年01月09日 13:15:26 +00:00Commented Jan 9, 2023 at 13:15
2 Answers 2
Ooohhh, look! It comes with unit tests, excellent!
Also, brownie points for the optional type annotations.
This is Just Wrong.
if all(phrase in input for phrase in phrases):
return True
Given a distance of zero and even a single unmatched word, we should be returning False. You were looking for
return all(phrase in input for phrase in phrases)
As it stands the False case falls through to rest of function, which does not appear to be Author's Intent.
I found this a bit opaque / hard-to-maintain:
ck_words = phrase.split()
first_word_matches = [
i for i, x in enumerate(keywords) if x == ck_words[0]
]
I'm sure ck_words
could be a perfectly nice
local identifier. But please offer a # comment
that gives me a hint how I should read "ck".
Maybe "check"? But that didn't seem to make sense.
It's not obvious to me what's special about the 1st word, e.g. "ummm can i go ..." arguably has "can" as the 1st important word, but .split() makes it the 2nd word.
I would feel better about complex logic in the
for first_word_match ...
loop if it could be broken out
as a helper.
That way it could have its own contract and its own unit tests.
Consider adding a """docstring""" to get_compound_keyword_match
.
Consider deleting one or more docstrings from the unit tests. You have lovely method names, so they are largely self-explanatory. If you feel there's still more to say, then yes by all means say it in a docstring.
The tests are all fine as they are. But don't be shy about adding more than one assert to a test. Sometimes it's more convenient that way. We assume all tests are Green. If it turns out it's not obvious what went wrong with a Red test, then you might get serious about running tiny atomic single-assert tests so you know exactly where things went south.
The .main()
call at the end is perfectly nice and convenient;
feel free to keep it. But please understand that it is pretty
usual for folks to run tests with a command like $ pytest
or
$ python -m unittest test_compound_keywords.py
Sometimes a Makefile target ($ make test
) will mention
such a command.
I'm just saying: don't feel like you have to throw that main() call in there.
Also, consider running this with $ pytest --cov
,
to help you assess whether the tests have
exercised both possibilities for each if
.
Overall?
Identifiers are well chosen. This code can reasonably be maintained by others.
The for
loop is perhaps more complex than it absolutely
has to be, for example there's coupling among a bunch of local variables. Breaking out the occasoinal helper could improve readability.
LGTM, ship it.
This is Just Wrong ...
if all(phrase in input for phrase in phrases):
return True
... but not only for the reason J_H has given.
self.assertFalse(
get_compound_keyword_match(
"is it ok if I mistake tylenol for aspirin",
["take tylenol"], 0,
)
)
There is no word boundary checking when a zero distance is given, which can lead to false positives.
-
1\$\begingroup\$ Yay! I feel this is an excellent time to advocate running tests against hypothesis. Believe no one! Not me. Not my (esteemed!) collaborator AJNeufeld. Rather, believe the machine! And gaze in astonishment at what it tells you about your shaky beliefs and weak "proofs". Fuzzing is an ugly business, but your code will emerge the better for it. (I admit it has humbled me more than once. For example, what does the unicode regex
r"\d+"
really mean? Yeah, not what you thought, eh?) \$\endgroup\$J_H– J_H2023年01月11日 05:22:51 +00:00Commented Jan 11, 2023 at 5:22