This is an implementation of Boyer-Moore string search algorithm index_of
for a pattern in text. I was challenged to implement it by my friend.
This Python function returns the starting index of the first occurrence of a pattern in text and returns None
if no index is found. For example, index_of('ababc', 'abc')
returns 2 and index_of('abc', 'z')
returns None
.
I have the following implementation that handles most of the cases and I think the worst case running time is \$O(m/n)\$.
def index_of(text, pattern):
assert isinstance(text, str), 'text is not a string: {}'.format(text)
assert isinstance(pattern, str), 'pattern is not a string: {}'.format(text)
# Check if empty; if so, return 0, earliest bit
if len(pattern) == 0:
return 0
pattern = list(pattern)
pattern_len = len(pattern)
text = list(text)
textext_len = len(text)
shift = 0
while(shift <= textext_len - pattern_len):
index = pattern_len - 1
# Keep going until it all meets
while index >= 0 and pattern[index] == text[shift+index]:
index -= 1
# If we get past index, then return where the shift is
if index < 0:
return shift
else:
if text[shift+pattern_len-1] in pattern:
shift += pattern_len - pattern.index(text[shift+pattern_len-1]) - 1
else:
shift += pattern_len
return None
I include my code here. I ran my code against 3 separate tests, and it passed all of the pytests. So it's pretty robust, and I think my implementation works fine.
import unittest
class Test(unittest.TestCase):
def test_index_of_with_matching_patterns(self):
assert index_of('abc', '') == 0 # all strings contain empty string
assert index_of('abc', 'a') == 0 # single letters are easy
assert index_of('abc', 'b') == 1
assert index_of('abc', 'c') == 2
assert index_of('abc', 'ab') == 0 # multiple letters are harder
assert index_of('abc', 'bc') == 1
assert index_of('abc', 'abc') == 0 # all strings contain themselves
assert index_of('aaa', 'a') == 0 # multiple occurrences
assert index_of('aaa', 'aa') == 0 # overlapping pattern
assert index_of('thisisabsolutelycrazy', 't') == 0
assert index_of('abcdef', 'abcdef') == 0 # all strings contain
assert index_of('abcdef', 'abcdef') == 0 # all strings contain
def test_index_of_with_non_matching_patterns(self):
# Negative test cases (counterexamples) with non-matching patterns
assert index_of('abc', 'z') is None # remember to test other letters
assert index_of('abc', 'ac') is None # important to test close cases
assert index_of('abc', 'az') is None # first letter, but not last
assert index_of('abc', 'abz') is None # first 2 letters, but not last
def test_index_of_with_complex_patterns(self):
# Difficult test cases (examples) with complex patterns
assert index_of('ababc', 'abc') == 2 # overlapping prefix
assert index_of('bananas', 'nas') == 4 # overlapping prefix
assert index_of('abcabcabc', 'abc') == 0 # multiple occurrences
assert index_of('abcabcab', 'abc') == 0 # multiple occurrences
assert index_of('abcabcdef', 'abcd') == 3 # overlapping prefix
assert index_of('abcabcdef', 'abcdef') == 3 # overlapping prefix
assert index_of('abcabcdabcde', 'abcde') == 7 # overlapping prefix
assert index_of('abcabcdabcde', 'abcd') == 3 # multiple occurrences, overlapping prefix
assert index_of('abra cadabra', 'abra') == 0 # multiple occurrences
assert index_of('abra cadabra', 'adab') == 6 # overlapping prefix
if __name__ == '__main__':
unittest.main()
3 Answers 3
1. Bug
The code in the post does not work!
>>> index_of('abba', 'bba')
>>> # expected 1 but got None
Read on to see how you could have discovered this.
2. Review of the unit tests
Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.
Using
assert
for the unit tests means that you don't get much information out of a test case failure. For example, if I artifically introduce a bug into the code, then I might get this test output:====================================================================== FAIL: test_index_of_with_matching_patterns (cr179311.Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "cr179311.py", line 38, in test_index_of_with_matching_patterns assert index_of('abc', 'c') == 2 AssertionError ----------------------------------------------------------------------
At this point it would be helpful to know what the incorrect value was. The
unittest.TestCase
class has a bunch of assert methods that produce better output. For example, if I replace the line:assert index_of('abc', 'c') == 2
with:
self.assertEqual(index_of('abc', 'c'), 2)
then the test failure message includes the incorrect value:
====================================================================== FAIL: test_index_of_with_matching_patterns (cr179311.Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "cr179311.py", line 38, in test_index_of_with_matching_patterns self.assertEqual(index_of('abc', 'c'), 2) AssertionError: -1 != 2 ----------------------------------------------------------------------
Each unit test knows its expected result. But you had to type these in by hand and could have made a mistake. It would be more reliable if the expected result were computed. In this case we could use the built-in
str.find
method, for example:self.assertEqual(index_of('abc', 'c'), 'abc'.find('c'))
This approach is known as using a test oracle.
The unit tests are very repetitive. Every test has the form:
assert index_of(haystack, needle) == expected_index
or if you follow my suggestions above,
self.assertEqual(index_of(haystack, needle), str.find(haystack, needle))
The repetition could be avoided by making the tests data-driven, like this:
class TestIndexOf(unittest.TestCase): CASES = [ # Positive cases ('abc', ''), # all strings contain empty string ('abc', 'a'), # single letters are easy ('abc', 'b'), ('abc', 'c'), ('abc', 'ab'), # multiple letters are harder ('abc', 'bc'), ('abc', 'abc'), # all strings contain themselves ('aaa', 'a'), # multiple occurrences ('aaa', 'aa'), # overlapping pattern ('thisisabsolutelycrazy', 't'), ('abcdef', 'abcdef'), # all strings contain themselves ('ababc', 'abc'), # overlapping prefix ('bananas', 'nas'), # overlapping prefix ('abcabcabc', 'abc'), # multiple occurrences ('abcabcab', 'abc'), # multiple occurrences ('abcabcdef', 'abcd'), # overlapping prefix ('abcabcdef', 'abcdef'), # overlapping prefix ('abcabcdabcde', 'abcde'), # overlapping prefix ('abcabcdabcde', 'abcd'), # multiple occurrences, overlapping prefix ('abra cadabra', 'abra'), # multiple occurrences ('abra cadabra', 'adab'), # overlapping prefix # Negative cases ('abc', 'z'), # remember to test other letters ('abc', 'ac'), # important to test close cases ('abc', 'az'), # first letter, but not last ('abc', 'abz'), # first 2 letters, but not last ] def check(self, haystack, needle): expected = haystack.find(needle) if expected == -1: expected = None self.assertEqual(index_of(haystack, needle), expected, "index_of({!r}, {!r})".format(haystack, needle)) def test_cases(self): for haystack, needle in self.CASES: self.check(haystack, needle)
This makes it easy to add more test cases.
Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.
It's easy to generate random (positive) test cases:
def test_random(self): for n in range(1, 100): haystack = ''.join(random.choice('abc') for _ in range(n)) start, stop = sorted(random.sample(range(n + 1), 2)) needle = haystack[start:stop] self.check(haystack, needle)
And if we run these randomly generated test cases, we soon get a test failure as described in §1 above:
====================================================================== FAIL: test_random (cr179311.TestIndexOf) ---------------------------------------------------------------------- Traceback (most recent call last): File "cr179311.py", line 91, in test_random self.check(haystack, needle) File "cr179311.py", line 81, in check "index_of({!r}, {!r})".format(haystack, needle)) AssertionError: None != 3 : index_of('cacaaaaaa', 'aaaaaa') ----------------------------------------------------------------------
-
4\$\begingroup\$ Excellent answer; it might be worth mentioning that any failures from random test cases should be added to the test suite as they are found (so that they can be demonstrably fixed). I mention this because it's easy to take for granted, but not obvious to everybody. \$\endgroup\$Toby Speight– Toby Speight2018年01月23日 14:29:33 +00:00Commented Jan 23, 2018 at 14:29
Nitpick
Because of Gareth Rees' excellent answer, there is no point in trying to improve your code until you've made it correct. However, here is a tiny piece of code that could be improved:
if text[shift+pattern_len-1] in pattern:
shift += pattern_len - pattern.index(text[shift+pattern_len-1]) - 1
else:
shift += pattern_len
could be rewritten by incrementing shift
at the beginning:
shift += pattern_len
if text[shift-1] in pattern:
shift -= pattern.index(text[shift-1]) + 1
then, the check using in
before calling index
could be changed for a single call to find
, thus iterating over pattern
only once:
shift += pattern_len
pos = pattern.find(text[shift-1])
if pos >= 0:
shift -= pos + 1
Then, you go further and actually increment by pattern_len - 1
at the beginning:
shift += pattern_len - 1
pos = pattern.find(text[shift])
if pos >= 0:
shift -= pos
else:
shift += 1
Ultimately, this becomes:
shift += pattern_len - 1
shift -= pattern.find(text[shift]) # -1 when not found
This is not tested yet :(
A few very nitpicky things.
You are missing docstrings. Each function and class should have a docstring so someone reading quickly over the code can tell what each is supposed to do, what its reason for being is, without having to look at calls to it in context and/or read through the statements. Docstrings are better than first-line comments because intelligent editors etc. can automatically parse them.
Your comments in general should be improved. Your first comment:
# Check if empty; if so, return 0, earliest bit
just restates what the code is doing, except for the cryptic phrase "earliest bit" which doesn't explain why 0 is returned. It's not immediately obvious that the empty string is defined as being at the earliest possible location.
You should strive to remove any duplicated code. We see:
pattern = list(pattern) pattern_len = len(pattern) text = list(text) textext_len = len(text)
It would be better to define a function that both listifies and gives the length, and then call it:
(pattern, pattern_len) = listify(pattern)
(text, textext_len) = listify(text)
and I'm not sure you mean to be reusing text
.
Explore related questions
See similar questions with these tags.
If you find a mismatch, you have a configuration of the type
. I'm tempted to suggestif __name__ ...: import sys main(sys.argv)
) Do "the twinassert()
s" call for procedural abstraction? I don't likeif condition: return... else:
. \$\endgroup\$worst case running time
? 2)main()
looks scrambled: shouldn't "the single shot call" be in "the== 2
-branch", the script controlled byelse
? 3) this doesn't look a true implementation of Boyer-Moore string-search (Neither precomputation nor (not quite as true) memoisation of delta1) 4) the comment aboutshift rule
looks misplaced: put it after the setup oftextext_len
(what a name...) 5)return 0, earliest bit
-return 0 (1st character)
? \$\endgroup\$