indexOf Boyer-Moore String Search algorithm

Question 1

This is an implementation of Boyer-Moore string search algorithm index_of for a pattern in text. I was challenged to implement it by my friend.

This Python function returns the starting index of the first occurrence of a pattern in text and returns None if no index is found. For example, index_of('ababc', 'abc') returns 2 and index_of('abc', 'z') returns None.

I have the following implementation that handles most of the cases and I think the worst case running time is \$O(m/n)\$.

def index_of(text, pattern):
 assert isinstance(text, str), 'text is not a string: {}'.format(text)
 assert isinstance(pattern, str), 'pattern is not a string: {}'.format(text)
 # Check if empty; if so, return 0, earliest bit
 if len(pattern) == 0:
 return 0
 pattern = list(pattern)
 pattern_len = len(pattern)
 text = list(text)
 textext_len = len(text)
 shift = 0
 while(shift <= textext_len - pattern_len):
 index = pattern_len - 1
 # Keep going until it all meets
 while index >= 0 and pattern[index] == text[shift+index]:
 index -= 1
 # If we get past index, then return where the shift is
 if index < 0:
 return shift
 else:
 if text[shift+pattern_len-1] in pattern:
 shift += pattern_len - pattern.index(text[shift+pattern_len-1]) - 1
 else:
 shift += pattern_len
 return None

I include my code here. I ran my code against 3 separate tests, and it passed all of the pytests. So it's pretty robust, and I think my implementation works fine.

import unittest
class Test(unittest.TestCase):
 def test_index_of_with_matching_patterns(self):
 assert index_of('abc', '') == 0 # all strings contain empty string
 assert index_of('abc', 'a') == 0 # single letters are easy
 assert index_of('abc', 'b') == 1
 assert index_of('abc', 'c') == 2
 assert index_of('abc', 'ab') == 0 # multiple letters are harder
 assert index_of('abc', 'bc') == 1
 assert index_of('abc', 'abc') == 0 # all strings contain themselves
 assert index_of('aaa', 'a') == 0 # multiple occurrences
 assert index_of('aaa', 'aa') == 0 # overlapping pattern
 assert index_of('thisisabsolutelycrazy', 't') == 0
 assert index_of('abcdef', 'abcdef') == 0 # all strings contain
 assert index_of('abcdef', 'abcdef') == 0 # all strings contain
 def test_index_of_with_non_matching_patterns(self):
 # Negative test cases (counterexamples) with non-matching patterns
 assert index_of('abc', 'z') is None # remember to test other letters
 assert index_of('abc', 'ac') is None # important to test close cases
 assert index_of('abc', 'az') is None # first letter, but not last
 assert index_of('abc', 'abz') is None # first 2 letters, but not last
 def test_index_of_with_complex_patterns(self):
 # Difficult test cases (examples) with complex patterns
 assert index_of('ababc', 'abc') == 2 # overlapping prefix
 assert index_of('bananas', 'nas') == 4 # overlapping prefix
 assert index_of('abcabcabc', 'abc') == 0 # multiple occurrences
 assert index_of('abcabcab', 'abc') == 0 # multiple occurrences
 assert index_of('abcabcdef', 'abcd') == 3 # overlapping prefix
 assert index_of('abcabcdef', 'abcdef') == 3 # overlapping prefix
 assert index_of('abcabcdabcde', 'abcde') == 7 # overlapping prefix
 assert index_of('abcabcdabcde', 'abcd') == 3 # multiple occurrences, overlapping prefix
 assert index_of('abra cadabra', 'abra') == 0 # multiple occurrences
 assert index_of('abra cadabra', 'adab') == 6 # overlapping prefix
if __name__ == '__main__':
 unittest.main()

Question 2

(I don't understand If you find a mismatch, you have a configuration of the type. I'm tempted to suggest if __name__ ...: import sys main(sys.argv)) Do "the twin assert()s" call for procedural abstraction? I don't like if condition: return... else:.

Question 3

i updated the comment

Question 4

Let me know if there's any comment and feedback for tihs code review

Question 5

Comments consume less time than a full-blown code review: 1) How do/did you assess worst case running time? 2) main() looks scrambled: shouldn't "the single shot call" be in "the == 2-branch", the script controlled by else? 3) this doesn't look a true implementation of Boyer-Moore string-search (Neither precomputation nor (not quite as true) memoisation of delta1) 4) the comment about shift rule looks misplaced: put it after the setup of textext_len (what a name...) 5) return 0, earliest bit - return 0 (1st character)?

Question 6

I appreciate the comment, and I was able to assess my worst time using Big O notation.

Question 7

1. Bug

The code in the post does not work!

>>> index_of('abba', 'bba')
>>> # expected 1 but got None

Read on to see how you could have discovered this.

2. Review of the unit tests

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I artifically introduce a bug into the code, then I might get this test output:

======================================================================
FAIL: test_index_of_with_matching_patterns (cr179311.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 38, in test_index_of_with_matching_patterns
 assert index_of('abc', 'c') == 2
AssertionError
----------------------------------------------------------------------

At this point it would be helpful to know what the incorrect value was. The unittest.TestCase class has a bunch of assert methods that produce better output. For example, if I replace the line:

assert index_of('abc', 'c') == 2

with:

self.assertEqual(index_of('abc', 'c'), 2)

then the test failure message includes the incorrect value:

======================================================================
FAIL: test_index_of_with_matching_patterns (cr179311.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 38, in test_index_of_with_matching_patterns
 self.assertEqual(index_of('abc', 'c'), 2)
AssertionError: -1 != 2
----------------------------------------------------------------------

Each unit test knows its expected result. But you had to type these in by hand and could have made a mistake. It would be more reliable if the expected result were computed. In this case we could use the built-in str.find method, for example:
```
self.assertEqual(index_of('abc', 'c'), 'abc'.find('c'))
```
This approach is known as using a test oracle.

The unit tests are very repetitive. Every test has the form:

assert index_of(haystack, needle) == expected_index

or if you follow my suggestions above,

self.assertEqual(index_of(haystack, needle), str.find(haystack, needle))

The repetition could be avoided by making the tests data-driven, like this:

class TestIndexOf(unittest.TestCase):
 CASES = [
 # Positive cases
 ('abc', ''), # all strings contain empty string
 ('abc', 'a'), # single letters are easy
 ('abc', 'b'),
 ('abc', 'c'),
 ('abc', 'ab'), # multiple letters are harder
 ('abc', 'bc'),
 ('abc', 'abc'), # all strings contain themselves
 ('aaa', 'a'), # multiple occurrences
 ('aaa', 'aa'), # overlapping pattern
 ('thisisabsolutelycrazy', 't'),
 ('abcdef', 'abcdef'), # all strings contain themselves
 ('ababc', 'abc'), # overlapping prefix
 ('bananas', 'nas'), # overlapping prefix
 ('abcabcabc', 'abc'), # multiple occurrences
 ('abcabcab', 'abc'), # multiple occurrences
 ('abcabcdef', 'abcd'), # overlapping prefix
 ('abcabcdef', 'abcdef'), # overlapping prefix
 ('abcabcdabcde', 'abcde'), # overlapping prefix
 ('abcabcdabcde', 'abcd'), # multiple occurrences, overlapping prefix
 ('abra cadabra', 'abra'), # multiple occurrences
 ('abra cadabra', 'adab'), # overlapping prefix
 # Negative cases
 ('abc', 'z'), # remember to test other letters
 ('abc', 'ac'), # important to test close cases
 ('abc', 'az'), # first letter, but not last
 ('abc', 'abz'), # first 2 letters, but not last
 ]
 def check(self, haystack, needle):
 expected = haystack.find(needle)
 if expected == -1:
 expected = None
 self.assertEqual(index_of(haystack, needle), expected,
 "index_of({!r}, {!r})".format(haystack, needle))
 def test_cases(self):
 for haystack, needle in self.CASES:
 self.check(haystack, needle)

This makes it easy to add more test cases.

Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

It's easy to generate random (positive) test cases:

def test_random(self):
 for n in range(1, 100):
 haystack = ''.join(random.choice('abc') for _ in range(n))
 start, stop = sorted(random.sample(range(n + 1), 2))
 needle = haystack[start:stop]
 self.check(haystack, needle)

And if we run these randomly generated test cases, we soon get a test failure as described in §1 above:

======================================================================
FAIL: test_random (cr179311.TestIndexOf)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 91, in test_random
 self.check(haystack, needle)
 File "cr179311.py", line 81, in check
 "index_of({!r}, {!r})".format(haystack, needle))
AssertionError: None != 3 : index_of('cacaaaaaa', 'aaaaaa')
----------------------------------------------------------------------

Question 8

Excellent answer; it might be worth mentioning that any failures from random test cases should be added to the test suite as they are found (so that they can be demonstrably fixed). I mention this because it's easy to take for granted, but not obvious to everybody.

Question 9

Nitpick

Because of Gareth Rees' excellent answer, there is no point in trying to improve your code until you've made it correct. However, here is a tiny piece of code that could be improved:

 if text[shift+pattern_len-1] in pattern:
 shift += pattern_len - pattern.index(text[shift+pattern_len-1]) - 1
 else:
 shift += pattern_len

could be rewritten by incrementing shift at the beginning:

 shift += pattern_len
 if text[shift-1] in pattern:
 shift -= pattern.index(text[shift-1]) + 1

then, the check using in before calling index could be changed for a single call to find, thus iterating over pattern only once:

 shift += pattern_len
 pos = pattern.find(text[shift-1])
 if pos >= 0:
 shift -= pos + 1

Then, you go further and actually increment by pattern_len - 1 at the beginning:

 shift += pattern_len - 1
 pos = pattern.find(text[shift])
 if pos >= 0:
 shift -= pos
 else:
 shift += 1

Ultimately, this becomes:

 shift += pattern_len - 1
 shift -= pattern.find(text[shift]) # -1 when not found

This is not tested yet :(

Question 10

A few very nitpicky things.

You are missing docstrings. Each function and class should have a docstring so someone reading quickly over the code can tell what each is supposed to do, what its reason for being is, without having to look at calls to it in context and/or read through the statements. Docstrings are better than first-line comments because intelligent editors etc. can automatically parse them.

Your comments in general should be improved. Your first comment:

 # Check if empty; if so, return 0, earliest bit

just restates what the code is doing, except for the cryptic phrase "earliest bit" which doesn't explain why 0 is returned. It's not immediately obvious that the empty string is defined as being at the earliest possible location.

You should strive to remove any duplicated code. We see:

 pattern = list(pattern)
 pattern_len = len(pattern)
 text = list(text)
 textext_len = len(text)

It would be better to define a function that both listifies and gives the length, and then call it:

(pattern, pattern_len) = listify(pattern)
(text, textext_len) = listify(text)

and I'm not sure you mean to be reusing text.

Toby Speight – Toby Speight · Accepted Answer · 2018-01-07 11:17:57Z

1. Bug

The code in the post does not work!

>>> index_of('abba', 'bba')
>>> # expected 1 but got None

Read on to see how you could have discovered this.

2. Review of the unit tests

Having a suite of unit tests is excellent, better than 99% of submissions to Code Review.

Using assert for the unit tests means that you don't get much information out of a test case failure. For example, if I artifically introduce a bug into the code, then I might get this test output:

======================================================================
FAIL: test_index_of_with_matching_patterns (cr179311.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 38, in test_index_of_with_matching_patterns
 assert index_of('abc', 'c') == 2
AssertionError
----------------------------------------------------------------------

At this point it would be helpful to know what the incorrect value was. The unittest.TestCase class has a bunch of assert methods that produce better output. For example, if I replace the line:

assert index_of('abc', 'c') == 2

with:

self.assertEqual(index_of('abc', 'c'), 2)

then the test failure message includes the incorrect value:

======================================================================
FAIL: test_index_of_with_matching_patterns (cr179311.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 38, in test_index_of_with_matching_patterns
 self.assertEqual(index_of('abc', 'c'), 2)
AssertionError: -1 != 2
----------------------------------------------------------------------

Each unit test knows its expected result. But you had to type these in by hand and could have made a mistake. It would be more reliable if the expected result were computed. In this case we could use the built-in str.find method, for example:
```
self.assertEqual(index_of('abc', 'c'), 'abc'.find('c'))
```
This approach is known as using a test oracle.

The unit tests are very repetitive. Every test has the form:

assert index_of(haystack, needle) == expected_index

or if you follow my suggestions above,

self.assertEqual(index_of(haystack, needle), str.find(haystack, needle))

The repetition could be avoided by making the tests data-driven, like this:

class TestIndexOf(unittest.TestCase):
 CASES = [
 # Positive cases
 ('abc', ''), # all strings contain empty string
 ('abc', 'a'), # single letters are easy
 ('abc', 'b'),
 ('abc', 'c'),
 ('abc', 'ab'), # multiple letters are harder
 ('abc', 'bc'),
 ('abc', 'abc'), # all strings contain themselves
 ('aaa', 'a'), # multiple occurrences
 ('aaa', 'aa'), # overlapping pattern
 ('thisisabsolutelycrazy', 't'),
 ('abcdef', 'abcdef'), # all strings contain themselves
 ('ababc', 'abc'), # overlapping prefix
 ('bananas', 'nas'), # overlapping prefix
 ('abcabcabc', 'abc'), # multiple occurrences
 ('abcabcab', 'abc'), # multiple occurrences
 ('abcabcdef', 'abcd'), # overlapping prefix
 ('abcabcdef', 'abcdef'), # overlapping prefix
 ('abcabcdabcde', 'abcde'), # overlapping prefix
 ('abcabcdabcde', 'abcd'), # multiple occurrences, overlapping prefix
 ('abra cadabra', 'abra'), # multiple occurrences
 ('abra cadabra', 'adab'), # overlapping prefix
 # Negative cases
 ('abc', 'z'), # remember to test other letters
 ('abc', 'ac'), # important to test close cases
 ('abc', 'az'), # first letter, but not last
 ('abc', 'abz'), # first 2 letters, but not last
 ]
 def check(self, haystack, needle):
 expected = haystack.find(needle)
 if expected == -1:
 expected = None
 self.assertEqual(index_of(haystack, needle), expected,
 "index_of({!r}, {!r})".format(haystack, needle))
 def test_cases(self):
 for haystack, needle in self.CASES:
 self.check(haystack, needle)

This makes it easy to add more test cases.

Since we have a test oracle, it would make a lot of sense to generate random test cases. This allows us to explore a much larger space of possibilities than we could do just by writing out test cases by hand.

It's easy to generate random (positive) test cases:

def test_random(self):
 for n in range(1, 100):
 haystack = ''.join(random.choice('abc') for _ in range(n))
 start, stop = sorted(random.sample(range(n + 1), 2))
 needle = haystack[start:stop]
 self.check(haystack, needle)

And if we run these randomly generated test cases, we soon get a test failure as described in §1 above:

======================================================================
FAIL: test_random (cr179311.TestIndexOf)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "cr179311.py", line 91, in test_random
 self.check(haystack, needle)
 File "cr179311.py", line 81, in check
 "index_of({!r}, {!r})".format(haystack, needle))
AssertionError: None != 3 : index_of('cacaaaaaa', 'aaaaaa')
----------------------------------------------------------------------

Excellent answer; it might be worth mentioning that any failures from random test cases should be added to the test suite as they are found (so that they can be demonstrably fixed). I mention this because it's easy to take for granted, but not obvious to everybody.

Stack Exchange Network

indexOf Boyer-Moore String Search algorithm

3 Answers 3

1. Bug

2. Review of the unit tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

indexOf Boyer-Moore String Search algorithm

3 Answers 3

1. Bug

2. Review of the unit tests

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions