Finding sub-list

Question 1

Pythonic way of expressing the simple problem:

Tell if the list needle is sublist of haystack

#!/usr/bin/env python3
def sublist (haystack, needle):
 def start ():
 i = iter(needle)
 return next(i), i
 try:
 n0, i = start()
 for h in haystack:
 if h == n0:
 n0 = next(i)
 else:
 n0, i = start()
 except StopIteration:
 return True
 return False

Question 2

So, what's the question?

Question 3

@JeffVanzella: Fair point. That's "code review". The question is: "what do you think in general?".

Question 4

1. Bug

Here's a case where your function fails:

>>> sublist([1, 1, 2], [1, 2])
False

this is because in the else: case you go back to the beginning of the needle, but you keep going forward in the haystack, possibly skipping over a match. In the test case, your function tries matching with the alignment shown below, which fails at the position marked X:

 X
haystack 1,1,2
needle 1,2

Then it starts over from the beginning of the needle, but keeps going forward in the haystack, thus missing the match:

 X
haystack 1,1,2
needle 1,2

So after a mismatch you need to go backward an appropriate distance in the haystack before starting over again from the beginning of the needle.

2. A better algorithm

It turns out to be better to start matching from the end of the needle. If this fails to match, we can skip forward several steps: possibly the whole length of the needle. For example, in this situation:

 X
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5

we can skip forward by the whole length of the needle (because 6 does not appear in the needle). The next alignment we need to try is this:

 O O O O O
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5

However, we can't always skip forward the whole length of the needle. The distance we can skip depends on the item that fails to match. In this situation:

 X
haystack 1,2,3,4,1,2,3,4,5
needle 1,2,3,4,5

we should skip forward by 4, to bring the 1s into alignment.

Making these ideas precise leads to the Boyer–Moore–Horspool algorithm:

def find(haystack, needle):
 """Return the index at which the sequence needle appears in the
 sequence haystack, or -1 if it is not found, using the Boyer-
 Moore-Horspool algorithm. The elements of needle and haystack must
 be hashable.
 >>> find([1, 1, 2], [1, 2])
 1
 """
 h = len(haystack)
 n = len(needle)
 skip = {needle[i]: n - i - 1 for i in range(n - 1)}
 i = n - 1
 while i < h:
 for j in range(n):
 if haystack[i - j] != needle[-j - 1]:
 i += skip.get(haystack[i], n)
 break
 else:
 return i - n + 1
 return -1

Question 5

@Lattyware: Generally I'd agree that you should use iterators wherever it makes sense, but this is one of the exceptional cases where iterators don't make much sense. A pure-iterator search runs in O(nm) whereas Boyer–Moore-Horspool is O(n). (In the average case on random text.) Sometimes you just have to get your hands grubby.

Question 6

Scratch all my arguments, it's impossible to do this correctly the way I was doing it.

Question 7

Accepted. The source is strong with this one.

Question 8

Does the dict comprehension overwrite previous values? If not you'll skip too far when you see e while searching for seven. I'd bet it does but am not sure. I do miss Python for its elegance.

Question 9

Yes, later items in the dict comprehension override previous items, and so find('seven', 'even') → 1 as required.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2012-12-15 02:00:44Z

1. Bug

Here's a case where your function fails:

>>> sublist([1, 1, 2], [1, 2])
False

this is because in the else: case you go back to the beginning of the needle, but you keep going forward in the haystack, possibly skipping over a match. In the test case, your function tries matching with the alignment shown below, which fails at the position marked X:

 X
haystack 1,1,2
needle 1,2

Then it starts over from the beginning of the needle, but keeps going forward in the haystack, thus missing the match:

 X
haystack 1,1,2
needle 1,2

So after a mismatch you need to go backward an appropriate distance in the haystack before starting over again from the beginning of the needle.

2. A better algorithm

It turns out to be better to start matching from the end of the needle. If this fails to match, we can skip forward several steps: possibly the whole length of the needle. For example, in this situation:

 X
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5

we can skip forward by the whole length of the needle (because 6 does not appear in the needle). The next alignment we need to try is this:

 O O O O O
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5

However, we can't always skip forward the whole length of the needle. The distance we can skip depends on the item that fails to match. In this situation:

 X
haystack 1,2,3,4,1,2,3,4,5
needle 1,2,3,4,5

we should skip forward by 4, to bring the 1s into alignment.

Making these ideas precise leads to the Boyer–Moore–Horspool algorithm:

def find(haystack, needle):
 """Return the index at which the sequence needle appears in the
 sequence haystack, or -1 if it is not found, using the Boyer-
 Moore-Horspool algorithm. The elements of needle and haystack must
 be hashable.
 >>> find([1, 1, 2], [1, 2])
 1
 """
 h = len(haystack)
 n = len(needle)
 skip = {needle[i]: n - i - 1 for i in range(n - 1)}
 i = n - 1
 while i < h:
 for j in range(n):
 if haystack[i - j] != needle[-j - 1]:
 i += skip.get(haystack[i], n)
 break
 else:
 return i - n + 1
 return -1

@Lattyware: Generally I'd agree that you should use iterators wherever it makes sense, but this is one of the exceptional cases where iterators don't make much sense. A pure-iterator search runs in O(nm) whereas Boyer–Moore-Horspool is O(n). (In the average case on random text.) Sometimes you just have to get your hands grubby.
Scratch all my arguments, it's impossible to do this correctly the way I was doing it.
Does the dict comprehension overwrite previous values? If not you'll skip too far when you see e while searching for seven. I'd bet it does but am not sure. I do miss Python for its elegance.
Yes, later items in the dict comprehension override previous items, and so find('seven', 'even') → 1 as required.

Stack Exchange Network

Finding sub-list

1 Answer 1

1. Bug

2. A better algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Finding sub-list

1 Answer 1

1. Bug

2. A better algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions