Pythonic way of expressing the simple problem:
Tell if the list
needle
is sublist ofhaystack
#!/usr/bin/env python3
def sublist (haystack, needle):
def start ():
i = iter(needle)
return next(i), i
try:
n0, i = start()
for h in haystack:
if h == n0:
n0 = next(i)
else:
n0, i = start()
except StopIteration:
return True
return False
-
\$\begingroup\$ So, what's the question? \$\endgroup\$Jeff Vanzella– Jeff Vanzella2012年12月15日 03:27:04 +00:00Commented Dec 15, 2012 at 3:27
-
\$\begingroup\$ @JeffVanzella: Fair point. That's "code review". The question is: "what do you think in general?". \$\endgroup\$Dacav– Dacav2012年12月15日 12:30:16 +00:00Commented Dec 15, 2012 at 12:30
1 Answer 1
1. Bug
Here's a case where your function fails:
>>> sublist([1, 1, 2], [1, 2])
False
this is because in the else:
case you go back to the beginning of the needle, but you keep going forward in the haystack, possibly skipping over a match. In the test case, your function tries matching with the alignment shown below, which fails at the position marked X
:
X
haystack 1,1,2
needle 1,2
Then it starts over from the beginning of the needle, but keeps going forward in the haystack, thus missing the match:
X
haystack 1,1,2
needle 1,2
So after a mismatch you need to go backward an appropriate distance in the haystack before starting over again from the beginning of the needle.
2. A better algorithm
It turns out to be better to start matching from the end of the needle. If this fails to match, we can skip forward several steps: possibly the whole length of the needle. For example, in this situation:
X
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5
we can skip forward by the whole length of the needle (because 6
does not appear in the needle). The next alignment we need to try is this:
O O O O O
haystack 1,2,3,4,6,1,2,3,4,5
needle 1,2,3,4,5
However, we can't always skip forward the whole length of the needle. The distance we can skip depends on the item that fails to match. In this situation:
X
haystack 1,2,3,4,1,2,3,4,5
needle 1,2,3,4,5
we should skip forward by 4, to bring the 1
s into alignment.
Making these ideas precise leads to the Boyer–Moore–Horspool algorithm:
def find(haystack, needle):
"""Return the index at which the sequence needle appears in the
sequence haystack, or -1 if it is not found, using the Boyer-
Moore-Horspool algorithm. The elements of needle and haystack must
be hashable.
>>> find([1, 1, 2], [1, 2])
1
"""
h = len(haystack)
n = len(needle)
skip = {needle[i]: n - i - 1 for i in range(n - 1)}
i = n - 1
while i < h:
for j in range(n):
if haystack[i - j] != needle[-j - 1]:
i += skip.get(haystack[i], n)
break
else:
return i - n + 1
return -1
-
\$\begingroup\$ @Lattyware: Generally I'd agree that you should use iterators wherever it makes sense, but this is one of the exceptional cases where iterators don't make much sense. A pure-iterator search runs in O(nm) whereas Boyer–Moore-Horspool is O(n). (In the average case on random text.) Sometimes you just have to get your hands grubby. \$\endgroup\$Gareth Rees– Gareth Rees2012年12月15日 02:11:56 +00:00Commented Dec 15, 2012 at 2:11
-
\$\begingroup\$ Scratch all my arguments, it's impossible to do this correctly the way I was doing it. \$\endgroup\$Latty– Latty2012年12月15日 02:29:05 +00:00Commented Dec 15, 2012 at 2:29
-
\$\begingroup\$ Accepted. The source is strong with this one. \$\endgroup\$Dacav– Dacav2012年12月15日 12:38:21 +00:00Commented Dec 15, 2012 at 12:38
-
\$\begingroup\$ Does the dict comprehension overwrite previous values? If not you'll skip too far when you see
e
while searching forseven
. I'd bet it does but am not sure. I do miss Python for its elegance. \$\endgroup\$David Harkness– David Harkness2014年02月06日 03:14:56 +00:00Commented Feb 6, 2014 at 3:14 -
\$\begingroup\$ Yes, later items in the dict comprehension override previous items, and so
find('seven', 'even')
→1
as required. \$\endgroup\$Gareth Rees– Gareth Rees2014年02月06日 10:17:39 +00:00Commented Feb 6, 2014 at 10:17