The minimum window substring problem from leetcode.com asks:
Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity \$O(n)\$.
For example,
S =
"ADOBECODEBANC"
T ="ABC"
Minimum window is
"BANC"
.
I post two versions of code. The differences between two versions of code is, the first one will move the start index when there is a match, the 2nd version will not move the start index when there is a move.
Wondering if logics are both correct? If both have the correct logic, which one do you think has better performance? I have did basic testing and post my test cases below. Using Python 2.7.
from collections import defaultdict
import sys
def minWindow(source, pattern):
targetCount = defaultdict(int)
currentCount = defaultdict(int)
for ch in pattern:
targetCount[ch] += 1
start = 0
validCount = 0
minSize = sys.maxint
for index in range(len(source)):
if source[index] not in targetCount:
continue
if currentCount[source[index]] < targetCount[source[index]]:
validCount += 1
currentCount[source[index]] += 1
while validCount == len(targetCount):
if source[start] not in targetCount:
start += 1
elif currentCount[source[start]] > targetCount[source[start]]:
currentCount[source[start]] -= 1
start += 1
else:
minSize = min(minSize, index-start+1)
currentCount[source[start]] -= 1
validCount -= 1
start += 1
return (minSize, start-1)
def minWindowv2(source, pattern):
targetCount = defaultdict(int)
currentCount = defaultdict(int)
for ch in pattern:
targetCount[ch] += 1
start = 0
validCount = 0
minSize = sys.maxint
for index in range(len(source)):
if source[index] not in targetCount:
continue
if currentCount[source[index]] < targetCount[source[index]]:
validCount += 1
currentCount[source[index]] += 1
while validCount == len(targetCount):
if source[start] not in targetCount:
start += 1
elif currentCount[source[start]] > targetCount[source[start]]:
currentCount[source[start]] -= 1
start += 1
else:
minSize = min(minSize, index-start+1)
break
return (minSize, start)
if __name__ == "__main__":
print minWindow("ADOBECODEBANC","ABC")
print minWindowv2("ADOBECODEBANC", "ABC")
1 Answer 1
1. Review
I don't like comparative reviews, so I only looked in detail at the first version of your code. Possibly some of these comments apply to the other version too.
The code does not return the correct answer if there are repeated letters in
pattern
. For example, given S ='ABCA'
and T ='AA'
, the minimum window should be the whole string. But:>>> minWindow('ABCA', 'AA') (1, 3)
To fix this, use
len(pattern)
instead oflen(targetCount)
.It would be better to raise an exception if no window is found rather than returning the meaningless result
(4294967295, -1)
.sys.maxint
is not portable to Python 3. If you need a starting value that's bigger than any integer, usefloat('inf')
. But it's possible to reorganize the code so that there's no need for a starting value — see the revised code below.Instead of
defaultdict(int)
, useCounter
, which you can initialize directly from an iterable:targetCount = Counter(pattern)
Instead of iterating over indexes into
source
, useenumerate
:for end, c in enumerate(source): if c not in targetCount: continue if currentCount[c] < targetCount[c]: validCount += 1 currentCount[c] += 1 # ... and so on ...
This avoids lots of lookups of
source[index]
.There's repetition of lines like
start += 1
andcurrentCount[source[start]] -= 1
. It would be easy to reorder the code so that each line appears only once.Similarly, instead of maintaining and incrementing a
start
index, create an iterable that enumerates the source string:window_start = enumerate(source)
and call
next
each time you need another character from it:start, c = next(window_start) if c not in targetCount: continue # ... and so on ...
This avoids lots of lookups of
source[start]
.The inner loop is:
while validCount == len(targetCount):
but
validCount
does not change on every loop iteration (only on loop iterations where the start character causes one of the current counts to drop below its target). So some of these tests are wasted. Instead, you could loop using:for start, c in window_start:
and then exit the loop only after reducing
validCount
:validCount -= 1 if validCount < len(pattern): break
The test:
if source[index] not in targetCount: continue
is not necessary — the algorithm works just as well without it. And similarly for the corresponding test in the inner loop.
2. Revised code
In this implementation I've tried to emphasize the symmetry between the outer and the inner loop. Really what we have here is a pair of coroutines, each iterating over the input, and yielding to the other one at the appropriate point.
from collections import Counter
def min_window(source, pattern):
"""Return the shortest substring of source that contains all the
characters in pattern (with the required multiplicity). Raise
ValueError if there is no such substring.
"""
def windows():
pattern_length = len(pattern)
target = Counter(pattern) # character -> required count
current = Counter() # character -> count in current window
matched = 0 # matched characters in current window
window_start, window_end = enumerate(source), enumerate(source)
for end, c in window_end:
current[c] += 1
if current[c] <= target[c]:
matched += 1
if matched == pattern_length:
for start, c in window_start:
current[c] -= 1
if current[c] < target[c]:
matched -= 1
if matched < pattern_length:
yield end + 1 - start, start
break
length, start = min(windows())
return source[start: start + length]
-
\$\begingroup\$ Thanks Gareth for the excellent comments and vote up. One quick question is, what puzzled me most in my original implementation is, whether I should move start position further when there is a match, or not. Wondering your comments. It seems you do not move start position when there is a match, but I think whether move start could have better performance and less positions to evaluate? Please feel free to correct me. Thanks. \$\endgroup\$Lin Ma– Lin Ma2016年06月16日 21:59:21 +00:00Commented Jun 16, 2016 at 21:59
-
2\$\begingroup\$ @LinMa: It's better to update
start
every time around the inner loop (as in my code) otherwise there might be extra unnecessary iterations of the inner loop. This should be clear from the symmetry of the code: the whole purpose of the inner loop is to updatestart
(just as the purpose of the outer loop is to updateend
). \$\endgroup\$Gareth Rees– Gareth Rees2016年06月16日 22:12:04 +00:00Commented Jun 16, 2016 at 22:12 -
\$\begingroup\$ Thanks Gareth, vote up. Actually I am asking whether it is better to move
start
tostart+1
when there is a match, Just to confirm my understanding is correct, for my example, "ADOBECODEBANC", when there is a match for sub-string, "ADOBEC", start point to the first characterA
, then it is better to move/increase start to the 2nd character (other than keep start pointing toA
), is that correct understanding? \$\endgroup\$Lin Ma– Lin Ma2016年06月16日 22:15:38 +00:00Commented Jun 16, 2016 at 22:15 -
1\$\begingroup\$ @LinMa: Yes, that's right — once you've found
"ABODEC"
, you know that it's pointless to consider any other substrings starting with the first characterA
, because they must all be longer than"ABODEC"
, and so none of them can be the minimum. \$\endgroup\$Gareth Rees– Gareth Rees2016年06月16日 22:17:58 +00:00Commented Jun 16, 2016 at 22:17
Explore related questions
See similar questions with these tags.