Minimum window substring

Question 1

The minimum window substring problem from leetcode.com asks:

Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity \$O(n)\$.

For example,

S = "ADOBECODEBANC"
T = "ABC"

Minimum window is "BANC".

I post two versions of code. The differences between two versions of code is, the first one will move the start index when there is a match, the 2nd version will not move the start index when there is a move.

Wondering if logics are both correct? If both have the correct logic, which one do you think has better performance? I have did basic testing and post my test cases below. Using Python 2.7.

from collections import defaultdict
import sys
def minWindow(source, pattern):
 targetCount = defaultdict(int)
 currentCount = defaultdict(int)
 for ch in pattern:
 targetCount[ch] += 1
 start = 0
 validCount = 0
 minSize = sys.maxint
 for index in range(len(source)):
 if source[index] not in targetCount:
 continue
 if currentCount[source[index]] < targetCount[source[index]]:
 validCount += 1
 currentCount[source[index]] += 1
 while validCount == len(targetCount):
 if source[start] not in targetCount:
 start += 1
 elif currentCount[source[start]] > targetCount[source[start]]:
 currentCount[source[start]] -= 1
 start += 1
 else:
 minSize = min(minSize, index-start+1)
 currentCount[source[start]] -= 1
 validCount -= 1
 start += 1
 return (minSize, start-1)
def minWindowv2(source, pattern):
 targetCount = defaultdict(int)
 currentCount = defaultdict(int)
 for ch in pattern:
 targetCount[ch] += 1
 start = 0
 validCount = 0
 minSize = sys.maxint
 for index in range(len(source)):
 if source[index] not in targetCount:
 continue
 if currentCount[source[index]] < targetCount[source[index]]:
 validCount += 1
 currentCount[source[index]] += 1
 while validCount == len(targetCount):
 if source[start] not in targetCount:
 start += 1
 elif currentCount[source[start]] > targetCount[source[start]]:
 currentCount[source[start]] -= 1
 start += 1
 else:
 minSize = min(minSize, index-start+1)
 break
 return (minSize, start)
if __name__ == "__main__":
 print minWindow("ADOBECODEBANC","ABC")
 print minWindowv2("ADOBECODEBANC", "ABC")

Question 2

1. Review

I don't like comparative reviews, so I only looked in detail at the first version of your code. Possibly some of these comments apply to the other version too.

The code does not return the correct answer if there are repeated letters in pattern. For example, given S = 'ABCA' and T = 'AA', the minimum window should be the whole string. But:
```
>>> minWindow('ABCA', 'AA')
(1, 3)
```
To fix this, use len(pattern) instead of len(targetCount).
It would be better to raise an exception if no window is found rather than returning the meaningless result (4294967295, -1).
sys.maxint is not portable to Python 3. If you need a starting value that's bigger than any integer, use float('inf'). But it's possible to reorganize the code so that there's no need for a starting value — see the revised code below.
Instead of defaultdict(int), use Counter, which you can initialize directly from an iterable:
```
targetCount = Counter(pattern)
```

Instead of iterating over indexes into source, use enumerate:

for end, c in enumerate(source):
 if c not in targetCount:
 continue
 if currentCount[c] < targetCount[c]:
 validCount += 1
 currentCount[c] += 1
 # ... and so on ...

This avoids lots of lookups of source[index].

There's repetition of lines like start += 1 and currentCount[source[start]] -= 1. It would be easy to reorder the code so that each line appears only once.
Similarly, instead of maintaining and incrementing a start index, create an iterable that enumerates the source string:
```
window_start = enumerate(source)
```
and call next each time you need another character from it:
```
start, c = next(window_start)
if c not in targetCount:
 continue
# ... and so on ...
```
This avoids lots of lookups of source[start].
The inner loop is:
```
while validCount == len(targetCount):
```
but validCount does not change on every loop iteration (only on loop iterations where the start character causes one of the current counts to drop below its target). So some of these tests are wasted. Instead, you could loop using:
```
for start, c in window_start:
```
and then exit the loop only after reducing validCount:
```
 validCount -= 1
 if validCount < len(pattern):
 break
```
The test:
```
if source[index] not in targetCount:
 continue
```
is not necessary — the algorithm works just as well without it. And similarly for the corresponding test in the inner loop.

2. Revised code

In this implementation I've tried to emphasize the symmetry between the outer and the inner loop. Really what we have here is a pair of coroutines, each iterating over the input, and yielding to the other one at the appropriate point.

from collections import Counter
def min_window(source, pattern):
 """Return the shortest substring of source that contains all the
 characters in pattern (with the required multiplicity). Raise
 ValueError if there is no such substring.
 """
 def windows():
 pattern_length = len(pattern)
 target = Counter(pattern) # character -> required count
 current = Counter() # character -> count in current window
 matched = 0 # matched characters in current window
 window_start, window_end = enumerate(source), enumerate(source)
 for end, c in window_end:
 current[c] += 1
 if current[c] <= target[c]:
 matched += 1
 if matched == pattern_length:
 for start, c in window_start:
 current[c] -= 1
 if current[c] < target[c]:
 matched -= 1
 if matched < pattern_length:
 yield end + 1 - start, start
 break
 length, start = min(windows())
 return source[start: start + length]

Question 3

Thanks Gareth for the excellent comments and vote up. One quick question is, what puzzled me most in my original implementation is, whether I should move start position further when there is a match, or not. Wondering your comments. It seems you do not move start position when there is a match, but I think whether move start could have better performance and less positions to evaluate? Please feel free to correct me. Thanks.

Question 4

@LinMa: It's better to update start every time around the inner loop (as in my code) otherwise there might be extra unnecessary iterations of the inner loop. This should be clear from the symmetry of the code: the whole purpose of the inner loop is to update start (just as the purpose of the outer loop is to update end).

Question 5

Thanks Gareth, vote up. Actually I am asking whether it is better to move start to start+1 when there is a match, Just to confirm my understanding is correct, for my example, "ADOBECODEBANC", when there is a match for sub-string, "ADOBEC", start point to the first character A, then it is better to move/increase start to the 2nd character (other than keep start pointing to A), is that correct understanding?

Question 6

@LinMa: Yes, that's right — once you've found "ABODEC", you know that it's pointless to consider any other substrings starting with the first character A, because they must all be longer than "ABODEC", and so none of them can be the minimum.

score 4 · Accepted Answer · 2016-06-16 21:17:28Z

1. Review

I don't like comparative reviews, so I only looked in detail at the first version of your code. Possibly some of these comments apply to the other version too.

The code does not return the correct answer if there are repeated letters in pattern. For example, given S = 'ABCA' and T = 'AA', the minimum window should be the whole string. But:
```
>>> minWindow('ABCA', 'AA')
(1, 3)
```
To fix this, use len(pattern) instead of len(targetCount).
It would be better to raise an exception if no window is found rather than returning the meaningless result (4294967295, -1).
sys.maxint is not portable to Python 3. If you need a starting value that's bigger than any integer, use float('inf'). But it's possible to reorganize the code so that there's no need for a starting value — see the revised code below.
Instead of defaultdict(int), use Counter, which you can initialize directly from an iterable:
```
targetCount = Counter(pattern)
```

Instead of iterating over indexes into source, use enumerate:

for end, c in enumerate(source):
 if c not in targetCount:
 continue
 if currentCount[c] < targetCount[c]:
 validCount += 1
 currentCount[c] += 1
 # ... and so on ...

This avoids lots of lookups of source[index].

There's repetition of lines like start += 1 and currentCount[source[start]] -= 1. It would be easy to reorder the code so that each line appears only once.
Similarly, instead of maintaining and incrementing a start index, create an iterable that enumerates the source string:
```
window_start = enumerate(source)
```
and call next each time you need another character from it:
```
start, c = next(window_start)
if c not in targetCount:
 continue
# ... and so on ...
```
This avoids lots of lookups of source[start].
The inner loop is:
```
while validCount == len(targetCount):
```
but validCount does not change on every loop iteration (only on loop iterations where the start character causes one of the current counts to drop below its target). So some of these tests are wasted. Instead, you could loop using:
```
for start, c in window_start:
```
and then exit the loop only after reducing validCount:
```
 validCount -= 1
 if validCount < len(pattern):
 break
```
The test:
```
if source[index] not in targetCount:
 continue
```
is not necessary — the algorithm works just as well without it. And similarly for the corresponding test in the inner loop.

2. Revised code

In this implementation I've tried to emphasize the symmetry between the outer and the inner loop. Really what we have here is a pair of coroutines, each iterating over the input, and yielding to the other one at the appropriate point.

from collections import Counter
def min_window(source, pattern):
 """Return the shortest substring of source that contains all the
 characters in pattern (with the required multiplicity). Raise
 ValueError if there is no such substring.
 """
 def windows():
 pattern_length = len(pattern)
 target = Counter(pattern) # character -> required count
 current = Counter() # character -> count in current window
 matched = 0 # matched characters in current window
 window_start, window_end = enumerate(source), enumerate(source)
 for end, c in window_end:
 current[c] += 1
 if current[c] <= target[c]:
 matched += 1
 if matched == pattern_length:
 for start, c in window_start:
 current[c] -= 1
 if current[c] < target[c]:
 matched -= 1
 if matched < pattern_length:
 yield end + 1 - start, start
 break
 length, start = min(windows())
 return source[start: start + length]

Thanks Gareth for the excellent comments and vote up. One quick question is, what puzzled me most in my original implementation is, whether I should move start position further when there is a match, or not. Wondering your comments. It seems you do not move start position when there is a match, but I think whether move start could have better performance and less positions to evaluate? Please feel free to correct me. Thanks.
@LinMa: It's better to update start every time around the inner loop (as in my code) otherwise there might be extra unnecessary iterations of the inner loop. This should be clear from the symmetry of the code: the whole purpose of the inner loop is to update start (just as the purpose of the outer loop is to update end).
Thanks Gareth, vote up. Actually I am asking whether it is better to move start to start+1 when there is a match, Just to confirm my understanding is correct, for my example, "ADOBECODEBANC", when there is a match for sub-string, "ADOBEC", start point to the first character A, then it is better to move/increase start to the 2nd character (other than keep start pointing to A), is that correct understanding?
@LinMa: Yes, that's right — once you've found "ABODEC", you know that it's pointless to consider any other substrings starting with the first character A, because they must all be longer than "ABODEC", and so none of them can be the minimum.

Stack Exchange Network

Minimum window substring

1 Answer 1

1. Review

2. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Minimum window substring

1 Answer 1

1. Review

2. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions