The "Save Humanity" problem on Hackerrank asks us to:
... find all substrings in the patient DNA that either exactly matches the virus DNA, or has at most one mismatch.
For example:
"aa"
and"aa"
are matching,"ab"
and"aa"
are matching, while"ab"
and"ba"
are not.Input Format
The first line contains the number of test cases T. T test cases follow. Each case contains two strings P(Patient DNA) and V(Virus DNA) separated by space.
Output Format
Output T lines, one corresponding to each test case. For each test case, output a space delimited list of starting indices (0 indexed) of substrings of P which are matching with V according to the condition mentioned above. The indices have to be in an increasing order. If there is no matching output No Match!.
In a nutshell, it requires me to compare two strings and find the beginning index value for all occurrences of the second substring in the first. The catch is that upto one mismatch in the substring is allowed.
I've coded the following solution which seems to work well for the first 3 testcases. The problem is it timed-out after the 3rd test case. I can't seem to optimize it any further than what I've already done unless my logic is somehow overly complicated or I'm not using enough in-built methods to quicken things up.
inp = inp.split('\n')
del inp[0]
for h in inp:
P,sep,V = str(h).partition(' ')
upperlimit = len(P) - len(V) + 1
if P == V:
print("0")
continue
else:
matches=0
for i in range(0,upperlimit):
Psub = P[i:i+len(V)]
if Psub == V: # Check if they match exactly
print(str(i) + " ", end='')
matches += 1
continue # Just proceed with the next iteration of loop if they match
mismatches = 0
for j in range(0,len(V)): # If we hit this code then, we're about to see how many characters mismatch in the substring
if Psub[j] != V[j]:
mismatches += 1
if mismatches>=2: # No point checking the substrings alphabet by alphabet any further, more than 1 mismatch present here.
break
if mismatches<2:
print(str(i)+" ",end='')
matches += 1
if matches==0: # This is possible ONLY if none of the above conditions were ever met.
print("No Match!",end='')
print()
Can someone please tell me why code is slow? Also, let me know if you want me to comment more of my code so it's easier to understand. The link to the problem description should be quite useful in understanding the problem.
1 Answer 1
Various nitpicks
range(0, x)
is better written asrange(x)
print(str(i) + " ", end='')
is equivalent toprint(i, end=' ')
but slowerif matches == 0
reads better asif not matches
- You compute
len(V)
several times, just store its value in a variable. - Given that you don't use
sep
, and thath
should already be a string,P,sep,V = str(h).partition(' ')
is better written asP, V = h.split()
Special cases aren't special enough to break the rules
Checking for equality between the patient sub-string and the virus is just a sub-case of counting the differences between them (and finding it is 0). Same for comparing the entire patient string and the virus. All are just sub-cases of:
for i in range(0,upperlimit):
Psub = P[i:i+len(V)]
for j in range(0,len(V)): # If we hit this code then, we're about to see how many characters mismatch in the substring
if Psub[j] != V[j]:
mismatches += 1
if mismatches>=2: # No point checking the substrings alphabet by alphabet any further, more than 1 mismatch present here.
break
Besides, it is to be expected that there will be much more mismatch or fuzzy-matching when checking the substrings than there will be exact matches. So the exact comparison is just slowing you down.
Moreover, you can abuse the fact that booleans are integers and use the sum
builtins to compute the number of differences faster:
for i in range(upperlimit):
Psub = P[i:i+len(V)]
mismatches = sum(Psub[j] != V[j] for j in range(len(V)))
Once again, this can be optimized by using zip
instead of retrieving letters with their indices in the string:
for i in range(upperlimit):
Psub = P[i:i+len(V)]
mismatches = sum(p != v for p, v in zip(Psub, V))
The advantage of using zip
being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:
for i in range(upperlimit):
mismatches = sum(p != v for p, v in zip(P[i:], V))
Separation of concerns
As it stand, your code does 3 things at once:
- iterate over the input data
- compute the index of "matching" substrings
- print such indexes
You should make functions to separate concerns and make it easier to re-use/test. A basic layout could look like:
def analyze(patient, virus):
# build indexes where there is a match
def main():
# iterate over lines in input
for line in input_string:
patient, virus = line.split()
matches = analyze(patient, virus)
# print output based on `matches`
It makes it more clear to follow what is going on. You then just have to figure out how to return meaningful values from analyze
. First thoughts should be to build a list and return it:
def analyze(patient, virus):
matches = []
for i in range(len(patient) - len(virus) + 1):
if sum(p != v for p, v in zip(patient[i:], virus)) < 2:
matches.append(i)
return matches
def main():
# iterate over lines in input
for line in input_string:
patient, virus = line.split()
matches = analyze(patient, virus)
if not matches:
print('No match!')
else:
print(' '.join(map(str, matches)))
Or you could turn analyze
into a generator and convert its computation into a list when calling it. It let you avoid explicitly calling append
:
Proposed improvements
def analyze(patient, virus):
for begin in range(len(patient) - len(virus) + 1):
if sum(p != v for p, v in zip(patient[begin:], virus)) < 2:
yield begin
if __name__ == '__main__':
T = int(input())
for _ in range(T):
patient, virus = input().split()
matches = list(analyze(patient, virus))
if not matches:
print('No Match!')
else:
print(' '.join(map(str, matches)))
The advantage of using a generator is that you don't even have to convert it to a list, your original print
s work as well:
for _ in range(T):
patient, virus = input().split()
matches = 0
for index in analyze(patient, virus):
print(index, end=' ')
matches += 1
if not matches:
print('No match!')
else:
print()
Or you can convert the matches on the fly:
for _ in range(T):
patient, virus = input().split()
matches = ' '.join(map(str, analyze(patient, virus)))
if not matches:
matches = 'No match!'
print(matches)
-
\$\begingroup\$ Thanks @Mathias for your answer, just one doubt though, in one of your improvements you've recommended Psub = P[i:i+len(V)] be replaced with P[i:] . Are the two even equivalent ? Psub and P[i:] will be of different lengths right ? Search for this sentence - "The advantage of using zip being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:" to know what I'm on about. \$\endgroup\$Dhiwakar Ravikumar– Dhiwakar Ravikumar2016年06月19日 14:06:19 +00:00Commented Jun 19, 2016 at 14:06
-
\$\begingroup\$ @DhiwakarRavikumar Thing is that you’re using
P[i:i+len(V)]
so thatPsub
andV
have the same length. But you don't really need to sincej
will stop when reaching the end ofV
. Same goes forzip
, it will stop generating pairs of letters when reaching the end of the shortest string, which will always beV
(orP[i:]
for the lasti
) because of therange(upperlimit)
. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年06月19日 14:12:00 +00:00Commented Jun 19, 2016 at 14:12 -
\$\begingroup\$ @DhiwakarRavikumar And since slicing a string is not really dependent on the length of the substring compared to other overheads, it is better to avoid throwing a few useless computation in here. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年06月19日 14:14:09 +00:00Commented Jun 19, 2016 at 14:14
Explore related questions
See similar questions with these tags.
inp
in the first place? \$\endgroup\$