HackerRank "Save Humanity"

Question 1

The "Save Humanity" problem on Hackerrank asks us to:

... find all substrings in the patient DNA that either exactly matches the virus DNA, or has at most one mismatch.

For example: "aa" and "aa" are matching, "ab" and "aa" are matching, while "ab" and "ba" are not.

Input Format

The first line contains the number of test cases T. T test cases follow. Each case contains two strings P(Patient DNA) and V(Virus DNA) separated by space.

Output Format

Output T lines, one corresponding to each test case. For each test case, output a space delimited list of starting indices (0 indexed) of substrings of P which are matching with V according to the condition mentioned above. The indices have to be in an increasing order. If there is no matching output No Match!.

In a nutshell, it requires me to compare two strings and find the beginning index value for all occurrences of the second substring in the first. The catch is that upto one mismatch in the substring is allowed.

I've coded the following solution which seems to work well for the first 3 testcases. The problem is it timed-out after the 3rd test case. I can't seem to optimize it any further than what I've already done unless my logic is somehow overly complicated or I'm not using enough in-built methods to quicken things up.

inp = inp.split('\n')
del inp[0]
for h in inp:
 P,sep,V = str(h).partition(' ')
 upperlimit = len(P) - len(V) + 1
 if P == V:
 print("0")
 continue
 else:
 matches=0
 for i in range(0,upperlimit):
 Psub = P[i:i+len(V)]
 if Psub == V: # Check if they match exactly
 print(str(i) + " ", end='')
 matches += 1
 continue # Just proceed with the next iteration of loop if they match
 mismatches = 0
 for j in range(0,len(V)): # If we hit this code then, we're about to see how many characters mismatch in the substring
 if Psub[j] != V[j]:
 mismatches += 1
 if mismatches>=2: # No point checking the substrings alphabet by alphabet any further, more than 1 mismatch present here.
 break
 if mismatches<2:
 print(str(i)+" ",end='')
 matches += 1
 if matches==0: # This is possible ONLY if none of the above conditions were ever met.
 print("No Match!",end='')
 print()

Can someone please tell me why code is slow? Also, let me know if you want me to comment more of my code so it's easier to understand. The link to the problem description should be quite useful in understanding the problem.

Question 2

How do you build inp in the first place?

Question 3

Various nitpicks

range(0, x) is better written as range(x)
print(str(i) + " ", end='') is equivalent to print(i, end=' ') but slower
if matches == 0 reads better as if not matches
You compute len(V) several times, just store its value in a variable.
Given that you don't use sep, and that h should already be a string, P,sep,V = str(h).partition(' ') is better written as P, V = h.split()

Special cases aren't special enough to break the rules

Checking for equality between the patient sub-string and the virus is just a sub-case of counting the differences between them (and finding it is 0). Same for comparing the entire patient string and the virus. All are just sub-cases of:

 for i in range(0,upperlimit):
 Psub = P[i:i+len(V)]
 for j in range(0,len(V)): # If we hit this code then, we're about to see how many characters mismatch in the substring
 if Psub[j] != V[j]:
 mismatches += 1
 if mismatches>=2: # No point checking the substrings alphabet by alphabet any further, more than 1 mismatch present here.
 break

Besides, it is to be expected that there will be much more mismatch or fuzzy-matching when checking the substrings than there will be exact matches. So the exact comparison is just slowing you down.

Moreover, you can abuse the fact that booleans are integers and use the sum builtins to compute the number of differences faster:

 for i in range(upperlimit):
 Psub = P[i:i+len(V)]
 mismatches = sum(Psub[j] != V[j] for j in range(len(V)))

Once again, this can be optimized by using zip instead of retrieving letters with their indices in the string:

 for i in range(upperlimit):
 Psub = P[i:i+len(V)]
 mismatches = sum(p != v for p, v in zip(Psub, V))

The advantage of using zip being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:

 for i in range(upperlimit):
 mismatches = sum(p != v for p, v in zip(P[i:], V))

Separation of concerns

As it stand, your code does 3 things at once:

iterate over the input data
compute the index of "matching" substrings
print such indexes

You should make functions to separate concerns and make it easier to re-use/test. A basic layout could look like:

def analyze(patient, virus):
 # build indexes where there is a match
def main():
 # iterate over lines in input
 for line in input_string:
 patient, virus = line.split()
 matches = analyze(patient, virus)
 # print output based on `matches`

It makes it more clear to follow what is going on. You then just have to figure out how to return meaningful values from analyze. First thoughts should be to build a list and return it:

def analyze(patient, virus):
 matches = []
 for i in range(len(patient) - len(virus) + 1):
 if sum(p != v for p, v in zip(patient[i:], virus)) < 2:
 matches.append(i)
 return matches
def main():
 # iterate over lines in input
 for line in input_string:
 patient, virus = line.split()
 matches = analyze(patient, virus)
 if not matches:
 print('No match!')
 else:
 print(' '.join(map(str, matches)))

Or you could turn analyze into a generator and convert its computation into a list when calling it. It let you avoid explicitly calling append:

Proposed improvements

def analyze(patient, virus):
 for begin in range(len(patient) - len(virus) + 1):
 if sum(p != v for p, v in zip(patient[begin:], virus)) < 2:
 yield begin
if __name__ == '__main__':
 T = int(input())
 for _ in range(T):
 patient, virus = input().split()
 matches = list(analyze(patient, virus))
 if not matches:
 print('No Match!')
 else:
 print(' '.join(map(str, matches)))

The advantage of using a generator is that you don't even have to convert it to a list, your original prints work as well:

 for _ in range(T):
 patient, virus = input().split()
 matches = 0
 for index in analyze(patient, virus):
 print(index, end=' ')
 matches += 1
 if not matches:
 print('No match!')
 else:
 print()

Or you can convert the matches on the fly:

 for _ in range(T):
 patient, virus = input().split()
 matches = ' '.join(map(str, analyze(patient, virus)))
 if not matches:
 matches = 'No match!'
 print(matches)

Question 4

Thanks @Mathias for your answer, just one doubt though, in one of your improvements you've recommended Psub = P[i:i+len(V)] be replaced with P[i:] . Are the two even equivalent ? Psub and P[i:] will be of different lengths right ? Search for this sentence - "The advantage of using zip being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:" to know what I'm on about.

Question 5

@DhiwakarRavikumar Thing is that you’re using P[i:i+len(V)] so that Psub and V have the same length. But you don't really need to since j will stop when reaching the end of V. Same goes for zip, it will stop generating pairs of letters when reaching the end of the shortest string, which will always be V (or P[i:] for the last i) because of the range(upperlimit).

Question 6

@DhiwakarRavikumar And since slicing a string is not really dependent on the length of the substring compared to other overheads, it is better to avoid throwing a few useless computation in here.

score 1 · Accepted Answer · 2016-06-19 11:54:12Z

Various nitpicks

range(0, x) is better written as range(x)
print(str(i) + " ", end='') is equivalent to print(i, end=' ') but slower
if matches == 0 reads better as if not matches
You compute len(V) several times, just store its value in a variable.
Given that you don't use sep, and that h should already be a string, P,sep,V = str(h).partition(' ') is better written as P, V = h.split()

Special cases aren't special enough to break the rules

Checking for equality between the patient sub-string and the virus is just a sub-case of counting the differences between them (and finding it is 0). Same for comparing the entire patient string and the virus. All are just sub-cases of:

 for i in range(0,upperlimit):
 Psub = P[i:i+len(V)]
 for j in range(0,len(V)): # If we hit this code then, we're about to see how many characters mismatch in the substring
 if Psub[j] != V[j]:
 mismatches += 1
 if mismatches>=2: # No point checking the substrings alphabet by alphabet any further, more than 1 mismatch present here.
 break

Besides, it is to be expected that there will be much more mismatch or fuzzy-matching when checking the substrings than there will be exact matches. So the exact comparison is just slowing you down.

Moreover, you can abuse the fact that booleans are integers and use the sum builtins to compute the number of differences faster:

 for i in range(upperlimit):
 Psub = P[i:i+len(V)]
 mismatches = sum(Psub[j] != V[j] for j in range(len(V)))

Once again, this can be optimized by using zip instead of retrieving letters with their indices in the string:

 for i in range(upperlimit):
 Psub = P[i:i+len(V)]
 mismatches = sum(p != v for p, v in zip(Psub, V))

The advantage of using zip being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:

 for i in range(upperlimit):
 mismatches = sum(p != v for p, v in zip(P[i:], V))

Separation of concerns

As it stand, your code does 3 things at once:

iterate over the input data
compute the index of "matching" substrings
print such indexes

You should make functions to separate concerns and make it easier to re-use/test. A basic layout could look like:

def analyze(patient, virus):
 # build indexes where there is a match
def main():
 # iterate over lines in input
 for line in input_string:
 patient, virus = line.split()
 matches = analyze(patient, virus)
 # print output based on `matches`

It makes it more clear to follow what is going on. You then just have to figure out how to return meaningful values from analyze. First thoughts should be to build a list and return it:

def analyze(patient, virus):
 matches = []
 for i in range(len(patient) - len(virus) + 1):
 if sum(p != v for p, v in zip(patient[i:], virus)) < 2:
 matches.append(i)
 return matches
def main():
 # iterate over lines in input
 for line in input_string:
 patient, virus = line.split()
 matches = analyze(patient, virus)
 if not matches:
 print('No match!')
 else:
 print(' '.join(map(str, matches)))

Or you could turn analyze into a generator and convert its computation into a list when calling it. It let you avoid explicitly calling append:

Proposed improvements

def analyze(patient, virus):
 for begin in range(len(patient) - len(virus) + 1):
 if sum(p != v for p, v in zip(patient[begin:], virus)) < 2:
 yield begin
if __name__ == '__main__':
 T = int(input())
 for _ in range(T):
 patient, virus = input().split()
 matches = list(analyze(patient, virus))
 if not matches:
 print('No Match!')
 else:
 print(' '.join(map(str, matches)))

The advantage of using a generator is that you don't even have to convert it to a list, your original prints work as well:

 for _ in range(T):
 patient, virus = input().split()
 matches = 0
 for index in analyze(patient, virus):
 print(index, end=' ')
 matches += 1
 if not matches:
 print('No match!')
 else:
 print()

Or you can convert the matches on the fly:

 for _ in range(T):
 patient, virus = input().split()
 matches = ' '.join(map(str, analyze(patient, virus)))
 if not matches:
 matches = 'No match!'
 print(matches)

Thanks @Mathias for your answer, just one doubt though, in one of your improvements you've recommended Psub = P[i:i+len(V)] be replaced with P[i:] . Are the two even equivalent ? Psub and P[i:] will be of different lengths right ? Search for this sentence - "The advantage of using zip being that the iteration will stop when reaching the end of the shortest string. We can use that to simplify the writting to:" to know what I'm on about.
@DhiwakarRavikumar Thing is that you’re using P[i:i+len(V)] so that Psub and V have the same length. But you don't really need to since j will stop when reaching the end of V. Same goes for zip, it will stop generating pairs of letters when reaching the end of the shortest string, which will always be V (or P[i:] for the last i) because of the range(upperlimit).
@DhiwakarRavikumar And since slicing a string is not really dependent on the length of the substring compared to other overheads, it is better to avoid throwing a few useless computation in here.

Stack Exchange Network

HackerRank "Save Humanity"

Input Format

Output Format

1 Answer 1

Various nitpicks

Special cases aren't special enough to break the rules

Separation of concerns

Proposed improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

HackerRank "Save Humanity"

Input Format

Output Format

1 Answer 1

Various nitpicks

Special cases aren't special enough to break the rules

Separation of concerns

Proposed improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions