I want to build a function which finds the longest substring that is common between two large strings. I want to do this using the first two functions.
I'm running it on very large strings (containing more than million characters) and it's taking a lot of time. How can I optimize the function findlongest?
def hash_sequence(string, k):
dictionary={}
for i in range(len(string)-(k-1)):
sequence = string[i:i+k]
dictionary.setdefault(sequence,[]).append(i)
return dictionary
def intersects(string1, string2, k): #what if k=0?
dictionary = hash_sequence(string1, k)
for i in range(len(string2)-1): #O(n) for sybstring in string
if string2[i:i+k] in dictionary: #O(n
return string2[i:i+k]
return None
def findlongest(string1,string2):
if intersects(string1,string2,1)==None:
return None
for i in range(2,min(len(string1),len(string2))):
if intersects(string1,string2,i)==None:
return intersects(string1,string2,i-1),len(intersects(string1,string2,i-1))
2 Answers 2
One big problem is that when you reach the longest common sequence and are ready to return, you re-run the previous iteration twice. Variables are your friends:
def find_longest(string1, string2):
longest_seq = None
for i in range(1, min(len(string1), len(string2))):
# Store the current iteration's intersection
current_seq = intersects(string1, string2, i)
# If this was one character too long, return.
# Else, a longer intersection was found. Store it.
if current_seq == None:
return longest_seq, i-1
else:
longest_seq = current_seq
# If we get here, the strings were the same.
# For consistency, return longest_seq and its length.
return longest_seq, len(longest_seq)
By adding in the two variables, we were able to remove the if-statement at the beginning and now only have to call intersects() once in the for-loop.
Also, take a look at PEP8, the official Python style guide. It will help you fit Pythonic conventions more closely.
If you wanted to use a more complex, and quite possibly faster, solution, take a look into implementing a binary search algorithm. They provide search times that are very efficient for large data sets.
Why don't you use the dynamic programming version of the code. Its complexity is O(nm) where n and m are the lengths of two strings. Yours seem to be a lot more complicated one. Below is a python code which implements the O(nm) solution of the problem.
def LCS(stringA, stringB):
lenStringA = 1 + len(stringA)
lenStringB = 1 + len(stringB)
matrix = [[0] * (lenStringB) for i in range(lenStringA)]
substringLength = 0
endIndex = 0
for aIndex in range(1, lenStringA):
for bIndex in range(1, lenStringB):
if stringA[aIndex - 1] == stringB[bIndex - 1]:
matrix[aIndex][bIndex] = matrix[aIndex - 1][bIndex - 1] + 1
if matrix[aIndex][bIndex] > substringLength:
substringLength = matrix[aIndex][bIndex]
endIndex = aIndex
else:
matrix[aIndex][bIndex] = 0
return stringA[endIndex - substringLength: endIndex]