Finding the longest substring in common with two large strings

Question 1

I want to build a function which finds the longest substring that is common between two large strings. I want to do this using the first two functions. I'm running it on very large strings (containing more than million characters) and it's taking a lot of time. How can I optimize the function findlongest?

def hash_sequence(string, k):
 dictionary={}
 for i in range(len(string)-(k-1)):
 sequence = string[i:i+k]
 dictionary.setdefault(sequence,[]).append(i)
 return dictionary
def intersects(string1, string2, k): #what if k=0?
 dictionary = hash_sequence(string1, k)
 for i in range(len(string2)-1): #O(n) for sybstring in string
 if string2[i:i+k] in dictionary: #O(n
 return string2[i:i+k]
 return None
def findlongest(string1,string2):
 if intersects(string1,string2,1)==None:
 return None
 for i in range(2,min(len(string1),len(string2))):
 if intersects(string1,string2,i)==None:
 return intersects(string1,string2,i-1),len(intersects(string1,string2,i-1))

Question 2

One big problem is that when you reach the longest common sequence and are ready to return, you re-run the previous iteration twice. Variables are your friends:

def find_longest(string1, string2):
 longest_seq = None
 for i in range(1, min(len(string1), len(string2))):
 # Store the current iteration's intersection
 current_seq = intersects(string1, string2, i)
 # If this was one character too long, return.
 # Else, a longer intersection was found. Store it.
 if current_seq == None:
 return longest_seq, i-1
 else:
 longest_seq = current_seq
 # If we get here, the strings were the same.
 # For consistency, return longest_seq and its length.
 return longest_seq, len(longest_seq)

By adding in the two variables, we were able to remove the if-statement at the beginning and now only have to call intersects() once in the for-loop.

Also, take a look at PEP8, the official Python style guide. It will help you fit Pythonic conventions more closely.

If you wanted to use a more complex, and quite possibly faster, solution, take a look into implementing a binary search algorithm. They provide search times that are very efficient for large data sets.

Question 3

Why don't you use the dynamic programming version of the code. Its complexity is O(nm) where n and m are the lengths of two strings. Yours seem to be a lot more complicated one. Below is a python code which implements the O(nm) solution of the problem.

def LCS(stringA, stringB):
lenStringA = 1 + len(stringA)
lenStringB = 1 + len(stringB)
matrix = [[0] * (lenStringB) for i in range(lenStringA)]
substringLength = 0
endIndex = 0
for aIndex in range(1, lenStringA):
 for bIndex in range(1, lenStringB):
 if stringA[aIndex - 1] == stringB[bIndex - 1]:
 matrix[aIndex][bIndex] = matrix[aIndex - 1][bIndex - 1] + 1
 if matrix[aIndex][bIndex] > substringLength:
 substringLength = matrix[aIndex][bIndex]
 endIndex = aIndex
 else:
 matrix[aIndex][bIndex] = 0
return stringA[endIndex - substringLength: endIndex]

BeetDemGuise 4,22612 silver badges29 bronze badges · Answer 1 · 2014-05-14 19:10:40Z

One big problem is that when you reach the longest common sequence and are ready to return, you re-run the previous iteration twice. Variables are your friends:

def find_longest(string1, string2):
 longest_seq = None
 for i in range(1, min(len(string1), len(string2))):
 # Store the current iteration's intersection
 current_seq = intersects(string1, string2, i)
 # If this was one character too long, return.
 # Else, a longer intersection was found. Store it.
 if current_seq == None:
 return longest_seq, i-1
 else:
 longest_seq = current_seq
 # If we get here, the strings were the same.
 # For consistency, return longest_seq and its length.
 return longest_seq, len(longest_seq)

By adding in the two variables, we were able to remove the if-statement at the beginning and now only have to call intersects() once in the for-loop.

Also, take a look at PEP8, the official Python style guide. It will help you fit Pythonic conventions more closely.

If you wanted to use a more complex, and quite possibly faster, solution, take a look into implementing a binary search algorithm. They provide search times that are very efficient for large data sets.

Pranav Raj 4952 silver badges11 bronze badges · Answer 2 · 2014-05-14 21:13:43Z

Why don't you use the dynamic programming version of the code. Its complexity is O(nm) where n and m are the lengths of two strings. Yours seem to be a lot more complicated one. Below is a python code which implements the O(nm) solution of the problem.

def LCS(stringA, stringB):
lenStringA = 1 + len(stringA)
lenStringB = 1 + len(stringB)
matrix = [[0] * (lenStringB) for i in range(lenStringA)]
substringLength = 0
endIndex = 0
for aIndex in range(1, lenStringA):
 for bIndex in range(1, lenStringB):
 if stringA[aIndex - 1] == stringB[bIndex - 1]:
 matrix[aIndex][bIndex] = matrix[aIndex - 1][bIndex - 1] + 1
 if matrix[aIndex][bIndex] > substringLength:
 substringLength = matrix[aIndex][bIndex]
 endIndex = aIndex
 else:
 matrix[aIndex][bIndex] = 0
return stringA[endIndex - substringLength: endIndex]

Stack Exchange Network

Finding the longest substring in common with two large strings

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Finding the longest substring in common with two large strings

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions