2
\$\begingroup\$

I want to build a function which finds the longest substring that is common between two large strings. I want to do this using the first two functions. I'm running it on very large strings (containing more than million characters) and it's taking a lot of time. How can I optimize the function findlongest?

def hash_sequence(string, k):
 dictionary={}
 for i in range(len(string)-(k-1)):
 sequence = string[i:i+k]
 dictionary.setdefault(sequence,[]).append(i)
 return dictionary
def intersects(string1, string2, k): #what if k=0?
 dictionary = hash_sequence(string1, k)
 for i in range(len(string2)-1): #O(n) for sybstring in string
 if string2[i:i+k] in dictionary: #O(n
 return string2[i:i+k]
 return None
def findlongest(string1,string2):
 if intersects(string1,string2,1)==None:
 return None
 for i in range(2,min(len(string1),len(string2))):
 if intersects(string1,string2,i)==None:
 return intersects(string1,string2,i-1),len(intersects(string1,string2,i-1))
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked May 14, 2014 at 18:35
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

One big problem is that when you reach the longest common sequence and are ready to return, you re-run the previous iteration twice. Variables are your friends:

def find_longest(string1, string2):
 longest_seq = None
 for i in range(1, min(len(string1), len(string2))):
 # Store the current iteration's intersection
 current_seq = intersects(string1, string2, i)
 # If this was one character too long, return.
 # Else, a longer intersection was found. Store it.
 if current_seq == None:
 return longest_seq, i-1
 else:
 longest_seq = current_seq
 # If we get here, the strings were the same.
 # For consistency, return longest_seq and its length.
 return longest_seq, len(longest_seq)

By adding in the two variables, we were able to remove the if-statement at the beginning and now only have to call intersects() once in the for-loop.

Also, take a look at PEP8, the official Python style guide. It will help you fit Pythonic conventions more closely.


If you wanted to use a more complex, and quite possibly faster, solution, take a look into implementing a binary search algorithm. They provide search times that are very efficient for large data sets.

answered May 14, 2014 at 19:10
\$\endgroup\$
1
\$\begingroup\$

Why don't you use the dynamic programming version of the code. Its complexity is O(nm) where n and m are the lengths of two strings. Yours seem to be a lot more complicated one. Below is a python code which implements the O(nm) solution of the problem.

def LCS(stringA, stringB):
lenStringA = 1 + len(stringA)
lenStringB = 1 + len(stringB)
matrix = [[0] * (lenStringB) for i in range(lenStringA)]
substringLength = 0
endIndex = 0
for aIndex in range(1, lenStringA):
 for bIndex in range(1, lenStringB):
 if stringA[aIndex - 1] == stringB[bIndex - 1]:
 matrix[aIndex][bIndex] = matrix[aIndex - 1][bIndex - 1] + 1
 if matrix[aIndex][bIndex] > substringLength:
 substringLength = matrix[aIndex][bIndex]
 endIndex = aIndex
 else:
 matrix[aIndex][bIndex] = 0
return stringA[endIndex - substringLength: endIndex]
answered May 14, 2014 at 21:13
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.