Longest common substring approach using suffix arrays

Question 1

I am trying to speed up a function to return the longest common substring.

The requirements:

The function takes two strings of arbitrary length (although on average they will be less than 50 chars each)
If two subsequences of the same length exist it can return either
Speed is the primary concern

This is what I have and it works according to my tests:

from os.path import commonprefix
class LongestCommonSubstr(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 try:
 substr_len =0
 max_len = 0
 lcs = None
 for i,n in enumerate(self._suffix_str_array):
 if n[-1] != self._suffix_str_array[i+1][-1]:
 substr = commonprefix([n,self._suffix_str_array[i+1]])
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 except IndexError:
 pass
 return lcs

Can the code be made faster in pure Python? Is there a better option to differentiate the input strings when made into one sorted list than concatenating an ID to them?

Question 2

Have you considered difflib in the Python library?

Question 3

OK then: since your concern is speed, let's track our progress with actual timing data. The first step is to run the code through the Python profiler. With the addition of a bit of driver code that just calls LongestCommonSubstr().longest_common_substr 10000 times, I get the following results:

$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
 1130008 function calls in 3.083 seconds
 Ordered by: standard name
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.000 0.000 :0(__build_class__)
 1 0.000 0.000 3.082 3.082 :0(exec)
 1 0.000 0.000 0.000 0.000 :0(hasattr)
 280000 0.257 0.000 0.257 0.000 :0(len)
 260000 0.330 0.000 0.330 0.000 :0(max)
 260000 0.350 0.000 0.350 0.000 :0(min)
 1 0.000 0.000 0.000 0.000 :0(setprofile)
 10000 0.042 0.000 0.042 0.000 :0(sorted)
 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist)
 260000 1.067 0.000 1.746 0.000 genericpath.py:69(commonprefix)
 1 0.022 0.022 3.082 3.082 lcs_profile.py:1(<module>)
 20000 0.069 0.000 0.157 0.000 lcs_profile.py:11(_get_suffix_str)
 20000 0.071 0.000 0.071 0.000 lcs_profile.py:13(<listcomp>)
 10000 0.807 0.000 2.794 0.000 lcs_profile.py:15(_get_lcsubstr)
 1 0.000 0.000 0.000 0.000 lcs_profile.py:3(LongestCommonSubstr)
 10000 0.067 0.000 3.060 0.000 lcs_profile.py:4(__init__)
 1 0.000 0.000 3.083 3.083 profile:0(<code object <module> at 0x10996f030, file "lcs_profile.py", line 1>)
 0 0.000 0.000 profile:0(profiler)

The most important column is the second one, tottime, marking the total amount of time consumed by each function (but not the functions it calls). The largest entries in that column are for the commonprefix function and the _get_lc_substr function, so those are the spots you should focus on optimizing.

To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:

def lcs_longest_match(a, b):
 return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
 import timeit, sys
 print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
 setup='from __main__ import lcs_longest_match',
 number=10000))

(this is for Python 3). A first run on my computer gives

$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs 0.49046793299203273

I'll start with _get_lc_substr, since you wrote that code and it'll be easier to work through. The algorithm you use is to go through each pair of consecutive strings in the sorted suffix array, find the common prefix, and save that prefix only if it's longer than any common prefix already found. I can suggest a few improvements:

Iterating over pairs of consecutive elements is a common task that has a fairly standard recipe, pairwise(iterable), given in the documentation for the itertools module. You can use the implementation from the more-itertools package if you want. This also lets you get rid of the try/except block (which probably didn't affect runtime much, but it helps code clarity).
Python has a built-in function, max, to find the maximum value of an iterable. If you use it, then the nuts and bolts of the loop as well as the compare-and-store-if-greater process get handled internally by the interpreter, which should be faster than doing them manually in pure Python.
Repeatedly accessing an attribute of an object, namely the method self._suffix_str_array, is slower than accessing it once and storing it locally as a new variable.

Let's check the effect of these changes on the runtime of the driver program:

lcs 0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453

Not much of a change. Actually, using max is slower here, so let's discard that change.

Time now to take a closer look at commonprefix.

If you look at the source code of this function,
```
def commonprefix(m):
 "Given a list of pathnames, returns the longest common leading component"
 if not m: return ''
 s1 = min(m)
 s2 = max(m)
 for i, c in enumerate(s1):
 if c != s2[i]:
 return s1[:i]
 return s1
```
you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to max and min do take up a significant amount of time. Therefore, it makes sense to implement your own version of commonprefix without the max and min calls.

This makes a big difference in the runtime, cutting it down by over 30%:

lcs 0.49046793299203273
optlcs3 0.4828952929965453
optlcs4 0.3215277789859101

If you think about it, you don't even really need the common prefix itself, except for the one string you actually return from _get_lcsubstr. You only need its length, so you can decide which substring is the longest. So instead of using a function that finds the full common prefix, just write one that will give you its length. You store the length, along with one of the strings, and at the end of _get_lcsubstr, use the length to trim the stored string.
```
def _get_lcsubstr(self):
 # initialize
 for s1, s2 in mt.pairwise(suffix_str_array):
 if s1[-1] != s2[-1]:
 substr_len = get_common_prefix_length(s1, s2)
 if substr_len > max_len:
 max_len = substr_len
 max_substr = s1
 return max_substr[:max_len]
```

This shaves another few percent off the runtime:

lcs 0.49046793299203273
optlcs4 0.3215277789859101
optlcs5 0.30241094100347254

If it's faster to avoid computing a substring in your commonprefix replacement, you might think of doing the same thing when you're finding the suffixes in the first place. In other words, instead of calculating all the suffixes of a string in _get_suffix_str, just make a list of (index, which_string) tuples to represent the suffixes. Then whenever you need to actually compare two suffixes, instead of taking a substring of the original string, you just start comparing characters at the required indices.

There are two problems with this in practice: first, Python isn't well suited to iterating from an arbitrary point in the middle of a string. In a language like C, where strings are character pointers, this would work out quite well, because you can jump into the middle of the string by advancing a pointer. But in Python, iterating from the middle of a string requires you to either start from the beginning and just skip the first several characters, or bypass the whole iteration mechanism and use a for loop with an integer index to access characters inside the string by their indices (which typically involves more Python code that is relatively inefficient). And besides, the other reason is you need to create the substrings anyway to sort them. If you try to do it so that you use the substrings as comparison keys without actually storing them, the program spends a lot of time converting between a substring and its index.

All told, the changes you need to make to the code to use indices everywhere wind up hurting, not helping. Here's the timing result:

lcs 0.49046793299203273
optlcs5 0.30241094100347254
optlcs6 0.3683815289987251

At this point, you can spend a lot of time making little tweaks to try to squeeze some extra performance out of the program, but I don't think there are any major performance gains left. It's already something like 40% faster than the original, which is not bad.

Of course, as dawg wrote in a comment, you can accomplish this task using Python's standard module difflib.

def difflib_longest_match(a, b):
 i, j, k = difflib.SequenceMatcher(a=a, b=b
 ).find_longest_match(0, len(a), 0, len(b))
 return a[i:i+k]

I have no idea what's in the difflib code (well, I could look, but I'll leave that as an exercise), but it's clearly heavily optimized for this kind of task. It's another 25% faster than my best version of your program:

lcs 0.49046793299203273
optlcs5 0.30241094100347254
difflib 0.2154458940058248

So if you really want to do this not for educational purposes, just use difflib.

Here is the content of the test script I used.

import difflib
import itertools as it
import more_itertools as mt
from os.path import commonprefix
class LongestCommonSubstr(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 try:
 substr_len =0
 max_len = 0
 lcs = None
 for i,n in enumerate(self._suffix_str_array):
 if n[-1] != self._suffix_str_array[i+1][-1]:
 substr = commonprefix([n,self._suffix_str_array[i+1]])
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 except IndexError:
 pass
 return lcs
class OptimizedLongestCommonSubstr1(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 substr_len =0
 max_len = 0
 lcs = None
 for s1, s2 in mt.pairwise(self._suffix_str_array):
 if s1[-1] != s2[-1]:
 substr = commonprefix([s1,s2])
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 return lcs
class OptimizedLongestCommonSubstr2(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(self._suffix_str_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr3(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(s_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr4(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 @staticmethod
 def _get_common_prefix(s1, s2):
 for i, c in enumerate(s1):
 if c != s2[i]:
 return s1[:i]
 return s1
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcp = self._get_common_prefix
 substr_len =0
 max_len = 0
 lcs = None
 for s1, s2 in mt.pairwise(s_array):
 if s1[-1] != s2[-1]:
 substr = gcp(s1, s2)
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 return lcs
class OptimizedLongestCommonSubstr5(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 @staticmethod
 def _get_common_prefix_length(s1, s2):
 for i, c in enumerate(s1):
 if c != s2[i]:
 return i
 return len(s1)
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcpl = self._get_common_prefix_length
 max_len = 0
 max_substr = ''
 for s1, s2 in mt.pairwise(s_array):
 if s1[-1] != s2[-1]:
 substr_len = gcpl(s1, s2)
 if substr_len > max_len:
 max_len = substr_len
 max_substr = s1
 return max_substr[:max_len]
class OptimizedLongestCommonSubstr6(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring
 self.rstring = rstring
 self._suffix_str_array = sorted(
 [(i, True) for i, _ in enumerate(lstring)]
 + [(i, False) for i, _ in enumerate(rstring)],
 key=lambda t: (lstring if t[1] else rstring)[t[0]:])
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_common_prefix_length(s1, start1, s2, start2):
 for i, c in enumerate(it.islice(s1, start1, None)):
 if c != s2[i + start2]:
 return i
 return i
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcpl = self._get_common_prefix_length
 max_len = 0
 max_start = 0
 max_substr = ''
 for (i1, b1), (i2, b2) in mt.pairwise(s_array):
 if b1 != b2:
 if b1:
 s1 = self.lstring
 s2 = self.rstring
 else:
 s1 = self.rstring
 s2 = self.lstring
 substr_len = gcpl(s1, i1, s2, i2)
 if substr_len > max_len:
 max_len = substr_len
 max_start = i1
 max_substr = s1
 return max_substr[max_start:max_start+max_len]
def lcs_longest_match(a, b):
 return LongestCommonSubstr(a, b).longest_common_substr
for n in it.count(1):
 if 'OptimizedLongestCommonSubstr{}'.format(n) in globals():
 exec('''def optimized_lcs_longest_match_{}(a, b):
 return OptimizedLongestCommonSubstr{}(a, b).longest_common_substr
'''.format(n, n))
 else:
 break
def optimized_lcs_longest_match(a, b):
 return OptimizedLongestCommonSubstr(a, b).longest_common_substr
def difflib_longest_match(a, b):
 i, j, k = difflib.SequenceMatcher(a=a, b=b).find_longest_match(0, len(a), 0, len(b))
 return a[i:i+k]
if __name__ == '__main__':
 import timeit, sys
 n_runs = 10000
 print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import lcs_longest_match', number=n_runs))
 for k in range(1, n):
 print('optlcs{}'.format(k), timeit.timeit(
 'optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k),
 setup='from __main__ import optimized_lcs_longest_match_{}'.format(k),
 number=n_runs))
 print('difflib', timeit.timeit('difflib_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import difflib_longest_match', number=n_runs))
 print('control results:', difflib_longest_match(sys.argv[1], sys.argv[2])) 
 for k in range(1, n):
 print('test results {}: '.format(k), eval('optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k)))

Question 4

I have learnt a lot from this answer. Thank you for a great introduction to Code Review.

Question 5

@JAB Looking into difflib's source is definitely a good exercise. I managed to strip down the 'junk' overhead, and get the run time down by ~30%. And if you change it into a function you can get it down by ~40% (without much effort)! Using the same test case here's what I got: lcs 1.23378048, difflib 0.56012416 and the modified ones class 0.37018752, function 0.3279744.

Question 6

Is it unusual to have a class with out any public methods, and to set the result of a private method to an attribute when the class is instantiated?

Question 7

@JAB, I would say your choice of using a class is somewhat debatable, or even possibly a anti-pattern. See my answer for further details related to the choice of using a class (and an even faster implementation :-) )

Question 8

@holroy for another through answer. I will spend time walking through it tomorrow. Much appreciated.

Question 9

Before tackling the speed issue, let us comment on the chosen solution of using a class and style issues:

Missing documenation – Neither the class nor any methods are documented, which makes it hard to understand how to use your class, and what your class provides. Invoking it through LongestCommonSubstr('banana', 'atana').longest_common_substr, is not exactly intuitive.
Almost no public interface – You've catched up on the class naming, and indicative naming of private members and methods, but this leaves little of your class available. Only the rstring, lstring and longest_common_substr is left as "public".
Avoid intensive calculations in __init__ – It is not normal to do all of the calculation within the initialisation method. This should be used to set up the base of the class, and then use methods to interact with the instantiated object. To actually do everything the class needs to do in the __init__, just to expose a public variable is an anti-pattern.
Avoid using a class if only one operation needs to be executed – This is not a stringent rule, but if you only want the class to do just a single thing, then you are most likely better off with a function. No need to encapsulate the function within a class. If you need local variables or functions (like _get_suffix_str() they can be made as internal functions of your main function.
Add spaces after comma, and vertical space – Mainly your code maintains a good style, but you should add a space after commas. And you could very well include a new line before if and for or other blocks within your functions/methods

Performance review

You are asking if there is another way to separate your input strings, and there is at least one method using tuples which can achieve this. Then instead of storing suffix1 you store (suffix, 1) (or 0). However this has a penalty when creating, and when accessing again. When speed is important, it seems like your current solution of postfixing with 0 or 1 is faster, although more unclear when reading the code.

I reimplemented the faster solution of David Z as a function, and got a performance gain. But I got an even greater gain when loosing the pairwise iterator, and using a previous and current variable set.

Here are the two main contenders for performance comparision, the first is a version using tuples instead of postfixing the strings, and not using the pairwise iterator:

 def longest_common_substring_v3(first, second):
 """Utilises a set of suffix tuples to return the longest common substring."""
 max_len = 0
 max_substring = ''
 # Build suffixes based on tuples to differentiate strings
 suffixes = sorted( [ (first[i:], 0) for i in range(len(first)) ] + 
 [ (second[i:], 1) for i in range(len(second)) ])
 # Loop through a sorted suffix set, and if two following tuples
 # are from alternating base strings (i.e. different id) compare
 # those and see if we found a new longer common substring
 previous_text, previous_id = suffixes[0]
 for (current_text, current_id) in suffixes[1:]:
 # Only continue comparision if id differs of two suffixes in row 
 if previous_id != current_id: 
 # Find the longest common start of these two texts
 for i, c in enumerate(previous_text):
 if c != current_text[i]:
 break
 else:
 i = len(previous_text)
 # If longer than previous max, set the new max
 if i > max_len:
 max_len = i
 max_substring = previous_text[:i]
 # Shift over current values, preparing for next loop
 previous_text, previous_id = current_text, current_id
 return max_substring

But the 2nd fastest version was using postfixing the string, a single function, and skipping the pairwise iteration, like in the following code:

def longest_common_substring_v4(first, second):
 """Utilises suffix list to return the longest common substring."""
 max_len = 0
 max_substring = ''
 # Add magic number to end of string, to differentiate suffixes
 first += '0'
 second += '1'
 suffixes = sorted( [first[i:] for i in range(len(first)-1) ] +
 [second[i:] for i in range(len(second)-1) ])
 # Loop over suffixes, one at a time
 previous = suffixes[0]
 for current in suffixes[1:]:
 # If suffixes have different last character, they come from different
 # strings and they are candidates for common substrings
 if previous[-1] != current[-1]:
 # Find the longest common start of these texts
 for i, c in enumerate(previous):
 if c != current[i]:
 break
 else:
 i = len(previous)
 # If longer than previous max, set the new max
 if i > max_len:
 max_len = i
 max_substring = previous[:i]
 previous = current 
 return max_substring

I profiled version 4, and found that finding the common start was the slowest part of my code. Then it hit me, why don't we skip comparing byte for byte, and instead do a string compare on current max_len instead? This leads to version 5, where the code around for i, c ...has been replaced with the following:

 if previous[-1] != current[-1] and previous[:max_len] == current[:max_len]:
 # Find the longest common start of these texts
 for i in range(max_len, len(previous)):
 if previous[i] != current[i]:
 break
 else:
 i = len(previous)

I compared my version 3, 4, & 5, your original version, the version by David Z, and the difflib version, for some various strings using timeit.timeit (with selecting the best of 3000 iterations), and got the following list:

Testing lcs of 'banana' vs 'atana'
 main_org = ana in 44.070 ms
 main_DavidZ = ana in 26.159 ms
 main_v3 = ana in 16.612 ms
 main_difflib = ana in 20.888 ms
 main_v4 = ana in 14.125 ms
 main_v5 = ana in 12.786 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'superfragilisticexpialinocious'
 main_org = fragilisticexpiali in 161.052 ms
 main_DavidZ = fragilisticexpiali in 98.213 ms
 main_v3 = fragilisticexpiali in 93.251 ms
 main_difflib = fragilisticexpiali in 86.685 ms
 main_v4 = fragilisticexpiali in 83.489 ms
 main_v5 = fragilisticexpiali in 43.128 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'asdfghjklqwertyuizxcvbnmwerftgh'
 main_org = er in 90.627 ms
 main_DavidZ = er in 58.914 ms
 main_v3 = er in 58.540 ms
 main_difflib = er in 57.414 ms
 main_v4 = er in 45.756 ms
 main_v5 = er in 37.218 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'supercalifragilisticexpialidocious'
 main_org = supercalifragilisticexpialidocious in 227.886 ms
 main_DavidZ = supercalifragilisticexpialidocious in 155.412 ms
 main_v3 = supercalifragilisticexpialidocious in 159.243 ms
 main_difflib = supercalifragilisticexpialidocious in 92.802 ms
 main_v4 = supercalifragilisticexpialidocious in 136.591 ms
 main_v5 = supercalifragilisticexpialidocious in 49.778 ms
Testing lcs of 'abcdeffghiklnopqr' vs 'bdefghijklmnopmnop'
 main_org = fghi in 69.507 ms
 main_DavidZ = fghi in 42.372 ms
 main_v3 = fghi in 34.204 ms
 main_difflib = fghi in 35.223 ms
 main_v4 = fghi in 31.538 ms
 main_v5 = fghi in 25.354 ms

As can be seen version 5 is now the fastest overall, even on the long text where difflib excels. But in most of the other cases, the order of the test is also the order from slowest to fastest. Do however note that execution timing is not a precise art as they are normally run on a computer which is busy doing other stuff, and as such can be interrupted at any point. But the timing does reveal there are differences in implementations.

So as a final note, I would say that your code does the job, but it hides away the implementation within a unneccessary and undocumented class, making it harder to reuse, and introducing some overhead as it uses a heavier structure around the current algorithm. Using a function can simplify the code, and make it easier to read and maintain, and run faster around 50 % faster.

_{PS! I tested this in Python 2.7, but the result should apply for Python 3.x as well, and this was the reason why I originally used xrange in the code.}

David Z David Z 4622 silver badges7 bronze badges · Accepted Answer · 2015-11-11 10:23:43Z

OK then: since your concern is speed, let's track our progress with actual timing data. The first step is to run the code through the Python profiler. With the addition of a bit of driver code that just calls LongestCommonSubstr().longest_common_substr 10000 times, I get the following results:

$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
 1130008 function calls in 3.083 seconds
 Ordered by: standard name
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 0.000 0.000 :0(__build_class__)
 1 0.000 0.000 3.082 3.082 :0(exec)
 1 0.000 0.000 0.000 0.000 :0(hasattr)
 280000 0.257 0.000 0.257 0.000 :0(len)
 260000 0.330 0.000 0.330 0.000 :0(max)
 260000 0.350 0.000 0.350 0.000 :0(min)
 1 0.000 0.000 0.000 0.000 :0(setprofile)
 10000 0.042 0.000 0.042 0.000 :0(sorted)
 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist)
 260000 1.067 0.000 1.746 0.000 genericpath.py:69(commonprefix)
 1 0.022 0.022 3.082 3.082 lcs_profile.py:1(<module>)
 20000 0.069 0.000 0.157 0.000 lcs_profile.py:11(_get_suffix_str)
 20000 0.071 0.000 0.071 0.000 lcs_profile.py:13(<listcomp>)
 10000 0.807 0.000 2.794 0.000 lcs_profile.py:15(_get_lcsubstr)
 1 0.000 0.000 0.000 0.000 lcs_profile.py:3(LongestCommonSubstr)
 10000 0.067 0.000 3.060 0.000 lcs_profile.py:4(__init__)
 1 0.000 0.000 3.083 3.083 profile:0(<code object <module> at 0x10996f030, file "lcs_profile.py", line 1>)
 0 0.000 0.000 profile:0(profiler)

The most important column is the second one, tottime, marking the total amount of time consumed by each function (but not the functions it calls). The largest entries in that column are for the commonprefix function and the _get_lc_substr function, so those are the spots you should focus on optimizing.

To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:

def lcs_longest_match(a, b):
 return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
 import timeit, sys
 print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
 setup='from __main__ import lcs_longest_match',
 number=10000))

(this is for Python 3). A first run on my computer gives

$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs 0.49046793299203273

I'll start with _get_lc_substr, since you wrote that code and it'll be easier to work through. The algorithm you use is to go through each pair of consecutive strings in the sorted suffix array, find the common prefix, and save that prefix only if it's longer than any common prefix already found. I can suggest a few improvements:

Iterating over pairs of consecutive elements is a common task that has a fairly standard recipe, pairwise(iterable), given in the documentation for the itertools module. You can use the implementation from the more-itertools package if you want. This also lets you get rid of the try/except block (which probably didn't affect runtime much, but it helps code clarity).
Python has a built-in function, max, to find the maximum value of an iterable. If you use it, then the nuts and bolts of the loop as well as the compare-and-store-if-greater process get handled internally by the interpreter, which should be faster than doing them manually in pure Python.
Repeatedly accessing an attribute of an object, namely the method self._suffix_str_array, is slower than accessing it once and storing it locally as a new variable.

Let's check the effect of these changes on the runtime of the driver program:

lcs 0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453

Not much of a change. Actually, using max is slower here, so let's discard that change.

Time now to take a closer look at commonprefix.

If you look at the source code of this function,
```
def commonprefix(m):
 "Given a list of pathnames, returns the longest common leading component"
 if not m: return ''
 s1 = min(m)
 s2 = max(m)
 for i, c in enumerate(s1):
 if c != s2[i]:
 return s1[:i]
 return s1
```
you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to max and min do take up a significant amount of time. Therefore, it makes sense to implement your own version of commonprefix without the max and min calls.

This makes a big difference in the runtime, cutting it down by over 30%:

lcs 0.49046793299203273
optlcs3 0.4828952929965453
optlcs4 0.3215277789859101

If you think about it, you don't even really need the common prefix itself, except for the one string you actually return from _get_lcsubstr. You only need its length, so you can decide which substring is the longest. So instead of using a function that finds the full common prefix, just write one that will give you its length. You store the length, along with one of the strings, and at the end of _get_lcsubstr, use the length to trim the stored string.
```
def _get_lcsubstr(self):
 # initialize
 for s1, s2 in mt.pairwise(suffix_str_array):
 if s1[-1] != s2[-1]:
 substr_len = get_common_prefix_length(s1, s2)
 if substr_len > max_len:
 max_len = substr_len
 max_substr = s1
 return max_substr[:max_len]
```

This shaves another few percent off the runtime:

lcs 0.49046793299203273
optlcs4 0.3215277789859101
optlcs5 0.30241094100347254

If it's faster to avoid computing a substring in your commonprefix replacement, you might think of doing the same thing when you're finding the suffixes in the first place. In other words, instead of calculating all the suffixes of a string in _get_suffix_str, just make a list of (index, which_string) tuples to represent the suffixes. Then whenever you need to actually compare two suffixes, instead of taking a substring of the original string, you just start comparing characters at the required indices.

There are two problems with this in practice: first, Python isn't well suited to iterating from an arbitrary point in the middle of a string. In a language like C, where strings are character pointers, this would work out quite well, because you can jump into the middle of the string by advancing a pointer. But in Python, iterating from the middle of a string requires you to either start from the beginning and just skip the first several characters, or bypass the whole iteration mechanism and use a for loop with an integer index to access characters inside the string by their indices (which typically involves more Python code that is relatively inefficient). And besides, the other reason is you need to create the substrings anyway to sort them. If you try to do it so that you use the substrings as comparison keys without actually storing them, the program spends a lot of time converting between a substring and its index.

All told, the changes you need to make to the code to use indices everywhere wind up hurting, not helping. Here's the timing result:

lcs 0.49046793299203273
optlcs5 0.30241094100347254
optlcs6 0.3683815289987251

At this point, you can spend a lot of time making little tweaks to try to squeeze some extra performance out of the program, but I don't think there are any major performance gains left. It's already something like 40% faster than the original, which is not bad.

Of course, as dawg wrote in a comment, you can accomplish this task using Python's standard module difflib.

def difflib_longest_match(a, b):
 i, j, k = difflib.SequenceMatcher(a=a, b=b
 ).find_longest_match(0, len(a), 0, len(b))
 return a[i:i+k]

I have no idea what's in the difflib code (well, I could look, but I'll leave that as an exercise), but it's clearly heavily optimized for this kind of task. It's another 25% faster than my best version of your program:

lcs 0.49046793299203273
optlcs5 0.30241094100347254
difflib 0.2154458940058248

So if you really want to do this not for educational purposes, just use difflib.

Here is the content of the test script I used.

import difflib
import itertools as it
import more_itertools as mt
from os.path import commonprefix
class LongestCommonSubstr(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 try:
 substr_len =0
 max_len = 0
 lcs = None
 for i,n in enumerate(self._suffix_str_array):
 if n[-1] != self._suffix_str_array[i+1][-1]:
 substr = commonprefix([n,self._suffix_str_array[i+1]])
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 except IndexError:
 pass
 return lcs
class OptimizedLongestCommonSubstr1(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 substr_len =0
 max_len = 0
 lcs = None
 for s1, s2 in mt.pairwise(self._suffix_str_array):
 if s1[-1] != s2[-1]:
 substr = commonprefix([s1,s2])
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 return lcs
class OptimizedLongestCommonSubstr2(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(self._suffix_str_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr3(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(s_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr4(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 @staticmethod
 def _get_common_prefix(s1, s2):
 for i, c in enumerate(s1):
 if c != s2[i]:
 return s1[:i]
 return s1
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcp = self._get_common_prefix
 substr_len =0
 max_len = 0
 lcs = None
 for s1, s2 in mt.pairwise(s_array):
 if s1[-1] != s2[-1]:
 substr = gcp(s1, s2)
 substr_len = len(substr)
 if substr_len > max_len:
 max_len = substr_len
 lcs = substr
 return lcs
class OptimizedLongestCommonSubstr5(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring+'0'
 self.rstring = rstring+'1'
 self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
 + self._get_suffix_str(self.rstring))
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_suffix_str(string):
 return [string[i:] for i in range(len(string))]
 @staticmethod
 def _get_common_prefix_length(s1, s2):
 for i, c in enumerate(s1):
 if c != s2[i]:
 return i
 return len(s1)
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcpl = self._get_common_prefix_length
 max_len = 0
 max_substr = ''
 for s1, s2 in mt.pairwise(s_array):
 if s1[-1] != s2[-1]:
 substr_len = gcpl(s1, s2)
 if substr_len > max_len:
 max_len = substr_len
 max_substr = s1
 return max_substr[:max_len]
class OptimizedLongestCommonSubstr6(object):
 def __init__(self, lstring, rstring):
 self.lstring = lstring
 self.rstring = rstring
 self._suffix_str_array = sorted(
 [(i, True) for i, _ in enumerate(lstring)]
 + [(i, False) for i, _ in enumerate(rstring)],
 key=lambda t: (lstring if t[1] else rstring)[t[0]:])
 self.longest_common_substr = self._get_lcsubstr()
 @staticmethod
 def _get_common_prefix_length(s1, start1, s2, start2):
 for i, c in enumerate(it.islice(s1, start1, None)):
 if c != s2[i + start2]:
 return i
 return i
 def _get_lcsubstr(self):
 s_array = self._suffix_str_array
 gcpl = self._get_common_prefix_length
 max_len = 0
 max_start = 0
 max_substr = ''
 for (i1, b1), (i2, b2) in mt.pairwise(s_array):
 if b1 != b2:
 if b1:
 s1 = self.lstring
 s2 = self.rstring
 else:
 s1 = self.rstring
 s2 = self.lstring
 substr_len = gcpl(s1, i1, s2, i2)
 if substr_len > max_len:
 max_len = substr_len
 max_start = i1
 max_substr = s1
 return max_substr[max_start:max_start+max_len]
def lcs_longest_match(a, b):
 return LongestCommonSubstr(a, b).longest_common_substr
for n in it.count(1):
 if 'OptimizedLongestCommonSubstr{}'.format(n) in globals():
 exec('''def optimized_lcs_longest_match_{}(a, b):
 return OptimizedLongestCommonSubstr{}(a, b).longest_common_substr
'''.format(n, n))
 else:
 break
def optimized_lcs_longest_match(a, b):
 return OptimizedLongestCommonSubstr(a, b).longest_common_substr
def difflib_longest_match(a, b):
 i, j, k = difflib.SequenceMatcher(a=a, b=b).find_longest_match(0, len(a), 0, len(b))
 return a[i:i+k]
if __name__ == '__main__':
 import timeit, sys
 n_runs = 10000
 print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import lcs_longest_match', number=n_runs))
 for k in range(1, n):
 print('optlcs{}'.format(k), timeit.timeit(
 'optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k),
 setup='from __main__ import optimized_lcs_longest_match_{}'.format(k),
 number=n_runs))
 print('difflib', timeit.timeit('difflib_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import difflib_longest_match', number=n_runs))
 print('control results:', difflib_longest_match(sys.argv[1], sys.argv[2])) 
 for k in range(1, n):
 print('test results {}: '.format(k), eval('optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k)))

I have learnt a lot from this answer. Thank you for a great introduction to Code Review.
@JAB Looking into difflib's source is definitely a good exercise. I managed to strip down the 'junk' overhead, and get the run time down by ~30%. And if you change it into a function you can get it down by ~40% (without much effort)! Using the same test case here's what I got: lcs 1.23378048, difflib 0.56012416 and the modified ones class 0.37018752, function 0.3279744.
Is it unusual to have a class with out any public methods, and to set the result of a private method to an attribute when the class is instantiated?
@JAB, I would say your choice of using a class is somewhat debatable, or even possibly a anti-pattern. See my answer for further details related to the choice of using a class (and an even faster implementation :-) )
@holroy for another through answer. I will spend time walking through it tomorrow. Much appreciated.

Stack Exchange Network

Longest common substring approach using suffix arrays

2 Answers 2

Performance review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Longest common substring approach using suffix arrays

2 Answers 2

Performance review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions