I am trying to speed up a function to return the longest common substring.
The requirements:
- The function takes two strings of arbitrary length (although on average they will be less than 50 chars each)
- If two subsequences of the same length exist it can return either
- Speed is the primary concern
This is what I have and it works according to my tests:
from os.path import commonprefix
class LongestCommonSubstr(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
try:
substr_len =0
max_len = 0
lcs = None
for i,n in enumerate(self._suffix_str_array):
if n[-1] != self._suffix_str_array[i+1][-1]:
substr = commonprefix([n,self._suffix_str_array[i+1]])
substr_len = len(substr)
if substr_len > max_len:
max_len = substr_len
lcs = substr
except IndexError:
pass
return lcs
Can the code be made faster in pure Python? Is there a better option to differentiate the input strings when made into one sorted list than concatenating an ID to them?
-
1\$\begingroup\$ Have you considered difflib in the Python library? \$\endgroup\$dawg– dawg2015年11月11日 00:08:38 +00:00Commented Nov 11, 2015 at 0:08
2 Answers 2
OK then: since your concern is speed, let's track our progress with actual timing data. The first step is to run the code through the Python profiler. With the addition of a bit of driver code that just calls LongestCommonSubstr().longest_common_substr
10000 times, I get the following results:
$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
1130008 function calls in 3.083 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(__build_class__)
1 0.000 0.000 3.082 3.082 :0(exec)
1 0.000 0.000 0.000 0.000 :0(hasattr)
280000 0.257 0.000 0.257 0.000 :0(len)
260000 0.330 0.000 0.330 0.000 :0(max)
260000 0.350 0.000 0.350 0.000 :0(min)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 0.042 0.000 0.042 0.000 :0(sorted)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist)
260000 1.067 0.000 1.746 0.000 genericpath.py:69(commonprefix)
1 0.022 0.022 3.082 3.082 lcs_profile.py:1(<module>)
20000 0.069 0.000 0.157 0.000 lcs_profile.py:11(_get_suffix_str)
20000 0.071 0.000 0.071 0.000 lcs_profile.py:13(<listcomp>)
10000 0.807 0.000 2.794 0.000 lcs_profile.py:15(_get_lcsubstr)
1 0.000 0.000 0.000 0.000 lcs_profile.py:3(LongestCommonSubstr)
10000 0.067 0.000 3.060 0.000 lcs_profile.py:4(__init__)
1 0.000 0.000 3.083 3.083 profile:0(<code object <module> at 0x10996f030, file "lcs_profile.py", line 1>)
0 0.000 0.000 profile:0(profiler)
The most important column is the second one, tottime
, marking the total amount of time consumed by each function (but not the functions it calls). The largest entries in that column are for the commonprefix
function and the _get_lc_substr
function, so those are the spots you should focus on optimizing.
To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:
def lcs_longest_match(a, b):
return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
import timeit, sys
print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
setup='from __main__ import lcs_longest_match',
number=10000))
(this is for Python 3). A first run on my computer gives
$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs 0.49046793299203273
I'll start with _get_lc_substr
, since you wrote that code and it'll be easier to work through. The algorithm you use is to go through each pair of consecutive strings in the sorted suffix array, find the common prefix, and save that prefix only if it's longer than any common prefix already found. I can suggest a few improvements:
- Iterating over pairs of consecutive elements is a common task that has a fairly standard recipe,
pairwise(iterable)
, given in the documentation for the itertools module. You can use the implementation from the more-itertools package if you want. This also lets you get rid of thetry
/except
block (which probably didn't affect runtime much, but it helps code clarity). - Python has a built-in function,
max
, to find the maximum value of an iterable. If you use it, then the nuts and bolts of the loop as well as the compare-and-store-if-greater process get handled internally by the interpreter, which should be faster than doing them manually in pure Python. - Repeatedly accessing an attribute of an object, namely the method
self._suffix_str_array
, is slower than accessing it once and storing it locally as a new variable.
Let's check the effect of these changes on the runtime of the driver program:
lcs 0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453
Not much of a change. Actually, using max
is slower here, so let's discard that change.
Time now to take a closer look at commonprefix
.
If you look at the source code of this function,
def commonprefix(m): "Given a list of pathnames, returns the longest common leading component" if not m: return '' s1 = min(m) s2 = max(m) for i, c in enumerate(s1): if c != s2[i]: return s1[:i] return s1
you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to
max
andmin
do take up a significant amount of time. Therefore, it makes sense to implement your own version ofcommonprefix
without themax
andmin
calls.
This makes a big difference in the runtime, cutting it down by over 30%:
lcs 0.49046793299203273
optlcs3 0.4828952929965453
optlcs4 0.3215277789859101
If you think about it, you don't even really need the common prefix itself, except for the one string you actually return from
_get_lcsubstr
. You only need its length, so you can decide which substring is the longest. So instead of using a function that finds the full common prefix, just write one that will give you its length. You store the length, along with one of the strings, and at the end of_get_lcsubstr
, use the length to trim the stored string.def _get_lcsubstr(self): # initialize for s1, s2 in mt.pairwise(suffix_str_array): if s1[-1] != s2[-1]: substr_len = get_common_prefix_length(s1, s2) if substr_len > max_len: max_len = substr_len max_substr = s1 return max_substr[:max_len]
This shaves another few percent off the runtime:
lcs 0.49046793299203273
optlcs4 0.3215277789859101
optlcs5 0.30241094100347254
If it's faster to avoid computing a substring in your
commonprefix
replacement, you might think of doing the same thing when you're finding the suffixes in the first place. In other words, instead of calculating all the suffixes of a string in_get_suffix_str
, just make a list of(index, which_string)
tuples to represent the suffixes. Then whenever you need to actually compare two suffixes, instead of taking a substring of the original string, you just start comparing characters at the required indices.There are two problems with this in practice: first, Python isn't well suited to iterating from an arbitrary point in the middle of a string. In a language like C, where strings are character pointers, this would work out quite well, because you can jump into the middle of the string by advancing a pointer. But in Python, iterating from the middle of a string requires you to either start from the beginning and just skip the first several characters, or bypass the whole iteration mechanism and use a
for
loop with an integer index to access characters inside the string by their indices (which typically involves more Python code that is relatively inefficient). And besides, the other reason is you need to create the substrings anyway to sort them. If you try to do it so that you use the substrings as comparison keys without actually storing them, the program spends a lot of time converting between a substring and its index.
All told, the changes you need to make to the code to use indices everywhere wind up hurting, not helping. Here's the timing result:
lcs 0.49046793299203273
optlcs5 0.30241094100347254
optlcs6 0.3683815289987251
At this point, you can spend a lot of time making little tweaks to try to squeeze some extra performance out of the program, but I don't think there are any major performance gains left. It's already something like 40% faster than the original, which is not bad.
Of course, as dawg wrote in a comment, you can accomplish this task using Python's standard module difflib
.
def difflib_longest_match(a, b):
i, j, k = difflib.SequenceMatcher(a=a, b=b
).find_longest_match(0, len(a), 0, len(b))
return a[i:i+k]
I have no idea what's in the difflib
code (well, I could look, but I'll leave that as an exercise), but it's clearly heavily optimized for this kind of task. It's another 25% faster than my best version of your program:
lcs 0.49046793299203273
optlcs5 0.30241094100347254
difflib 0.2154458940058248
So if you really want to do this not for educational purposes, just use difflib
.
Here is the content of the test script I used.
import difflib
import itertools as it
import more_itertools as mt
from os.path import commonprefix
class LongestCommonSubstr(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
try:
substr_len =0
max_len = 0
lcs = None
for i,n in enumerate(self._suffix_str_array):
if n[-1] != self._suffix_str_array[i+1][-1]:
substr = commonprefix([n,self._suffix_str_array[i+1]])
substr_len = len(substr)
if substr_len > max_len:
max_len = substr_len
lcs = substr
except IndexError:
pass
return lcs
class OptimizedLongestCommonSubstr1(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
substr_len =0
max_len = 0
lcs = None
for s1, s2 in mt.pairwise(self._suffix_str_array):
if s1[-1] != s2[-1]:
substr = commonprefix([s1,s2])
substr_len = len(substr)
if substr_len > max_len:
max_len = substr_len
lcs = substr
return lcs
class OptimizedLongestCommonSubstr2(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(self._suffix_str_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr3(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
s_array = self._suffix_str_array
return max((commonprefix([s1,s2]) for s1, s2 in mt.pairwise(s_array) if s1[-1] != s2[-1]), key=len)
class OptimizedLongestCommonSubstr4(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
@staticmethod
def _get_common_prefix(s1, s2):
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1
def _get_lcsubstr(self):
s_array = self._suffix_str_array
gcp = self._get_common_prefix
substr_len =0
max_len = 0
lcs = None
for s1, s2 in mt.pairwise(s_array):
if s1[-1] != s2[-1]:
substr = gcp(s1, s2)
substr_len = len(substr)
if substr_len > max_len:
max_len = substr_len
lcs = substr
return lcs
class OptimizedLongestCommonSubstr5(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
@staticmethod
def _get_common_prefix_length(s1, s2):
for i, c in enumerate(s1):
if c != s2[i]:
return i
return len(s1)
def _get_lcsubstr(self):
s_array = self._suffix_str_array
gcpl = self._get_common_prefix_length
max_len = 0
max_substr = ''
for s1, s2 in mt.pairwise(s_array):
if s1[-1] != s2[-1]:
substr_len = gcpl(s1, s2)
if substr_len > max_len:
max_len = substr_len
max_substr = s1
return max_substr[:max_len]
class OptimizedLongestCommonSubstr6(object):
def __init__(self, lstring, rstring):
self.lstring = lstring
self.rstring = rstring
self._suffix_str_array = sorted(
[(i, True) for i, _ in enumerate(lstring)]
+ [(i, False) for i, _ in enumerate(rstring)],
key=lambda t: (lstring if t[1] else rstring)[t[0]:])
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_common_prefix_length(s1, start1, s2, start2):
for i, c in enumerate(it.islice(s1, start1, None)):
if c != s2[i + start2]:
return i
return i
def _get_lcsubstr(self):
s_array = self._suffix_str_array
gcpl = self._get_common_prefix_length
max_len = 0
max_start = 0
max_substr = ''
for (i1, b1), (i2, b2) in mt.pairwise(s_array):
if b1 != b2:
if b1:
s1 = self.lstring
s2 = self.rstring
else:
s1 = self.rstring
s2 = self.lstring
substr_len = gcpl(s1, i1, s2, i2)
if substr_len > max_len:
max_len = substr_len
max_start = i1
max_substr = s1
return max_substr[max_start:max_start+max_len]
def lcs_longest_match(a, b):
return LongestCommonSubstr(a, b).longest_common_substr
for n in it.count(1):
if 'OptimizedLongestCommonSubstr{}'.format(n) in globals():
exec('''def optimized_lcs_longest_match_{}(a, b):
return OptimizedLongestCommonSubstr{}(a, b).longest_common_substr
'''.format(n, n))
else:
break
def optimized_lcs_longest_match(a, b):
return OptimizedLongestCommonSubstr(a, b).longest_common_substr
def difflib_longest_match(a, b):
i, j, k = difflib.SequenceMatcher(a=a, b=b).find_longest_match(0, len(a), 0, len(b))
return a[i:i+k]
if __name__ == '__main__':
import timeit, sys
n_runs = 10000
print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import lcs_longest_match', number=n_runs))
for k in range(1, n):
print('optlcs{}'.format(k), timeit.timeit(
'optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k),
setup='from __main__ import optimized_lcs_longest_match_{}'.format(k),
number=n_runs))
print('difflib', timeit.timeit('difflib_longest_match(sys.argv[1], sys.argv[2])', setup='from __main__ import difflib_longest_match', number=n_runs))
print('control results:', difflib_longest_match(sys.argv[1], sys.argv[2]))
for k in range(1, n):
print('test results {}: '.format(k), eval('optimized_lcs_longest_match_{}(sys.argv[1], sys.argv[2])'.format(k)))
-
\$\begingroup\$ I have learnt a lot from this answer. Thank you for a great introduction to Code Review. \$\endgroup\$JAB– JAB2015年11月11日 14:50:13 +00:00Commented Nov 11, 2015 at 14:50
-
1\$\begingroup\$ @JAB Looking into
difflib
's source is definitely a good exercise. I managed to strip down the 'junk' overhead, and get the run time down by ~30%. And if you change it into a function you can get it down by ~40% (without much effort)! Using the same test case here's what I got:lcs 1.23378048
,difflib 0.56012416
and the modified onesclass 0.37018752
,function 0.3279744
. \$\endgroup\$2015年11月11日 15:24:43 +00:00Commented Nov 11, 2015 at 15:24 -
\$\begingroup\$ Is it unusual to have a class with out any public methods, and to set the result of a private method to an attribute when the class is instantiated? \$\endgroup\$JAB– JAB2015年11月11日 17:43:21 +00:00Commented Nov 11, 2015 at 17:43
-
\$\begingroup\$ @JAB, I would say your choice of using a class is somewhat debatable, or even possibly a anti-pattern. See my answer for further details related to the choice of using a class (and an even faster implementation :-) ) \$\endgroup\$holroy– holroy2015年11月11日 21:42:41 +00:00Commented Nov 11, 2015 at 21:42
-
\$\begingroup\$ @holroy for another through answer. I will spend time walking through it tomorrow. Much appreciated. \$\endgroup\$JAB– JAB2015年11月12日 03:50:30 +00:00Commented Nov 12, 2015 at 3:50
Before tackling the speed issue, let us comment on the chosen solution of using a class and style issues:
- Missing documenation – Neither the class nor any methods are documented, which makes it hard to understand how to use your class, and what your class provides. Invoking it through
LongestCommonSubstr('banana', 'atana').longest_common_substr
, is not exactly intuitive. - Almost no public interface – You've catched up on the class naming, and indicative naming of private members and methods, but this leaves little of your class available. Only the
rstring
,lstring
andlongest_common_substr
is left as "public". - Avoid intensive calculations in
__init__
– It is not normal to do all of the calculation within the initialisation method. This should be used to set up the base of the class, and then use methods to interact with the instantiated object. To actually do everything the class needs to do in the__init__
, just to expose a public variable is an anti-pattern. - Avoid using a class if only one operation needs to be executed – This is not a stringent rule, but if you only want the class to do just a single thing, then you are most likely better off with a function. No need to encapsulate the function within a class. If you need local variables or functions (like
_get_suffix_str()
they can be made as internal functions of your main function. - Add spaces after comma, and vertical space – Mainly your code maintains a good style, but you should add a space after commas. And you could very well include a new line before
if
andfor
or other blocks within your functions/methods
Performance review
You are asking if there is another way to separate your input strings, and there is at least one method using tuples which can achieve this. Then instead of storing suffix1
you store (suffix, 1)
(or 0
). However this has a penalty when creating, and when accessing again. When speed is important, it seems like your current solution of postfixing with 0
or 1
is faster, although more unclear when reading the code.
I reimplemented the faster solution of David Z as a function, and got a performance gain. But I got an even greater gain when loosing the pairwise
iterator, and using a previous and current variable set.
Here are the two main contenders for performance comparision, the first is a version using tuples instead of postfixing the strings, and not using the pairwise iterator:
def longest_common_substring_v3(first, second):
"""Utilises a set of suffix tuples to return the longest common substring."""
max_len = 0
max_substring = ''
# Build suffixes based on tuples to differentiate strings
suffixes = sorted( [ (first[i:], 0) for i in range(len(first)) ] +
[ (second[i:], 1) for i in range(len(second)) ])
# Loop through a sorted suffix set, and if two following tuples
# are from alternating base strings (i.e. different id) compare
# those and see if we found a new longer common substring
previous_text, previous_id = suffixes[0]
for (current_text, current_id) in suffixes[1:]:
# Only continue comparision if id differs of two suffixes in row
if previous_id != current_id:
# Find the longest common start of these two texts
for i, c in enumerate(previous_text):
if c != current_text[i]:
break
else:
i = len(previous_text)
# If longer than previous max, set the new max
if i > max_len:
max_len = i
max_substring = previous_text[:i]
# Shift over current values, preparing for next loop
previous_text, previous_id = current_text, current_id
return max_substring
But the 2nd fastest version was using postfixing the string, a single function, and skipping the pairwise iteration, like in the following code:
def longest_common_substring_v4(first, second):
"""Utilises suffix list to return the longest common substring."""
max_len = 0
max_substring = ''
# Add magic number to end of string, to differentiate suffixes
first += '0'
second += '1'
suffixes = sorted( [first[i:] for i in range(len(first)-1) ] +
[second[i:] for i in range(len(second)-1) ])
# Loop over suffixes, one at a time
previous = suffixes[0]
for current in suffixes[1:]:
# If suffixes have different last character, they come from different
# strings and they are candidates for common substrings
if previous[-1] != current[-1]:
# Find the longest common start of these texts
for i, c in enumerate(previous):
if c != current[i]:
break
else:
i = len(previous)
# If longer than previous max, set the new max
if i > max_len:
max_len = i
max_substring = previous[:i]
previous = current
return max_substring
I profiled version 4, and found that finding the common start was the slowest part of my code. Then it hit me, why don't we skip comparing byte for byte, and instead do a string compare on current max_len
instead? This leads to version 5, where the code around for i, c ...
has been replaced with the following:
if previous[-1] != current[-1] and previous[:max_len] == current[:max_len]:
# Find the longest common start of these texts
for i in range(max_len, len(previous)):
if previous[i] != current[i]:
break
else:
i = len(previous)
I compared my version 3, 4, & 5, your original version, the version by David Z, and the difflib version, for some various strings using timeit.timeit (with selecting the best of 3000 iterations), and got the following list:
Testing lcs of 'banana' vs 'atana'
main_org = ana in 44.070 ms
main_DavidZ = ana in 26.159 ms
main_v3 = ana in 16.612 ms
main_difflib = ana in 20.888 ms
main_v4 = ana in 14.125 ms
main_v5 = ana in 12.786 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'superfragilisticexpialinocious'
main_org = fragilisticexpiali in 161.052 ms
main_DavidZ = fragilisticexpiali in 98.213 ms
main_v3 = fragilisticexpiali in 93.251 ms
main_difflib = fragilisticexpiali in 86.685 ms
main_v4 = fragilisticexpiali in 83.489 ms
main_v5 = fragilisticexpiali in 43.128 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'asdfghjklqwertyuizxcvbnmwerftgh'
main_org = er in 90.627 ms
main_DavidZ = er in 58.914 ms
main_v3 = er in 58.540 ms
main_difflib = er in 57.414 ms
main_v4 = er in 45.756 ms
main_v5 = er in 37.218 ms
Testing lcs of 'supercalifragilisticexpialidocious' vs 'supercalifragilisticexpialidocious'
main_org = supercalifragilisticexpialidocious in 227.886 ms
main_DavidZ = supercalifragilisticexpialidocious in 155.412 ms
main_v3 = supercalifragilisticexpialidocious in 159.243 ms
main_difflib = supercalifragilisticexpialidocious in 92.802 ms
main_v4 = supercalifragilisticexpialidocious in 136.591 ms
main_v5 = supercalifragilisticexpialidocious in 49.778 ms
Testing lcs of 'abcdeffghiklnopqr' vs 'bdefghijklmnopmnop'
main_org = fghi in 69.507 ms
main_DavidZ = fghi in 42.372 ms
main_v3 = fghi in 34.204 ms
main_difflib = fghi in 35.223 ms
main_v4 = fghi in 31.538 ms
main_v5 = fghi in 25.354 ms
As can be seen version 5 is now the fastest overall, even on the long text where difflib excels. But in most of the other cases, the order of the test is also the order from slowest to fastest. Do however note that execution timing is not a precise art as they are normally run on a computer which is busy doing other stuff, and as such can be interrupted at any point. But the timing does reveal there are differences in implementations.
So as a final note, I would say that your code does the job, but it hides away the implementation within a unneccessary and undocumented class, making it harder to reuse, and introducing some overhead as it uses a heavier structure around the current algorithm. Using a function can simplify the code, and make it easier to read and maintain, and run faster around 50 % faster.
PS! I tested this in Python 2.7, but the result should apply for Python 3.x as well, and this was the reason why I originally used xrange
in the code.