I've designed an algorithm to find the longest common subsequence. The steps are:
Start with i = 0
- Picks the first letter from the first string start from $i^{th}$ letter.
- Go to the second string looking for that picked letter.
- If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
- Now that found a common letter in the second string, adds that to
common_subsequence
. - Store its position in
index
. - Picks next letter from the first string and do step 2 but this time starts from
index
. - Repeat 3 to 6 until reached end of string 1 or string 2.
- If length
common_subsequence
is greater than length of common subsequence so far add that changelcs
to thecommon_subsequence
. - Add 1 to the
i
and repeat 1 to 9 whilei
is less than the length of the first string.
Example
X = ABCBDAB Y = BDCABA
- First pick
A
.- Look for
A
inY
.- Now that found
A
add that to thecommon_subsequence
.- Then pick
B
fromX
.- Look for
B
inY
but this time start searching fromA
.- Now pick
C
. It isn't there in string 2, so pick the next letter inX
that isB
.
The complexity of this algorithm is $\Theta(n\cdot m)$.
I implemented it using two methods. Here are my implementations.
- First
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
- Second Algorithm using hash table
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. Can someone elaborate why?
-
2$\begingroup$ This might be better suited to stackoverflow.com, or better yet, to Code Review. To me, their FAQ seems to allow such questions as on topic. $\endgroup$Paresh– Paresh2013年01月09日 16:48:49 +00:00Commented Jan 9, 2013 at 16:48
-
1$\begingroup$ Also, algorithmist.com/index.php/… for an implementation. $\endgroup$Paresh– Paresh2013年01月09日 16:51:34 +00:00Commented Jan 9, 2013 at 16:51
-
$\begingroup$ The algorithm described here is cubic - O(m^2 * n) - not quadratic. The net is choc full not only with descriptions of the standard O(m*n) algorithm but also with explanations how to derive the algorithm from first principles... The most important optimisation here is to consult an introductory algorithm text or any of the gazillion resources out there on the 'net (including Rosetta Code, which has LCS implementations in plenty of languages). $\endgroup$DarthGizka– DarthGizka2016年03月24日 15:39:12 +00:00Commented Mar 24, 2016 at 15:39
1 Answer 1
One optimization is to replace the access to reached_index[item]
by a local variable which is initialized from the hash table at the beginning of the for item
loop, and stored back to the hash table at the end of the loop.
Another optimization which will speed up things is to replace the hash tables with arrays of size 256.
Explore related questions
See similar questions with these tags.