Algorithm for finding the longest common subsequence using hash table. Why is it slow?

Question 1

I've designed an algorithm to find the longest common subsequence. The steps are:

Start with i = 0

Picks the first letter from the first string start from $i^{th}$ letter.
Go to the second string looking for that picked letter.
If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
Now that found a common letter in the second string, adds that to common_subsequence.
Store its position in index.
Picks next letter from the first string and do step 2 but this time starts from index.
Repeat 3 to 6 until reached end of string 1 or string 2.
If length common_subsequence is greater than length of common subsequence so far add that change lcs to the common_subsequence.
Add 1 to the i and repeat 1 to 9 while i is less than the length of the first string.

Example‫‪
X = ABCBDAB‬‬ 
‫‪Y = BDCABA‬‬ 
First pick A.

Look for A in Y.

Now that found A add that to the common_subsequence.

Then pick B from X.

Look for B in Y but this time start searching from A.

Now pick C. It isn't there in string 2, so pick the next letter in X that is B.

The complexity of this algorithm is $\Theta(n\cdot m)$.

I implemented it using two methods. Here are my implementations.

First

import time
def lcs(xstr, ystr):
 if not (xstr and ystr): return # if string is empty
 lcs = [''] # longest common subsequence
 lcslen = 0 # length of longest common subsequence so far
 for i in xrange(len(xstr)):
 cs = '' # common subsequence
 start = 0 # start position in ystr
 for item in xstr[i:]:
 index = ystr.find(item, start) # position at the common letter
 if index != -1: # if common letter has found
 cs += item # add common letter to the cs
 start = index + 1
 if index == len(ystr) - 1: break # if reached end of the ystr
 # update lcs and lcslen if found better cs
 if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
 elif len(cs) == lcslen: lcs.append(cs)
 return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Second Algorithm using hash table

import time
from collections import defaultdict
def lcs(xstr, ystr):
 if not (xstr and ystr): return # if strings are empty
 lcs = [''] # longest common subsequence
 lcslen = 0 # length of longest common subsequence so far
 location = defaultdict(list) # keeps track of items in the ystr
 i = 0
 for k in ystr:
 location[k].append(i)
 i += 1
 for i in xrange(len(xstr)):
 cs = '' # common subsequence
 index = -1
 reached_index = defaultdict(int)
 for item in xstr[i:]:
 for new_index in location[item][reached_index[item]:]:
 reached_index[item] += 1
 if index < new_index:
 cs += item # add item to the cs
 index = new_index
 break
 if index == len(ystr) - 1: break # if reached end of the ystr
 # update lcs and lcslen if found better cs
 if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
 elif len(cs) == lcslen: lcs.append(cs)
 return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. Can someone elaborate why?

Question 2

This might be better suited to stackoverflow.com, or better yet, to Code Review. To me, their FAQ seems to allow such questions as on topic.

Question 3

Also, algorithmist.com/index.php/… for an implementation.

Question 4

The algorithm described here is cubic - O(m^2 * n) - not quadratic. The net is choc full not only with descriptions of the standard O(m*n) algorithm but also with explanations how to derive the algorithm from first principles... The most important optimisation here is to consult an introductory algorithm text or any of the gazillion resources out there on the 'net (including Rosetta Code, which has LCS implementations in plenty of languages).

Question 5

One optimization is to replace the access to reached_index[item] by a local variable which is initialized from the hash table at the beginning of the for item loop, and stored back to the hash table at the end of the loop.

Another optimization which will speed up things is to replace the hash tables with arrays of size 256.

Yuval Filmus Yuval Filmus 281k27 gold badges317 silver badges514 bronze badges · Accepted Answer · 2013-01-09 20:29:18Z

One optimization is to replace the access to reached_index[item] by a local variable which is initialized from the hash table at the beginning of the for item loop, and stored back to the hash table at the end of the loop.

Another optimization which will speed up things is to replace the hash tables with arrays of size 256.

Stack Exchange Network

Algorithm for finding the longest common subsequence using hash table. Why is it slow?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Algorithm for finding the longest common subsequence using hash table. Why is it slow?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions