0
$\begingroup$

I've designed an algorithm to find the longest common subsequence. The steps are:

Start with i = 0

  1. Picks the first letter from the first string start from $i^{th}$ letter.
  2. Go to the second string looking for that picked letter.
  3. If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
  4. Now that found a common letter in the second string, adds that to common_subsequence.
  5. Store its position in index.
  6. Picks next letter from the first string and do step 2 but this time starts from index.
  7. Repeat 3 to 6 until reached end of string 1 or string 2.
  8. If length common_subsequence is greater than length of common subsequence so far add that change lcs to the common_subsequence.
  9. Add 1 to the i and repeat 1 to 9 while i is less than the length of the first string.

Example‫‪

X = ABCBDAB‬‬ 
‫‪Y = BDCABA‬‬ 
  1. First pick A.
  2. Look for A in Y.
  3. Now that found A add that to the common_subsequence.
  4. Then pick B from X.
  5. Look for B in Y but this time start searching from A.
  6. Now pick C. It isn't there in string 2, so pick the next letter in X that is B.

The complexity of this algorithm is $\Theta(n\cdot m)$.

I implemented it using two methods. Here are my implementations.

  • First
import time
def lcs(xstr, ystr):
 if not (xstr and ystr): return # if string is empty
 lcs = [''] # longest common subsequence
 lcslen = 0 # length of longest common subsequence so far
 for i in xrange(len(xstr)):
 cs = '' # common subsequence
 start = 0 # start position in ystr
 for item in xstr[i:]:
 index = ystr.find(item, start) # position at the common letter
 if index != -1: # if common letter has found
 cs += item # add common letter to the cs
 start = index + 1
 if index == len(ystr) - 1: break # if reached end of the ystr
 # update lcs and lcslen if found better cs
 if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
 elif len(cs) == lcslen: lcs.append(cs)
 return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
  • Second Algorithm using hash table
import time
from collections import defaultdict
def lcs(xstr, ystr):
 if not (xstr and ystr): return # if strings are empty
 lcs = [''] # longest common subsequence
 lcslen = 0 # length of longest common subsequence so far
 location = defaultdict(list) # keeps track of items in the ystr
 i = 0
 for k in ystr:
 location[k].append(i)
 i += 1
 for i in xrange(len(xstr)):
 cs = '' # common subsequence
 index = -1
 reached_index = defaultdict(int)
 for item in xstr[i:]:
 for new_index in location[item][reached_index[item]:]:
 reached_index[item] += 1
 if index < new_index:
 cs += item # add item to the cs
 index = new_index
 break
 if index == len(ystr) - 1: break # if reached end of the ystr
 # update lcs and lcslen if found better cs
 if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
 elif len(cs) == lcslen: lcs.append(cs)
 return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. Can someone elaborate why?

asked Jan 9, 2013 at 16:28
$\endgroup$
3
  • 2
    $\begingroup$ This might be better suited to stackoverflow.com, or better yet, to Code Review. To me, their FAQ seems to allow such questions as on topic. $\endgroup$ Commented Jan 9, 2013 at 16:48
  • 1
    $\begingroup$ Also, algorithmist.com/index.php/… for an implementation. $\endgroup$ Commented Jan 9, 2013 at 16:51
  • $\begingroup$ The algorithm described here is cubic - O(m^2 * n) - not quadratic. The net is choc full not only with descriptions of the standard O(m*n) algorithm but also with explanations how to derive the algorithm from first principles... The most important optimisation here is to consult an introductory algorithm text or any of the gazillion resources out there on the 'net (including Rosetta Code, which has LCS implementations in plenty of languages). $\endgroup$ Commented Mar 24, 2016 at 15:39

1 Answer 1

1
$\begingroup$

One optimization is to replace the access to reached_index[item] by a local variable which is initialized from the hash table at the beginning of the for item loop, and stored back to the hash table at the end of the loop.

Another optimization which will speed up things is to replace the hash tables with arrays of size 256.

answered Jan 9, 2013 at 20:29
$\endgroup$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.