Algorithm for length of longest common subsequence

Question 1

The case of multiple strings. A slight modification of the dynamic programming algorithm for two strings is used as a subroutine. Here is the pseudo code:

# The modified dynamic programming algorithm for longest common subsequence.
# Inputs: x, y - the strings
 xL, yL - their respective length vectors
lcs(x, xL, y, yL)
 u = |x|
 v = |y|
 T = 2-dimensional table with default value 0 for all elements.
 for i in 0..u-1
 for j in 0..v-1
 if x[i] == y[j]
 T[i, j] = T[i - 1, j - 1] + 1
 else
 T[i, j] = max(T[i - 1, j], T[i, j - 1])
 # limit the value of the table element by the length vectors:
 T[i, j] = min(T[i, j], xL[i], yL[j])
 # Take the element wise minimum of the vectors for each of the strings and the bottom row and right-most column respectively.
 return x, element-wise-minimum-of(xL, T[_, v - 1]), y, element-wise-minimum-of(yL, T[u - 1, _])
# Algorithm to compute the length of the longest common subsequence of multiple strings.
# Inputs: strings - a list of strings
mlcs(strings):
 if for any string x: |x| == 0
 return 0
 R = a dictionary indexed by a string x, where the value is initialized to be the length vector of x (initially [1,2,..,|x|]).
 N = initially a copy of R.
 while true
 for i in 0..|strings|
 for j in i + 1..|strings|
 x, y = strings[i], strings[j]
 _, xL, _, yL = lcs(x, R[x], y, R[y])
 R[x], R[y] = xL, yL
 if for every string x: N[x] is element wise equal to R[x]
 break
 else
 N[x] = R[x]
 return the last element of any of the vectors in R

What follows is an implementation in the python language.

import collections
def lcs(x, xL, y, yL):
 u = len(x)
 v = len(y)
 T = collections.defaultdict(int)
 for i in range(u):
 for j in range(v):
 if x[i] == y[j]:
 T[i, j] = T[i - 1, j - 1] + 1
 else:
 T[i, j] = max(T[i - 1, j], T[i, j - 1])
 T[i, j] = min(T[i, j], xL[i], yL[j])
 return (x, [min(xL[i], T[i, v - 1]) for i in range(u)]), (y, [min(yL[j], T[u - 1, j]) for j in range(v)])
def mlcs(strings):
 if any(len(x) == 0 for x in strings):
 return 0
 R = {x: list(range(1, len(x) + 1)) for x in strings}
 N = {x: R[x] for x in strings}
 while True:
 for i in range(len(strings)):
 for j in range(i + 1, len(strings)):
 x, y = strings[i], strings[j]
 (_, xL), (_, yL) = lcs(x, R[x], y, R[y])
 R[x] = xL
 R[y] = yL
 if all(all(N[x][i] == R[x][i] for i in range(len(x))) for x in strings):
 break
 else:
 for x in strings:
 N[x] = R[x][:]
 return R[strings[-1]][-1]

The idea is to initialize for each input string a vector which holds information for each position of that string about what is the maximum length of a LCS up to that position. These are updated each time a pair-wise LCS computation is done. This in turn is done for all pairs until all the vectors are stable (ie. don't change) between iterations.

So far testing gives me mixed results. The correct length at first seems to be produced pretty reliably, but when trying to use the function mlcs as an oracle in an algorithm for retrieving an actual longest sequence, sometimes the result is not what is expected. This may or may not indicate that there are cases for which an incorrect length is produced.

So on to the question: Is the idea of the length computing algorithm sound?

Question 2

en.wikipedia.org/wiki/…

Question 3

Posting a big blob of code is a poor fit for this site, I think. Could you reformulate your problem using concise pseudo-code?

Question 4

Also, please add what you have tried towards proving correctness of your algorithm, and where you get stuck.

Question 5

Please define what you mean by "sound". What do you want to know about your approach? Do you want to know if it gives the correct answer? How efficient it is? Something else? What does "not what is expected" mean -- do you mean "is not the correct answer"? If so, does that mean you've answered your own question? Based on your comment below it appears that you already know of a testcase where your algorithm gives the wrong answer. So does any question remain? Is there any reason to keep this open? Would you like to answer your own question?

Question 6

By sound I meant that it would always produce the correct result, which is as we have seen not true, so I am going to add an answer to this question.

Question 7

The algorithm is incorrect. That is it may return a length that is greater than the actual length of a longest common subsequence. Here is a test case which fails to produce the correct length: ['baaa', 'aaba', 'baba'].

Question 8

This answer can be improved by explaining why the algorithm fails conceptually.

Question 9

I doubt that algorithm has any concept, so it is hard to fail conceptually. And "is incorrect" with a counterexample is absolutely good enough.

Question 10

I didn't try to read your pseudocode, but based on your description, if I understand your idea correctly, it has a (probably fatal) problem: Even if all pairs $(i, j)$ of strings have a common subsequence with length $\ge k,ドル it may be that the LCS of the set of all strings has length $< k$.

Here's an example:

AB
BC
CA

Every pair of strings has an LCS of length 1, but no character appears in all 3 strings, so the LCS of the set of all 3 strings is 0.

Question 11

So, initially we have the strings AB, BC, CA with the associated vectors [1 2], [1 2], [1 2]. Now, AB×BC gives AB=[0 1], BC=[1 1]. Then AB×CA gives AB=[0 0], CA =[0 0] and finally BC×CA gives BC=[0 0] and CA=[0 0], and we've reached a non-changing state of the vectors and we can conclude that the answer is 0.

Question 12

Then I didn't fully understand your idea. Since you know that the function returns incorrect values sometimes, I suggest using standard debugging techniques to isolate a small counterexample, which will allow to see whether it is a fundamental problem with the algorithm or just a bug in your implementation of it. (To automate this I would generate all possible inputs in increasing size order, looking for the first one for which your algorithm and a known-good algorithm return different results.)

Question 13

I haven't been able to pin-point the problem. I have been concentrating on exploring with the implementation which can be tracked at github.com/batprey/lcs and now it seems the code is at a point where a sequence the length of an embedded common sequence in the test strings can be found (maybe I should say most of the time to be cautious). But embedding a sequence in the test strings does not guarantee that there is not a sequence even longer than that common to the strings which should be found instead were the algorithm working correctly.

Question 14

Also, as to your good suggestions @j_random_hacker indeed I must test more systematically.

Question 15

Your counterexample ['baaa', 'aaba', 'baba'] is a good one -- that should be small enough that you can figure out by hand what your algorithm should be doing with it.

Question 16

The problem you're solving is called the MLCS (multiple longest common subsequence).

A standard sequential algorithm for three strings A, B, C of lengths a, b, c would look something like:

R <- a ×ばつ b ×ばつ c array
-- initialise the base cases
for i <- 1...a:
 R[i, 0, 0] <- 0
for i <- 1...b:
 R[0, i, 0] <- 0
for i <- 1...c:
 R[0, 0, i] <- 0
-- the actual bottom-up 3-LCS
for i <- 1...a:
 for j <- 1...b:
 for k <- 1...c:
 if A[i - 1] = B[j - 1] and B[j - 1] = C[k - 1]: -- match
 R[i, j, k] = R[i - 1, j - 1, k - 1] + 1
 else -- no match => get the max of the three predecessors
 R[i, j, k] = max(R[i - 1, j, k], R[i, j - 1, k], R[i, j, k - 1])
-- the length of the longest 3LCS is stored in R[a, b, c] at the end

As a bonus for reading through this pseudocode, here's a handy teaser for where you can take this idea going forward: There is an opportunity to parallelise this algorithm by wavefronts (each wavefront defined by i + j + k), processing the wavefronts successively, but the elements of each wavefront in parallel.

Question 17

R[2, 5, 0] has not been initialized to 0. It will be used when A[3] = B[6] = C[1]. R[0, 0, 0] is not initialized.

user72935 user72935 711 silver badge7 bronze badges · Answer 1 · 2016-08-16 06:25:02Z

3

$\begingroup$

The algorithm is incorrect. That is it may return a length that is greater than the actual length of a longest common subsequence. Here is a test case which fails to produce the correct length: ['baaa', 'aaba', 'baba'].

Share

Improve this answer

answered Aug 16, 2016 at 6:25

user72935's user avatar

user72935 user72935

711 silver badge7 bronze badges

$\endgroup$

2

1

$\begingroup$ This answer can be improved by explaining why the algorithm fails conceptually. $\endgroup$

Raphael
– Raphael

2016年08月16日 09:36:49 +00:00
Commented Aug 16, 2016 at 9:36
$\begingroup$ I doubt that algorithm has any concept, so it is hard to fail conceptually. And "is incorrect" with a counterexample is absolutely good enough. $\endgroup$

gnasher729
– gnasher729

2016年08月17日 14:03:24 +00:00
Commented Aug 17, 2016 at 14:03

Add a comment |

score 2 · Answer 2 · 2016-07-14 11:13:11Z

2

$\begingroup$

I didn't try to read your pseudocode, but based on your description, if I understand your idea correctly, it has a (probably fatal) problem: Even if all pairs $(i, j)$ of strings have a common subsequence with length $\ge k,ドル it may be that the LCS of the set of all strings has length $< k$.

Here's an example:

AB
BC
CA

Every pair of strings has an LCS of length 1, but no character appears in all 3 strings, so the LCS of the set of all 3 strings is 0.

Share

Improve this answer

answered Jul 14, 2016 at 11:13

j_random_hacker's user avatar

j_random_hacker j_random_hacker

5,5091 gold badge17 silver badges22 bronze badges

$\endgroup$

8

$\begingroup$ So, initially we have the strings AB, BC, CA with the associated vectors [1 2], [1 2], [1 2]. Now, AB×BC gives AB=[0 1], BC=[1 1]. Then AB×CA gives AB=[0 0], CA =[0 0] and finally BC×CA gives BC=[0 0] and CA=[0 0], and we've reached a non-changing state of the vectors and we can conclude that the answer is 0. $\endgroup$

user72935
– user72935

2016年07月14日 11:28:49 +00:00
Commented Jul 14, 2016 at 11:28
2

$\begingroup$ Then I didn't fully understand your idea. Since you know that the function returns incorrect values sometimes, I suggest using standard debugging techniques to isolate a small counterexample, which will allow to see whether it is a fundamental problem with the algorithm or just a bug in your implementation of it. (To automate this I would generate all possible inputs in increasing size order, looking for the first one for which your algorithm and a known-good algorithm return different results.) $\endgroup$

j_random_hacker
– j_random_hacker

2016年07月14日 11:39:06 +00:00
Commented Jul 14, 2016 at 11:39
$\begingroup$ I haven't been able to pin-point the problem. I have been concentrating on exploring with the implementation which can be tracked at github.com/batprey/lcs and now it seems the code is at a point where a sequence the length of an embedded common sequence in the test strings can be found (maybe I should say most of the time to be cautious). But embedding a sequence in the test strings does not guarantee that there is not a sequence even longer than that common to the strings which should be found instead were the algorithm working correctly. $\endgroup$

user72935
– user72935

2016年07月19日 20:31:56 +00:00
Commented Jul 19, 2016 at 20:31
$\begingroup$ Also, as to your good suggestions @j_random_hacker indeed I must test more systematically. $\endgroup$

user72935
– user72935

2016年07月19日 20:32:59 +00:00
Commented Jul 19, 2016 at 20:32
1

$\begingroup$ Your counterexample ['baaa', 'aaba', 'baba'] is a good one -- that should be small enough that you can figure out by hand what your algorithm should be doing with it. $\endgroup$

j_random_hacker
– j_random_hacker

2016年07月22日 13:31:31 +00:00
Commented Jul 22, 2016 at 13:31

| Show 3 more comments

Nathan Davis Nathan Davis 566 bronze badges · Answer 3 · 2023-06-15 22:42:37Z

The problem you're solving is called the MLCS (multiple longest common subsequence).

A standard sequential algorithm for three strings A, B, C of lengths a, b, c would look something like:

R <- a ×ばつ b ×ばつ c array
-- initialise the base cases
for i <- 1...a:
 R[i, 0, 0] <- 0
for i <- 1...b:
 R[0, i, 0] <- 0
for i <- 1...c:
 R[0, 0, i] <- 0
-- the actual bottom-up 3-LCS
for i <- 1...a:
 for j <- 1...b:
 for k <- 1...c:
 if A[i - 1] = B[j - 1] and B[j - 1] = C[k - 1]: -- match
 R[i, j, k] = R[i - 1, j - 1, k - 1] + 1
 else -- no match => get the max of the three predecessors
 R[i, j, k] = max(R[i - 1, j, k], R[i, j - 1, k], R[i, j, k - 1])
-- the length of the longest 3LCS is stored in R[a, b, c] at the end

As a bonus for reading through this pseudocode, here's a handy teaser for where you can take this idea going forward: There is an opportunity to parallelise this algorithm by wavefronts (each wavefront defined by i + j + k), processing the wavefronts successively, but the elements of each wavefront in parallel.

R[2, 5, 0] has not been initialized to 0. It will be used when A[3] = B[6] = C[1]. R[0, 0, 0] is not initialized.

Stack Exchange Network

Algorithm for length of longest common subsequence

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Algorithm for length of longest common subsequence

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions