0
$\begingroup$

The case of multiple strings. A slight modification of the dynamic programming algorithm for two strings is used as a subroutine. Here is the pseudo code:

# The modified dynamic programming algorithm for longest common subsequence.
# Inputs: x, y - the strings
 xL, yL - their respective length vectors
lcs(x, xL, y, yL)
 u = |x|
 v = |y|
 T = 2-dimensional table with default value 0 for all elements.
 for i in 0..u-1
 for j in 0..v-1
 if x[i] == y[j]
 T[i, j] = T[i - 1, j - 1] + 1
 else
 T[i, j] = max(T[i - 1, j], T[i, j - 1])
 # limit the value of the table element by the length vectors:
 T[i, j] = min(T[i, j], xL[i], yL[j])
 # Take the element wise minimum of the vectors for each of the strings and the bottom row and right-most column respectively.
 return x, element-wise-minimum-of(xL, T[_, v - 1]), y, element-wise-minimum-of(yL, T[u - 1, _])
# Algorithm to compute the length of the longest common subsequence of multiple strings.
# Inputs: strings - a list of strings
mlcs(strings):
 if for any string x: |x| == 0
 return 0
 R = a dictionary indexed by a string x, where the value is initialized to be the length vector of x (initially [1,2,..,|x|]).
 N = initially a copy of R.
 while true
 for i in 0..|strings|
 for j in i + 1..|strings|
 x, y = strings[i], strings[j]
 _, xL, _, yL = lcs(x, R[x], y, R[y])
 R[x], R[y] = xL, yL
 if for every string x: N[x] is element wise equal to R[x]
 break
 else
 N[x] = R[x]
 return the last element of any of the vectors in R

What follows is an implementation in the python language.

import collections
def lcs(x, xL, y, yL):
 u = len(x)
 v = len(y)
 T = collections.defaultdict(int)
 for i in range(u):
 for j in range(v):
 if x[i] == y[j]:
 T[i, j] = T[i - 1, j - 1] + 1
 else:
 T[i, j] = max(T[i - 1, j], T[i, j - 1])
 T[i, j] = min(T[i, j], xL[i], yL[j])
 return (x, [min(xL[i], T[i, v - 1]) for i in range(u)]), (y, [min(yL[j], T[u - 1, j]) for j in range(v)])
def mlcs(strings):
 if any(len(x) == 0 for x in strings):
 return 0
 R = {x: list(range(1, len(x) + 1)) for x in strings}
 N = {x: R[x] for x in strings}
 while True:
 for i in range(len(strings)):
 for j in range(i + 1, len(strings)):
 x, y = strings[i], strings[j]
 (_, xL), (_, yL) = lcs(x, R[x], y, R[y])
 R[x] = xL
 R[y] = yL
 if all(all(N[x][i] == R[x][i] for i in range(len(x))) for x in strings):
 break
 else:
 for x in strings:
 N[x] = R[x][:]
 return R[strings[-1]][-1]

The idea is to initialize for each input string a vector which holds information for each position of that string about what is the maximum length of a LCS up to that position. These are updated each time a pair-wise LCS computation is done. This in turn is done for all pairs until all the vectors are stable (ie. don't change) between iterations.

So far testing gives me mixed results. The correct length at first seems to be produced pretty reliably, but when trying to use the function mlcs as an oracle in an algorithm for retrieving an actual longest sequence, sometimes the result is not what is expected. This may or may not indicate that there are cases for which an incorrect length is produced.

So on to the question: Is the idea of the length computing algorithm sound?

asked Jul 14, 2016 at 6:43
$\endgroup$
6
  • $\begingroup$ en.wikipedia.org/wiki/… $\endgroup$ Commented Jul 14, 2016 at 7:41
  • 6
    $\begingroup$ Posting a big blob of code is a poor fit for this site, I think. Could you reformulate your problem using concise pseudo-code? $\endgroup$ Commented Jul 14, 2016 at 7:43
  • 2
    $\begingroup$ Also, please add what you have tried towards proving correctness of your algorithm, and where you get stuck. $\endgroup$ Commented Jul 14, 2016 at 9:10
  • $\begingroup$ Please define what you mean by "sound". What do you want to know about your approach? Do you want to know if it gives the correct answer? How efficient it is? Something else? What does "not what is expected" mean -- do you mean "is not the correct answer"? If so, does that mean you've answered your own question? Based on your comment below it appears that you already know of a testcase where your algorithm gives the wrong answer. So does any question remain? Is there any reason to keep this open? Would you like to answer your own question? $\endgroup$ Commented Aug 14, 2016 at 4:28
  • $\begingroup$ By sound I meant that it would always produce the correct result, which is as we have seen not true, so I am going to add an answer to this question. $\endgroup$ Commented Aug 16, 2016 at 6:22

3 Answers 3

3
$\begingroup$

The algorithm is incorrect. That is it may return a length that is greater than the actual length of a longest common subsequence. Here is a test case which fails to produce the correct length: ['baaa', 'aaba', 'baba'].

answered Aug 16, 2016 at 6:25
$\endgroup$
2
  • 1
    $\begingroup$ This answer can be improved by explaining why the algorithm fails conceptually. $\endgroup$ Commented Aug 16, 2016 at 9:36
  • $\begingroup$ I doubt that algorithm has any concept, so it is hard to fail conceptually. And "is incorrect" with a counterexample is absolutely good enough. $\endgroup$ Commented Aug 17, 2016 at 14:03
2
$\begingroup$

I didn't try to read your pseudocode, but based on your description, if I understand your idea correctly, it has a (probably fatal) problem: Even if all pairs $(i, j)$ of strings have a common subsequence with length $\ge k,ドル it may be that the LCS of the set of all strings has length $< k$.

Here's an example:

AB
BC
CA

Every pair of strings has an LCS of length 1, but no character appears in all 3 strings, so the LCS of the set of all 3 strings is 0.

answered Jul 14, 2016 at 11:13
$\endgroup$
8
  • $\begingroup$ So, initially we have the strings AB, BC, CA with the associated vectors [1 2], [1 2], [1 2]. Now, AB×BC gives AB=[0 1], BC=[1 1]. Then AB×CA gives AB=[0 0], CA =[0 0] and finally BC×CA gives BC=[0 0] and CA=[0 0], and we've reached a non-changing state of the vectors and we can conclude that the answer is 0. $\endgroup$ Commented Jul 14, 2016 at 11:28
  • 2
    $\begingroup$ Then I didn't fully understand your idea. Since you know that the function returns incorrect values sometimes, I suggest using standard debugging techniques to isolate a small counterexample, which will allow to see whether it is a fundamental problem with the algorithm or just a bug in your implementation of it. (To automate this I would generate all possible inputs in increasing size order, looking for the first one for which your algorithm and a known-good algorithm return different results.) $\endgroup$ Commented Jul 14, 2016 at 11:39
  • $\begingroup$ I haven't been able to pin-point the problem. I have been concentrating on exploring with the implementation which can be tracked at github.com/batprey/lcs and now it seems the code is at a point where a sequence the length of an embedded common sequence in the test strings can be found (maybe I should say most of the time to be cautious). But embedding a sequence in the test strings does not guarantee that there is not a sequence even longer than that common to the strings which should be found instead were the algorithm working correctly. $\endgroup$ Commented Jul 19, 2016 at 20:31
  • $\begingroup$ Also, as to your good suggestions @j_random_hacker indeed I must test more systematically. $\endgroup$ Commented Jul 19, 2016 at 20:32
  • 1
    $\begingroup$ Your counterexample ['baaa', 'aaba', 'baba'] is a good one -- that should be small enough that you can figure out by hand what your algorithm should be doing with it. $\endgroup$ Commented Jul 22, 2016 at 13:31
0
$\begingroup$

The problem you're solving is called the MLCS (multiple longest common subsequence).

A standard sequential algorithm for three strings A, B, C of lengths a, b, c would look something like:

R <- a ×ばつ b ×ばつ c array
-- initialise the base cases
for i <- 1...a:
 R[i, 0, 0] <- 0
for i <- 1...b:
 R[0, i, 0] <- 0
for i <- 1...c:
 R[0, 0, i] <- 0
-- the actual bottom-up 3-LCS
for i <- 1...a:
 for j <- 1...b:
 for k <- 1...c:
 if A[i - 1] = B[j - 1] and B[j - 1] = C[k - 1]: -- match
 R[i, j, k] = R[i - 1, j - 1, k - 1] + 1
 else -- no match => get the max of the three predecessors
 R[i, j, k] = max(R[i - 1, j, k], R[i, j - 1, k], R[i, j, k - 1])
-- the length of the longest 3LCS is stored in R[a, b, c] at the end

As a bonus for reading through this pseudocode, here's a handy teaser for where you can take this idea going forward: There is an opportunity to parallelise this algorithm by wavefronts (each wavefront defined by i + j + k), processing the wavefronts successively, but the elements of each wavefront in parallel.

answered Jun 15, 2023 at 22:42
$\endgroup$
1
  • $\begingroup$ R[2, 5, 0] has not been initialized to 0. It will be used when A[3] = B[6] = C[1]. R[0, 0, 0] is not initialized. $\endgroup$ Commented Jun 16, 2023 at 2:59

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.