Give an efficient dynamic programming algorithm that decides if a string is an interleaving of two other strings

Question 1

I'm trying to understand this dynamic programming related problem, adapted from Kleinberg's Algorithm Design book.

You’re consulting for a group of people (who would prefer not to be mentioned here by name) whose jobs consist of monitoring and analyzing electronic signals coming from ships in coastal Atlantic waters. They want a fast algorithm for a basic primitive that arises frequently: "untangling" a superposition of two known signals. Specifically, they’re picturing a situation in which each of two ships is emitting a short sequence of 0ドルs$ and 1ドルs$ over and over, and they want to make sure that the signal they’re hearing is simply an interleaving of these two emissions, with nothing extra added in. This describes the whole problem; we can make it a little more explicit as follows.

Given a string $x$ consisting of 0ドルs$ and 1ドルs$, we write $x^k$ to denote $k$ copies of $x$ concatenated together. We say that a string $x$ is a repetition of $x$ if it is a prefix of $x^k$ for some number $k$. So $x = 10110110110$ is a repetition of $x = 101$. We say that a string $s$ is an interleaving of $x$ and $y$ if its symbols can be partitioned into two (not necessarily contiguous) subsequences $s'$ and $s''$, so that $s'$ is a repetition of $x$, and $s''$ is a repetition of $y$. (So each digit in $s$ must belong to exactly one of $s'$ or $s''$.) For example, if $x = 101$ and $y = 00$, then $s = 100010101$ is an interleaving of $x$ and $y$, since characters at position 1,2,5,7,8,9ドル$ form 101101ドル$—a repetition of $x$—and the remaining characters form 000ドル$—a repetition of $y$.

In terms of our application, $x$ and $y$ are the repeating sequences from the two ships, and $s$ is the signal we’re listening to: We want to make sure $s$ "unravels" into simple repetitions of $x$ and $y$. Give an efficient algorithm that takes strings $s, x,$ and $y$ and decides if $s$ is an interleaving of $x$ and $y$.

Question 2

Let's use caps, S for input string, A for x and B for y. Furthermore, let |S| = lenS, |A| = lenA, |B| = lenB.

With any Dynamic Programming problem, it is useful to append to the problem:

"... demonstrate one such solution." or, equivalently "... provide one such proof certificate."

This is because a Dynamic Programming solution is equivalent to searching over proof certificates. For this problem, what's a proof certificate? There are many possibilities. One of them is: Strings over {a, b} of length lenS.

Certificate aabbaa encodes: S[1] = A[1], S[2] = A[2], S[3] = B[1], S[4] = B[2], S[5] = A[3],... where array indices in A & B wrap around i.e. A[r + lenA] = A[r].

Clearly, S is an interleaving of A & B if and only if such a proof certificate exists. The task now boils down to finding one such certificate, if it exists.

Source code in C++ (Repl: https://repl.it/@vemana/csstackexchange#interleaving/main.cpp)

#include <bits/stdc++.h>
using namespace std;
const int MAX_LENS = 100, MAX_LENA=100, MAX_LENB=100;
// Memoization array
// Initialize: to -1.
int memo[MAX_LENS][MAX_LENA][MAX_LENB];
// Initialize: to 0
// Encodes the Proof Certificate, C
// SELECT[r] = 1 implies C[r] = a
// SELECT[r] = 2 implies C[r] = b
int SELECT[MAX_LENS][MAX_LENA][MAX_LENB];
// The input strings
string S, A, B;
// Initialize: lenS = S.size(), lenA = A.size(), lenB = B.size()
// Expected: lenS > 0, lenA > 0, lenB > 0
int lenS, lenA, lenB;
vector<string> getCertificate() {
 if (SELECT[0][0][0] == 0)
 return vector<string>(); // No solution
 vector<string> ret(2);
 int s = 0, a = 0, b = 0;
 while(s < lenS) {
 if (SELECT[s][a][b] == 1) {
 ret[0] += A[a];
 ret[1] += '_';
 a = (a+1)%lenA;
 s = s + 1;
 } else {
 ret[0] += '_';
 ret[1] += B[b];
 b = (b+1)%lenB;
 s = s + 1;
 }
 }
 return ret;
}
// Finds certificate for S[s...lenS] where
// next index in A is a (0 based)
// next index in B is b (0 based)
//
// Returns 0 if not found
// Returns 1 if found
int findCertificate(int s, int a, int b) {
 if (s == lenS)
 return 1; // Found the "empty" certificate.
 int& ret = memo[s][a][b];
 if (ret != -1) // already computed
 return ret;
 // Try to match S with A
 bool takeA = (S[s] == A[a]) && findCertificate(s+1, (a+1)%lenA, b) == 1;
 // Try to match S with B
 bool takeB = (S[s] == B[b]) && findCertificate(s+1, a, (b+1)%lenB) == 1;
 if (takeA) {
 SELECT[s][a][b] = 1; // SELECT A to extend the certificate
 return ret = 1;
 }
 if (takeB) {
 SELECT[s][a][b] = 2; // SELECT B to extend the certificate
 return ret = 1;
 }
 return ret = 0; // Neither matched. So, no certificate found
}
// Just run with ./a.out < in
// File 'in' has three strings, one per line, representing A, B & C.
int main() {
 // initialize S, A, B, lenA, lenB, lenS, SELECT, memo
 cin >> A;
 cin >> B;
 cin >> S;
 lenA = A.size();
 lenB = B.size();
 lenS = S.size();
 memset(memo,-1,sizeof(memo));
 memset(SELECT, 0, sizeof(SELECT));
 int out = findCertificate(0, 0, 0);
 vector<string> cert = getCertificate();
 string output = cert.empty() ? "NOT INTERLEAVED" : ("INTERLEAVED\n" + cert[0] + "\n" + cert[1]);
 cout << output << endl;
 return 0;
}

Question 3

Welcome to SE computer science. I cannot really comment your answer because I am not used to the kind of approach you are using. However, I suspect that people here often prefer an abstract algorithmic presentation to a computer program. And also some of us do not know all languages. So, if you actually use a programming language, it might be nice to say which, and if useful, which version. Some of these answers are still being used years after they have been posted.

Question 4

Thanks for the advice ! Appreciate it. repl.it/repls/DistantCreativeInstitute#main.cpp is the repl. C++ (version 7.5.0). It prints out an interleaving given three strings. I left an input & instruction to run the program. I liked your elegant NFA argument. My solution is pretty much the same, but in code. I find it easier to present code rather than math; it is after-all one way of expressing mathematical thought and can be tested out on real inputs.

Question 5

babou, my solution uses the NFA approach you outline. In findCertificate(s, a, b), s stands for position in the input string being recognized, a stands for the state of the first DFA (the one for A), b stands for the state in the second DFA (the one for B). Note that the DFA that recognizes repetitions of A has exactly |A| accepting states. The index 0<=a<|A| identifies which state the DFA is in. The function findCertificate(s,a,b) is just trying out both possibilities for which of the DFAs consumes the next character. My solution is O(|A| |B| |S|).

Question 6

No way you can make me read that length of code, but I get it from your comments. Not sure how much it differs from the other answer, but ... The advantage of the automata approach is that you get for free all the know-how of automata theory, such as minimization of the number of states. I do try to avoid code, and even math when I can afford it. I believe that when I can do it, it means I understand what I am doing, and I am best able to convey the gist of the problem. Long ago, I once gave up on a class because I got the best grade without understanding what I was doing. Symbol pushing.

Question 7

One reason I've gravitated towards code is that its runtime complexity can be evaluated. For e.g. my solution is O(|A| |B| |S|). I am not yet clear on how to convert the NFA to a DFA in polynomial time. The powerset construction could theoretically take exponential time in the number of states, which is roughly |A|.|B|. So, even with linear recognition time, the total runtime is dominated by NFA construction, which is exponential. A pathological case for NFA -> DFA construction could be when A = (a^n)x and B = (a^n)y. The NFA can be in one of k+1 states after consuming the first k<n 'a's.

Question 8

For 0 ≤ k ≤ n, determine all pairs (i, j) such that the first k characters of s are an interleaving of a string X consisting of repeats of x followed by the i first characters of x, 0 ≤ i < length(x), and a string Y consisting of repeats of y followed by the j first characters of y.

For k = 0, this is { (0, 0) }.

For k+1, for every pair (i, j) in the set for k, check if the next character of s matches the next character of x or the next character of y and add 0, 1, or 2 pairs to the set for k+1, avoiding duplicates.

For fixed x, y, you have a regular language and can create a state machine for it. You can create that state machine on the fly, making the algorithm run very fast if the number is input symbols is much more than lcm (len(x), len(y)).

Question 9

The easiest way to tackle this problem is to first solve it with a non-deterministic solution, then use dynamic programming to change it into a deterministic one.

The main difficulty here is that the problem is so specific that the specifics hide a simple solution that can be applied to a much wider class of problems.

So I will first généralise the problem. You may have noticed that what you call a repetition of $x$ is a regular set. Instead on considering this very peculiar family of regular sets, we will consider all regular sets on the given alphabet. So, instead of repetitions of $x$ and $y$, we consider two regular sets $X$ and $Y$. Then we call an interleaving of $X$ and $Y$ a sequence that can be partitioned in the way you describe into 2 strings $x\in X$ and $y\in Y$.

If you consider two FSA (finite state automata) $F_X$ and $F_Y$ for recognizing respectively $X$ and $Y$, it is quite sraightforward to build from then a non-deterministic FSA that recognizes te set of interleavings of $X$ and $Y$. It is very similar to the cross-product construction to get the intersection of two regular sets. You take as set of states the crossproduct of the states of both FSA, so that you can, nondeterministically in some cases, follow the recognition on the $X$ side of things, or on the $Y$ side of things (but not both as the same time as in the intersection construction). Dealing with initial and final states is left as an exercise.

No dynamic programming yet. Here it comes. As we all know, non-determinism does not work nicely in the real world, unless we accomodate it with various techniques, and dynamic programming is one of them. This also applies to non-deterministic finite-state automata, and in this special case, dynamic programming is called the powerset construction to turn NFA (non-deterministic finite automata) into DFA (deterministic finite automata). So you just do it, and there you can go spying of other ships. The recognition algorithm works in linear-time. The cost is in preprocessing.

The definition of a repetition od a string $x$ plays no role in this solution. It is just to be translated into a FSA that recognizes such a repetition.

Vemana Vemana 314 bronze badges · Answer 1 · 2020-06-07 17:23:41Z

Let's use caps, S for input string, A for x and B for y. Furthermore, let |S| = lenS, |A| = lenA, |B| = lenB.

With any Dynamic Programming problem, it is useful to append to the problem:

"... demonstrate one such solution." or, equivalently "... provide one such proof certificate."

This is because a Dynamic Programming solution is equivalent to searching over proof certificates. For this problem, what's a proof certificate? There are many possibilities. One of them is: Strings over {a, b} of length lenS.

Certificate aabbaa encodes: S[1] = A[1], S[2] = A[2], S[3] = B[1], S[4] = B[2], S[5] = A[3],... where array indices in A & B wrap around i.e. A[r + lenA] = A[r].

Clearly, S is an interleaving of A & B if and only if such a proof certificate exists. The task now boils down to finding one such certificate, if it exists.

Source code in C++ (Repl: https://repl.it/@vemana/csstackexchange#interleaving/main.cpp)

#include <bits/stdc++.h>
using namespace std;
const int MAX_LENS = 100, MAX_LENA=100, MAX_LENB=100;
// Memoization array
// Initialize: to -1.
int memo[MAX_LENS][MAX_LENA][MAX_LENB];
// Initialize: to 0
// Encodes the Proof Certificate, C
// SELECT[r] = 1 implies C[r] = a
// SELECT[r] = 2 implies C[r] = b
int SELECT[MAX_LENS][MAX_LENA][MAX_LENB];
// The input strings
string S, A, B;
// Initialize: lenS = S.size(), lenA = A.size(), lenB = B.size()
// Expected: lenS > 0, lenA > 0, lenB > 0
int lenS, lenA, lenB;
vector<string> getCertificate() {
 if (SELECT[0][0][0] == 0)
 return vector<string>(); // No solution
 vector<string> ret(2);
 int s = 0, a = 0, b = 0;
 while(s < lenS) {
 if (SELECT[s][a][b] == 1) {
 ret[0] += A[a];
 ret[1] += '_';
 a = (a+1)%lenA;
 s = s + 1;
 } else {
 ret[0] += '_';
 ret[1] += B[b];
 b = (b+1)%lenB;
 s = s + 1;
 }
 }
 return ret;
}
// Finds certificate for S[s...lenS] where
// next index in A is a (0 based)
// next index in B is b (0 based)
//
// Returns 0 if not found
// Returns 1 if found
int findCertificate(int s, int a, int b) {
 if (s == lenS)
 return 1; // Found the "empty" certificate.
 int& ret = memo[s][a][b];
 if (ret != -1) // already computed
 return ret;
 // Try to match S with A
 bool takeA = (S[s] == A[a]) && findCertificate(s+1, (a+1)%lenA, b) == 1;
 // Try to match S with B
 bool takeB = (S[s] == B[b]) && findCertificate(s+1, a, (b+1)%lenB) == 1;
 if (takeA) {
 SELECT[s][a][b] = 1; // SELECT A to extend the certificate
 return ret = 1;
 }
 if (takeB) {
 SELECT[s][a][b] = 2; // SELECT B to extend the certificate
 return ret = 1;
 }
 return ret = 0; // Neither matched. So, no certificate found
}
// Just run with ./a.out < in
// File 'in' has three strings, one per line, representing A, B & C.
int main() {
 // initialize S, A, B, lenA, lenB, lenS, SELECT, memo
 cin >> A;
 cin >> B;
 cin >> S;
 lenA = A.size();
 lenB = B.size();
 lenS = S.size();
 memset(memo,-1,sizeof(memo));
 memset(SELECT, 0, sizeof(SELECT));
 int out = findCertificate(0, 0, 0);
 vector<string> cert = getCertificate();
 string output = cert.empty() ? "NOT INTERLEAVED" : ("INTERLEAVED\n" + cert[0] + "\n" + cert[1]);
 cout << output << endl;
 return 0;
}

Welcome to SE computer science. I cannot really comment your answer because I am not used to the kind of approach you are using. However, I suspect that people here often prefer an abstract algorithmic presentation to a computer program. And also some of us do not know all languages. So, if you actually use a programming language, it might be nice to say which, and if useful, which version. Some of these answers are still being used years after they have been posted.
Thanks for the advice ! Appreciate it. repl.it/repls/DistantCreativeInstitute#main.cpp is the repl. C++ (version 7.5.0). It prints out an interleaving given three strings. I left an input & instruction to run the program. I liked your elegant NFA argument. My solution is pretty much the same, but in code. I find it easier to present code rather than math; it is after-all one way of expressing mathematical thought and can be tested out on real inputs.
babou, my solution uses the NFA approach you outline. In findCertificate(s, a, b), s stands for position in the input string being recognized, a stands for the state of the first DFA (the one for A), b stands for the state in the second DFA (the one for B). Note that the DFA that recognizes repetitions of A has exactly |A| accepting states. The index 0<=a<|A| identifies which state the DFA is in. The function findCertificate(s,a,b) is just trying out both possibilities for which of the DFAs consumes the next character. My solution is O(|A| |B| |S|).
No way you can make me read that length of code, but I get it from your comments. Not sure how much it differs from the other answer, but ... The advantage of the automata approach is that you get for free all the know-how of automata theory, such as minimization of the number of states. I do try to avoid code, and even math when I can afford it. I believe that when I can do it, it means I understand what I am doing, and I am best able to convey the gist of the problem. Long ago, I once gave up on a class because I got the best grade without understanding what I was doing. Symbol pushing.
One reason I've gravitated towards code is that its runtime complexity can be evaluated. For e.g. my solution is O(|A| |B| |S|). I am not yet clear on how to convert the NFA to a DFA in polynomial time. The powerset construction could theoretically take exponential time in the number of states, which is roughly |A|.|B|. So, even with linear recognition time, the total runtime is dominated by NFA construction, which is exponential. A pathological case for NFA -> DFA construction could be when A = (a^n)x and B = (a^n)y. The NFA can be in one of k+1 states after consuming the first k<n 'a's.

gnasher729 gnasher729 32.4k36 silver badges56 bronze badges · Answer 2 · 2020-03-04 17:35:44Z

For 0 ≤ k ≤ n, determine all pairs (i, j) such that the first k characters of s are an interleaving of a string X consisting of repeats of x followed by the i first characters of x, 0 ≤ i < length(x), and a string Y consisting of repeats of y followed by the j first characters of y.

For k = 0, this is { (0, 0) }.

For k+1, for every pair (i, j) in the set for k, check if the next character of s matches the next character of x or the next character of y and add 0, 1, or 2 pairs to the set for k+1, avoiding duplicates.

For fixed x, y, you have a regular language and can create a state machine for it. You can create that state machine on the fly, making the algorithm run very fast if the number is input symbols is much more than lcm (len(x), len(y)).

babou babou 19.6k43 silver badges77 bronze badges · Answer 3 · 2020-06-08 15:30:31Z

The easiest way to tackle this problem is to first solve it with a non-deterministic solution, then use dynamic programming to change it into a deterministic one.

The main difficulty here is that the problem is so specific that the specifics hide a simple solution that can be applied to a much wider class of problems.

So I will first généralise the problem. You may have noticed that what you call a repetition of $x$ is a regular set. Instead on considering this very peculiar family of regular sets, we will consider all regular sets on the given alphabet. So, instead of repetitions of $x$ and $y$, we consider two regular sets $X$ and $Y$. Then we call an interleaving of $X$ and $Y$ a sequence that can be partitioned in the way you describe into 2 strings $x\in X$ and $y\in Y$.

If you consider two FSA (finite state automata) $F_X$ and $F_Y$ for recognizing respectively $X$ and $Y$, it is quite sraightforward to build from then a non-deterministic FSA that recognizes te set of interleavings of $X$ and $Y$. It is very similar to the cross-product construction to get the intersection of two regular sets. You take as set of states the crossproduct of the states of both FSA, so that you can, nondeterministically in some cases, follow the recognition on the $X$ side of things, or on the $Y$ side of things (but not both as the same time as in the intersection construction). Dealing with initial and final states is left as an exercise.

No dynamic programming yet. Here it comes. As we all know, non-determinism does not work nicely in the real world, unless we accomodate it with various techniques, and dynamic programming is one of them. This also applies to non-deterministic finite-state automata, and in this special case, dynamic programming is called the powerset construction to turn NFA (non-deterministic finite automata) into DFA (deterministic finite automata). So you just do it, and there you can go spying of other ships. The recognition algorithm works in linear-time. The cost is in preprocessing.

The definition of a repetition od a string $x$ plays no role in this solution. It is just to be translated into a FSA that recognizes such a repetition.

Stack Exchange Network

Give an efficient dynamic programming algorithm that decides if a string is an interleaving of two other strings

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Give an efficient dynamic programming algorithm that decides if a string is an interleaving of two other strings

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions