Given n strings, is one of them a substring of another?

Question 1

Suppose we are given a collection of $n$ strings, $S_1,\dots,S_n$. I would like to know whether any of those strings is a substring of any other string in the collection. In other words, I'd like an algorithm for the following task:

Input: $S_1,\dots,S_n$

Output: $i,j$ such that $S_i$ is a substring of $S_j$ and $i\ne j,ドル or None if no such $i,j$ exist

Is there an efficient algorithm for this?

If we replace "substring" with "prefix", there is an efficient algorithm (sort the strings, then do a linear scan to compare adjacent strings; sorting will ensure that substrings are adjacent). But it seems more challenging to test whether any string is a substring of any other string. A naive algorithm is to iterate over all pairs $i,j,ドル but this requires $\Theta(n^2)$ substring tests. Is there a more efficient algorithm?

I guess we could call this "all-pairs substring testing", or something like that.

My ultimate goal is to prune the collection so no string is a substring of any other, by removing each one that is a substring of something else in the collection.

Question 2

Hint: Suffix array.

Question 3

As a side note, $\Theta(n^2)$ is not correct if you remove substrings as you find them. It will be less. Also, you should sort by length since a longer string cannot appear in a shorter string. Again $\Theta(n^2)$ is wrong here.

Question 4

@AlexisWilke, $\Theta(n^2)$ is correct: that's the number of substring tests in the worst case (the worst case is where no string is a substring of any other). Sorting by length gives you only a factor of two, which doesn't affect the asymptotics.

Question 5

You can build a suffix tree in linear time and check if there's an inner node that corresponds to a full string (constant time per node).

In more detail, assume we are given strings $s_1, \dots, s_n \in \Sigma^*$.

Build a (generalised) suffix tree of $s_1\$_1, s_2\$_2, \dots, s_n\$_n$ with $n$ pairwise distinct terminal markers $\$_1,\dots,\$_n \notin \Sigma$.

Using Ukkonen's algorithm, this can be done in linear time; linear in the sum of all string lengths.
Assuming that you label leaves with $(i,j)$ if they represent suffix $s_i[j..|s_i|]$ of $s_i,ドル traverse the tree and find those $n$ leaves labelled $(i,0),ドル i.e. the leaves that correspond to the full strings.

This takes time linear in the tree size, which itself is linear in the input size.
The descendant leaves of the parent of $(i,0)$ (which is reached by an edge labelled $\$_i$) represent all matches from the set; this follows from the basic invariant of suffix trees. Find any one match by descending to any leaf (but $(i,0)$).

This again takes linear time.

The distinct terminal markers are not really necessary; a single one used to terminate all strings is quite sufficient as long as you allow multiple labels per leaf.

Raphael Raphael 73.3k31 gold badges184 silver badges406 bronze badges · Answer 1 · 2014-11-03 07:35:27Z

You can build a suffix tree in linear time and check if there's an inner node that corresponds to a full string (constant time per node).

In more detail, assume we are given strings $s_1, \dots, s_n \in \Sigma^*$.

Build a (generalised) suffix tree of $s_1\$_1, s_2\$_2, \dots, s_n\$_n$ with $n$ pairwise distinct terminal markers $\$_1,\dots,\$_n \notin \Sigma$.

Using Ukkonen's algorithm, this can be done in linear time; linear in the sum of all string lengths.
Assuming that you label leaves with $(i,j)$ if they represent suffix $s_i[j..|s_i|]$ of $s_i,ドル traverse the tree and find those $n$ leaves labelled $(i,0),ドル i.e. the leaves that correspond to the full strings.

This takes time linear in the tree size, which itself is linear in the input size.
The descendant leaves of the parent of $(i,0)$ (which is reached by an edge labelled $\$_i$) represent all matches from the set; this follows from the basic invariant of suffix trees. Find any one match by descending to any leaf (but $(i,0)$).

This again takes linear time.

The distinct terminal markers are not really necessary; a single one used to terminate all strings is quite sufficient as long as you allow multiple labels per leaf.

Stack Exchange Network

Given n strings, is one of them a substring of another?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Given n strings, is one of them a substring of another?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions