10
$\begingroup$

Suppose we are given a collection of $n$ strings, $S_1,\dots,S_n$. I would like to know whether any of those strings is a substring of any other string in the collection. In other words, I'd like an algorithm for the following task:

Input: $S_1,\dots,S_n$

Output: $i,j$ such that $S_i$ is a substring of $S_j$ and $i\ne j,ドル or None if no such $i,j$ exist

Is there an efficient algorithm for this?

If we replace "substring" with "prefix", there is an efficient algorithm (sort the strings, then do a linear scan to compare adjacent strings; sorting will ensure that substrings are adjacent). But it seems more challenging to test whether any string is a substring of any other string. A naive algorithm is to iterate over all pairs $i,j,ドル but this requires $\Theta(n^2)$ substring tests. Is there a more efficient algorithm?

I guess we could call this "all-pairs substring testing", or something like that.

My ultimate goal is to prune the collection so no string is a substring of any other, by removing each one that is a substring of something else in the collection.

Raphael
73.3k31 gold badges184 silver badges406 bronze badges
asked Nov 3, 2014 at 1:58
$\endgroup$
3
  • $\begingroup$ Hint: Suffix array. $\endgroup$ Commented Nov 3, 2014 at 4:29
  • $\begingroup$ As a side note, $\Theta(n^2)$ is not correct if you remove substrings as you find them. It will be less. Also, you should sort by length since a longer string cannot appear in a shorter string. Again $\Theta(n^2)$ is wrong here. $\endgroup$ Commented Nov 3, 2014 at 5:50
  • $\begingroup$ @AlexisWilke, $\Theta(n^2)$ is correct: that's the number of substring tests in the worst case (the worst case is where no string is a substring of any other). Sorting by length gives you only a factor of two, which doesn't affect the asymptotics. $\endgroup$ Commented Nov 3, 2014 at 8:09

1 Answer 1

7
$\begingroup$

You can build a suffix tree in linear time and check if there's an inner node that corresponds to a full string (constant time per node).

In more detail, assume we are given strings $s_1, \dots, s_n \in \Sigma^*$.

  1. Build a (generalised) suffix tree of $s_1\$_1, s_2\$_2, \dots, s_n\$_n$ with $n$ pairwise distinct terminal markers $\$_1,\dots,\$_n \notin \Sigma$.

    Using Ukkonen's algorithm, this can be done in linear time; linear in the sum of all string lengths.

  2. Assuming that you label leaves with $(i,j)$ if they represent suffix $s_i[j..|s_i|]$ of $s_i,ドル traverse the tree and find those $n$ leaves labelled $(i,0),ドル i.e. the leaves that correspond to the full strings.

    This takes time linear in the tree size, which itself is linear in the input size.

  3. The descendant leaves of the parent of $(i,0)$ (which is reached by an edge labelled $\$_i$) represent all matches from the set; this follows from the basic invariant of suffix trees. Find any one match by descending to any leaf (but $(i,0)$).

    This again takes linear time.

The distinct terminal markers are not really necessary; a single one used to terminate all strings is quite sufficient as long as you allow multiple labels per leaf.

answered Nov 3, 2014 at 7:35
$\endgroup$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.