Given some string fragments, I would like to find the shortest possible single string ("output string") that contains all the fragments. Fragments can overlap each other in the output string.
Example:
For the string fragments:
BCDA
AGF
ABC
The following output string contains all fragments, and was made by naive appending:
BCDAAGFABC
However this output string is better (shorter), as it employs overlaps:
ABCDAGF
^
ABC
^
BCDA
^
AGF
I'm looking for algorithms for this problem. It's not absolutely important to find the strictly shortest output string, but the shorter the better. I'm looking for an algorithm better than the obvious naive one that would try appending all permutations of the input fragments and removing overlaps (which would appear to be NP-Complete).
I've started work on a solution and it's proving quite interesting; I'd like to see what other people might come up with. I'll add my work-in-progress to this question in a while.
-
3The problem seems to be NP-complete. If so, you won't be able to find a polynomial algorithm for determining the shortest string at all, but there might be polynomial algorithms which give approximate (not the shortest possible) solutions.superM– superM09/25/2012 10:59:01Commented Sep 25, 2012 at 10:59
-
3This blog post regarding NP-Complete is nice: codinghorror.com/blog/2008/11/…occulus– occulus09/25/2012 11:04:03Commented Sep 25, 2012 at 11:04
-
The blog's really nice, I read it all the time )))superM– superM09/25/2012 11:06:36Commented Sep 25, 2012 at 11:06
-
@superM this is similar enough to traveling salesman (each string a city and cost between cities = some number-overlap)ratchet freak– ratchet freak09/25/2012 11:34:11Commented Sep 25, 2012 at 11:34
-
@ratchet freak, it is _ you could give small cost between cities if they have more common letters, and the biggest cost when they don't have any common letter at allsuperM– superM09/25/2012 11:41:01Commented Sep 25, 2012 at 11:41
1 Answer 1
What you're asking about is the Shortest Common Superstring problem, for which there is no algorithm that works for all cases. But it is a common problem (in compression and DNA sequencing) and several approximation algorithms are well-known.
"Greedy" algorithms are generally accepted to be the most effective (as in, they have the least-bad worst-case).
Have a read of the paper Approximation Algorithms for the Shortest Common Superstring Problem by Jonathan Turner for much more information.
-
1Some relevant pages: update.uu.se/~shikaree/Westling and cs.sunysb.edu/~algorith/files/shortest-common-superstring.shtmlocculus– occulus09/25/2012 12:44:29Commented Sep 25, 2012 at 12:44
-
Hmm, note that the first link in my comment just above address supersequences and not superstrings! A supersequence does not seem to require all characters in a sequence be contiguous.occulus– occulus09/25/2012 13:24:00Commented Sep 25, 2012 at 13:24
-