I wrote a program to find the longest common subsequence among several strings. I used a naive algorithm and implementation.
The motivation was solving the Rosalind problem at http://rosalind.info/problems/lcs/ . You can find sample input there as well. The Rosalind problem concerns strings as DNA, but I think my code can be treated as a general string operation.
The problem asks for any of the common substrings if there is more than one, but I find all of them.
A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if a longer common substring of the collection does not exist. For example, CG is a common substring of ACGTACGT and AACCGGTATA, whereas GTA is a longest common substring. Note that multiple longest common substrings may exist.
Given: A collection of DNA strings (of length at most 1 kbp each; ).
Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)
Sample Dataset
GATTACA TAGACCA ATACA
Sample Output
AC
How can this code be improved? What obvious problems are there?
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Diagnostics;
using System.IO;
using System.Linq;
namespace Finding_a_Shared_Motif
{
class Program
{
private static void Main()
{
var input = File.ReadAllLines("rosalind_lcs.txt").ToList();
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstring(input);
t.Stop();
File.WriteAllLines("output.txt", lcs);
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}
public static IEnumerable<string> LongestCommonSubstring(List<string> strings)
{
var lcs = LongestCommonSubstring(strings[0], strings[1]);
for (var i = 2; i < strings.Count(); i++)
{
var new_lcs = new BestStrings();
foreach (var s in lcs) new_lcs.Add(LongestCommonSubstring(s, strings[i]));
lcs = new_lcs;
}
return lcs;
}
private static BestStrings LongestCommonSubstring(string s1, string s2)
{
var lcs = new BestStrings();
for (var i = 1 - s2.Length; i < s1.Length; i++)
{
var substrings = BestSubstringWithAlignment(s1, s2, i);
if (substrings.Length == 0) continue;
lcs.Add(substrings);
}
return lcs;
}
private static BestStrings BestSubstringWithAlignment(string s1, string s2, int offset)
{
var substrings = new BestStrings();
var substring = "";
for (var i = Math.Max(0, offset); i < s1.Length && i < s2.Length + offset; i++)
{
var c1 = s1[i];
var c2 = s2[i - offset];
if (c1 == c2)
{
substring = substring + c1;
}
else
{
substrings.Add(substring);
substring = "";
}
}
substrings.Add(substring);
return substrings;
}
sealed class BestStrings : Collection<string>
{
public int Length
{
get { return base[0].Length; }
}
public BestStrings()
{
base.Add("");
}
public new void Add(string s)
{
if (s.Length == 0 || s.Length < Length || Contains(s)) return;
if (s.Length > Length) Clear();
base.Add(s);
}
public void Add(IEnumerable<string> collection)
{
foreach (var s in collection) Add(s);
}
}
}
}
-
\$\begingroup\$ Nothing major for me. Just your one place using new_lcs instead of camel casing. Pretty consistent everywhere else though. \$\endgroup\$dreza– dreza2013年02月04日 19:29:22 +00:00Commented Feb 4, 2013 at 19:29
-
\$\begingroup\$ It's not correct! Change input to List<string> input = new() { "abc1edfg", "abc2edfg", "abc3edfg", "abc4edg", "abc5edfg", }; the answer is "ed" not "abc", because it ignores "abc" since beginning. \$\endgroup\$Mu Jinming– Mu Jinming2021年11月04日 02:00:58 +00:00Commented Nov 4, 2021 at 2:00
2 Answers 2
Firstly some style.
Short named variables don't help readability at all. c1
, c2
, s1
, s2
are a bad idea. I know this is just a challenge and not production code but keeping a consistent style is a good habit.
Secondly I would start by using the string.Contains()
method. It will find if a specified substring exists and should help clean up some of the code.
In this train of thought I decided to start with all the possible substrings in the first string and then search the list of all strings.
public static IEnumerable<string> LongestCommonSubstrings(List<string> strings)
{
var firstString = strings.FirstOrDefault();
var allSubstrings = new List<string>();
for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
{
for(int offset = 0; (substringLength + offset) < firstString.Length; offset++)
{
string currentSubstring = firstString.Substring(offset,substringLength);
if (!System.String.IsNullOrWhiteSpace(currentSubstring) && !allSubstrings.Contains(currentSubstring))
{
allSubstrings.Add(currentSubstring);
}
}
}
return allSubstrings.OrderBy(subStr => subStr).ThenByDescending(subStr => subStr.Length).Where(subStr => strings.All(currentString => currentString.Contains(subStr)));
}
This will also allow us to do a .FirstOrDefault()
on the linq and get the largest ( due to the orderby()
calls.
(To test I used a static list of strings instead of a file:)
public static void Run()
{
var input = new List<string>{
"GATTACA",
"TAGACCA",
"ATACA",
};
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstrings(input);
t.Stop();
Console.WriteLine("All Common substrings: \r\n{0}", string.Join("\r\n", lcs));
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}
DISCLAIMER: Haven't fully tested the above code. May harm your brain/computer.
-
1\$\begingroup\$ Careful!
OrderByDescending
overrides the effects ofOrderBy
— what you need isThenByDescending
. And please don't fully qualifySystem.String
. I suggest simply using the lower case aliasstring
if you feel thatString
alone is too ambiguous. \$\endgroup\$Adam– Adam2013年02月05日 09:54:33 +00:00Commented Feb 5, 2013 at 9:54 -
\$\begingroup\$ @codesparkle oops! Thanks. I really should be more careful I hadn't noticed the orderby/orderbydescending. \$\endgroup\$James Khoury– James Khoury2013年02月05日 23:36:05 +00:00Commented Feb 5, 2013 at 23:36
I'm not sure if you're interested in better algorithms, but if so here are some.
In this particular implementation, I would consider changing how you handle substrings. You're currently doing a lot of string concatenation, which can be slow as a new string
object is allocated each time. Since you're just tracking substrings anyway, you could instead store the source string, start index and length of each match. That would save you potentially quite a bit of memory, and run faster as well.
If you're really attached to having separate string objects, at the very least figure out the extent of the match and then do a single substring()
to extract it, rather than building up the substring one character at a time.
Explore related questions
See similar questions with these tags.