Longest common substring

Question 1

I wrote a program to find the longest common subsequence among several strings. I used a naive algorithm and implementation.

The motivation was solving the Rosalind problem at http://rosalind.info/problems/lcs/ . You can find sample input there as well. The Rosalind problem concerns strings as DNA, but I think my code can be treated as a general string operation.

The problem asks for any of the common substrings if there is more than one, but I find all of them.

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if a longer common substring of the collection does not exist. For example, CG is a common substring of ACGTACGT and AACCGGTATA, whereas GTA is a longest common substring. Note that multiple longest common substrings may exist.

Given: A collection of DNA strings (of length at most 1 kbp each; ).

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset
 GATTACA 
 TAGACCA 
 ATACA
Sample Output
 AC

How can this code be improved? What obvious problems are there?

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Diagnostics;
using System.IO;
using System.Linq;
namespace Finding_a_Shared_Motif
{
 class Program
 {
 private static void Main()
 {
 var input = File.ReadAllLines("rosalind_lcs.txt").ToList();
 var t = new Stopwatch();
 t.Restart();
 var lcs = LongestCommonSubstring(input);
 t.Stop();
 File.WriteAllLines("output.txt", lcs);
 Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
 Console.ReadLine();
 }
 public static IEnumerable<string> LongestCommonSubstring(List<string> strings)
 {
 var lcs = LongestCommonSubstring(strings[0], strings[1]);
 for (var i = 2; i < strings.Count(); i++)
 {
 var new_lcs = new BestStrings();
 foreach (var s in lcs) new_lcs.Add(LongestCommonSubstring(s, strings[i]));
 lcs = new_lcs;
 }
 return lcs;
 }
 private static BestStrings LongestCommonSubstring(string s1, string s2)
 {
 var lcs = new BestStrings();
 for (var i = 1 - s2.Length; i < s1.Length; i++)
 {
 var substrings = BestSubstringWithAlignment(s1, s2, i);
 if (substrings.Length == 0) continue;
 lcs.Add(substrings);
 }
 return lcs;
 }
 private static BestStrings BestSubstringWithAlignment(string s1, string s2, int offset)
 {
 var substrings = new BestStrings();
 var substring = "";
 for (var i = Math.Max(0, offset); i < s1.Length && i < s2.Length + offset; i++)
 {
 var c1 = s1[i];
 var c2 = s2[i - offset];
 if (c1 == c2)
 {
 substring = substring + c1;
 }
 else
 {
 substrings.Add(substring);
 substring = "";
 }
 }
 substrings.Add(substring);
 return substrings;
 }
 sealed class BestStrings : Collection<string>
 {
 public int Length
 {
 get { return base[0].Length; }
 }
 public BestStrings()
 {
 base.Add("");
 }
 public new void Add(string s)
 {
 if (s.Length == 0 || s.Length < Length || Contains(s)) return;
 if (s.Length > Length) Clear();
 base.Add(s);
 }
 public void Add(IEnumerable<string> collection)
 {
 foreach (var s in collection) Add(s);
 }
 }
 }
}

Question 2

Nothing major for me. Just your one place using new_lcs instead of camel casing. Pretty consistent everywhere else though.

Question 3

It's not correct! Change input to List<string> input = new() { "abc1edfg", "abc2edfg", "abc3edfg", "abc4edg", "abc5edfg", }; the answer is "ed" not "abc", because it ignores "abc" since beginning.

Question 4

Firstly some style.

Short named variables don't help readability at all. c1, c2, s1, s2 are a bad idea. I know this is just a challenge and not production code but keeping a consistent style is a good habit.

Secondly I would start by using the string.Contains() method. It will find if a specified substring exists and should help clean up some of the code.

In this train of thought I decided to start with all the possible substrings in the first string and then search the list of all strings.

public static IEnumerable<string> LongestCommonSubstrings(List<string> strings)
{
 var firstString = strings.FirstOrDefault();
 var allSubstrings = new List<string>();
 for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
 {
 for(int offset = 0; (substringLength + offset) < firstString.Length; offset++)
 {
 string currentSubstring = firstString.Substring(offset,substringLength);
 if (!System.String.IsNullOrWhiteSpace(currentSubstring) && !allSubstrings.Contains(currentSubstring))
 {
 allSubstrings.Add(currentSubstring);
 }
 }
 }
 return allSubstrings.OrderBy(subStr => subStr).ThenByDescending(subStr => subStr.Length).Where(subStr => strings.All(currentString => currentString.Contains(subStr)));
}

This will also allow us to do a .FirstOrDefault() on the linq and get the largest ( due to the orderby() calls.

(To test I used a static list of strings instead of a file:)

public static void Run()
{
 var input = new List<string>{
 "GATTACA",
 "TAGACCA",
 "ATACA",
 };
 var t = new Stopwatch();
 t.Restart();
 var lcs = LongestCommonSubstrings(input);
 t.Stop();
 Console.WriteLine("All Common substrings: \r\n{0}", string.Join("\r\n", lcs));
 Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
 Console.ReadLine();
}

DISCLAIMER: Haven't fully tested the above code. May harm your brain/computer.

Question 5

Careful! OrderByDescending overrides the effects of OrderBy — what you need is ThenByDescending. And please don't fully qualify System.String. I suggest simply using the lower case alias string if you feel that String alone is too ambiguous.

Question 6

@codesparkle oops! Thanks. I really should be more careful I hadn't noticed the orderby/orderbydescending.

Question 7

I'm not sure if you're interested in better algorithms, but if so here are some.

In this particular implementation, I would consider changing how you handle substrings. You're currently doing a lot of string concatenation, which can be slow as a new string object is allocated each time. Since you're just tracking substrings anyway, you could instead store the source string, start index and length of each match. That would save you potentially quite a bit of memory, and run faster as well.

If you're really attached to having separate string objects, at the very least figure out the extent of the match and then do a single substring() to extract it, rather than building up the substring one character at a time.

James Khoury James Khoury 3,1651 gold badge25 silver badges51 bronze badges · Answer 1 · 2013-02-05 06:00:42Z

Firstly some style.

Short named variables don't help readability at all. c1, c2, s1, s2 are a bad idea. I know this is just a challenge and not production code but keeping a consistent style is a good habit.

Secondly I would start by using the string.Contains() method. It will find if a specified substring exists and should help clean up some of the code.

In this train of thought I decided to start with all the possible substrings in the first string and then search the list of all strings.

public static IEnumerable<string> LongestCommonSubstrings(List<string> strings)
{
 var firstString = strings.FirstOrDefault();
 var allSubstrings = new List<string>();
 for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
 {
 for(int offset = 0; (substringLength + offset) < firstString.Length; offset++)
 {
 string currentSubstring = firstString.Substring(offset,substringLength);
 if (!System.String.IsNullOrWhiteSpace(currentSubstring) && !allSubstrings.Contains(currentSubstring))
 {
 allSubstrings.Add(currentSubstring);
 }
 }
 }
 return allSubstrings.OrderBy(subStr => subStr).ThenByDescending(subStr => subStr.Length).Where(subStr => strings.All(currentString => currentString.Contains(subStr)));
}

This will also allow us to do a .FirstOrDefault() on the linq and get the largest ( due to the orderby() calls.

(To test I used a static list of strings instead of a file:)

public static void Run()
{
 var input = new List<string>{
 "GATTACA",
 "TAGACCA",
 "ATACA",
 };
 var t = new Stopwatch();
 t.Restart();
 var lcs = LongestCommonSubstrings(input);
 t.Stop();
 Console.WriteLine("All Common substrings: \r\n{0}", string.Join("\r\n", lcs));
 Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
 Console.ReadLine();
}

DISCLAIMER: Haven't fully tested the above code. May harm your brain/computer.

Careful! OrderByDescending overrides the effects of OrderBy — what you need is ThenByDescending. And please don't fully qualify System.String. I suggest simply using the lower case alias string if you feel that String alone is too ambiguous.
@codesparkle oops! Thanks. I really should be more careful I hadn't noticed the orderby/orderbydescending.

breischl breischl 4492 silver badges5 bronze badges · Answer 2 · 2013-02-07 19:42:13Z

I'm not sure if you're interested in better algorithms, but if so here are some.

In this particular implementation, I would consider changing how you handle substrings. You're currently doing a lot of string concatenation, which can be slow as a new string object is allocated each time. Since you're just tracking substrings anyway, you could instead store the source string, start index and length of each match. That would save you potentially quite a bit of memory, and run faster as well.

If you're really attached to having separate string objects, at the very least figure out the extent of the match and then do a single substring() to extract it, rather than building up the substring one character at a time.

Stack Exchange Network

Longest common substring

Sample Dataset

Sample Output

2 Answers 2

Hot Network Questions

Longest common substring

Sample Dataset

Sample Output

2 Answers 2

Related

Hot Network Questions