Finding the longest common substring of multiple strings in Haskell

Question 1

The task is to find the longest common substring of a multitude of long strings. I'm trying to find the best algorithm that wouldn't use syntax trees/arrays (as I don't know anything about them yet).

The idea was to use binary search to find the length of the desired substring, instead of simply trying n, then (n - 1) etc.

My code looks like this:

-- find all the substrings of length n
ngrams :: Int -> String -> [String]
ngrams n s | n > length s = []
 | otherwise = take n s : ngrams n (drop 1 s)
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
 -- find a substring of a given length n that is common to all strings
 commonSubstrings n = foldr1 intersect (map (ngrams n) xs)
 go l r
 -- if the binary search ended, pick one common substring
 | r - l == 1 = head $ commonSubstrings l
 | otherwise
 = case commonSubstrings m of
 [] -> go l m -- the length is too big, try a smaller one
 _ -> go m r -- try longer length to find longer substring
 where
 m = (l + r) `div` 2 -- the middle point

It runs at around 3s for my dataset (~100 strings of length ~1000), which seems slow to me. Is there any way to clean up and quicken the code? And is there a better way to approach this problem (apart from the syntax trees) in general?

Question 2

intersect runs in quadratic time. Sets can use Ord information to speed that up.

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
 -- find a substring of a given length n that is common to all strings
 commonSubstrings n = foldr1 S.intersection (map (S.fromList . ngrams n) xs)
 go l r
 -- if the binary search ended, pick one common substring
 | r - l == 1 = S.findMin $ commonSubstrings l
 | otherwise
 = if S.null $ commonSubstrings m
 then go l m -- the length is too big, try a smaller one
 else go m r -- try longer length to find longer substring
 where
 m = (l + r) `div` 2 -- the middle point

For comparison, here's an implementation that skips the binary search part:

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring = maximumBy (comparing length)
 . foldr1 S.intersection . map (S.fromList . map inits . tails)

Question 3

Good insights. Your version (with sets and with binary search) runs 2s longer then the one with lists, though.

Gurkenglas Gurkenglas 3,79413 silver badges17 bronze badges · Answer 1 · 2017-06-04 14:24:33Z

intersect runs in quadratic time. Sets can use Ord information to speed that up.

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
 -- find a substring of a given length n that is common to all strings
 commonSubstrings n = foldr1 S.intersection (map (S.fromList . ngrams n) xs)
 go l r
 -- if the binary search ended, pick one common substring
 | r - l == 1 = S.findMin $ commonSubstrings l
 | otherwise
 = if S.null $ commonSubstrings m
 then go l m -- the length is too big, try a smaller one
 else go m r -- try longer length to find longer substring
 where
 m = (l + r) `div` 2 -- the middle point

For comparison, here's an implementation that skips the binary search part:

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring = maximumBy (comparing length)
 . foldr1 S.intersection . map (S.fromList . map inits . tails)

Good insights. Your version (with sets and with binary search) runs 2s longer then the one with lists, though.

Stack Exchange Network

Finding the longest common substring of multiple strings in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding the longest common substring of multiple strings in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions