The task is to find the longest common substring of a multitude of long strings. I'm trying to find the best algorithm that wouldn't use syntax trees/arrays (as I don't know anything about them yet).
The idea was to use binary search to find the length of the desired substring, instead of simply trying n, then (n - 1) etc.
My code looks like this:
-- find all the substrings of length n
ngrams :: Int -> String -> [String]
ngrams n s | n > length s = []
| otherwise = take n s : ngrams n (drop 1 s)
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
where
-- find a substring of a given length n that is common to all strings
commonSubstrings n = foldr1 intersect (map (ngrams n) xs)
go l r
-- if the binary search ended, pick one common substring
| r - l == 1 = head $ commonSubstrings l
| otherwise
= case commonSubstrings m of
[] -> go l m -- the length is too big, try a smaller one
_ -> go m r -- try longer length to find longer substring
where
m = (l + r) `div` 2 -- the middle point
It runs at around 3s for my dataset (~100 strings of length ~1000), which seems slow to me. Is there any way to clean up and quicken the code? And is there a better way to approach this problem (apart from the syntax trees) in general?
1 Answer 1
intersect
runs in quadratic time. Set
s can use Ord
information to speed that up.
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
where
-- find a substring of a given length n that is common to all strings
commonSubstrings n = foldr1 S.intersection (map (S.fromList . ngrams n) xs)
go l r
-- if the binary search ended, pick one common substring
| r - l == 1 = S.findMin $ commonSubstrings l
| otherwise
= if S.null $ commonSubstrings m
then go l m -- the length is too big, try a smaller one
else go m r -- try longer length to find longer substring
where
m = (l + r) `div` 2 -- the middle point
For comparison, here's an implementation that skips the binary search part:
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring = maximumBy (comparing length)
. foldr1 S.intersection . map (S.fromList . map inits . tails)
-
1\$\begingroup\$ Good insights. Your version (with sets and with binary search) runs 2s longer then the one with lists, though. \$\endgroup\$Eugleo– Eugleo2017年06月12日 15:17:56 +00:00Commented Jun 12, 2017 at 15:17