2
\$\begingroup\$

The task is to find the longest common substring of a multitude of long strings. I'm trying to find the best algorithm that wouldn't use syntax trees/arrays (as I don't know anything about them yet).

The idea was to use binary search to find the length of the desired substring, instead of simply trying n, then (n - 1) etc.

My code looks like this:

-- find all the substrings of length n
ngrams :: Int -> String -> [String]
ngrams n s | n > length s = []
 | otherwise = take n s : ngrams n (drop 1 s)
-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
 -- find a substring of a given length n that is common to all strings
 commonSubstrings n = foldr1 intersect (map (ngrams n) xs)
 go l r
 -- if the binary search ended, pick one common substring
 | r - l == 1 = head $ commonSubstrings l
 | otherwise
 = case commonSubstrings m of
 [] -> go l m -- the length is too big, try a smaller one
 _ -> go m r -- try longer length to find longer substring
 where
 m = (l + r) `div` 2 -- the middle point

It runs at around 3s for my dataset (~100 strings of length ~1000), which seems slow to me. Is there any way to clean up and quicken the code? And is there a better way to approach this problem (apart from the syntax trees) in general?

asked Jun 4, 2017 at 8:22
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

intersect runs in quadratic time. Sets can use Ord information to speed that up.

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring xs = go 0 $ length (head xs) + 1
 where
 -- find a substring of a given length n that is common to all strings
 commonSubstrings n = foldr1 S.intersection (map (S.fromList . ngrams n) xs)
 go l r
 -- if the binary search ended, pick one common substring
 | r - l == 1 = S.findMin $ commonSubstrings l
 | otherwise
 = if S.null $ commonSubstrings m
 then go l m -- the length is too big, try a smaller one
 else go m r -- try longer length to find longer substring
 where
 m = (l + r) `div` 2 -- the middle point

For comparison, here's an implementation that skips the binary search part:

-- find the longest common substring of multiple strings
longestCommonSubstring :: [String] -> String
longestCommonSubstring = maximumBy (comparing length)
 . foldr1 S.intersection . map (S.fromList . map inits . tails)
answered Jun 4, 2017 at 14:24
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Good insights. Your version (with sets and with binary search) runs 2s longer then the one with lists, though. \$\endgroup\$ Commented Jun 12, 2017 at 15:17

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.