Optimize calculation of string self-similarly with its suffixes

Question 1

I am trying to solve a Hacker Rank problem about string suffixes:

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

Calculate the sum of similarities of a string S with each of its suffixes.

I have the following code:

import System.IO
import Control.Monad
import Data.Maybe
import Data.List
import qualified Data.ByteString.Char8 as BSC
main = do
 n <- liftM getIntFromBS BSC.getLine
 replicateM_ n $ do
 s <- BSC.getLine
 putStrLn . show $ sum $ map (cntPrefix s) $ tail $ sort $ BSC.tails s
 where
 getIntFromBS = fst.fromJust.BSC.readInt
 cntPrefix str pref = length $ takeWhile (\t -> fst t == snd t) $ BSC.zip pref str

But performance is really bad. I am doing this challenges to learn Haskell and thus my skills aren't good enough to optimize this program. Any help is appreciated.

Question 2

What you may and may not do after receiving answers

Question 3

Normally, I would recommend the Z Algorithm, but I'm not sure how to translate it into Haskell.

Question 4

As far as I can tell your solution is not only slow but incorrect because of your usage of sort. Dropping that single function from your pipeline nets an exponential improvement, and it stopped including the similarity of s to itself in the answer.

Make sure also you're compiling with -O2, otherwise sum won't be optimized into a strict fold and you'll blow up the stack with thunks.

I also came up with a version targeted toward readability, but it's ~3x slower than your (corrected) ByteString based version so I'll just include it here as a curiosity.

import Control.Monad (replicateM_)
import Data.List (tails)
main :: IO ()
main = do
 n <- readLn
 replicateM_ n $ do
 s <- getLine
 print $ sum $ map (similarity s) (suffixes s)
similarity :: (Eq a) => [a] -> [a] -> Int
similarity s = length . takeWhile id . zipWith (==) s
suffixes :: [a] -> [[a]]
suffixes = tail . tails

Question 5

The sort isn't there for fun, is necessary to give the correct answer. Your answer gives wrong output.

Question 6

Only if you consider xs :: [a] to be a suffix of xs :: [a], which depends on how pedantic you're willing to be. It also is completely unnecessary, because the only thing tail . sort . tails will ever do is drop an instance of the empty list which in your case would have a similarity of 0 anyway and not affect the sum. Remove tail $ sort $ from your version and you'll see the answer you expect along with the speedup.

Question 7

That's different indeed to what you posted since with your answer tail will drop an arbitrary suffix and not the empty list, that as you said is redundant to drop since it equals zero. Unfortunately tried with the suggested modifications but still too slow. I'll update the original post.

bisserlis bisserlis 3,3311 gold badge13 silver badges17 bronze badges · Answer 1 · 2014-11-13 13:49:13Z

As far as I can tell your solution is not only slow but incorrect because of your usage of sort. Dropping that single function from your pipeline nets an exponential improvement, and it stopped including the similarity of s to itself in the answer.

Make sure also you're compiling with -O2, otherwise sum won't be optimized into a strict fold and you'll blow up the stack with thunks.

I also came up with a version targeted toward readability, but it's ~3x slower than your (corrected) ByteString based version so I'll just include it here as a curiosity.

import Control.Monad (replicateM_)
import Data.List (tails)
main :: IO ()
main = do
 n <- readLn
 replicateM_ n $ do
 s <- getLine
 print $ sum $ map (similarity s) (suffixes s)
similarity :: (Eq a) => [a] -> [a] -> Int
similarity s = length . takeWhile id . zipWith (==) s
suffixes :: [a] -> [[a]]
suffixes = tail . tails

The sort isn't there for fun, is necessary to give the correct answer. Your answer gives wrong output.
Only if you consider xs :: [a] to be a suffix of xs :: [a], which depends on how pedantic you're willing to be. It also is completely unnecessary, because the only thing tail . sort . tails will ever do is drop an instance of the empty list which in your case would have a similarity of 0 anyway and not affect the sum. Remove tail $ sort $ from your version and you'll see the answer you expect along with the speedup.
That's different indeed to what you posted since with your answer tail will drop an arbitrary suffix and not the empty list, that as you said is redundant to drop since it equals zero. Unfortunately tried with the suggested modifications but still too slow. I'll update the original post.

Stack Exchange Network

Optimize calculation of string self-similarly with its suffixes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Optimize calculation of string self-similarly with its suffixes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions