2
\$\begingroup\$

I'm working on the famous clump finding problem to learn Haskell. Part of the problem involve breaking nucleotide sequences, called kmers, into subsequences as follows:

ACTGCA -> [ACT,CTG,GCA]

This sliding window generates all possible kmers of length k, 3 in the above example.

The problem is my very simple algorithm is too slow to work on the E. Coli genome. Other algorithms in slower languages can generate the kmer list in seconds.

Here is what I wrote:

kmerBreak k genome@(x : xs)
 | k <= length genome = take k genome : kmerBreak k xs
 | otherwise = []

I would love some feedback on how to improve the performance of this simple algorithm. As well as my use of Haskell in general.

Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
asked Jan 5, 2023 at 19:22
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

Simple mistake: a list does not contain its length, so length traverses the list to compute the length. This means that kmerBreak runs in time quadratic in the size of the input. A simple fix is to check instead that take k genome is indeed of length k, assuming that k is small. Or you can keep track of the length of genome as an extra parameter to the recursive function.

answered Jan 5, 2023 at 22:10
\$\endgroup\$
0
1
\$\begingroup\$

As the previous poster suggested, you can check that take k returns a list of length k. And it is possible to do it like this:

import Data.List (tails)
kmerBreak k = takeWhile ((==k) . length) . map (take k) . tails 
answered Jan 6, 2023 at 10:11
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.