I'm working on the famous clump finding problem to learn Haskell. Part of the problem involve breaking nucleotide sequences, called kmers, into subsequences as follows:
ACTGCA -> [ACT,CTG,GCA]
This sliding window generates all possible kmers of length k, 3 in the above example.
The problem is my very simple algorithm is too slow to work on the E. Coli genome. Other algorithms in slower languages can generate the kmer list in seconds.
Here is what I wrote:
kmerBreak k genome@(x : xs)
| k <= length genome = take k genome : kmerBreak k xs
| otherwise = []
I would love some feedback on how to improve the performance of this simple algorithm. As well as my use of Haskell in general.
2 Answers 2
Simple mistake: a list does not contain its length, so length
traverses the list to compute the length. This means that kmerBreak
runs in time quadratic in the size of the input. A simple fix is to check instead that take k genome
is indeed of length k
, assuming that k
is small. Or you can keep track of the length of genome
as an extra parameter to the recursive function.
As the previous poster suggested, you can check that take k
returns a list of length k
. And it is possible to do it like this:
import Data.List (tails)
kmerBreak k = takeWhile ((==k) . length) . map (take k) . tails
Explore related questions
See similar questions with these tags.