Counting K-mers (words) in a sequence string

Question 1

I recently found out how to use joblib for parallelization. I want to iterate through this string seq in steps of 1 and count the number that word is seen. I thought this would be really fast but for larger files it still takes long.

How could I make this word counter faster?

def iterSlice(seq, start_idx, end_idx):
 """
 Slice sequence
 """
 return seq[start_idx:end_idx]
def countKmers(seq, K=5, num_cores=multiprocessing.cpu_count()):
 """
 Count kmers in sequence
 """
 start_indices = np.arange(0, len(seq) - K + 1, step=1)
 end_indices = start_indices + K
 list_of_kmers = Parallel(n_jobs=num_cores)\
 (delayed(iterSlice)\
 (seq=seq, start_idx=start_idx, end_idx=end_idx)\
 for start_idx,end_idx in zip(start_indices,end_indices))
 return Counter(list_of_kmers)
seq = "ABCDBBBDBABCBDBABCBCBCABCDBACBDCBCABBCABDCBCABCABCABDCBABABABCD"
countKmers(seq=seq, K=3, num_cores=8)
Counter({'ABA': 2,
 'ABB': 1,
 'ABC': 7,
 'ABD': 2,
 'ACB': 1,
 'BAB': 5,
 'BAC': 1,
 'BBB': 1,
 'BBC': 1,
 'BBD': 1,
 'BCA': 6,
 'BCB': 3,
 'BCD': 3,
 'BDB': 2,
 'BDC': 3,
 'CAB': 6,
 'CBA': 1,
 'CBC': 4,
 'CBD': 2,
 'CDB': 2,
 'DBA': 3,
 'DBB': 1,
 'DCB': 3})

10 loops, best of 3: 186 ms per loop

Question 2

Could you also include the needed import statements? As is, we can not run this to see what it does.

Question 3

Specifically, where do Parallel, delayed and multiprocessing come from/what do they do?

Question 4

@Graipher sounds like the first two are from joblib and the third one is basic Python module.

Question 5

@MatthiasEttinger You are right (and I found this out myself after commenting). I still think the import statements should have been included in the code.

Question 6

Sorry about that, the import statements are in my iPython profile. Yea that's probably pretty misleading. I can add those in.

Question 7

Obligatory PEP8 reference: It is recommended to use lower_case names for variables and functions. camelCase is alright if you are consistent (you seem to be using lower_case for variables and camelCase for functions).

The hardest to read part is the calling of the Parallels part. I would make a few changes there to improve readability:

The way delayed is used I would think it is a decorator and you could do:

@delayed
def iterSlice(seq, start_idx, end_idx):
 """
 Slice sequence
 """
 return seq[start_idx:end_idx]

Unfortunately this does not work, I asked a stackoverflow question why here.

You can however do this:

def iterSlice(seq, start_idx, end_idx):
 """
 Slice sequence
 """
 return seq[start_idx:end_idx]
iterSlice = delayed(iterSlice)

This seems a bit too verbose to me:

iterSlice(seq=seq, start_idx=start_idx, end_idx=end_idx)

This is just as explicit:

itereSlice(seq, start_idx, end_idx)

Parallels has an implemented contextlib so you can do:

with Parallel(n_jobs=num_cores) as parallel:
 list_of_kmers = parallel(iterSlice(seq, start_idx, end_idx)
 for start_idx,end_idx in zip(start_indices,end_indices))

It would also allow you to reuse the pool of workers if you had en explicit loop in there, which you don't.

You can probably move the definition of list_of_kmers directly into Counter(...) to avoid generating an intermediate list.

Regarding the performance:

The actually parallelized taks you are performing is getting the slice of the string. This is not really CPU intensive, so the overhead of setting up the parallelization might ruin the benefits.

For large files you will be limited by the reading of the file itself, which will be done linearly with the file size and with one CPU.

Have you tried comparing your code to a naive implementation like this:

from collections import Counter
def count_kmers(seq, K=5):
 """Count kmers in sequence"""
 return Counter(seq[start:start+K] for start in xrange(len(seq) - K))

(In python 3.x use range instead of xrange)

Question 8

That's what I was using before but I thought if I could index all the kmer/words in parallel it would be much faster than reading through it item by item. Is that not the case? I'm very new to parallel processing.

Question 9

Well, it depends. At some point reading the file is becoming a major issue. If that takes longer than the actual processing (linear, not parallel), then it doesn't make sense to parallelize...However, if you can read the file once and then operate on the content in memory it might make sense again. Especially if you need to do many things/complicated things.

Question 10

You should jsut run both codes on a large/usual for you application file and time both. See if it makes any difference.

Question 11

So much faster! That's what I was doing before I tried the parallelization. Man, I really thought that splitting up the indexing on different cores and combining the results would be the fastest thing possible.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2016-08-09 10:18:55Z

Obligatory PEP8 reference: It is recommended to use lower_case names for variables and functions. camelCase is alright if you are consistent (you seem to be using lower_case for variables and camelCase for functions).

The hardest to read part is the calling of the Parallels part. I would make a few changes there to improve readability:

The way delayed is used I would think it is a decorator and you could do:

@delayed
def iterSlice(seq, start_idx, end_idx):
 """
 Slice sequence
 """
 return seq[start_idx:end_idx]

Unfortunately this does not work, I asked a stackoverflow question why here.

You can however do this:

def iterSlice(seq, start_idx, end_idx):
 """
 Slice sequence
 """
 return seq[start_idx:end_idx]
iterSlice = delayed(iterSlice)

This seems a bit too verbose to me:

iterSlice(seq=seq, start_idx=start_idx, end_idx=end_idx)

This is just as explicit:

itereSlice(seq, start_idx, end_idx)

Parallels has an implemented contextlib so you can do:

with Parallel(n_jobs=num_cores) as parallel:
 list_of_kmers = parallel(iterSlice(seq, start_idx, end_idx)
 for start_idx,end_idx in zip(start_indices,end_indices))

It would also allow you to reuse the pool of workers if you had en explicit loop in there, which you don't.

You can probably move the definition of list_of_kmers directly into Counter(...) to avoid generating an intermediate list.

Regarding the performance:

The actually parallelized taks you are performing is getting the slice of the string. This is not really CPU intensive, so the overhead of setting up the parallelization might ruin the benefits.

For large files you will be limited by the reading of the file itself, which will be done linearly with the file size and with one CPU.

Have you tried comparing your code to a naive implementation like this:

from collections import Counter
def count_kmers(seq, K=5):
 """Count kmers in sequence"""
 return Counter(seq[start:start+K] for start in xrange(len(seq) - K))

(In python 3.x use range instead of xrange)

That's what I was using before but I thought if I could index all the kmer/words in parallel it would be much faster than reading through it item by item. Is that not the case? I'm very new to parallel processing.
Well, it depends. At some point reading the file is becoming a major issue. If that takes longer than the actual processing (linear, not parallel), then it doesn't make sense to parallelize...However, if you can read the file once and then operate on the content in memory it might make sense again. Especially if you need to do many things/complicated things.
You should jsut run both codes on a large/usual for you application file and time both. See if it makes any difference.
So much faster! That's what I was doing before I tried the parallelization. Man, I really thought that splitting up the indexing on different cores and combining the results would be the fastest thing possible.

Stack Exchange Network

Counting K-mers (words) in a sequence string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting K-mers (words) in a sequence string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions