I recently found out how to use joblib
for parallelization. I want to iterate through this string seq
in steps of 1 and count the number that word is seen. I thought this would be really fast but for larger files it still takes long.
How could I make this word counter faster?
def iterSlice(seq, start_idx, end_idx):
"""
Slice sequence
"""
return seq[start_idx:end_idx]
def countKmers(seq, K=5, num_cores=multiprocessing.cpu_count()):
"""
Count kmers in sequence
"""
start_indices = np.arange(0, len(seq) - K + 1, step=1)
end_indices = start_indices + K
list_of_kmers = Parallel(n_jobs=num_cores)\
(delayed(iterSlice)\
(seq=seq, start_idx=start_idx, end_idx=end_idx)\
for start_idx,end_idx in zip(start_indices,end_indices))
return Counter(list_of_kmers)
seq = "ABCDBBBDBABCBDBABCBCBCABCDBACBDCBCABBCABDCBCABCABCABDCBABABABCD"
countKmers(seq=seq, K=3, num_cores=8)
Counter({'ABA': 2,
'ABB': 1,
'ABC': 7,
'ABD': 2,
'ACB': 1,
'BAB': 5,
'BAC': 1,
'BBB': 1,
'BBC': 1,
'BBD': 1,
'BCA': 6,
'BCB': 3,
'BCD': 3,
'BDB': 2,
'BDC': 3,
'CAB': 6,
'CBA': 1,
'CBC': 4,
'CBD': 2,
'CDB': 2,
'DBA': 3,
'DBB': 1,
'DCB': 3})
10 loops, best of 3: 186 ms per loop
1 Answer 1
Obligatory PEP8 reference: It is recommended to use lower_case
names for variables and functions. camelCase
is alright if you are consistent (you seem to be using lower_case
for variables and camelCase
for functions).
The hardest to read part is the calling of the Parallels part. I would make a few changes there to improve readability:
The way delayed
is used I would think it is a decorator and you could do:
@delayed
def iterSlice(seq, start_idx, end_idx):
"""
Slice sequence
"""
return seq[start_idx:end_idx]
Unfortunately this does not work, I asked a stackoverflow question why here.
You can however do this:
def iterSlice(seq, start_idx, end_idx):
"""
Slice sequence
"""
return seq[start_idx:end_idx]
iterSlice = delayed(iterSlice)
This seems a bit too verbose to me:
iterSlice(seq=seq, start_idx=start_idx, end_idx=end_idx)
This is just as explicit:
itereSlice(seq, start_idx, end_idx)
Parallels
has an implemented contextlib so you can do:
with Parallel(n_jobs=num_cores) as parallel:
list_of_kmers = parallel(iterSlice(seq, start_idx, end_idx)
for start_idx,end_idx in zip(start_indices,end_indices))
It would also allow you to reuse the pool of workers if you had en explicit loop in there, which you don't.
You can probably move the definition of list_of_kmers
directly into Counter(...)
to avoid generating an intermediate list.
Regarding the performance:
The actually parallelized taks you are performing is getting the slice of the string. This is not really CPU intensive, so the overhead of setting up the parallelization might ruin the benefits.
For large files you will be limited by the reading of the file itself, which will be done linearly with the file size and with one CPU.
Have you tried comparing your code to a naive implementation like this:
from collections import Counter
def count_kmers(seq, K=5):
"""Count kmers in sequence"""
return Counter(seq[start:start+K] for start in xrange(len(seq) - K))
(In python 3.x use range
instead of xrange
)
-
\$\begingroup\$ That's what I was using before but I thought if I could index all the kmer/words in parallel it would be much faster than reading through it item by item. Is that not the case? I'm very new to parallel processing. \$\endgroup\$O.rka– O.rka2016年08月09日 18:37:39 +00:00Commented Aug 9, 2016 at 18:37
-
1\$\begingroup\$ Well, it depends. At some point reading the file is becoming a major issue. If that takes longer than the actual processing (linear, not parallel), then it doesn't make sense to parallelize...However, if you can read the file once and then operate on the content in memory it might make sense again. Especially if you need to do many things/complicated things. \$\endgroup\$Graipher– Graipher2016年08月09日 19:51:04 +00:00Commented Aug 9, 2016 at 19:51
-
\$\begingroup\$ You should jsut run both codes on a large/usual for you application file and time both. See if it makes any difference. \$\endgroup\$Graipher– Graipher2016年08月09日 20:07:02 +00:00Commented Aug 9, 2016 at 20:07
-
\$\begingroup\$ So much faster! That's what I was doing before I tried the parallelization. Man, I really thought that splitting up the indexing on different cores and combining the results would be the fastest thing possible. \$\endgroup\$O.rka– O.rka2016年08月09日 21:17:37 +00:00Commented Aug 9, 2016 at 21:17
Explore related questions
See similar questions with these tags.
import
statements? As is, we can not run this to see what it does. \$\endgroup\$Parallel
,delayed
andmultiprocessing
come from/what do they do? \$\endgroup\$joblib
and the third one is basic Python module. \$\endgroup\$import
statements should have been included in the code. \$\endgroup\$