Objective: Given a set of sequences ( eg: step1->step2->step3, step1->step3->step5) ) arranged in an array of lists, count the number of times every contiguous sub-lists occur
Where I need your help:
The code below works, but is very slow on my original dataset (100M sequences, with 100 unique steps). Can you help me make this more efficient? For larger datasets, is there a more efficient programming method than just brute force?
My code currently depends on each element in the list being a single character. How can I adapt this code to handle multiple-character elements?
Working code:
from collections import Counter
sequences = [['A','B'],['A','B','B'],['A','C','A','B']]
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
print counts
for x in counts:
print x,":",counts[x]," times"
1 Answer 1
1. Write a test case
When working on performance of code, the first thing to do is to make a reproducible test case. We'll need to have your code in a function:
from collections import Counter
def subsequence_counts_1(sequences):
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
return counts
and then we need some test data:
import random
import string
def test_data(n, m, choices):
"""Return a list of n lists of m items chosen randomly from choices."""
return [[random.choice(choices) for _ in range(m)] for _ in range(n)]
So let's try a small example:
>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958
2. Don't build data structures by repeated addition
The main problem with the code is this line:
counts = counts + Counter(...)
which has essentially the same effect as:
new_counts = counts + Counter(...)
counts = new_counts
That is, it creates a new Counter
object and populates it with the counts from both of the addends. In particular, this involves copying across the whole contents of counts
into the new object.
We can avoid all that copying by using the update
method:
def subsequence_counts_2(sequences):
counts = Counter()
for sequence in sequences:
input = "".join(sequence)
for j in range(1,len(input)+1):
counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
return counts
and this is a thousand times faster:
>>> timeit(lambda:subsequence_counts_2(data), number=1)
0.08853063406422734
3. Further improvements
There are few more minor improvements that we could make:
Instead of having separate iterations over
i
andj
, we can iterate over both at the same time usingitertools.combinations
.Instead of calling
Counter.update
for each sequence, we could do all the work in one comprehension.
This results in the following:
from itertools import combinations
def subsequence_counts_3(sequences):
return Counter(seq[i:j]
for seq in map(''.join, sequences)
for i, j in combinations(range(len(seq) + 1), 2))
which is about 60% faster than subsequence_counts_2
:
>>> timeit(lambda:subsequence_counts_3(data), number=1)
0.052610712591558695
But it's still not going to be able to solve your problem in a reasonable amount of time:
>>> data2 = test_data(100, 100, string.ascii_uppercase)
>>> timeit(lambda:subsequence_counts_3(data2), number=1)
0.5441978382878006
So to process 100 million sequences of 100 characters would take more than half a million seconds, which is more than six days.
4. Using tuples
If you want to handle other kinds of data, convert the sequences to tuples:
def subsequence_counts_4(sequences):
return Counter(seq[i:j]
for seq in map(tuple, sequences)
for i, j in combinations(range(len(seq) + 1), 2))
and then you can use any hashable items:
>>> data3 = test_data(50, 50, list(range(10)))
>>> subsequence_counts_4(data3)[(8, 8, 8)]
5
-
\$\begingroup\$ Gareth... you are amazing \$\endgroup\$ADatoo– ADatoo2015年10月19日 21:29:54 +00:00Commented Oct 19, 2015 at 21:29
-
\$\begingroup\$ Thanks Gareth - one follow up. In the function subsequence_counts_4, how can I add a parameter to only select tuples of length 3? \$\endgroup\$ADatoo– ADatoo2015年10月20日 10:56:52 +00:00Commented Oct 20, 2015 at 10:56
-
\$\begingroup\$
seq[i:i+3] for seq in map(tuple, sequences) for i in range(len(seq) - 3)
\$\endgroup\$Gareth Rees– Gareth Rees2015年10月20日 11:17:52 +00:00Commented Oct 20, 2015 at 11:17 -
\$\begingroup\$ Thanks Gareth. One more follow up: How can I insert a criteria to ignore tuples where the length is 1? I tried putting the condition "if j-i>1" at the end but it didnt work \$\endgroup\$ADatoo– ADatoo2015年10月21日 17:32:30 +00:00Commented Oct 21, 2015 at 17:32
-
\$\begingroup\$
seq[i:j] for seq in map(tuple, sequences) for i, j in combinations(range(len(seq) + 1), 2) if j - i > 1
\$\endgroup\$Gareth Rees– Gareth Rees2015年10月21日 21:11:57 +00:00Commented Oct 21, 2015 at 21:11