Finding most common contiguous sub-lists in an array of lists

Question 1

Objective: Given a set of sequences ( eg: step1->step2->step3, step1->step3->step5) ) arranged in an array of lists, count the number of times every contiguous sub-lists occur

Where I need your help:

The code below works, but is very slow on my original dataset (100M sequences, with 100 unique steps). Can you help me make this more efficient? For larger datasets, is there a more efficient programming method than just brute force?
My code currently depends on each element in the list being a single character. How can I adapt this code to handle multiple-character elements?

Working code:

from collections import Counter
sequences = [['A','B'],['A','B','B'],['A','C','A','B']]
counts = Counter()
for sequence in sequences:
 input = "".join(sequence)
 for j in range(1,len(input)+1):
 counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
 print counts
for x in counts:
 print x,":",counts[x]," times"

Question 2

1. Write a test case

When working on performance of code, the first thing to do is to make a reproducible test case. We'll need to have your code in a function:

from collections import Counter
def subsequence_counts_1(sequences):
 counts = Counter()
 for sequence in sequences:
 input = "".join(sequence)
 for j in range(1,len(input)+1):
 counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
 return counts

and then we need some test data:

import random
import string
def test_data(n, m, choices):
 """Return a list of n lists of m items chosen randomly from choices."""
 return [[random.choice(choices) for _ in range(m)] for _ in range(n)]

So let's try a small example:

>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958

2. Don't build data structures by repeated addition

The main problem with the code is this line:

counts = counts + Counter(...)

which has essentially the same effect as:

new_counts = counts + Counter(...)
counts = new_counts

That is, it creates a new Counter object and populates it with the counts from both of the addends. In particular, this involves copying across the whole contents of counts into the new object.

We can avoid all that copying by using the update method:

def subsequence_counts_2(sequences):
 counts = Counter()
 for sequence in sequences:
 input = "".join(sequence)
 for j in range(1,len(input)+1):
 counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
 return counts

and this is a thousand times faster:

>>> timeit(lambda:subsequence_counts_2(data), number=1)
0.08853063406422734

3. Further improvements

There are few more minor improvements that we could make:

Instead of having separate iterations over i and j, we can iterate over both at the same time using itertools.combinations.
Instead of calling Counter.update for each sequence, we could do all the work in one comprehension.

This results in the following:

from itertools import combinations
def subsequence_counts_3(sequences):
 return Counter(seq[i:j]
 for seq in map(''.join, sequences)
 for i, j in combinations(range(len(seq) + 1), 2))

which is about 60% faster than subsequence_counts_2:

>>> timeit(lambda:subsequence_counts_3(data), number=1)
0.052610712591558695

But it's still not going to be able to solve your problem in a reasonable amount of time:

>>> data2 = test_data(100, 100, string.ascii_uppercase)
>>> timeit(lambda:subsequence_counts_3(data2), number=1)
0.5441978382878006

So to process 100 million sequences of 100 characters would take more than half a million seconds, which is more than six days.

4. Using tuples

If you want to handle other kinds of data, convert the sequences to tuples:

def subsequence_counts_4(sequences):
 return Counter(seq[i:j]
 for seq in map(tuple, sequences)
 for i, j in combinations(range(len(seq) + 1), 2))

and then you can use any hashable items:

>>> data3 = test_data(50, 50, list(range(10)))
>>> subsequence_counts_4(data3)[(8, 8, 8)]
5

Question 3

Gareth... you are amazing

Question 4

Thanks Gareth - one follow up. In the function subsequence_counts_4, how can I add a parameter to only select tuples of length 3?

Question 5

seq[i:i+3] for seq in map(tuple, sequences) for i in range(len(seq) - 3)

Question 6

Thanks Gareth. One more follow up: How can I insert a criteria to ignore tuples where the length is 1? I tried putting the condition "if j-i>1" at the end but it didnt work

Question 7

seq[i:j] for seq in map(tuple, sequences) for i, j in combinations(range(len(seq) + 1), 2) if j - i > 1

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2015-10-19 20:43:24Z

1. Write a test case

When working on performance of code, the first thing to do is to make a reproducible test case. We'll need to have your code in a function:

from collections import Counter
def subsequence_counts_1(sequences):
 counts = Counter()
 for sequence in sequences:
 input = "".join(sequence)
 for j in range(1,len(input)+1):
 counts = counts + Counter(input[i:i+j] for i in range(len(input)-(j-1)))
 return counts

and then we need some test data:

import random
import string
def test_data(n, m, choices):
 """Return a list of n lists of m items chosen randomly from choices."""
 return [[random.choice(choices) for _ in range(m)] for _ in range(n)]

So let's try a small example:

>>> data = test_data(50, 50, string.ascii_uppercase)
>>> from timeit import timeit
>>> timeit(lambda:subsequence_counts_1(data), number=1)
102.42408156394958

2. Don't build data structures by repeated addition

The main problem with the code is this line:

counts = counts + Counter(...)

which has essentially the same effect as:

new_counts = counts + Counter(...)
counts = new_counts

That is, it creates a new Counter object and populates it with the counts from both of the addends. In particular, this involves copying across the whole contents of counts into the new object.

We can avoid all that copying by using the update method:

def subsequence_counts_2(sequences):
 counts = Counter()
 for sequence in sequences:
 input = "".join(sequence)
 for j in range(1,len(input)+1):
 counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
 return counts

and this is a thousand times faster:

>>> timeit(lambda:subsequence_counts_2(data), number=1)
0.08853063406422734

3. Further improvements

There are few more minor improvements that we could make:

Instead of having separate iterations over i and j, we can iterate over both at the same time using itertools.combinations.
Instead of calling Counter.update for each sequence, we could do all the work in one comprehension.

This results in the following:

from itertools import combinations
def subsequence_counts_3(sequences):
 return Counter(seq[i:j]
 for seq in map(''.join, sequences)
 for i, j in combinations(range(len(seq) + 1), 2))

which is about 60% faster than subsequence_counts_2:

>>> timeit(lambda:subsequence_counts_3(data), number=1)
0.052610712591558695

But it's still not going to be able to solve your problem in a reasonable amount of time:

>>> data2 = test_data(100, 100, string.ascii_uppercase)
>>> timeit(lambda:subsequence_counts_3(data2), number=1)
0.5441978382878006

So to process 100 million sequences of 100 characters would take more than half a million seconds, which is more than six days.

4. Using tuples

If you want to handle other kinds of data, convert the sequences to tuples:

def subsequence_counts_4(sequences):
 return Counter(seq[i:j]
 for seq in map(tuple, sequences)
 for i, j in combinations(range(len(seq) + 1), 2))

and then you can use any hashable items:

>>> data3 = test_data(50, 50, list(range(10)))
>>> subsequence_counts_4(data3)[(8, 8, 8)]
5

Thanks Gareth - one follow up. In the function subsequence_counts_4, how can I add a parameter to only select tuples of length 3?
seq[i:i+3] for seq in map(tuple, sequences) for i in range(len(seq) - 3)
Thanks Gareth. One more follow up: How can I insert a criteria to ignore tuples where the length is 1? I tried putting the condition "if j-i>1" at the end but it didnt work
seq[i:j] for seq in map(tuple, sequences) for i, j in combinations(range(len(seq) + 1), 2) if j - i > 1

Stack Exchange Network

Finding most common contiguous sub-lists in an array of lists

1 Answer 1

1. Write a test case

2. Don't build data structures by repeated addition

3. Further improvements

4. Using tuples

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Finding most common contiguous sub-lists in an array of lists

1 Answer 1

1. Write a test case

2. Don't build data structures by repeated addition

3. Further improvements

4. Using tuples

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions