1. Improving the code
Here are some ways in which you could improve your code, while leaving the algorithm unchanged:
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k.
Here's the revised code after making the above improvements:
from collections import Counter
from itertools import combinations
def substring_distance_counts(array, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
the number of repeated substrings of length `k` in `array` at that
distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(array) - k + 1), 2):
if array[i:i + k] == array[j:j + k]:
counts[j - i] += 1
return counts
It's possible to rewrite the body of this function as a single expression:
return Counter(j - i
for i, j in combinations(range(len(array) - k + 1), 2)
if array[i:i + k] == array[j:j + k])
Some people prefer this style of coding, and others prefer the explicitness of the first version. It doesn't make much difference to the performance, so comes down to personal taste.
2. Improving the algorithm
Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the array. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
def substring_distance_counts(array, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
the number of repeated substrings of length `k` in `array` at that
distance. `k` defaults to 3.
"""
positions = defaultdict(set) # Set of positions where each substring appears.
for i in range(len(array) - k + 1):
positions[array[i:i + k]].add(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))
- 50.1k
- 3
- 130
- 210