Skip to main content
Code Review

Return to Revisions

2 of 6
generalize to arbitrary k
Gareth Rees
  • 50.1k
  • 3
  • 130
  • 210

1. Improving the code

Here are some ways in which you could improve your code, while leaving the algorithm unchanged:

  1. You can avoid the need to check counts.has_key(j-i) if you use collections.Counter.

  2. You are looping over all pairs of distinct indexes between 0 and len(array)-2. The function itertools.combinations provides a way to cleanly express your intention.

  3. Instead of comparing three pairs of array elements, you can compare one pair of array slices.

  4. Organize the code into a function with a docstring.

  5. Generalize it from 3 to k.

Here's the revised code after making the above improvements:

from collections import Counter
from itertools import combinations
def substring_distance_counts(array, k = 3):
 """
 Return a dictionary whose keys are distances and whose values are
 the number of repeated substrings of length `k` in `array` at that
 distance. `k` defaults to 3.
 """
 counts = Counter()
 for i, j in combinations(range(len(array) - k + 1), 2):
 if array[i:i + k] == array[j:j + k]:
 counts[j - i] += 1
 return counts

It's possible to rewrite the body of this function as a single expression:

 return Counter(j - i
 for i, j in combinations(range(len(array) - k + 1), 2)
 if array[i:i + k] == array[j:j + k])

Some people prefer this style of coding, and others prefer the explicitness of the first version. It doesn't make much difference to the performance, so comes down to personal taste.

2. Improving the algorithm

Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the array. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.

def substring_distance_counts(array, k = 3):
 """
 Return a dictionary whose keys are distances and whose values are
 the number of repeated substrings of length `k` in `array` at that
 distance. `k` defaults to 3.
 """
 positions = defaultdict(set) # Set of positions where each substring appears.
 for i in range(len(array) - k + 1):
 positions[array[i:i + k]].add(i)
 return Counter(j - i
 for p in positions.itervalues()
 for i, j in combinations(p, 2))
Gareth Rees
  • 50.1k
  • 3
  • 130
  • 210
default

AltStyle によって変換されたページ (->オリジナル) /