You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k\$k\$.
Python doesn't have an "array" type, but it does have sequences. So your variables could be better named.
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k = 3k=3):
"""
Return"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return counts
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is O(n2)\$ O(n^2) \$, where n\$n\$ is the length of the sequence. The algorithm below is O(n + m)\$O(n + m)\$, where m\$m\$ is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2)\$m = Θ(n^2)\$, but in cases where m = ο(n2)\$m = ο(n^2)\$ it should be an improvement.
def subseq_distance_counts(seq, k = 3k=3):
"""
Return"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k.
Python doesn't have an "array" type, but it does have sequences. So your variables could be better named.
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k = 3):
"""
Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return counts
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is O(n2), where n is the length of the sequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
def subseq_distance_counts(seq, k = 3):
"""
Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to \$k\$.
Python doesn't have an "array" type, but it does have sequences. So your variables could be better named.
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return counts
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is \$ O(n^2) \$, where \$n\$ is the length of the sequence. The algorithm below is \$O(n + m)\$, where \$m\$ is the number of repeated triples. This won't make any different in the worst case, where \$m = Θ(n^2)\$, but in cases where \$m = ο(n^2)\$ it should be an improvement.
def subseq_distance_counts(seq, k=3):
"""Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is O(n2), where n is the length of the sequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the sequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
Here's an alternative approach (that you could use if the subsequences are hashable). Your original algorithm is O(n2), where n is the length of the sequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
- 50.1k
- 3
- 130
- 210
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k.
Python doesn't have an "array" type, but it does have sequences . So your variables could be better named.
from collections import Counter
from itertools import combinations
def substring_distance_countssubseq_distance_counts(arrayseq, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
are the number of repeatedidentical substringspairs of `k`-length `k`subsequences
in `array` at that
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(arrayseq) - k + 1), 2):
if array[iseq[i:i + k] == array[jseq[j:j + k]:
counts[j - i] += 1
return counts
return Counter(j - i
for i, j in combinations(range(len(arrayseq) - k + 1), 2)
if array[iseq[i:i + k] == array[jseq[j:j + k])
Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the arraysequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
def substring_distance_countssubseq_distance_counts(arrayseq, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
are the number of repeatedidentical substringspairs of `k`-length `k`subsequences
in `array` at that
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(setlist) # SetList of positions where each substringsubsequence appears.
for i in range(len(arrayseq) - k + 1):
positions[array[ipositions[seq[i:i + k]].addappend(i)
return Counter(abs(j - i)
for p in positions.itervalues()
for i, j in combinations(p, 2))
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k.
from collections import Counter
from itertools import combinations
def substring_distance_counts(array, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
the number of repeated substrings of length `k` in `array` at that
distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(array) - k + 1), 2):
if array[i:i + k] == array[j:j + k]:
counts[j - i] += 1
return counts
return Counter(j - i
for i, j in combinations(range(len(array) - k + 1), 2)
if array[i:i + k] == array[j:j + k])
Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the array. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
def substring_distance_counts(array, k = 3):
"""
Return a dictionary whose keys are distances and whose values are
the number of repeated substrings of length `k` in `array` at that
distance. `k` defaults to 3.
"""
positions = defaultdict(set) # Set of positions where each substring appears.
for i in range(len(array) - k + 1):
positions[array[i:i + k]].add(i)
return Counter(abs(j - i)
for p in positions.itervalues()
for i, j in combinations(p, 2))
You can avoid the need to check
counts.has_key(j-i)
if you usecollections.Counter
.You are looping over all pairs of distinct indexes between
0
andlen(array)-2
. The functionitertools.combinations
provides a way to cleanly express your intention.Instead of comparing three pairs of array elements, you can compare one pair of array slices.
Organize the code into a function with a docstring.
Generalize it from 3 to k.
Python doesn't have an "array" type, but it does have sequences . So your variables could be better named.
from collections import Counter
from itertools import combinations
def subseq_distance_counts(seq, k = 3):
"""
Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
"""
counts = Counter()
for i, j in combinations(range(len(seq) - k + 1), 2):
if seq[i:i + k] == seq[j:j + k]:
counts[j - i] += 1
return counts
return Counter(j - i
for i, j in combinations(range(len(seq) - k + 1), 2)
if seq[i:i + k] == seq[j:j + k])
Here's an alternative approach. Your original algorithm is O(n2), where n is the length of the sequence. The algorithm below is O(n + m), where m is the number of repeated triples. This won't make any different in the worst case, where m = Θ(n2), but in cases where m = ο(n2) it should be an improvement.
def subseq_distance_counts(seq, k = 3):
"""
Return a dictionary whose keys are distances and whose values
are the number of identical pairs of `k`-length subsequences
of `seq` separated by that distance. `k` defaults to 3.
>>> subseq_distance_counts('abcabcabc')
Counter({3: 4, 6: 1})
>>> subseq_distance_counts('aaaaaa', 1)
Counter({1: 5, 2: 4, 3: 3, 4: 2, 5: 1})
"""
positions = defaultdict(list) # List of positions where each subsequence appears.
for i in range(len(seq) - k + 1):
positions[seq[i:i + k]].append(i)
return Counter(j - i
for p in positions.itervalues()
for i, j in combinations(p, 2))