I wrote a function to return the first n length sequence which occurs twice in the digits of pi. Along with the repeated sequence, I want the indices of the first and second occurrences. For example, if n is 1
, code returns (1, 0, 2)
and if n is 2
, (26, 5, 20)
.
For reference, here are the first digits of pi:
3.141592653589793238462643383279502884197169399375105820974944592...
The code uses a breadth-first search (as opposed to a depth-first search, which I coded up to be faster, but isn't what I'm looking for). I realize code below is convoluted and inefficient because it checks same digits multiple times.
How do I make it simpler and faster?
code
from mpmath import mp
mp.dps = 1000
pi = mp.pi
def repeats(n):
pi_list = [int(x) for x in str(pi)[2:]]
breadth = range(n + n, len(pi_list))
for breadth_level in breadth:
pattern_starts = range(breadth_level - 2 * n + 1)
for p_s in pattern_starts:
test_starts = range(n, breadth_level - n + 1)
for t_s in test_starts:
candidate = pi_list[p_s:p_s + n]
test_start = max(p_s + n, t_s)
test = pi_list[test_start:test_start + n]
if candidate == test:
return candidate, p_s, t_s
print(repeats(1))
print(repeats(2))
print(repeats(3))
output
([1], 0, 2)
([2, 6], 5, 20)
([5, 9, 2], 3, 60)
2 Answers 2
How do I make it simpler and faster?
A more efficient approach:
- Iterate on
pi
with a sliding window of sizen
. - Save each sequence (and the starting position) in a lookup map
- Return as soon as you find a match in the lookup map
Something like this:
def repeats(n):
lookup = {}
pi_s = str(pi)[2:]
for i in range(n, len(pi_s)):
start = i - n
seq = pi_s[start:i]
if seq in lookup:
return list(map(int, seq)), lookup[seq], start
lookup[seq] = start
For n=5
this runs in 0.001 seconds, while your approach takes ~40 seconds.
For n=10
and 1 million digits, this approach finds ([4, 3, 9, 2, 3, 6, 6, 4, 8, 4], 182104, 240478)
in ~50 seconds. The slowest part now seems to be str(pi)
, which generates the digits. Once the digits are generated, searching for the sequence takes less than a second.
Review
- Naming: the function name
repeats
is a bit ambiguous, it sounds like how many timesn
appears in the sequence. A better name could befirst_repeating_sequence_of
. - Edge case: for bigger
n
, more digits of pi are required, otherwise no matches are found. Consider passing the max number of digits of pi as an argument so the caller is aware of this limit.
-
2\$\begingroup\$ This solution is really beautiful and simple. I didn't realize that dictionary lookups are so fast and nested loops so slow. Thanks for the other feedback too! \$\endgroup\$cona– cona2021年05月18日 03:10:52 +00:00Commented May 18, 2021 at 3:10
-
3\$\begingroup\$ @cona I am glad I could help. To get more answers, consider accepting after one day, so reviewers in different timezone have the time to answer. Generally, answers on Code Review come slower compared to Stack Overflow. \$\endgroup\$Marc– Marc2021年05月18日 03:24:38 +00:00Commented May 18, 2021 at 3:24
A late review, but ...
You are trying to solve the problem as one monolithic block. It is often better to break the problem down into individual steps.
In this case, you have a long string of digits, and you want to extract sub-sequences of digits of a specific length.
For each sub-sequence item you find, you want to remember the position you found it at, and if you've seen it before.
And you might want to stop when you've found the first one.
You can take these smaller operations and compose them together.
The following reworked code demonstrates this approach, along with type-hints, docstrings, and a doctest, which are always good things to have.
from typing import Iterable, Tuple
def sliding_window(subject: str, window: int) -> Iterable[str]:
"""
Return successive substrings of a given length, moving forward
through the string:
>>> list(sliding_window("3.14159", 2))
['3.', '.1', '14', '41', '15', '59']
>>> list(sliding_window("3.14159", 3))
['3.1', '.14', '141', '415', '159']
"""
for i in range(len(subject) - window + 1):
yield subject[i:i+window]
def repeats(sequence: Iterable[str]) -> Iterable[Tuple[str, int, int]]:
"""
Scan through a sequence of items, recording the location of each item,
and when a duplication is found, return the item along with
its previous and current position.
>>> list(repeats(["xy", "xyz", "ab", "xy", "ab", "ab"]))
[('xy', 0, 3), ('ab', 2, 4), ('ab', 4, 5)]
"""
positions = {}
for pos, item in enumerate(sequence):
prev = positions.get(item)
if prev is not None:
yield item, prev, pos
positions[item] = pos
def first_repeat(subject: str, length: int) -> Tuple[str, int, int]:
"""
Scan through a string, and find the first repeated substring of
a given length. Returns the substring, first position and second position.
>>> first_repeat("The quick brown fox jumped over the lazy dog", 2)
('he', 1, 33)
"""
return next(repeats(sliding_window(subject, length)))
def main():
pi = "3.141592653589793238462643383279502884197169399375105820974944592"
pi_fraction = pi[2:]
print(first_repeat(pi_fraction, 1))
print(first_repeat(pi_fraction, 2))
print(first_repeat(pi_fraction, 3))
if __name__ == '__main__':
import doctest
doctest.testmod()
main()
-
3\$\begingroup\$ +1 for the nice review. One more step would be to map the result into a list of integers, to get the same output as OP's solution. \$\endgroup\$Marc– Marc2021年05月21日 01:13:54 +00:00Commented May 21, 2021 at 1:13
Explore related questions
See similar questions with these tags.
repeat
many times, it might be worth building asuffix tree
once, and the running the queries on the tree. A suffix tree can also let you answer similar questions like: are there any that repeat more the 2 times, what's the most repeats, which pair has the earliest repeat, find all repeats, etc. \$\endgroup\$