Find the first repeating sequence in digits of pi

Question 1

I wrote a function to return the first n length sequence which occurs twice in the digits of pi. Along with the repeated sequence, I want the indices of the first and second occurrences. For example, if n is 1, code returns (1, 0, 2) and if n is 2, (26, 5, 20).

For reference, here are the first digits of pi:

3.141592653589793238462643383279502884197169399375105820974944592...

The code uses a breadth-first search (as opposed to a depth-first search, which I coded up to be faster, but isn't what I'm looking for). I realize code below is convoluted and inefficient because it checks same digits multiple times.

How do I make it simpler and faster?

code

from mpmath import mp
mp.dps = 1000
pi = mp.pi
def repeats(n):
 pi_list = [int(x) for x in str(pi)[2:]]
 breadth = range(n + n, len(pi_list))
 
 for breadth_level in breadth:
 
 pattern_starts = range(breadth_level - 2 * n + 1)
 for p_s in pattern_starts:
 
 test_starts = range(n, breadth_level - n + 1)
 for t_s in test_starts:
 candidate = pi_list[p_s:p_s + n]
 test_start = max(p_s + n, t_s)
 test = pi_list[test_start:test_start + n]
 if candidate == test:
 return candidate, p_s, t_s
print(repeats(1))
print(repeats(2))
print(repeats(3))

output

([1], 0, 2)
([2, 6], 5, 20)
([5, 9, 2], 3, 60)

Question 2

If you expect to call repeat many times, it might be worth building a suffix tree once, and the running the queries on the tree. A suffix tree can also let you answer similar questions like: are there any that repeat more the 2 times, what's the most repeats, which pair has the earliest repeat, find all repeats, etc.

Question 3

How do I make it simpler and faster?

A more efficient approach:

Iterate on pi with a sliding window of size n.
Save each sequence (and the starting position) in a lookup map
Return as soon as you find a match in the lookup map

Something like this:

def repeats(n):
 lookup = {}
 pi_s = str(pi)[2:]
 for i in range(n, len(pi_s)):
 start = i - n
 seq = pi_s[start:i]
 if seq in lookup:
 return list(map(int, seq)), lookup[seq], start
 lookup[seq] = start

For n=5 this runs in 0.001 seconds, while your approach takes ~40 seconds.

For n=10 and 1 million digits, this approach finds ([4, 3, 9, 2, 3, 6, 6, 4, 8, 4], 182104, 240478) in ~50 seconds. The slowest part now seems to be str(pi), which generates the digits. Once the digits are generated, searching for the sequence takes less than a second.

Review

Naming: the function name repeats is a bit ambiguous, it sounds like how many times n appears in the sequence. A better name could be first_repeating_sequence_of.
Edge case: for bigger n, more digits of pi are required, otherwise no matches are found. Consider passing the max number of digits of pi as an argument so the caller is aware of this limit.

Question 4

This solution is really beautiful and simple. I didn't realize that dictionary lookups are so fast and nested loops so slow. Thanks for the other feedback too!

Question 5

@cona I am glad I could help. To get more answers, consider accepting after one day, so reviewers in different timezone have the time to answer. Generally, answers on Code Review come slower compared to Stack Overflow.

Question 6

A late review, but ...

You are trying to solve the problem as one monolithic block. It is often better to break the problem down into individual steps.

In this case, you have a long string of digits, and you want to extract sub-sequences of digits of a specific length.

For each sub-sequence item you find, you want to remember the position you found it at, and if you've seen it before.

And you might want to stop when you've found the first one.

You can take these smaller operations and compose them together.

The following reworked code demonstrates this approach, along with type-hints, docstrings, and a doctest, which are always good things to have.

from typing import Iterable, Tuple
def sliding_window(subject: str, window: int) -> Iterable[str]:
 """
 Return successive substrings of a given length, moving forward
 through the string:
 >>> list(sliding_window("3.14159", 2))
 ['3.', '.1', '14', '41', '15', '59']
 >>> list(sliding_window("3.14159", 3))
 ['3.1', '.14', '141', '415', '159']
 """
 for i in range(len(subject) - window + 1):
 yield subject[i:i+window]
def repeats(sequence: Iterable[str]) -> Iterable[Tuple[str, int, int]]:
 """
 Scan through a sequence of items, recording the location of each item,
 and when a duplication is found, return the item along with
 its previous and current position.
 >>> list(repeats(["xy", "xyz", "ab", "xy", "ab", "ab"]))
 [('xy', 0, 3), ('ab', 2, 4), ('ab', 4, 5)]
 """
 positions = {}
 for pos, item in enumerate(sequence):
 prev = positions.get(item)
 if prev is not None:
 yield item, prev, pos
 positions[item] = pos
def first_repeat(subject: str, length: int) -> Tuple[str, int, int]:
 """
 Scan through a string, and find the first repeated substring of
 a given length. Returns the substring, first position and second position.
 >>> first_repeat("The quick brown fox jumped over the lazy dog", 2)
 ('he', 1, 33)
 """
 return next(repeats(sliding_window(subject, length)))
def main():
 pi = "3.141592653589793238462643383279502884197169399375105820974944592"
 pi_fraction = pi[2:]
 print(first_repeat(pi_fraction, 1))
 print(first_repeat(pi_fraction, 2))
 print(first_repeat(pi_fraction, 3))
if __name__ == '__main__':
 import doctest
 doctest.testmod()
 main()

Question 7

+1 for the nice review. One more step would be to map the result into a list of integers, to get the same output as OP's solution.

Marc Marc 5,7342 gold badges15 silver badges35 bronze badges · Accepted Answer · 2021-05-18 02:23:07Z

How do I make it simpler and faster?

A more efficient approach:

Iterate on pi with a sliding window of size n.
Save each sequence (and the starting position) in a lookup map
Return as soon as you find a match in the lookup map

Something like this:

def repeats(n):
 lookup = {}
 pi_s = str(pi)[2:]
 for i in range(n, len(pi_s)):
 start = i - n
 seq = pi_s[start:i]
 if seq in lookup:
 return list(map(int, seq)), lookup[seq], start
 lookup[seq] = start

For n=5 this runs in 0.001 seconds, while your approach takes ~40 seconds.

For n=10 and 1 million digits, this approach finds ([4, 3, 9, 2, 3, 6, 6, 4, 8, 4], 182104, 240478) in ~50 seconds. The slowest part now seems to be str(pi), which generates the digits. Once the digits are generated, searching for the sequence takes less than a second.

Review

Naming: the function name repeats is a bit ambiguous, it sounds like how many times n appears in the sequence. A better name could be first_repeating_sequence_of.
Edge case: for bigger n, more digits of pi are required, otherwise no matches are found. Consider passing the max number of digits of pi as an argument so the caller is aware of this limit.

This solution is really beautiful and simple. I didn't realize that dictionary lookups are so fast and nested loops so slow. Thanks for the other feedback too!
@cona I am glad I could help. To get more answers, consider accepting after one day, so reviewers in different timezone have the time to answer. Generally, answers on Code Review come slower compared to Stack Overflow.

Stack Exchange Network

Find the first repeating sequence in digits of pi

code

output

2 Answers 2

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find the first repeating sequence in digits of pi

code

output

2 Answers 2

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions