Use Python to determine the repeating pattern in a string

Question 1

I am writing an algorithm to count the number of times a substring repeats itself. The string is between 1-200 characters ranging from letters a-z. There should be no left overs at the end of the pattern either and it should be split into the smallest possible combination.

answer("abcabcabcabc"):output = 4
answer("abccbaabccba"): output = 2
answer("abcabcd"): output = 1

My code:

import re
def answer(s):
 length = len(s)
 x=[]
 reg = ""
 for i in range(1, length+1):
 if length % i == 0:
 x.append(i)
 repeat = len(x)*10
 for _ in range(0, repeat):
 a = length/max(x)
 print(length)
 print(max(x))
 print(a)
 for _ in range(0, a):
 reg = reg + "."
 exp = re.findall(reg, s)
 print exp
 if all(check==exp[0] for check in exp):
 print len(exp)
 return len(exp)
 elif all(check!=exp[0] for check in exp):
 x.remove(max(x))

This is Python 2.7 code and I don't have to use regex. It just seemed like the easiest way.

Is there a better/faster/more optimal way to do this?

NOTE: it breaks if the string size is too big.

EDIT: Fixed indentation

Question 2

You list this as a programming-challenge, could you please state the site of this programming challenge?

Question 3

It's a level 1 foobar question. I passed the test, I'm just curious if there is a better way.

Question 4

Indentation seems off. Please double check.

Question 5

@vnp where? It works for me

Question 6

@JakeSteele for i in range(1, length+1): is indented just as the next line if length % i == 0:.

Question 7

I see that there already is an accepted answer, before I got around to writing this answer. That is a little discouraging for me as a reviewer, but I've written a code alternative and have some thoughts regarding your code.

Code review

Document your code – To guess what your code is doing by simply reading it is not very easy. It could very well have a few comments on what it tries to achieve.
Better variable names – In general try to use variable names describing the content. Names like x, a, exp (which I would read as expression?), and so on, are not very informative.
Regex are generally expensive – In general regex, whilst being a fantastic too, are somewhat overused. Do note that it takes time to build the patterns and match the patterns, especially compared to simple string matches.

Some new(?) constructs

I'm not sure if you know these already, but there are a few new constructs I would like to show you:

String repeat – Strings can be multiplied (aka duplicated) using the multiplication operator.
```
text = "abc"
print text * 3 # Will output "abcabcabc"
```
divmod with multiple outputs – divmod(a, b) will divide a by b and return the divisor and the rest. This can be stored directly into a tuple like in the following:
```
div, rest = divmod(13, 5) # Giving div == 2, rest = 3
```
Slice and dice strings – As of Python 2.3 one can slice strings. The most common variants indicates how many characters you want from start like text[:3] or if you want the rest of the text like text[3:]. This can be very useful...
A slightly fancier print varant – Using .format in combination with print can produce nicer output rather easily:
```
print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], div, rest)
```
This would output on the same line, something like:
```
i: 4, sub: abcd, div: 2, mod: 3
```
else-block after for?! – This will make sense later on, but if a for loop completes normally, it'll not enter the optional else:-block. In other words, if you break out of a for loop Python won't enter the else block. This can be used to verify that the for loop actually found/did something, and provide an alternative if it didn't.
```
for i in range(5):
 print i
 if i == 3:
 break
else:
 print "Nothing larger than 3..."
```

Code alternative

I haven't commented too much on your chosen algorithm, as I find it a little confusing, so I thought what is he trying to achieve and what is an alternative approach. I then came up with these demands for the code:

You're looking for the maximum repeated substring completely filling out the original string. This means:
- The candidate substring length, must then divide the original string length without any leftover (or rest characters)
- The candidate substring can't be more than half the length of the original string, as it then can't be duplicated
- The first (and shortest) candidate substring will always give the most repeats, if it matches the other criteria

So one way to write this out is like this:

def repeatedSubstring(source):
 """Look for the shortest substring which when repeated equals
 the source string, without any left over characters.
 """
 length = len(source)
 print '\nOriginal text: {}. Length: {}'.format(source, length)
 # Check candidate strings
 for i in range(1, length/2+1):
 repeat_count, leftovers = divmod(length, i)
 # print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], repeat_count, leftovers)
 # print 'repeated: {}'.format(source[:i]*repeat_count)
 # Check for no leftovers characters, and equality when repeated 
 if (leftovers == 0) and (source == source[:i]*repeat_count):
 print 'Found repeated substring: {}, count: {}'.format(source[:i], repeat_count)
 break
 else:
 print "Couldn't find any repeated substring"
repeatedSubstring("abcdeabcde")
repeatedSubstring("abcdeabcdef")
repeatedSubstring("aaaaa")

I've commented out some debug print statements, and left it a little more verbose than the original code.

If you want, you can easily replace the remaining print statements with return div and return 1. A stripped down version would then look like:

def repeatedSubstringCount(source):
 """Look for the shortest substring which when repeated equals
 the source string, without any left over characters.
 Return the maximum repeat count, 1 if none found.
 """
 length = len(source)
 # Check candidate strings
 for i in range(1, length/2+1):
 repeat_count, leftovers = divmod(length, i)
 # Check for no leftovers characters, and equality when repeated 
 if (leftovers == 0) and (source == source[:i]*repeat_count):
 return repeat_count
 return 1

I still left a few comments in there, so that it possible to have some idea on what is happening. I've also changed the name to "repeatedSubstringCount" to indicate both what the function does, but also what it returns.

Question 8

Edit: @holroy's answer is faster than this one

This would be my approached on this task:

from itertools import groupby
def all_equal(iterable):
 "Returns True if all the elements are equal to each other"
 "https://docs.python.org/2.7/library/itertools.html#recipes"
 g = groupby(iterable)
 return next(g, True) and not next(g, False)
def divisors(n): # based on vnp's answer
 x = []
 y = []
 i = 1
 while i*i < n:
 if n % i == 0:
 x.append(i)
 y.append(n/i)
 i += 1
 if i*i == n:
 x.append(i)
 x.extend(list(reversed(y)))
 return x
def answer(s):
 n = len(s)
 for divisor in divisors(n):
 # cut the string in divisor-length substring and check if all substrings are equal
 if all_equal(s[i:i+divisor] for i in range(0, n, divisor)):
 return n / divisor

Time measurements:

In [52]: %timeit answer1("abcdefghijklmnopqrstuvwxyz"*10) # Your solution
1000 loops, best of 3: 437 μs per loop
In [53]: %timeit answer("abcdefghijklmnopqrstuvwxyz"*10) # My solution
10000 loops, best of 3: 36.6 μs per loop
In [19]: %timeit repeatedSubstringCount("abcdefghijklmnopqrstuvwxyz"*10) # @holroy's solution
100000 loops, best of 3: 12.2 μs per loop
In [59]: %timeit answer1("abcabcd")
100000 loops, best of 3: 20.4 μs per loop
In [60]: %timeit answer("abcabcd")
100000 loops, best of 3: 8.96 μs per loop
In [21]: %timeit repeatedSubstringCount("abcabcd")
1000000 loops, best of 3: 2.03 μs per loop

Question 9

Thanks for commenting on the speed issue... I didn't think about that at all when implementing it, I just went with my basic instinct on how I would code it and make it readable. It's always nice when that is fast as well!

Question 10

The loop

 for i in range(1, length+1):
 if length % i == 0:
 x.append(i)

builds a list of the divisors of length. It has a very well defined purpose, and I recommend to factor it out into a function

 def divisors(n)

It could also be optimized. There is no need to encompass the entire range(1, length+1). Notice that the divisors come in pairs: if i divides n, then n/i also divides n:

 i = 1
 while i*i < n:
 if n % i == 0:
 x.append(i)
 x.append(n/i)
 i += 1
 if i*i == n:
 x.append(i)
 return x

This version has the \$O(\sqrt{n})\$ time complexity vs \$O(n)\$ of the original. The resulting list is not sorted, but it is easily amendable:

 x = []
 y = []
 while i*i < n:
 if n % i == 0:
 x.append(i)
 y.append(n/i)
 i += 1
 if i*i == n:
 x.append(i)
 x.append(list(reversed(y)))
 return x

The line

 repeat = len(x)*10

truly stumbles me. Where does 10 come from?

The

 for _ in range(0, len(x)):

seems to work just fine.

The loop above computes max(x) at each iteration, and therefore exhibits a quadratic complexity over the len(x). Since x is sorted, you should just iterate from the end (or reverse x to begin with).

The elif is unjustified. If the code reaches this clause, it is already known that the condition is true - otherwise the function would already return. I recommend

 if all(check==exp[0] for check in exp):
 print len(exp)
 return len(exp)
 x.remove(max(x))

All that said, I am not sure I understand the core logic (or the problem statement, since you said you passed the test). Can you explain the results:

>>> print answer("aba")
1
>>> print answer("aaba")
1
>>> print answer("aabaa")
None

Question 11

Aren't the first two cases you asked about just without any repeating substring, so it is one string which is repeated exactly once? The last one should also be 1 in that case, though.

Question 12

Whenever I ran it for a larger string with close to 200 characters it would break. I have no idea why the 10 makes a difference, but it didn't work without making the range bigger.

Question 13

There is a more elegant way to do that.

def answer(s):
 return len(s) // (s+s).find(s, 1)

This works because the first alignment of a string on itself doubled is exactly at the length of the repeated pattern.

abcabcabcabcabcabcabcabc
 abcabcabcabc
 ^ first alignment at index 3 (len(s)/3 = 4)

abccbaabccbaabccbaabccba
 abccbaabccba
 ^ first alignment at index 6 (len(s)/6 = 2)

abcabcdabcabcd
 abcabcd
 ^ first alignment at index 7 (len(s)/7 = 1)

holroy holroy 11.7k1 gold badge27 silver badges59 bronze badges · Accepted Answer · 2017-04-21 13:51:39Z

I see that there already is an accepted answer, before I got around to writing this answer. That is a little discouraging for me as a reviewer, but I've written a code alternative and have some thoughts regarding your code.

Code review

Document your code – To guess what your code is doing by simply reading it is not very easy. It could very well have a few comments on what it tries to achieve.
Better variable names – In general try to use variable names describing the content. Names like x, a, exp (which I would read as expression?), and so on, are not very informative.
Regex are generally expensive – In general regex, whilst being a fantastic too, are somewhat overused. Do note that it takes time to build the patterns and match the patterns, especially compared to simple string matches.

Some new(?) constructs

I'm not sure if you know these already, but there are a few new constructs I would like to show you:

String repeat – Strings can be multiplied (aka duplicated) using the multiplication operator.
```
text = "abc"
print text * 3 # Will output "abcabcabc"
```
divmod with multiple outputs – divmod(a, b) will divide a by b and return the divisor and the rest. This can be stored directly into a tuple like in the following:
```
div, rest = divmod(13, 5) # Giving div == 2, rest = 3
```
Slice and dice strings – As of Python 2.3 one can slice strings. The most common variants indicates how many characters you want from start like text[:3] or if you want the rest of the text like text[3:]. This can be very useful...
A slightly fancier print varant – Using .format in combination with print can produce nicer output rather easily:
```
print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], div, rest)
```
This would output on the same line, something like:
```
i: 4, sub: abcd, div: 2, mod: 3
```
else-block after for?! – This will make sense later on, but if a for loop completes normally, it'll not enter the optional else:-block. In other words, if you break out of a for loop Python won't enter the else block. This can be used to verify that the for loop actually found/did something, and provide an alternative if it didn't.
```
for i in range(5):
 print i
 if i == 3:
 break
else:
 print "Nothing larger than 3..."
```

Code alternative

I haven't commented too much on your chosen algorithm, as I find it a little confusing, so I thought what is he trying to achieve and what is an alternative approach. I then came up with these demands for the code:

You're looking for the maximum repeated substring completely filling out the original string. This means:
- The candidate substring length, must then divide the original string length without any leftover (or rest characters)
- The candidate substring can't be more than half the length of the original string, as it then can't be duplicated
- The first (and shortest) candidate substring will always give the most repeats, if it matches the other criteria

So one way to write this out is like this:

def repeatedSubstring(source):
 """Look for the shortest substring which when repeated equals
 the source string, without any left over characters.
 """
 length = len(source)
 print '\nOriginal text: {}. Length: {}'.format(source, length)
 # Check candidate strings
 for i in range(1, length/2+1):
 repeat_count, leftovers = divmod(length, i)
 # print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], repeat_count, leftovers)
 # print 'repeated: {}'.format(source[:i]*repeat_count)
 # Check for no leftovers characters, and equality when repeated 
 if (leftovers == 0) and (source == source[:i]*repeat_count):
 print 'Found repeated substring: {}, count: {}'.format(source[:i], repeat_count)
 break
 else:
 print "Couldn't find any repeated substring"
repeatedSubstring("abcdeabcde")
repeatedSubstring("abcdeabcdef")
repeatedSubstring("aaaaa")

I've commented out some debug print statements, and left it a little more verbose than the original code.

If you want, you can easily replace the remaining print statements with return div and return 1. A stripped down version would then look like:

def repeatedSubstringCount(source):
 """Look for the shortest substring which when repeated equals
 the source string, without any left over characters.
 Return the maximum repeat count, 1 if none found.
 """
 length = len(source)
 # Check candidate strings
 for i in range(1, length/2+1):
 repeat_count, leftovers = divmod(length, i)
 # Check for no leftovers characters, and equality when repeated 
 if (leftovers == 0) and (source == source[:i]*repeat_count):
 return repeat_count
 return 1

I still left a few comments in there, so that it possible to have some idea on what is happening. I've also changed the name to "repeatedSubstringCount" to indicate both what the function does, but also what it returns.

Stack Exchange Network

Use Python to determine the repeating pattern in a string

4 Answers 4

Code review

Some new(?) constructs

Code alternative

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Use Python to determine the repeating pattern in a string

4 Answers 4

Code review

Some new(?) constructs

Code alternative

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions