I am writing an algorithm to count the number of times a substring repeats itself. The string is between 1-200 characters ranging from letters a-z. There should be no left overs at the end of the pattern either and it should be split into the smallest possible combination.
answer("abcabcabcabc"):output = 4
answer("abccbaabccba"): output = 2
answer("abcabcd"): output = 1
My code:
import re
def answer(s):
length = len(s)
x=[]
reg = ""
for i in range(1, length+1):
if length % i == 0:
x.append(i)
repeat = len(x)*10
for _ in range(0, repeat):
a = length/max(x)
print(length)
print(max(x))
print(a)
for _ in range(0, a):
reg = reg + "."
exp = re.findall(reg, s)
print exp
if all(check==exp[0] for check in exp):
print len(exp)
return len(exp)
elif all(check!=exp[0] for check in exp):
x.remove(max(x))
This is Python 2.7 code and I don't have to use regex. It just seemed like the easiest way.
Is there a better/faster/more optimal way to do this?
NOTE: it breaks if the string size is too big.
EDIT: Fixed indentation
4 Answers 4
I see that there already is an accepted answer, before I got around to writing this answer. That is a little discouraging for me as a reviewer, but I've written a code alternative and have some thoughts regarding your code.
Code review
- Document your code – To guess what your code is doing by simply reading it is not very easy. It could very well have a few comments on what it tries to achieve.
- Better variable names – In general try to use variable names describing the content. Names like
x
,a
,exp
(which I would read as expression?), and so on, are not very informative. - Regex are generally expensive – In general regex, whilst being a fantastic too, are somewhat overused. Do note that it takes time to build the patterns and match the patterns, especially compared to simple string matches.
Some new(?) constructs
I'm not sure if you know these already, but there are a few new constructs I would like to show you:
String repeat – Strings can be multiplied (aka duplicated) using the multiplication operator.
text = "abc" print text * 3 # Will output "abcabcabc"
divmod
with multiple outputs –divmod(a, b)
will dividea
byb
and return the divisor and the rest. This can be stored directly into a tuple like in the following:div, rest = divmod(13, 5) # Giving div == 2, rest = 3
- Slice and dice strings – As of Python 2.3 one can slice strings. The most common variants indicates how many characters you want from start like
text[:3]
or if you want the rest of the text liketext[3:]
. This can be very useful... A slightly fancier
print
varant – Using .format in combination withprint
can produce nicer output rather easily:print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], div, rest)
This would output on the same line, something like:
i: 4, sub: abcd, div: 2, mod: 3
else
-block afterfor
?! – This will make sense later on, but if afor
loop completes normally, it'll not enter the optionalelse:
-block. In other words, if youbreak
out of afor
loop Python won't enter theelse
block. This can be used to verify that thefor
loop actually found/did something, and provide an alternative if it didn't.for i in range(5): print i if i == 3: break else: print "Nothing larger than 3..."
Code alternative
I haven't commented too much on your chosen algorithm, as I find it a little confusing, so I thought what is he trying to achieve and what is an alternative approach. I then came up with these demands for the code:
- You're looking for the maximum repeated substring completely filling out the original string. This means:
- The candidate substring length, must then divide the original string length without any leftover (or rest characters)
- The candidate substring can't be more than half the length of the original string, as it then can't be duplicated
- The first (and shortest) candidate substring will always give the most repeats, if it matches the other criteria
So one way to write this out is like this:
def repeatedSubstring(source):
"""Look for the shortest substring which when repeated equals
the source string, without any left over characters.
"""
length = len(source)
print '\nOriginal text: {}. Length: {}'.format(source, length)
# Check candidate strings
for i in range(1, length/2+1):
repeat_count, leftovers = divmod(length, i)
# print 'i: {}, sub: {}, div: {}, mod: {}'.format(i, source[:i], repeat_count, leftovers)
# print 'repeated: {}'.format(source[:i]*repeat_count)
# Check for no leftovers characters, and equality when repeated
if (leftovers == 0) and (source == source[:i]*repeat_count):
print 'Found repeated substring: {}, count: {}'.format(source[:i], repeat_count)
break
else:
print "Couldn't find any repeated substring"
repeatedSubstring("abcdeabcde")
repeatedSubstring("abcdeabcdef")
repeatedSubstring("aaaaa")
I've commented out some debug print
statements, and left it a little more verbose than the original code.
If you want, you can easily replace the remaining print
statements with return div
and return 1
. A stripped down version would then look like:
def repeatedSubstringCount(source):
"""Look for the shortest substring which when repeated equals
the source string, without any left over characters.
Return the maximum repeat count, 1 if none found.
"""
length = len(source)
# Check candidate strings
for i in range(1, length/2+1):
repeat_count, leftovers = divmod(length, i)
# Check for no leftovers characters, and equality when repeated
if (leftovers == 0) and (source == source[:i]*repeat_count):
return repeat_count
return 1
I still left a few comments in there, so that it possible to have some idea on what is happening. I've also changed the name to "repeatedSubstringCount" to indicate both what the function does, but also what it returns.
Edit: @holroy's answer is faster than this one
This would be my approached on this task:
from itertools import groupby
def all_equal(iterable):
"Returns True if all the elements are equal to each other"
"https://docs.python.org/2.7/library/itertools.html#recipes"
g = groupby(iterable)
return next(g, True) and not next(g, False)
def divisors(n): # based on vnp's answer
x = []
y = []
i = 1
while i*i < n:
if n % i == 0:
x.append(i)
y.append(n/i)
i += 1
if i*i == n:
x.append(i)
x.extend(list(reversed(y)))
return x
def answer(s):
n = len(s)
for divisor in divisors(n):
# cut the string in divisor-length substring and check if all substrings are equal
if all_equal(s[i:i+divisor] for i in range(0, n, divisor)):
return n / divisor
Time measurements:
In [52]: %timeit answer1("abcdefghijklmnopqrstuvwxyz"*10) # Your solution
1000 loops, best of 3: 437 μs per loop
In [53]: %timeit answer("abcdefghijklmnopqrstuvwxyz"*10) # My solution
10000 loops, best of 3: 36.6 μs per loop
In [19]: %timeit repeatedSubstringCount("abcdefghijklmnopqrstuvwxyz"*10) # @holroy's solution
100000 loops, best of 3: 12.2 μs per loop
In [59]: %timeit answer1("abcabcd")
100000 loops, best of 3: 20.4 μs per loop
In [60]: %timeit answer("abcabcd")
100000 loops, best of 3: 8.96 μs per loop
In [21]: %timeit repeatedSubstringCount("abcabcd")
1000000 loops, best of 3: 2.03 μs per loop
-
\$\begingroup\$ Thanks for commenting on the speed issue... I didn't think about that at all when implementing it, I just went with my basic instinct on how I would code it and make it readable. It's always nice when that is fast as well! \$\endgroup\$holroy– holroy2017年04月21日 14:12:20 +00:00Commented Apr 21, 2017 at 14:12
The loop
for i in range(1, length+1):
if length % i == 0:
x.append(i)
builds a list of the divisors of length
. It has a very well defined purpose, and I recommend to factor it out into a function
def divisors(n)
It could also be optimized. There is no need to encompass the entire range(1, length+1)
. Notice that the divisors come in pairs: if i
divides n
, then n/i
also divides n
:
i = 1
while i*i < n:
if n % i == 0:
x.append(i)
x.append(n/i)
i += 1
if i*i == n:
x.append(i)
return x
This version has the \$O(\sqrt{n})\$ time complexity vs \$O(n)\$ of the original. The resulting list is not sorted, but it is easily amendable:
x = []
y = []
while i*i < n:
if n % i == 0:
x.append(i)
y.append(n/i)
i += 1
if i*i == n:
x.append(i)
x.append(list(reversed(y)))
return x
The line
repeat = len(x)*10
truly stumbles me. Where does 10
come from?
The
for _ in range(0, len(x)):
seems to work just fine.
The loop above computes max(x)
at each iteration, and therefore exhibits a quadratic complexity over the len(x)
. Since x
is sorted, you should just iterate from the end (or reverse x
to begin with).
The elif
is unjustified. If the code reaches this clause, it is already known that the condition is true - otherwise the function would already return. I recommend
if all(check==exp[0] for check in exp):
print len(exp)
return len(exp)
x.remove(max(x))
All that said, I am not sure I understand the core logic (or the problem statement, since you said you passed the test). Can you explain the results:
>>> print answer("aba")
1
>>> print answer("aaba")
1
>>> print answer("aabaa")
None
-
\$\begingroup\$ Aren't the first two cases you asked about just without any repeating substring, so it is one string which is repeated exactly once? The last one should also be 1 in that case, though. \$\endgroup\$Graipher– Graipher2017年04月21日 10:26:02 +00:00Commented Apr 21, 2017 at 10:26
-
\$\begingroup\$ Whenever I ran it for a larger string with close to 200 characters it would break. I have no idea why the 10 makes a difference, but it didn't work without making the range bigger. \$\endgroup\$Jake Steele– Jake Steele2017年04月21日 13:02:43 +00:00Commented Apr 21, 2017 at 13:02
There is a more elegant way to do that.
def answer(s):
return len(s) // (s+s).find(s, 1)
This works because the first alignment of a string on itself doubled is exactly at the length of the repeated pattern.
abcabcabcabcabcabcabcabc
abcabcabcabc
^ first alignment at index 3 (len(s)/3 = 4)
abccbaabccbaabccbaabccba
abccbaabccba
^ first alignment at index 6 (len(s)/6 = 2)
abcabcdabcabcd
abcabcd
^ first alignment at index 7 (len(s)/7 = 1)
Explore related questions
See similar questions with these tags.
for i in range(1, length+1):
is indented just as the next lineif length % i == 0:
. \$\endgroup\$