Determine if a string only contains repetitions of a substring

Question 1

This is my first Python program in years, it's just a simple string manipulation program. Given a string, determine if it contains repetitions of any substring. For example, testtest contains two substring test and nothing else.

def onlySubstrings(stringOrig):
 stringCopy = (stringOrig+'.')[:-1] # copy string contents
 for i in range(1, len(stringCopy)): # for each character
 stringCopy = stringCopy[1:]+stringCopy[0] # shift first char to end
 if stringCopy == stringOrig: # if equivalent to original input 
 return True # it consists of substrings only
 else: # otherwise the only substring is the string itself
 return False
if __name__ == '__main__':
 wordList = ['testtest','testteste','test','tetestst'] # some test words
 print('The below words only contain substring repetitions:')
 for word in wordList: # for each word
print(onlySubstrings(word), '\t', word)

If you imagine the string as circular unto itself, the string can only contain equal substring(s) if you can move the characters to the left or right and get the same original string in less than len shifts:

testtest - 0 shifts, original string
esttestt - 1 shift, not original
sttestte - 2 shifts, not original
ttesttes - 3 shifts, not original
testtest - 4 shifts, original again

Since 4 < len, it contains only substring repetitions.

testteste - 0 shifts, original string
esttestet - 1 shift, not original
...
etesttest - (len-1) shifts, not original
testteste - len shifts, original again

Since we couldn't find the original string again in under len shifts, it does not only contain repetitions of a substring.

Question 2

I also just realized that a string containing >1 equal substrings must have maximum substring length no more than half the string length. I don't need to shift len-1 times, rather I need to shift ceil(len/2) times. Saves a little time.

Question 3

another optimization: you don't need to check if i does not evenly divide len(str) since doing so would attempt to match partials which we know won't be true

Question 4

The code is relatively concise and easy to read. Here are some points that could be improved:

A general thing is the amount of commenting, you should generally resort to comments only when it's hard to make the code self-explanatory. You don't really need to comment each line of the code.
stringCopy = (stringOrig+'.')[:-1] # copy string contents
You can make a copy this way instead stringCopy = stringOrig[:].
To make the code more readable, you can create a function that shifts a string n characters. This will really help make the code more readable.
The else keyword after the for loop is not needed since you return anyway if you didn't finish the loop. else is usually used with break in the for loop.
The name of the function could be slightly improved, maybe consists_of_repeated_substrings. This is longer but is very easy to understand.
A note on performance: your algorithm creates a copy of the original string in each iteration. This works fine for shorter strings, but if you wanna make it more efficient you can consider matching the string without making a shifted copy of it.

Incorporating these comments you could rewrite the code as follows:

def shifted_string(str, n):
 return str[n:] + str[:n]
def consists_of_repeated_substrings(str):
 '''
 Checks whether a string consists of a substring repeated multiple times.
 Ex. 'testtesttest' is the string 'test' repeated 3 times.
 '''
 # The string can only contain equal substrings if you can shift it n 
 # times (0 < n < len) and get the same original string.
 for i in range(1, len(str)):
 if shifted_string(str, i) == str:
 return True
 return False

A clever algorithm for doing the same thing is as follows:

double_str = (str + str)[1:-1]
return double_str.find(str) != -1

The idea is similar to yours, so I will leave understanding it as an exercise.

Question 5

That is quite clever and an impressive truncated-to-one-line.

Question 6

This can be shortened to a one-liner using regular expressions:

import re
def isMadeFromRepeatedSubstrings(text):
 return re.search(r'^(.+?)1円+$', text)

The returned object will evaluate true or false, as in the original, and the substring itself is accessible via .groups(1):

>>> isMadeFromRepeatedSubstrings("testtest").groups(1)
('test',)

Question 7

This is exactly what I thought of when I read the question title (under HNQ) but I would go with (.+?) instead of (.+). It will make a difference if the input is something like "TestTestTestTest".

Question 8

excellent point, edited

Question 9

Please change your code to snake_case so people don't think camelCase is a widely accepted style in Python.

Question 10

My impression is that the code seems good: no suggestions at that level, other than a very minor one. Python's for-else structure is a bit of an oddball: I never use it and a almost never see it used. More to the point, it adds no clarity in this specific case. Just return False outside the loop.

Regarding the algorithm, however, I do have a suggestion. It seems fairly low level (in the sense of mucking around with character shifting and copying), not super easy to explain, and it did not seem intuitive at first glance to me.

A different approach is to rely on string multiplication: if the full string equals exactly N copies of a substring, then return True. You just need to loop over all possible substring lengths. For example:

def onlySubstrings(orig):
 orig_len = len(orig)
 max_len = int(orig_len / 2)
 for i in range(1, max_len + 1):
 substr = orig[0:i]
 n_copies = int(orig_len / i)
 if orig == substr * n_copies:
 return True
 return False

Question 11

On top of the other great answers, here are a few additional comments.

Style

There is a Style Guide for Python code called PEP 8 and I'd recommend reading it and trying to follow it more or less strictly. It your case, you could change the functions/variables names.

Better tests

Your test suite can be improved with a few simple details:

add the edge cases (strings of length 0 or 1) - also, these would need to be documented in the function docstring
add to your structure the expected output so that you can check the result automatically

You could write something like:

def unit_test_only_substring():
 # test cases - list of (input, expected_output)
 tests = [
 # Edge cases
 ('', False), # To be confirmed
 ('t', False), # To be confirmed
 # No repetition
 ('ab', False),
 ('tetestst', False),
 ('testteste', False),
 # Repetition
 ('tt', True),
 ('ttt', True),
 ('testtest', True),
 ('testtesttest', True),
 ]
 for str_input, expected_out in tests:
 out = onlySubstrings(str_input)
 print(out, '\t', expected_out, '\t', str_input)
 assert out == expected_out
if __name__ == '__main__':
 unit_test_only_substring()

You could go further and use a proper unit-test framework.

More code improvement

(Based on Hesham Attia) This is great chance to learn about Python builtins all and any:

def onlySubstrings(string):
 # The string can only contain equal substrings if you can shift it n
 # times (0 < n < len) and get the same original string.
 return any(shiftedString(string, i) == string for i in range(1, len(string)))

Hesham Attia Hesham Attia 8184 silver badges9 bronze badges · Accepted Answer · 2019-05-12 06:31:57Z

The code is relatively concise and easy to read. Here are some points that could be improved:

A general thing is the amount of commenting, you should generally resort to comments only when it's hard to make the code self-explanatory. You don't really need to comment each line of the code.
stringCopy = (stringOrig+'.')[:-1] # copy string contents
You can make a copy this way instead stringCopy = stringOrig[:].
To make the code more readable, you can create a function that shifts a string n characters. This will really help make the code more readable.
The else keyword after the for loop is not needed since you return anyway if you didn't finish the loop. else is usually used with break in the for loop.
The name of the function could be slightly improved, maybe consists_of_repeated_substrings. This is longer but is very easy to understand.
A note on performance: your algorithm creates a copy of the original string in each iteration. This works fine for shorter strings, but if you wanna make it more efficient you can consider matching the string without making a shifted copy of it.

Incorporating these comments you could rewrite the code as follows:

def shifted_string(str, n):
 return str[n:] + str[:n]
def consists_of_repeated_substrings(str):
 '''
 Checks whether a string consists of a substring repeated multiple times.
 Ex. 'testtesttest' is the string 'test' repeated 3 times.
 '''
 # The string can only contain equal substrings if you can shift it n 
 # times (0 < n < len) and get the same original string.
 for i in range(1, len(str)):
 if shifted_string(str, i) == str:
 return True
 return False

A clever algorithm for doing the same thing is as follows:

double_str = (str + str)[1:-1]
return double_str.find(str) != -1

The idea is similar to yours, so I will leave understanding it as an exercise.

That is quite clever and an impressive truncated-to-one-line.

Stack Exchange Network

Determine if a string only contains repetitions of a substring

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Determine if a string only contains repetitions of a substring

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions