Anagram substrings in a string

Question 1

I was trying to solve this questions:

Given an input string, find all combination of unique substrings, which are anagrams of each other and distinct from each other. The substrings must follow 2 <= length(substring) < len(original string).

Example 1:

String: "vadapav"
substrings: va, vad, vada, vadap, vadapa, ad, ada, adap, adapa, adapav, da, dap, dapa, dapav, ap, apa, apav, pa, pav, av
Output: (va, av), (ad, da), (ap, pa)

How can it be optimised in terms of algorithmic and code complexity?

# Code goes below
from collections import defaultdict
import itertools
def anagram_substring(str):
 substr_list = []
 ans = []
 is_present = defaultdict(list)
 for i in xrange(len(str)):
 for j in xrange(i+2, len(str)+1):
 substr_list.append(str[i:j])
 substr_list = list(set(substr_list)) 
 for substr in substr_list:
 if is_present[''.join(sorted(substr))]:
 for anagram_str in is_present[''.join(sorted(substr))]:
 ans.append([anagram_str,substr])
 is_present[''.join(sorted(substr))].append(substr) 
 return ans
str = raw_input().strip()
print anagram_substring(str)

Question 2

The requirement 'The substrings must follow 2 <= length(substring) < len(original string)' is supefluous, because it follows from the preceding sentence: 1-char anagrams can't be distinct, hence len>1, and there's only one substring with length equal to the input string, so it couldn't differ from itself, either.

Question 3

If you do not have all expected outputs, the code is broken, which is off-topic here, but my suggestion would be to at least remove duplicates with a set when building the list.

Question 4

As far as I can understand your code, you test for two substrings for being reverse of one another (like "dome" & "emod"), not anagrams (like "dome" & "mode"). Reverse is a special case of anagram, but not every anagram is a reverse; not if string has more than 2 characters. For example, in "brabant" your code could find "ab" & "ba", but probably not "bra" & "rab".

Question 5

IMHO from the problem description it follows that for "barbarian" the solution should be similar to ("bar", "arb", "bra"). Can your code produce that?

Question 6

yes, I modified code, its working fine, can it be optimised more?

Question 7

There's no need to use a list and convert it to a set, you can use a set from the start.

Also, you're checking for every entry if it's in the list as an anagram and adding it to the answer, which is inefficient. You can do this in a faster way by using a dictionary that has the sorted substr as key and pushing the values in there.

from collections import defaultdict
import itertools
def anagram_substring(str):
 substr_list = set([])
 anagrams = defaultdict(list)
 for i in xrange(len(str)):
 for j in xrange(i + 2, len(str) + 1):
 substr_list.add(str[i:j])
 for substr in substr_list:
 anagrams[''.join(sorted(substr))].append(substr)
 return [values for key, values in anagrams.iteritems() if len(values) > 1]
str = raw_input().strip()
print anagram_substring(str)

Question 8

If you look at the current code complexity, it is O(n**2) in space complexity (since all substrings need to be saved).

Also, it is O(n**2)*O(k*log(k)) in time complexity, where O(n**2) is time to generate each substring for original string of length n, and O(k*log(k)) is time complexity to sort each substring of length k and save it as a key to be checked against later on.

In the worst case, k will approach n, so your time complexity would be O(log(n)*n**3) for the overall solution.

However, instead of sorting the keys, if you could just look up the keys for each substring, and compare against their values in other substrings, you would be good to go.

So consider solution below, which basically stores the count of each character within a substring in a Counter dictionary, and compares the Counters. Creation of counter as well as comparision of it is an O(k) operation. So that, the overall time complexity becomes O(n**3) in worst case.

Also note, your original output in the problem statement needs to be corrected: 'dapav', 'vadap' is a pair of anagrams that the output in the question is missing.

def find_anagram_substrings(s):
 length = len(s)
 substrings, substring_chars = [], {}
 results = set()
 for i in xrange(0, length):
 for j in xrange(i+2, length+1):
 substring = s[i:j]
 substrings.append(s[i:j])
 substring_chars[substring] = Counter(substring)
 for s1, s2 in product(substrings, substrings):
 if s1 != s2 and substring_chars[s1] == substring_chars[s2]:
 results.add((s1, s2) if s1 < s2 else (s2, s1))
 return results
print find_anagram_substrings("vadapav")
print find_anagram_substrings("gnomeslikelemons")

Coming to the code in the original post, I've added a few suggestions on the naming conventions that could help make it more readable:

from collections import defaultdict
import itertools
# itertools is not used, no need to import
def anagram_substring(str):
 substr_list = []
 ans = []
 # ans by itself does not clarify what it might contain.
 # result could be a better choice of variable name
 is_present = defaultdict(list)
 # is_present is again an ambiguous choice of variable name
 # it does not tell what values could it be storing
 for i in xrange(len(str)):
 # len(str) is used multiple times, can be a variable
 for j in xrange(i+2, len(str)+1):
 substr_list.append(str[i:j])
 substr_list = list(set(substr_list))
 # a set can be iterated normally, so why convert it back to list 
 for substr in substr_list:
 if is_present[''.join(sorted(substr))]:
 for anagram_str in is_present[''.join(sorted(substr))]:
 ans.append([anagram_str,substr])
 is_present[''.join(sorted(substr))].append(substr) 
 return ans
str = raw_input().strip()
print anagram_substring(str)

ChatterOne ChatterOne 2,84512 silver badges18 bronze badges · Answer 1 · 2017-04-01 15:40:31Z

There's no need to use a list and convert it to a set, you can use a set from the start.

Also, you're checking for every entry if it's in the list as an anagram and adding it to the answer, which is inefficient. You can do this in a faster way by using a dictionary that has the sorted substr as key and pushing the values in there.

from collections import defaultdict
import itertools
def anagram_substring(str):
 substr_list = set([])
 anagrams = defaultdict(list)
 for i in xrange(len(str)):
 for j in xrange(i + 2, len(str) + 1):
 substr_list.add(str[i:j])
 for substr in substr_list:
 anagrams[''.join(sorted(substr))].append(substr)
 return [values for key, values in anagrams.iteritems() if len(values) > 1]
str = raw_input().strip()
print anagram_substring(str)

Anshul Goyal Anshul Goyal 4443 silver badges13 bronze badges · Answer 2 · 2017-04-04 20:30:13Z

If you look at the current code complexity, it is O(n**2) in space complexity (since all substrings need to be saved).

Also, it is O(n**2)*O(k*log(k)) in time complexity, where O(n**2) is time to generate each substring for original string of length n, and O(k*log(k)) is time complexity to sort each substring of length k and save it as a key to be checked against later on.

In the worst case, k will approach n, so your time complexity would be O(log(n)*n**3) for the overall solution.

However, instead of sorting the keys, if you could just look up the keys for each substring, and compare against their values in other substrings, you would be good to go.

So consider solution below, which basically stores the count of each character within a substring in a Counter dictionary, and compares the Counters. Creation of counter as well as comparision of it is an O(k) operation. So that, the overall time complexity becomes O(n**3) in worst case.

Also note, your original output in the problem statement needs to be corrected: 'dapav', 'vadap' is a pair of anagrams that the output in the question is missing.

def find_anagram_substrings(s):
 length = len(s)
 substrings, substring_chars = [], {}
 results = set()
 for i in xrange(0, length):
 for j in xrange(i+2, length+1):
 substring = s[i:j]
 substrings.append(s[i:j])
 substring_chars[substring] = Counter(substring)
 for s1, s2 in product(substrings, substrings):
 if s1 != s2 and substring_chars[s1] == substring_chars[s2]:
 results.add((s1, s2) if s1 < s2 else (s2, s1))
 return results
print find_anagram_substrings("vadapav")
print find_anagram_substrings("gnomeslikelemons")

Coming to the code in the original post, I've added a few suggestions on the naming conventions that could help make it more readable:

from collections import defaultdict
import itertools
# itertools is not used, no need to import
def anagram_substring(str):
 substr_list = []
 ans = []
 # ans by itself does not clarify what it might contain.
 # result could be a better choice of variable name
 is_present = defaultdict(list)
 # is_present is again an ambiguous choice of variable name
 # it does not tell what values could it be storing
 for i in xrange(len(str)):
 # len(str) is used multiple times, can be a variable
 for j in xrange(i+2, len(str)+1):
 substr_list.append(str[i:j])
 substr_list = list(set(substr_list))
 # a set can be iterated normally, so why convert it back to list 
 for substr in substr_list:
 if is_present[''.join(sorted(substr))]:
 for anagram_str in is_present[''.join(sorted(substr))]:
 ans.append([anagram_str,substr])
 is_present[''.join(sorted(substr))].append(substr) 
 return ans
str = raw_input().strip()
print anagram_substring(str)

Stack Exchange Network

Anagram substrings in a string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Anagram substrings in a string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions