High execution time to count overlapping substrings

Question 1

I was doing this HackerRank problem which basically boils down to counting overlapping substrings in a string. I used this solution from StackOverflow to build this program -

def overlapping_occurrences(string, sub):
 count = start = 0
 while True:
 start = string.find(sub, start) + 1
 if start > 0:
 count+=1
 else:
 return count
def find_health(start, end, d, genes, gene_health):
 hv = 0
 beneficial = set(genes[start:end+1])
 for b in beneficial:
 # b is just a single letter, so no possibilities of overlapping
 if len(b) == 1:
 occurrences = d.count(b)
 else:
 occurrences = overlapping_occurrences(d, b)
 total_value = sum(gene_health[b])
 hv += total_value*occurrences
 return hv
if __name__ == '__main__':
 n = int(raw_input())
 genes = list(raw_input().split())
 health = list(raw_input().split())
 gene_health = {}
 for idx in xrange(n):
 try:
 gene_health[genes[idx]].append(int(health[idx]))
 except:
 gene_health[genes[idx]] = []
 gene_health[genes[idx]].append(int(health[idx]))
 s = int(raw_input())
 health_values = []
 for _ in xrange(s):
 start, end, d = raw_input().split()
 health_values.append(find_health(int(start), int(end), d, genes, gene_health))
 low = min(health_values)
 high = max(health_values)
 print "%d %d" % (low, high)

I've used a dictionary gene_health because as the first example in the problem description shows, a single gene can have multiple different values, so I store all those values in a list which I sum over to get the total_value.

My problem is that this program takes a very long time and exceeds the time limit on most of the test cases. I do think that this algorithm is CORRECT, since for the few test cases where it doesn't time out, it gives the correct answer. So I'm just looking for a way to speed up the program, and I think the only place where any speedups can be performed is in the overlapping_occurrences method.

I've searched for other similar questions on StackOverflow but all of them use regex but regex will make it even slower (I think).

Any suggestions or poking in the right direction will help. Thank you!

Question 2

overlapping_occurrences has \$O(MN)\$ complexity where \$M\$ and \$N\$ are the sizes of the string and the substring respectively. You can reduce it to \$O(M+N)\$ using KMP algorithm. However, looking at the constraints, I feel that even that might not be sufficient - the highest score for the problem is only 50.

Question 3

A naive implementation of the algorithm described there performs worse for some worst-case scenarios like text = "A"*10000000 + "B"; word = "A"*100 + "B". text.find(word) is significantly faster than kmp_search(text, word) there. Not sure what kind of corner cases the actual challenge uses, though.

Question 4

Can you show me your naive implementation? It is very easy to get KMP wrong.

Question 5

Sure, I'll just ask it as a question here.

Question 6

I have a question though. The KMP algorithm is just used to find if a substring occurs in a word. So, if the word is caaabaa, then kmp('caaabaa', 'aa') will just return the index of the first occurrence of the substring in the word. So if I want to count the number of occurrences, do I have to do something like kmp('caaabaa', 'aa') which will give me the first occurrence's index, then get the part of the string AFTER this index, and do kmp('aabaa', 'aa'), and so on? If so, I don't see how that's better cause the complexity will again be a product of lengths.

Question 7

My naive KMP implementation can be found in this question

Raziman T V 8331 gold badge5 silver badges8 bronze badges · Answer 1 · 2017-03-21 12:46:29Z

6

\$\begingroup\$

overlapping_occurrences has \$O(MN)\$ complexity where \$M\$ and \$N\$ are the sizes of the string and the substring respectively. You can reduce it to \$O(M+N)\$ using KMP algorithm. However, looking at the constraints, I feel that even that might not be sufficient - the highest score for the problem is only 50.

Share

answered Mar 21, 2017 at 12:46

Raziman T V's user avatar

Raziman T V

8331 gold badge5 silver badges8 bronze badges

\$\endgroup\$

9

\$\begingroup\$ A naive implementation of the algorithm described there performs worse for some worst-case scenarios like text = "A"*10000000 + "B"; word = "A"*100 + "B". text.find(word) is significantly faster than kmp_search(text, word) there. Not sure what kind of corner cases the actual challenge uses, though. \$\endgroup\$

Graipher
– Graipher

2017年03月21日 13:17:46 +00:00
Commented Mar 21, 2017 at 13:17
\$\begingroup\$ Can you show me your naive implementation? It is very easy to get KMP wrong. \$\endgroup\$

Raziman T V
– Raziman T V

2017年03月21日 13:18:26 +00:00
Commented Mar 21, 2017 at 13:18
\$\begingroup\$ Sure, I'll just ask it as a question here. \$\endgroup\$

Graipher
– Graipher

2017年03月21日 13:19:33 +00:00
Commented Mar 21, 2017 at 13:19
\$\begingroup\$ I have a question though. The KMP algorithm is just used to find if a substring occurs in a word. So, if the word is caaabaa, then kmp('caaabaa', 'aa') will just return the index of the first occurrence of the substring in the word. So if I want to count the number of occurrences, do I have to do something like kmp('caaabaa', 'aa') which will give me the first occurrence's index, then get the part of the string AFTER this index, and do kmp('aabaa', 'aa'), and so on? If so, I don't see how that's better cause the complexity will again be a product of lengths. \$\endgroup\$

Sidharth Samant
– Sidharth Samant

2017年03月21日 13:28:22 +00:00
Commented Mar 21, 2017 at 13:28
1

\$\begingroup\$ My naive KMP implementation can be found in this question \$\endgroup\$

Graipher
– Graipher

2017年03月21日 13:31:57 +00:00
Commented Mar 21, 2017 at 13:31

| Show 4 more comments

Stack Exchange Network

High execution time to count overlapping substrings

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

High execution time to count overlapping substrings

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions