I was doing this HackerRank problem which basically boils down to counting overlapping substrings in a string. I used this solution from StackOverflow to build this program -
def overlapping_occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
def find_health(start, end, d, genes, gene_health):
hv = 0
beneficial = set(genes[start:end+1])
for b in beneficial:
# b is just a single letter, so no possibilities of overlapping
if len(b) == 1:
occurrences = d.count(b)
else:
occurrences = overlapping_occurrences(d, b)
total_value = sum(gene_health[b])
hv += total_value*occurrences
return hv
if __name__ == '__main__':
n = int(raw_input())
genes = list(raw_input().split())
health = list(raw_input().split())
gene_health = {}
for idx in xrange(n):
try:
gene_health[genes[idx]].append(int(health[idx]))
except:
gene_health[genes[idx]] = []
gene_health[genes[idx]].append(int(health[idx]))
s = int(raw_input())
health_values = []
for _ in xrange(s):
start, end, d = raw_input().split()
health_values.append(find_health(int(start), int(end), d, genes, gene_health))
low = min(health_values)
high = max(health_values)
print "%d %d" % (low, high)
I've used a dictionary gene_health
because as the first example in the problem description shows, a single gene can have multiple different values, so I store all those values in a list which I sum over to get the total_value
.
My problem is that this program takes a very long time and exceeds the time limit on most of the test cases. I do think that this algorithm is CORRECT, since for the few test cases where it doesn't time out, it gives the correct answer. So I'm just looking for a way to speed up the program, and I think the only place where any speedups can be performed is in the overlapping_occurrences
method.
I've searched for other similar questions on StackOverflow but all of them use regex but regex will make it even slower (I think).
Any suggestions or poking in the right direction will help. Thank you!
1 Answer 1
overlapping_occurrences
has \$O(MN)\$ complexity where \$M\$ and \$N\$ are the sizes of the string and the substring respectively. You can reduce it to \$O(M+N)\$ using KMP algorithm. However, looking at the constraints, I feel that even that might not be sufficient - the highest score for the problem is only 50.
-
\$\begingroup\$ A naive implementation of the algorithm described there performs worse for some worst-case scenarios like
text = "A"*10000000 + "B"; word = "A"*100 + "B"
.text.find(word)
is significantly faster thankmp_search(text, word)
there. Not sure what kind of corner cases the actual challenge uses, though. \$\endgroup\$Graipher– Graipher2017年03月21日 13:17:46 +00:00Commented Mar 21, 2017 at 13:17 -
\$\begingroup\$ Can you show me your naive implementation? It is very easy to get KMP wrong. \$\endgroup\$Raziman T V– Raziman T V2017年03月21日 13:18:26 +00:00Commented Mar 21, 2017 at 13:18
-
\$\begingroup\$ Sure, I'll just ask it as a question here. \$\endgroup\$Graipher– Graipher2017年03月21日 13:19:33 +00:00Commented Mar 21, 2017 at 13:19
-
\$\begingroup\$ I have a question though. The KMP algorithm is just used to find if a substring occurs in a word. So, if the word is
caaabaa
, thenkmp('caaabaa', 'aa')
will just return the index of the first occurrence of the substring in the word. So if I want to count the number of occurrences, do I have to do something likekmp('caaabaa', 'aa')
which will give me the first occurrence's index, then get the part of the string AFTER this index, and dokmp('aabaa', 'aa')
, and so on? If so, I don't see how that's better cause the complexity will again be a product of lengths. \$\endgroup\$Sidharth Samant– Sidharth Samant2017年03月21日 13:28:22 +00:00Commented Mar 21, 2017 at 13:28 -
1\$\begingroup\$ My naive KMP implementation can be found in this question \$\endgroup\$Graipher– Graipher2017年03月21日 13:31:57 +00:00Commented Mar 21, 2017 at 13:31
Explore related questions
See similar questions with these tags.