I'm solving yet another problem in HackerRank (https://www.hackerrank.com/challenges/determining-dna-health/problem). In short: you are given 2 arrays (genes
and health
), one of which have a 'gene' name, and the other - 'gene' weight (aka health). You then given a bunch of strings, each containing values m
and n
, which denote the start and end of the slice to be applied to the genes
and health
arrays, and the 'gene'-string, for which we need to determine healthiness. Then we need to return health-values for the most and the least healthy strings.
My solution is below, and it works, but it's not scalable, i.e. it fails testcases with a lot of values.
import re
if __name__ == '__main__':
n = int(input())
genes = input().rstrip().split()
health = list(map(int, input().rstrip().split()))
s = int(input())
weights = []
for s_itr in range(s):
m,n,gn = input().split()
weight = 0
for i in range(int(m),int(n)+1):
if genes[i] in gn:
compilt = "r'(?=("+genes[i]+"))'"
matches = len(re.findall(eval(compilt), gn))
weight += health[i]*matches
weights.append(weight)
print(min(weights), max(weights))
Can you advise on how to apply generators here? I suspect that the solution fails because of the very big list that's being assembled. Is there a way to get min and max values here without collecting them all?
Example values:
genes = ['a', 'b', 'c', 'aa', 'd', 'b']
health = [1, 2, 3, 4, 5, 6]
gene1 = "1 5 caaab" (result = 19 = max)
gene2 = "0 4 xyz" (result = 0 = min)
gene3 = "2 4 bcdybc" (result = 11)
This case returns 0 19
UPD: Follow-up question on optimized (but still not working) solution here.
-
6\$\begingroup\$ I have rolled back your most recent edit. Code Review does not permit the question to be substantially changed after answers have been posted (accepted or not) as it becomes ambiguous as to which version of the question the answer post is answering. If you have an improved version of the code, and want feedback on it, you must post a new question. See "What should I do when someone answers my question" in the help center. Note the recommendation about adding a link from this question to the new question, as well as a back link from the new question to this one. \$\endgroup\$AJNeufeld– AJNeufeld2020年06月01日 23:08:47 +00:00Commented Jun 1, 2020 at 23:08
1 Answer 1
eval(...)
What is the purpose of:
compilt = "r'(?=(" + genes[i] + "))'"
... eval(compilt), ...
It takes a string like "aa"
, and forms a new string "r'(?=(aa))'"
, which is source code for the raw string r'(?=(aa))'
, which when eval
uated yields the string "(?=(aa))"
.
There is no escaping being done, no obvious reason to do the raw string formation and subsequent evaluation, and no prevention of a syntax error due to a stray '
character in the genes[i]
array. So ... why not simply:
regex = "(?=(" + gene[i] + "))"
and no call to eval(...)
at all?
Regex Capturing
The regex subexpression (...)
is a "capturing group", which copies the matching characters into an internal buffer, for returning in the match group.
>>> re.findall('(?=(aa))', "caaab")
['aa', 'aa']
Without the capturing group, the matching characters do not have to be copied to the internal buffer, to be returned.
>>> re.findall('(?=aa)', "caaab")
['', '']
Given that you are only interested in the len(...)
of the list returned from re.findall()
, the capturing group seems like unnecessary overhead, which can be eliminated for faster execution.
Compiled Regex
As Python uses regular expressions, it maintains a cache of the most recently used regular expressions. This cache has a limited size, to prevent an excessive memory load.
In this exercise, you are repeatedly using the same gene regular expressions for each "healthy string" test. If the number of genes exceeds the cache size, Python will be tossing out compiled regular expressions, only to compile them again moments later.
You can preempt this by compiling and storing all the gene regular expressions ahead of time. Leveraging Python 3.6's f-strings, and list comprehension:
genes = input().rstrip().split()
genes_rx = [re.compile(f"(?={gene})") for gene in genes]
Used as:
matches = len(re.findall(genes_rx[i], gn))
Now the gene to regular expression string, to compiled regular expression is done once per gene, instead of once per "healthy string" test.
Computing min/max weight without creating a list
How about:
min_weight = math.inf
max_weight = -math.inf
for ...:
weight = ...
if weight < min_weight:
min_weight = weight
if weight > max_weight:
max_weight = weight
print(min_weight, max_weight)
-
\$\begingroup\$ AJNeufeld, I would need to take some time digesting your answer: some of the concepts you mentioned are not yet familiar to me, but right now I would like to thank you: your effort is really appreciated! \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月01日 20:13:04 +00:00Commented Jun 1, 2020 at 20:13
-
\$\begingroup\$ As for the 'what's the purpose' questions - I'm fairly new to the language, I learn stuff from various sources, and that is the best I could come up with. You are definitely right about it not being clean or optimal \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月01日 20:14:36 +00:00Commented Jun 1, 2020 at 20:14
-
\$\begingroup\$ I tried the things you mentioned, but the code still fails due to timeout. See full solution in the question body. \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月01日 22:22:55 +00:00Commented Jun 1, 2020 at 22:22
-
1\$\begingroup\$ That too bad. It should be faster. Does it pass more test cases, or is it just one super large case that fails with a timeout? \$\endgroup\$AJNeufeld– AJNeufeld2020年06月01日 23:12:54 +00:00Commented Jun 1, 2020 at 23:12
-
\$\begingroup\$ @DenisShvetsov Can you test and profile your code locally to see the difference in execution time? \$\endgroup\$2020年06月02日 06:11:51 +00:00Commented Jun 2, 2020 at 6:11
Explore related questions
See similar questions with these tags.