This is a follow up for my question about optimizing solution for DNA Health HackerRank problem.
Short re-cap:
You are given 2 arrays (
genes
andhealth
), one of which have a 'gene' name, and the other - 'gene' weight (aka health). You then given a bunch of strings, each containing valuesm
andn
, which denote the start and end of the slice to be applied to thegenes
andhealth
arrays, and the 'gene'-string, for which we need to determine healthiness. Then we need to return health-values for the most and the least healthy strings.
At the advice from AJNeufeld, I optimized my code into the following. Unfotunately, it still fails to execute on large testcases, such as this one.
So, the question now is: what else can I do to make the solution less expensive?
if __name__ == '__main__':
n = int(input())
genes = input().rstrip().split()
genes_regex = [re.compile(f"(?={gene})") for gene in genes]
health = list(map(int, input().rstrip().split()))
s = int(input())
min_weight = math.inf
max_weight = -math.inf
for s_itr in range(s):
m,n,gn = input().split()
weight = 0
for i in range(int(m),int(n)+1):
if genes[i] in gn:
matches = len(re.findall(genes_regex[i], gn))
weight += health[i]*matches
if weight < min_weight:
min_weight = weight
if weight > max_weight:
max_weight = weight
print(min_weight, max_weight)
2 Answers 2
There are three obvious suggestions here.
Measure!
First, and most important: develop a timing framework to measure your changes! You can't know if a change is beneficial if all you have is "the old version doesn't pass" and "the new version doesn't pass".
Build a standard set of test cases, and a timing framework, and subject any change to measurement in the timing framework. It's better if the timing is better, otherwise it's not.
Cache your results
The examples shown at the hackerrank site specifically include one where the same gene string is repeated twice. So it seems likely that caching your results might provide an obvious performance win.
Stop using the regex engine
This is a "maybe." You're using the regex engine to get the findall
behavior, which is sensible, since it gives you access to C code that does what you want. But using that engine comes at a price -- regex operations are traditionally slower than string operations. So see if you can write your code to do the same job without using regex calls.
I'm honestly not sure if this will benefit you or not, since the regexes you are using are so simple. But if you pre-compute the minimum offset for each pattern, to allow for overlaps (like "a" -> +1, "aa" -> +1, "ab" -> +2) you should be able to scan using str.find
or str.index
and get what you want without any re
calls.
Bonus: generator
Your original question asked about using generators. Because the underlying operation is so expensive, I'd suggest writing a single minmax
function that yields both values at the same time (like divmod
does). You can feed that function with a generator that yields up the scores:
queries = [input() for _ in range(s)]
low, high = minmax(gene_scores(int(l[0]), int(l[1]), l[2] for l in queries))
(This has nothing to do with performance. But you wanted to use them, so here's a way!)
-
\$\begingroup\$ Thank you, Austin Quick question: I'm trying to use
lru_cache
fromfunctools
to optimize code, and I'm getting TypeError: unhashable type: 'list'. In the 2nd version of the solution it is probably because ofgenes_regex = [re.compile(f"(?={gene})") for gene in genes]
line. Or can it be the inputs? Any suggestions? \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月02日 22:02:11 +00:00Commented Jun 2, 2020 at 22:02 -
1\$\begingroup\$ @DenisShvetsov since
@lru_cache
is a function decorator, it must be on a function you have written. So that's where thelist
type is being returned in error. Make sure you are caching single-item functions, since those are the ones likely to provide benefit from caching. \$\endgroup\$aghast– aghast2020年06月03日 00:21:57 +00:00Commented Jun 3, 2020 at 0:21
Aho-Corasick algorithm
Perhaps another approach is in order. I'd suggest the Aho-Corasick algorithm. Here's the original paper Efficient String Matching: An Aid to Bibliographic Search (pdf).
- Create a mapping from gene to gene index and weight.
- Build the DFA with the all of the genes.
- Loop over the DNA tests
- Scan each DNA test using the DFA.
- For each gene returned by the DFA, look up the index and weight.
- If the index is in the allowed range for that DNA, add the weight to the running total for that DNA
- Keep track of the min and max scores for each DNA test
- Output the results
Don't have time now. I'll try to code it up later.
-
\$\begingroup\$ This is very interesting. But maybe don't publish your code here right away. I would like to try and figure it out myself, if you don't mind \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月05日 06:55:19 +00:00Commented Jun 5, 2020 at 6:55
-
\$\begingroup\$ Hey, @RootTwo, can you please check out codereview.stackexchange.com/questions/243500/… - it's my attempt to apply Aho-Corasick using ahocorapy (github.com/abusix/ahocorapy/blob/master/README.md), but it doesn't work well. \$\endgroup\$Denis Shvetsov– Denis Shvetsov2020年06月06日 23:18:59 +00:00Commented Jun 6, 2020 at 23:18
Explore related questions
See similar questions with these tags.
genes
&health
) are 100K each, and then there are 41K+ genes to test \$\endgroup\$