Finding min and max values of an iterable on the fly

Question 1

This is a follow up for my question about optimizing solution for DNA Health HackerRank problem.

Short re-cap:

You are given 2 arrays (genes and health), one of which have a 'gene' name, and the other - 'gene' weight (aka health). You then given a bunch of strings, each containing values m and n, which denote the start and end of the slice to be applied to the genes and health arrays, and the 'gene'-string, for which we need to determine healthiness. Then we need to return health-values for the most and the least healthy strings.

At the advice from AJNeufeld, I optimized my code into the following. Unfotunately, it still fails to execute on large testcases, such as this one.

So, the question now is: what else can I do to make the solution less expensive?

if __name__ == '__main__':
 n = int(input())
 genes = input().rstrip().split()
 genes_regex = [re.compile(f"(?={gene})") for gene in genes]
 health = list(map(int, input().rstrip().split()))
 s = int(input())
 min_weight = math.inf
 max_weight = -math.inf
 for s_itr in range(s):
 m,n,gn = input().split()
 weight = 0
 for i in range(int(m),int(n)+1):
 if genes[i] in gn:
 matches = len(re.findall(genes_regex[i], gn))
 weight += health[i]*matches
 if weight < min_weight:
 min_weight = weight
 if weight > max_weight:
 max_weight = weight
 print(min_weight, max_weight)

Question 2

Timing measurements for small testcase: Version 1(original): 0.0080 averaged accross 5 attempts (100 cycles each) (min=0.006530, max=0.010242) Version 2(current): 0.0029 averaged accross 5 attempts (100 cycles each) (min=0.002103, max=0.003921) As you can see, version 2 is faster than version 1 almost 3 times

Question 3

Timing measurements for a large testcase. Version 1: single attempt with 1 cycle - 234.6446889 Version 2: single attempt with 1 cycle - 164.2045158 Definitely better, but still rather expensive

Question 4

The measurements don't help much unless we know what the test cases looked like. Put them in the question or link to a pastebin. Or at least for the large test case: how many genes, how many gene strings, how long are the gene strings, how many (min, max) genes are tested against each gene string? Try running the profiler on you code to see where it si spending time. (My guess is the nested loops)

Question 5

The small testcase mentioned can be found in the first question asked (codereview.stackexchange.com/questions/243217/…). The large one I will put on pastebin a little latter, but the lists (genes & health) are 100K each, and then there are 41K+ genes to test

Question 6

There are three obvious suggestions here.

Measure!

First, and most important: develop a timing framework to measure your changes! You can't know if a change is beneficial if all you have is "the old version doesn't pass" and "the new version doesn't pass".

Build a standard set of test cases, and a timing framework, and subject any change to measurement in the timing framework. It's better if the timing is better, otherwise it's not.

Cache your results

The examples shown at the hackerrank site specifically include one where the same gene string is repeated twice. So it seems likely that caching your results might provide an obvious performance win.

Stop using the regex engine

This is a "maybe." You're using the regex engine to get the findall behavior, which is sensible, since it gives you access to C code that does what you want. But using that engine comes at a price -- regex operations are traditionally slower than string operations. So see if you can write your code to do the same job without using regex calls.

I'm honestly not sure if this will benefit you or not, since the regexes you are using are so simple. But if you pre-compute the minimum offset for each pattern, to allow for overlaps (like "a" -> +1, "aa" -> +1, "ab" -> +2) you should be able to scan using str.find or str.index and get what you want without any re calls.

Bonus: generator

Your original question asked about using generators. Because the underlying operation is so expensive, I'd suggest writing a single minmax function that yields both values at the same time (like divmod does). You can feed that function with a generator that yields up the scores:

queries = [input() for _ in range(s)]
low, high = minmax(gene_scores(int(l[0]), int(l[1]), l[2] for l in queries))

(This has nothing to do with performance. But you wanted to use them, so here's a way!)

Question 7

Thank you, Austin Quick question: I'm trying to use lru_cache from functools to optimize code, and I'm getting TypeError: unhashable type: 'list'. In the 2nd version of the solution it is probably because of genes_regex = [re.compile(f"(?={gene})") for gene in genes] line. Or can it be the inputs? Any suggestions?

Question 8

@DenisShvetsov since @lru_cache is a function decorator, it must be on a function you have written. So that's where the list type is being returned in error. Make sure you are caching single-item functions, since those are the ones likely to provide benefit from caching.

Question 9

Aho-Corasick algorithm

Perhaps another approach is in order. I'd suggest the Aho-Corasick algorithm. Here's the original paper Efficient String Matching: An Aid to Bibliographic Search (pdf).

Create a mapping from gene to gene index and weight.
Build the DFA with the all of the genes.
Loop over the DNA tests
Scan each DNA test using the DFA.
For each gene returned by the DFA, look up the index and weight.
If the index is in the allowed range for that DNA, add the weight to the running total for that DNA
Keep track of the min and max scores for each DNA test
Output the results

Don't have time now. I'll try to code it up later.

Question 10

This is very interesting. But maybe don't publish your code here right away. I would like to try and figure it out myself, if you don't mind

Question 11

Hey, @RootTwo, can you please check out codereview.stackexchange.com/questions/243500/… - it's my attempt to apply Aho-Corasick using ahocorapy (github.com/abusix/ahocorapy/blob/master/README.md), but it doesn't work well.

aghast aghast 12.6k25 silver badges46 bronze badges · Accepted Answer · 2020-06-02 11:31:59Z

There are three obvious suggestions here.

Measure!

First, and most important: develop a timing framework to measure your changes! You can't know if a change is beneficial if all you have is "the old version doesn't pass" and "the new version doesn't pass".

Build a standard set of test cases, and a timing framework, and subject any change to measurement in the timing framework. It's better if the timing is better, otherwise it's not.

Cache your results

The examples shown at the hackerrank site specifically include one where the same gene string is repeated twice. So it seems likely that caching your results might provide an obvious performance win.

Stop using the regex engine

This is a "maybe." You're using the regex engine to get the findall behavior, which is sensible, since it gives you access to C code that does what you want. But using that engine comes at a price -- regex operations are traditionally slower than string operations. So see if you can write your code to do the same job without using regex calls.

I'm honestly not sure if this will benefit you or not, since the regexes you are using are so simple. But if you pre-compute the minimum offset for each pattern, to allow for overlaps (like "a" -> +1, "aa" -> +1, "ab" -> +2) you should be able to scan using str.find or str.index and get what you want without any re calls.

Bonus: generator

Your original question asked about using generators. Because the underlying operation is so expensive, I'd suggest writing a single minmax function that yields both values at the same time (like divmod does). You can feed that function with a generator that yields up the scores:

queries = [input() for _ in range(s)]
low, high = minmax(gene_scores(int(l[0]), int(l[1]), l[2] for l in queries))

(This has nothing to do with performance. But you wanted to use them, so here's a way!)

Thank you, Austin Quick question: I'm trying to use lru_cache from functools to optimize code, and I'm getting TypeError: unhashable type: 'list'. In the 2nd version of the solution it is probably because of genes_regex = [re.compile(f"(?={gene})") for gene in genes] line. Or can it be the inputs? Any suggestions?
@DenisShvetsov since @lru_cache is a function decorator, it must be on a function you have written. So that's where the list type is being returned in error. Make sure you are caching single-item functions, since those are the ones likely to provide benefit from caching.

Stack Exchange Network

Finding min and max values of an iterable on the fly

2 Answers 2

Measure!

Cache your results

Stop using the regex engine

Bonus: generator

Aho-Corasick algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Finding min and max values of an iterable on the fly

2 Answers 2

Measure!

Cache your results

Stop using the regex engine

Bonus: generator

Aho-Corasick algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions