How to apply Python generators to finding min and max values?

Question 1

I'm solving yet another problem in HackerRank (https://www.hackerrank.com/challenges/determining-dna-health/problem). In short: you are given 2 arrays (genes and health), one of which have a 'gene' name, and the other - 'gene' weight (aka health). You then given a bunch of strings, each containing values m and n, which denote the start and end of the slice to be applied to the genes and health arrays, and the 'gene'-string, for which we need to determine healthiness. Then we need to return health-values for the most and the least healthy strings.

My solution is below, and it works, but it's not scalable, i.e. it fails testcases with a lot of values.

import re
if __name__ == '__main__':
 n = int(input())
 genes = input().rstrip().split()
 health = list(map(int, input().rstrip().split()))
 s = int(input())
 weights = []
 for s_itr in range(s):
 m,n,gn = input().split()
 weight = 0
 for i in range(int(m),int(n)+1):
 if genes[i] in gn:
 compilt = "r'(?=("+genes[i]+"))'"
 matches = len(re.findall(eval(compilt), gn))
 weight += health[i]*matches
 weights.append(weight)
 print(min(weights), max(weights))

Can you advise on how to apply generators here? I suspect that the solution fails because of the very big list that's being assembled. Is there a way to get min and max values here without collecting them all?

Example values:

genes = ['a', 'b', 'c', 'aa', 'd', 'b']
health = [1, 2, 3, 4, 5, 6]
gene1 = "1 5 caaab" (result = 19 = max)
gene2 = "0 4 xyz" (result = 0 = min)
gene3 = "2 4 bcdybc" (result = 11)

This case returns 0 19

UPD: Follow-up question on optimized (but still not working) solution here.

Question 2

I have rolled back your most recent edit. Code Review does not permit the question to be substantially changed after answers have been posted (accepted or not) as it becomes ambiguous as to which version of the question the answer post is answering. If you have an improved version of the code, and want feedback on it, you must post a new question. See "What should I do when someone answers my question" in the help center. Note the recommendation about adding a link from this question to the new question, as well as a back link from the new question to this one.

Question 3

eval(...)

What is the purpose of:

compilt = "r'(?=(" + genes[i] + "))'"
... eval(compilt), ...

It takes a string like "aa", and forms a new string "r'(?=(aa))'", which is source code for the raw string r'(?=(aa))', which when evaluated yields the string "(?=(aa))".

There is no escaping being done, no obvious reason to do the raw string formation and subsequent evaluation, and no prevention of a syntax error due to a stray ' character in the genes[i] array. So ... why not simply:

regex = "(?=(" + gene[i] + "))"

and no call to eval(...) at all?

Regex Capturing

The regex subexpression (...) is a "capturing group", which copies the matching characters into an internal buffer, for returning in the match group.

>>> re.findall('(?=(aa))', "caaab")
['aa', 'aa']

Without the capturing group, the matching characters do not have to be copied to the internal buffer, to be returned.

>>> re.findall('(?=aa)', "caaab")
['', '']

Given that you are only interested in the len(...) of the list returned from re.findall(), the capturing group seems like unnecessary overhead, which can be eliminated for faster execution.

Compiled Regex

As Python uses regular expressions, it maintains a cache of the most recently used regular expressions. This cache has a limited size, to prevent an excessive memory load.

In this exercise, you are repeatedly using the same gene regular expressions for each "healthy string" test. If the number of genes exceeds the cache size, Python will be tossing out compiled regular expressions, only to compile them again moments later.

You can preempt this by compiling and storing all the gene regular expressions ahead of time. Leveraging Python 3.6's f-strings, and list comprehension:

 genes = input().rstrip().split()
 genes_rx = [re.compile(f"(?={gene})") for gene in genes]

Used as:

 matches = len(re.findall(genes_rx[i], gn))

Now the gene to regular expression string, to compiled regular expression is done once per gene, instead of once per "healthy string" test.

Computing min/max weight without creating a list

How about:

 min_weight = math.inf
 max_weight = -math.inf
 for ...:
 weight = ...
 if weight < min_weight:
 min_weight = weight
 if weight > max_weight:
 max_weight = weight
 print(min_weight, max_weight)

Question 4

AJNeufeld, I would need to take some time digesting your answer: some of the concepts you mentioned are not yet familiar to me, but right now I would like to thank you: your effort is really appreciated!

Question 5

As for the 'what's the purpose' questions - I'm fairly new to the language, I learn stuff from various sources, and that is the best I could come up with. You are definitely right about it not being clean or optimal

Question 6

I tried the things you mentioned, but the code still fails due to timeout. See full solution in the question body.

Question 7

That too bad. It should be faster. Does it pass more test cases, or is it just one super large case that fails with a timeout?

Question 8

@DenisShvetsov Can you test and profile your code locally to see the difference in execution time?

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Accepted Answer · 2020-06-01 20:00:13Z

eval(...)

What is the purpose of:

compilt = "r'(?=(" + genes[i] + "))'"
... eval(compilt), ...

It takes a string like "aa", and forms a new string "r'(?=(aa))'", which is source code for the raw string r'(?=(aa))', which when evaluated yields the string "(?=(aa))".

There is no escaping being done, no obvious reason to do the raw string formation and subsequent evaluation, and no prevention of a syntax error due to a stray ' character in the genes[i] array. So ... why not simply:

regex = "(?=(" + gene[i] + "))"

and no call to eval(...) at all?

Regex Capturing

The regex subexpression (...) is a "capturing group", which copies the matching characters into an internal buffer, for returning in the match group.

>>> re.findall('(?=(aa))', "caaab")
['aa', 'aa']

Without the capturing group, the matching characters do not have to be copied to the internal buffer, to be returned.

>>> re.findall('(?=aa)', "caaab")
['', '']

Given that you are only interested in the len(...) of the list returned from re.findall(), the capturing group seems like unnecessary overhead, which can be eliminated for faster execution.

Compiled Regex

As Python uses regular expressions, it maintains a cache of the most recently used regular expressions. This cache has a limited size, to prevent an excessive memory load.

In this exercise, you are repeatedly using the same gene regular expressions for each "healthy string" test. If the number of genes exceeds the cache size, Python will be tossing out compiled regular expressions, only to compile them again moments later.

You can preempt this by compiling and storing all the gene regular expressions ahead of time. Leveraging Python 3.6's f-strings, and list comprehension:

 genes = input().rstrip().split()
 genes_rx = [re.compile(f"(?={gene})") for gene in genes]

Used as:

 matches = len(re.findall(genes_rx[i], gn))

Now the gene to regular expression string, to compiled regular expression is done once per gene, instead of once per "healthy string" test.

Computing min/max weight without creating a list

How about:

 min_weight = math.inf
 max_weight = -math.inf
 for ...:
 weight = ...
 if weight < min_weight:
 min_weight = weight
 if weight > max_weight:
 max_weight = weight
 print(min_weight, max_weight)

AJNeufeld, I would need to take some time digesting your answer: some of the concepts you mentioned are not yet familiar to me, but right now I would like to thank you: your effort is really appreciated!
As for the 'what's the purpose' questions - I'm fairly new to the language, I learn stuff from various sources, and that is the best I could come up with. You are definitely right about it not being clean or optimal
I tried the things you mentioned, but the code still fails due to timeout. See full solution in the question body.
That too bad. It should be faster. Does it pass more test cases, or is it just one super large case that fails with a timeout?
@DenisShvetsov Can you test and profile your code locally to see the difference in execution time?

Stack Exchange Network

How to apply Python generators to finding min and max values?

1 Answer 1

eval(...)

Regex Capturing

Compiled Regex

Computing min/max weight without creating a list

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to apply Python generators to finding min and max values?

1 Answer 1

eval(...)

Regex Capturing

Compiled Regex

Computing min/max weight without creating a list

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions