Case 1: rank1_naming
This function takes two arguments
- list_proteins_pattern_available
- best_match_protein_name
Objective: Extract the three letter pattern from the both arguments. Match the pattern and keep only the matched items. Extract the numbers from
list_proteins_pattern_available
and also sort it. Find the maximum number from the collected numbers and add1
to get the next number.
Please let me know if you have any questions.
Can you point ways to improve this script?
import re
def case_rank1_naming(list_proteins_pattern_available, best_match_protein_name):
#This will store the list of numbers
available_list_numbers = []
#extract the three letter pattern
protein_pattern = re.search(r"[A-Z]{1}[a-z]{2}", best_match_protein_name)
protein_pattern = protein_pattern.group()
#extract the numbers
for name in list_proteins_pattern_available:
pattern = re.search(r"[A-Z]{1}[a-z]{2}\d{1,3}", name)
number = re.search(r"\d{1,3}", pattern.group())
available_list_numbers.append(number.group())
#Convert all the string numbers to integers
available_list_numbers = [int(x) for x in available_list_numbers]
#Sort the available number. Just realized I use two times sort function.
available_list_numbers.sort()
# Sort the available number, get the maximum number and add one to get next number
# Example: result will be 50
primary_number_prediction = int(max(sorted(available_list_numbers))) + 1
#Add the protein pattern, the next predicted number and 'Aa1' at the suffix
predicted_name = protein_pattern + str(primary_number_prediction) + 'Aa1'
return predicted_name
list_proteins_pattern_available = ['Xpp1Aa1', 'Xpp2Aa1', 'Xpp35Aa1', 'Xpp35Ab1', 'Xpp35Ac1', 'Xpp35Ba1', 'Xpp36Aa1', 'Xpp49Aa1', 'Xpp49Ab1']
best_match_protein_name = 'Xpp35Ba1'
predicted_name = case_rank1_naming(list_proteins_pattern_available, best_match_protein_name)
print(predicted_name)
#Xpp50Aa1
1 Answer 1
I'll show an example implementation first, and then describe it:
from typing import Iterable
import re
def case_rank1_naming(proteins_available: Iterable[str], best_match_protein_name: str) -> str:
# extract the three-letter pattern
protein_pattern = re.search(r"[A-Z][a-z]{2}", best_match_protein_name).group()
# extract the numbers
best_number = max(
int(re.search(r"[A-Z][a-z]{2}(\d{1,3})", name)[1])
for name in proteins_available
)
# Add the protein pattern, the next predicted number and 'Aa1' at the suffix
return f'{protein_pattern}{best_number + 1}Aa1'
def main():
proteins_available = (
'Xpp1Aa1', 'Xpp2Aa1', 'Xpp35Aa1', 'Xpp35Ab1', 'Xpp35Ac1',
'Xpp35Ba1', 'Xpp36Aa1', 'Xpp49Aa1', 'Xpp49Ab1'
)
best_match_protein_name = 'Xpp35Ba1'
predicted_name = case_rank1_naming(proteins_available, best_match_protein_name)
assert predicted_name == 'Xpp50Aa1'
if __name__ == '__main__':
main()
- Add type hints to better-define your function signature
- Don't write
{1}
in a regex - you can just drop it - Call
max
immediately on a generator rather than making and sorting a list - Shorten your variable names. Especially don't include the type of the variable in its name. Type hints and appropriate pluralization will cover you instead.
- Use f-strings
- Have a
main
function - In
main
, use a tuple forproteins_available
instead of a list because it doesn't need to mutate
-
\$\begingroup\$ you are on a roll :) \$\endgroup\$dfhwze– dfhwze2019年09月17日 15:32:41 +00:00Commented Sep 17, 2019 at 15:32
-
\$\begingroup\$ Thank you. I will go through it and understand. \$\endgroup\$catuf– catuf2019年09月17日 15:46:37 +00:00Commented Sep 17, 2019 at 15:46
-
\$\begingroup\$ Why this step assert predicted_name == 'Xpp50Aa1'. In many cases we don't know the predicted name. right? Is this for testing? \$\endgroup\$catuf– catuf2019年09月17日 17:20:38 +00:00Commented Sep 17, 2019 at 17:20
-
\$\begingroup\$ Yes, it's only for testing. You'll definitely want to leave that out in general program use, and move it to a unit test. \$\endgroup\$Reinderien– Reinderien2019年09月17日 17:22:29 +00:00Commented Sep 17, 2019 at 17:22