Extract, filter and match three letters from the given arguments and predict the name

Question 1

Case 1: rank1_naming

This function takes two arguments

list_proteins_pattern_available
best_match_protein_name

Objective: Extract the three letter pattern from the both arguments. Match the pattern and keep only the matched items. Extract the numbers from list_proteins_pattern_available and also sort it. Find the maximum number from the collected numbers and add 1 to get the next number.

Please let me know if you have any questions.

Can you point ways to improve this script?

import re
def case_rank1_naming(list_proteins_pattern_available, best_match_protein_name):
 #This will store the list of numbers
 available_list_numbers = []
 #extract the three letter pattern
 protein_pattern = re.search(r"[A-Z]{1}[a-z]{2}", best_match_protein_name)
 protein_pattern = protein_pattern.group()
 #extract the numbers
 for name in list_proteins_pattern_available:
 pattern = re.search(r"[A-Z]{1}[a-z]{2}\d{1,3}", name)
 number = re.search(r"\d{1,3}", pattern.group())
 available_list_numbers.append(number.group())
 #Convert all the string numbers to integers
 available_list_numbers = [int(x) for x in available_list_numbers]
 #Sort the available number. Just realized I use two times sort function.
 available_list_numbers.sort()
 # Sort the available number, get the maximum number and add one to get next number
 # Example: result will be 50
 primary_number_prediction = int(max(sorted(available_list_numbers))) + 1
 #Add the protein pattern, the next predicted number and 'Aa1' at the suffix
 predicted_name = protein_pattern + str(primary_number_prediction) + 'Aa1'
 return predicted_name
list_proteins_pattern_available = ['Xpp1Aa1', 'Xpp2Aa1', 'Xpp35Aa1', 'Xpp35Ab1', 'Xpp35Ac1', 'Xpp35Ba1', 'Xpp36Aa1', 'Xpp49Aa1', 'Xpp49Ab1']
best_match_protein_name = 'Xpp35Ba1'
predicted_name = case_rank1_naming(list_proteins_pattern_available, best_match_protein_name)
print(predicted_name)
#Xpp50Aa1

Question 2

I'll show an example implementation first, and then describe it:

from typing import Iterable
import re
def case_rank1_naming(proteins_available: Iterable[str], best_match_protein_name: str) -> str:
 # extract the three-letter pattern
 protein_pattern = re.search(r"[A-Z][a-z]{2}", best_match_protein_name).group()
 # extract the numbers
 best_number = max(
 int(re.search(r"[A-Z][a-z]{2}(\d{1,3})", name)[1])
 for name in proteins_available
 )
 # Add the protein pattern, the next predicted number and 'Aa1' at the suffix
 return f'{protein_pattern}{best_number + 1}Aa1'
def main():
 proteins_available = (
 'Xpp1Aa1', 'Xpp2Aa1', 'Xpp35Aa1', 'Xpp35Ab1', 'Xpp35Ac1',
 'Xpp35Ba1', 'Xpp36Aa1', 'Xpp49Aa1', 'Xpp49Ab1'
 )
 best_match_protein_name = 'Xpp35Ba1'
 predicted_name = case_rank1_naming(proteins_available, best_match_protein_name)
 assert predicted_name == 'Xpp50Aa1'
if __name__ == '__main__':
 main()

Add type hints to better-define your function signature
Don't write {1} in a regex - you can just drop it
Call max immediately on a generator rather than making and sorting a list
Shorten your variable names. Especially don't include the type of the variable in its name. Type hints and appropriate pluralization will cover you instead.
Use f-strings
Have a main function
In main, use a tuple for proteins_available instead of a list because it doesn't need to mutate

Question 3

you are on a roll :)

Question 4

Thank you. I will go through it and understand.

Question 5

Why this step assert predicted_name == 'Xpp50Aa1'. In many cases we don't know the predicted name. right? Is this for testing?

Question 6

Yes, it's only for testing. You'll definitely want to leave that out in general program use, and move it to a unit test.

score 3 · Accepted Answer · 2019-09-17 15:26:12Z

I'll show an example implementation first, and then describe it:

from typing import Iterable
import re
def case_rank1_naming(proteins_available: Iterable[str], best_match_protein_name: str) -> str:
 # extract the three-letter pattern
 protein_pattern = re.search(r"[A-Z][a-z]{2}", best_match_protein_name).group()
 # extract the numbers
 best_number = max(
 int(re.search(r"[A-Z][a-z]{2}(\d{1,3})", name)[1])
 for name in proteins_available
 )
 # Add the protein pattern, the next predicted number and 'Aa1' at the suffix
 return f'{protein_pattern}{best_number + 1}Aa1'
def main():
 proteins_available = (
 'Xpp1Aa1', 'Xpp2Aa1', 'Xpp35Aa1', 'Xpp35Ab1', 'Xpp35Ac1',
 'Xpp35Ba1', 'Xpp36Aa1', 'Xpp49Aa1', 'Xpp49Ab1'
 )
 best_match_protein_name = 'Xpp35Ba1'
 predicted_name = case_rank1_naming(proteins_available, best_match_protein_name)
 assert predicted_name == 'Xpp50Aa1'
if __name__ == '__main__':
 main()

Add type hints to better-define your function signature
Don't write {1} in a regex - you can just drop it
Call max immediately on a generator rather than making and sorting a list
Shorten your variable names. Especially don't include the type of the variable in its name. Type hints and appropriate pluralization will cover you instead.
Use f-strings
Have a main function
In main, use a tuple for proteins_available instead of a list because it doesn't need to mutate

Why this step assert predicted_name == 'Xpp50Aa1'. In many cases we don't know the predicted name. right? Is this for testing?
Yes, it's only for testing. You'll definitely want to leave that out in general program use, and move it to a unit test.

Stack Exchange Network

Extract, filter and match three letters from the given arguments and predict the name

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extract, filter and match three letters from the given arguments and predict the name

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions