Python script which translates a DNA sequence

Question 1

For my Biology class, I made a Python script which takes a DNA sequence as input, translates it into an mRNA sequence, and then again into a tRNA sequence. It then matches each mRNA codon with an amino acid, and gives the user all of the data it produces.

Since this program may have to work with large amounts of DNA code, I just want some advice to see what could be done to make the program run faster and more efficiently.

Here is the 'symbols.p' file and here is the 'mRNA_to_protein.p' file.

import pickle
'''
Program takes a DNA genetic code sequence in as input.
It then translates the DNA code into mRNA code, and again into tRNA code.
After that, it matches each mRNA codon with an amino acid, as found in the hash table inside the pickle file.
It then matches each amino acid with its symbol, and prints all the data onto the screen.
'''
def main():
 # Asks the user if they would like to open a file which contains their genetic sequence.
 # Removes all whitespace from input in order to process it correctly.
 open_choice = remove_spaces(input("Do you want to load a file to translate [Y/N]").upper())
 # Processes whether the user wants to use a file
 while open_choice != 'Y' and open_choice != 'N':
 open_choice = remove_spaces(input("Do you want to load a file to translate [Y/N]").upper())
 if open_choice == 'Y':
 sequence = get_file().upper()
 else:
 sequence = input("Enter the DNA sequence to convert it: ").upper() # Gets the DNA sequence to convert from input, if the user
 # declines to open a file.
 sequence = remove_spaces(sequence) # Removes spaces from the user's sequence
 while not check_sequence(sequence, 'dna'): # Sends to check sequence function to confirm that it is a valid sequence
 sequence = input("Please enter a correct sequence: ").upper() # If sequence is invalid, repeat until it is valid
 sequence = remove_spaces(sequence)
 original_sequence = ' '.join([sequence[i:i + 3] for i in range(0, len(sequence), 3)]) # Saves original DNA sequence
 mRNA = convert_sequence(sequence, 'dna') # Saves mRNA sequence
 tRNA = convert_sequence(remove_spaces(mRNA), 'rna') # Saves tRNA sequence
 proteins = convert_to_proteins((mRNA + " ")) # Prints amino acid sequence
 symbols = convert_symbols(proteins) # Prints amino acid symbols
 print('DNA: ' + original_sequence) # Prints original sequence
 print('mRNA: ' + mRNA) # Prints mRNA sequence
 print('tRNA: ' + tRNA) # Prints tRNA sequence
 print(" ".join(proteins))
 print(" ".join(symbols))
 dump_data(original_sequence, mRNA, tRNA, " ".join(proteins), " ".join(symbols))
 input()
# Checks sequence for validility
def check_sequence(sequence, type): # Takes the sequence input and the type of sequence
 if type == 'rna': # If it is an RNA sequence, confirm it only contains characters in AUCG
 a = 'AUCG'
 else:
 a = 'ATCG' # If it is an DNA sequence, confirm it only contains characters in ATCG
 sequence_list = list(sequence) # Converts sequence into a list
 # Checks each character in list to see if it is in respective character list determined above
 for i in sequence_list:
 if i not in a: # If a character is invalid, return False
 return False
 return True # If all characters are valid, return True
# Converts sequence to rNA
def convert_sequence(sequence, sequence_type): # Takes sequence and type of secuence
 if sequence_type == 'dna': # if the sequence is DNA: convert t to u
 conversion_dict = {
 'A': 'U',
 'T': 'A',
 'C': 'G',
 'G': 'C'
 }
 else: # if the sequence is RBA: convert u to a
 conversion_dict = {
 'A': 'U',
 'U': 'A',
 'C': 'G',
 'G': 'C'
 }
 # convert sequence into a list
 converted_sequence = []
 sequence_list = list(sequence)
 # convert list one by one, checking the dictionary for the corresponding key, and add it to the new clist
 for i in sequence_list:
 converted_sequence.append(conversion_dict[i])
 # return converted sequence, seperated by a space every three spaces
 converted_sequence = ''.join(converted_sequence)
 # noinspection PyTypeChecker
 return ' '.join([converted_sequence[i:i + 3] for i in range(0, len(converted_sequence), 3)])
def convert_to_proteins(sequence):
 n = []
 protein_sequence = []
 mrna_to_protein = pickle.load(open('mRNA_to_protein.p', 'rb'))
 for i in sequence:
 if not i.isspace():
 n.append(i)
 else:
 if len(n) < 3:
 break
 protein_sequence.append(mrna_to_protein[''.join(n)])
 n = []
 return protein_sequence
def convert_symbols(proteins):
 symbol_list = []
 symbols = pickle.load(open('symbols.p', 'rb'))
 for i in proteins:
 symbol_list.append(symbols[i])
 return symbol_list
# removes all spaces in a sequence
def remove_spaces(x):
 return (''.join(x.split())).strip()
def get_file():
 file_name = input("Enter file name: ")
 while True:
 try:
 f = open(file_name, 'r')
 sequence = f.read()
 while not check_sequence(remove_spaces(sequence).upper(), 'dna'):
 file_name = input("\nPlease provide a file with a correct DNA sequence: ")
 break
 except FileNotFoundError:
 file_name = input("\nThe file '{}' was not found. \nPlease enter an accurate file name/path: ".format(file_name))
 return sequence
def dump_data(dna, mrna, trna, aa, s):
 file = open('results.txt', 'w')
 file.write('DNA: ' + dna + "\n")
 file.write('mRNA: ' + mrna + "\n")
 file.write('tRNA: ' + trna + "\n")
 file.write(aa + "\n")
 file.write(s + "\n")
 return True
main()

Question 2

There's a couple of high level problems with your code:

You should remove your comments. They're as useful as the code they're commenting as they're 'micro' comments. print(mRNA) is without a doubt printing mRNA. If there was a comment saying that you use lists in both forms, ['abc', 'def'] and 'abc def', then that would have been a good comment.

They should be used to describe oddities in your code. Or if we are using optimized code to say what the code is doing, from a high-level perspective.
You should order your code so you can read down the file, and know most things about the code at that point. To fully understand main you need to know every other function, and so that should go to the bottom of the file.
You should use if __name__ == '__main__'.
You should pick one quotation mark style. And stick with it. ' or ".

The above are simple and easy to fix. But you also have harder to fix problems, which require more effort to fix.

You duplicate the code to 'chunk' strings into lists. This is when you change 'abcdef' to either 'abc def' or ['abc', 'def']. And so you should make this a function.
You should change check_sequence to take a rather than type as an argument. This allows greater control on the function. And allows you to perform an abstraction on your code.

You can then move these global inputs to global constants, and use them when needed. Global constants mean that you're less likely to make a typo and unknowingly break your program.
You could change check_sequence to use a comprehension and all to reduce the amount of lines. It can also improve clarity as then you don't need to explicitly return True or False.
You should use arrays rather than strings. As said above you use all three formats, 'abcdef', 'abc def', and ['abc', 'def']. Instead you should only use the last format. This allows you to get easier to read code, and removes the need for n in convert_to_proteins.
As stated above you should change convert_sequence to take a list. This can allow you to use a list comprehension to build the list with minimal characters. You can also change your code to use str.translate rather than implementing the translation yourself.
You can change both convert_to_proteins and convert_symbols to be the same function. Moving pickle.load out of the function, and by passing lists, simplify the code to just a list comprehension, where you're indexing a dictionary.
You can move your print to dump_data as you're printing and writing the same data. This allows you to use format to build the string once, and for you to print it and write it.
Always use with when you use open. It closes the stream when you're done with it, and without can lead to bugs and errors.
Your code to get user input is confusing, and duplicates logic. Your option for reading from a file or from user input is good. You code initially reads that a user could pick to read from a file, but then if the file is malformed it reads from user input. On further reading you find out it's not possible, which is a problem for readers.

And so if you implement all the above you can get:

import pickle
TRANSLATION_DNA = {'A': 'U', 'T': 'A', 'C': 'G', 'G': 'C'}
TRANSLATION_OTHER = {'A': 'U', 'U': 'A', 'C': 'G', 'G': 'C'}
VALID_INPUT_RNA = 'AUCG'
VALID_INPUT_OTHER = 'ATCG'
def grouper(sequence, n):
 return [sequence[i:i + n] for i in range(0, len(sequence), n)]
def check_sequence(sequence, valid_input):
 return all(i in valid_input for i in sequence)
def translate_sequence(sequence, conversion_dict):
 table = {ord(k): v for k, v in conversion_dict.items()}
 return [g.translate(table) for g in sequence]
def convert_sequence(sequence, conversion_dict):
 return [conversion_dict[i] for i in sequence]
def dump_data(*args):
 output = 'DNA: {}\nmRNA: {}\ntRNA: {}\n{}\n{}'.format(*map(' '.join, args))
 print(output)
 with open('results.txt', 'w') as f:
 f.write(output + '\n')
def safe_read(file_name):
 with open(file_name, 'rb') as f:
 return pickle.load(f)
def convert(sequence):
 mrna_to_protein = safe_read('mRNA_to_protein.p')
 protein_symbols = safe_read('symbols.p')
 original_sequence = grouper(sequence, 3)
 mRNA = translate_sequence(original_sequence, TRANSLATION_DNA)
 tRNA = translate_sequence(mRNA, TRANSLATION_OTHER)
 proteins = convert_sequence(mRNA, mrna_to_protein)
 symbols = convert_sequence(proteins, protein_symbols)
 return original_sequence, mRNA, tRNA, proteins, symbols
def read_sequence():
 while True:
 file_name = input('Enter file name: ')
 try:
 with open(file_name, 'r') as f:
 return f.read()
 except FileNotFoundError:
 print('File {!r} not found.'.format(file_name))
def remove_spaces(x):
 return (''.join(x.split())).strip()
def get_sequence():
 while True:
 open_choice = input('Do you want to load a file to translate [Y/N]').strip().upper()
 if open_choice in ('Y', 'N'):
 break
 while True:
 sequence = (
 read_sequence()
 if open_choice == 'Y' else
 input('Enter the DNA sequence to convert it: ')
 )
 sequence = remove_spaces(sequence.upper())
 if check_sequence(sequence, VALID_INPUT_OTHER):
 return sequence
 print('Invalid sequence.')
def main():
 sequence = get_sequence()
 original_sequence, mRNA, tRNA, proteins, symbols = convert(sequence)
 dump_data(original_sequence, mRNA, tRNA, proteins, symbols)
 input()
if __name__ == '__main__':
 main()

Question 3

Do you think that doing dump_data(convert(get_sequence())) in main would be too compact?

Question 4

@ChatterOne Yes you definitely could use dump_data(*convert(get_sequence())). If I edit my program I'd add that in, I just didn't think of it. :)

Question 5

@Peilonrayz, Thanks for the excellent review. I am having trouble understanding how the translate_sequence function works. Could you explain it?

Question 6

@vkumar Sure, :) The first line changes the dictionary from say {"A": "B"} to {65: "B"} this is as str.translate looks up on the characters Unicode ordinal. If this limitation wasn't imposed by str.translate then that line wouldn't be there. The second line performs the translation on all the items in the list, as we're inputting it in the form ['abc', 'def'] rather than 'abcdef'. And is roughly equivalent to [''.join(conversion_dict[i] for i in g) for g in sequence].

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2016-11-15 12:03:16Z

There's a couple of high level problems with your code:

You should remove your comments. They're as useful as the code they're commenting as they're 'micro' comments. print(mRNA) is without a doubt printing mRNA. If there was a comment saying that you use lists in both forms, ['abc', 'def'] and 'abc def', then that would have been a good comment.

They should be used to describe oddities in your code. Or if we are using optimized code to say what the code is doing, from a high-level perspective.
You should order your code so you can read down the file, and know most things about the code at that point. To fully understand main you need to know every other function, and so that should go to the bottom of the file.
You should use if __name__ == '__main__'.
You should pick one quotation mark style. And stick with it. ' or ".

The above are simple and easy to fix. But you also have harder to fix problems, which require more effort to fix.

You duplicate the code to 'chunk' strings into lists. This is when you change 'abcdef' to either 'abc def' or ['abc', 'def']. And so you should make this a function.
You should change check_sequence to take a rather than type as an argument. This allows greater control on the function. And allows you to perform an abstraction on your code.

You can then move these global inputs to global constants, and use them when needed. Global constants mean that you're less likely to make a typo and unknowingly break your program.
You could change check_sequence to use a comprehension and all to reduce the amount of lines. It can also improve clarity as then you don't need to explicitly return True or False.
You should use arrays rather than strings. As said above you use all three formats, 'abcdef', 'abc def', and ['abc', 'def']. Instead you should only use the last format. This allows you to get easier to read code, and removes the need for n in convert_to_proteins.
As stated above you should change convert_sequence to take a list. This can allow you to use a list comprehension to build the list with minimal characters. You can also change your code to use str.translate rather than implementing the translation yourself.
You can change both convert_to_proteins and convert_symbols to be the same function. Moving pickle.load out of the function, and by passing lists, simplify the code to just a list comprehension, where you're indexing a dictionary.
You can move your print to dump_data as you're printing and writing the same data. This allows you to use format to build the string once, and for you to print it and write it.
Always use with when you use open. It closes the stream when you're done with it, and without can lead to bugs and errors.
Your code to get user input is confusing, and duplicates logic. Your option for reading from a file or from user input is good. You code initially reads that a user could pick to read from a file, but then if the file is malformed it reads from user input. On further reading you find out it's not possible, which is a problem for readers.

And so if you implement all the above you can get:

import pickle
TRANSLATION_DNA = {'A': 'U', 'T': 'A', 'C': 'G', 'G': 'C'}
TRANSLATION_OTHER = {'A': 'U', 'U': 'A', 'C': 'G', 'G': 'C'}
VALID_INPUT_RNA = 'AUCG'
VALID_INPUT_OTHER = 'ATCG'
def grouper(sequence, n):
 return [sequence[i:i + n] for i in range(0, len(sequence), n)]
def check_sequence(sequence, valid_input):
 return all(i in valid_input for i in sequence)
def translate_sequence(sequence, conversion_dict):
 table = {ord(k): v for k, v in conversion_dict.items()}
 return [g.translate(table) for g in sequence]
def convert_sequence(sequence, conversion_dict):
 return [conversion_dict[i] for i in sequence]
def dump_data(*args):
 output = 'DNA: {}\nmRNA: {}\ntRNA: {}\n{}\n{}'.format(*map(' '.join, args))
 print(output)
 with open('results.txt', 'w') as f:
 f.write(output + '\n')
def safe_read(file_name):
 with open(file_name, 'rb') as f:
 return pickle.load(f)
def convert(sequence):
 mrna_to_protein = safe_read('mRNA_to_protein.p')
 protein_symbols = safe_read('symbols.p')
 original_sequence = grouper(sequence, 3)
 mRNA = translate_sequence(original_sequence, TRANSLATION_DNA)
 tRNA = translate_sequence(mRNA, TRANSLATION_OTHER)
 proteins = convert_sequence(mRNA, mrna_to_protein)
 symbols = convert_sequence(proteins, protein_symbols)
 return original_sequence, mRNA, tRNA, proteins, symbols
def read_sequence():
 while True:
 file_name = input('Enter file name: ')
 try:
 with open(file_name, 'r') as f:
 return f.read()
 except FileNotFoundError:
 print('File {!r} not found.'.format(file_name))
def remove_spaces(x):
 return (''.join(x.split())).strip()
def get_sequence():
 while True:
 open_choice = input('Do you want to load a file to translate [Y/N]').strip().upper()
 if open_choice in ('Y', 'N'):
 break
 while True:
 sequence = (
 read_sequence()
 if open_choice == 'Y' else
 input('Enter the DNA sequence to convert it: ')
 )
 sequence = remove_spaces(sequence.upper())
 if check_sequence(sequence, VALID_INPUT_OTHER):
 return sequence
 print('Invalid sequence.')
def main():
 sequence = get_sequence()
 original_sequence, mRNA, tRNA, proteins, symbols = convert(sequence)
 dump_data(original_sequence, mRNA, tRNA, proteins, symbols)
 input()
if __name__ == '__main__':
 main()

Do you think that doing dump_data(convert(get_sequence())) in main would be too compact?
@ChatterOne Yes you definitely could use dump_data(*convert(get_sequence())). If I edit my program I'd add that in, I just didn't think of it. :)
@Peilonrayz, Thanks for the excellent review. I am having trouble understanding how the translate_sequence function works. Could you explain it?
@vkumar Sure, :) The first line changes the dictionary from say {"A": "B"} to {65: "B"} this is as str.translate looks up on the characters Unicode ordinal. If this limitation wasn't imposed by str.translate then that line wouldn't be there. The second line performs the translation on all the items in the list, as we're inputting it in the form ['abc', 'def'] rather than 'abcdef'. And is roughly equivalent to [''.join(conversion_dict[i] for i in g) for g in sequence].

Stack Exchange Network

Python script which translates a DNA sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script which translates a DNA sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions