I worked on a script that takes as input a multi-sequence .fasta
file and outputs the number of each aminoacid/sequence in a .tsv
format.
This is how a .fasta
file looks:
>sp|P21515|ACPH_ECOLI Acyl carrier protein phosphodiesterase OS=Escherichia coli (strain K12) OX=83333 GN=acpH PE=1 SV=2
MNFLAHLHLAHLAESSLSGNLLADFVRGNPEESFPPDVVAGIHMHRRIDVLTDNLPEVREAREWFRSETRRVAPITLDVMWDHFLSRHWSQLSPDFPLQEFVCYAREQVMTILPDSPPRFINLNNYLWSEQWLVRYRDMDFIQNVLNGMASRRPRLDALRDSWYDLDAHYDALETRFWQFYPRMMAQASRKAL
I know there are libraries which do the same thing, however, I had to write this script as a homework, without the help of BioPython
or other similar libraries.
My question is: how can I make this code more efficient?
import sys
import os
def readFasta( path ):
seq = { }
aCount = { }
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWYZX*-"
for fileName in path:
with open( fileName ) as file:
fastaLine = file.readlines( )
# sequence storage
for line in fastaLine:
if line[ 0 ] == ">" :
headerTitle = line.split( "|" )[ 1 ]
seq[ headerTitle ] = ""
else:
seq[ headerTitle ] += line.strip( )
# aminoacid count
for id in seq:
aCount[ id ] = { }
for sequence in seq[ id ]:
if sequence not in alphabet:
print( "Check your sequence! Character: " + str(sequence) + " from sequence: " + id )
for letter in alphabet:
aCount[ id ][ letter ] = 0
# count aminoacid occurence/sequence
for name in seq:
for char in seq[ name ]:
if char in aCount[ name ]:
aCount[ name ][ char ] += 1
# write to .csv file
outFile = open( "output.csv" , "w" )
outFile.write( "Name" + '\t' )
header = ""
for letter in alphabet:
header += letter + '\t'
outFile.write( header + "Total" )
outFile.write( '\n' )
for name in aCount:
entry = ""
counter = 0
for letter in aCount[ name ]:
entry += str(aCount[ name ][ letter ]).strip( ) + '\t'
counter += aCount[ name ][ letter ]
outFile.write( name + '\t' + entry + '\t' + str( counter ) )
outFile.write( '\n' )
# files to list
entries = os.listdir( path )
# filter for .fasta/.fa files
new_list = [ ]
for entry in entries:
if ".fasta" in entry:
new_list.append( entry )
elif ".fa" in entry:
new_list.append( entry )
# execute func
readFasta( new_list )
3 Answers 3
In addition to the points raised in the other answers:
Extraneous import
The first line is import sys
, but I don't see sys
used anywhere; this can go.
Explicit naming
It's generally better to use explicit names. Thus, rather than just seq
(which might be short for sequence
or sequences
, for example) spell it out. In this case, it looks like you should use sequences
. Similarly, aCounts
doesn't tell me what you're counting, other than the fact that there's an a
involved; just call it amino_acid_counts
.
Consistent naming
When you're talking about the same object in two different places in the code, you should prefer to use consistent names. For example, you use headerTitle
for the seq
key when reading the file in, but id
for the same keys in the next block, and name
in the block after that. Again, I would go with a more explicit name like sequence_id
.
Shadowing built-in objects
Speaking of id
, that's actually a built-in function. It's generally a bad idea to rename something that's built in to the language. This is called "shadowing", and it's possible to break code by doing it — and generally considered bad form.
Keeping the file open longer than it's needed
Once you've run fastaLine = file.readlines( )
, you're done with the file, and you can close it — meaning that you're done with the with open
block, and you can un-indent the blocks after it. Of course, as Reinderien
pointed out, it's probably better to just process each line as you iterate over them. (This is especially important with huge files, where you might want to just load a single line at a time into memory, rather than loading the entire file into memory.) So you'll want that with open
block to look more like this:
# sequence storage
with open(path, "r") as file:
for line in file:
if line[0] == ">" :
sequence_id = line.split("|")[1]
sequences[sequence_id] = ""
else:
sequences[sequence_id] += line.strip()
# amino acid count
Note that I've un-indented the amino acid count comment (and everything following it).
Probably excessive looping
I'm guessing that you just want to warn the user about all unrecognized characters in a given line, rather than every single occurrence of every such character. In that case, it's better to make a set
of all the characters in the sequence, and check if any are not in alphabet
:
sequence_characters = set(seq[id])
if not sequence_characters.issubset(alphabet):
print(
"Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
+ "' from sequence: " + str(sequence_id)
)
Definitely excessive looping
When doing for sequence in seq[ id ]
, you're looping over every character in the sequence. While I doubt that this is what you want for the previous lines, this definitely is not what you want when doing for letter in alphabet
; that loop should be un-indented, so that it's just done once for each id
.
Use the built-in str.count
method to do your counting
Python already knows how to count how many times a character appears in a string: that's the str.count
method. So rather than looping over every character in the string and adding 1 to the appropriate character, just loop over the alphabet. You can cut out that loop where you initialize the counts to 0, and just figure out the counts. Putting this together with the previous items (including renaming some variables), you'll have something like this:
# amino acid count
for sequence_id, sequence in sequences.items():
amino_acid_counts[sequence_id] = { }
sequence_characters = set(sequence)
if not sequence_characters.issubset(alphabet):
print(
"Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
+ "' from sequence: " + str(sequence_id)
)
for letter in alphabet:
amino_acid_counts[sequence_id][letter] = sequence.count(letter)
And that whole loop under # count aminoacid occurence/sequence
can be removed. This will be much faster.
Use with open
block for writing file, too
You correctly used a with open
block for reading the files; use it when you write the CSV file, too:
with open("output.csv", "w") as outFile:
(and indent whatever follows and uses outFile
).
No need to add explicit strings
When you write "Name" + '\t'
, you could just as well write "Name\t"
. Similarly, when you write
outFile.write( header + "Total" )
outFile.write( '\n' )
you could just as well write
outFile.write(header + "Total\n")
-
\$\begingroup\$
str.count()
has to loop over the whole string, and you're doing that 26 times. That costs 26x the memory traffic if it's a long sequence that doesn't fit in CPU cache. It's probably still faster than looping manually (assuming CPython where interpreter overhead is killer), but even better would be if Python has an efficient function to histogram, like into a dictionary of key => count. (Then you could check the set of keys against the alphabet instead of doing that separately.) \$\endgroup\$Peter Cordes– Peter Cordes2020年10月14日 05:27:48 +00:00Commented Oct 14, 2020 at 5:27 -
\$\begingroup\$ Also, probably it would make more sense to do the alphabet check by looking for a match against the regex
[^-*A-Z]
. If it matches, there was a character not in that set. Perhaps faster than forcing it to figure out exactly which characters were present/absent, IDK. (If it were just one contiguous range, it could be checked very efficiently if the regex engine had the right optimizations...) But hopefully that goes away with a histogram where we end up with the set of keys as part of the result. \$\endgroup\$Peter Cordes– Peter Cordes2020年10月14日 05:31:58 +00:00Commented Oct 14, 2020 at 5:31 -
2\$\begingroup\$ @PeterCordes
collections.Counter
is exactly what you hoped for in your first comment. \$\endgroup\$Graipher– Graipher2020年10月14日 08:33:01 +00:00Commented Oct 14, 2020 at 8:33 -
\$\begingroup\$ @Graipher: Thanks. Someone that knows something about Python should post that as an answer, then; if it has an efficient internal implementation it should be much faster than anything proposed in any existing answers. \$\endgroup\$Peter Cordes– Peter Cordes2020年10月14日 08:38:01 +00:00Commented Oct 14, 2020 at 8:38
-
1\$\begingroup\$ But if they're asked to "do without"
csv
, I'm thinking the teacher wants them to do things at a more basic level than just callingCounter
. It seems the students are at a level where they could use more learning about writing their own algorithms before learning about libraries. And I'd bet regexes are a bit advanced for them too — whereas using a basic built-in type likeset
is probably something the teacher would approve. \$\endgroup\$Mike– Mike2020年10月14日 13:14:50 +00:00Commented Oct 14, 2020 at 13:14
Welcome to Code Review!
I'll add to the other answer from Reinderien.
PEP-8
In python, it is common (and recommended) to follow the PEP-8 style guide for writing clean, maintainable and consistent code.
Functions and variables should be named in a lower_snake_case
, classes as UpperCamelCase
, and constants as UPPER_SNAKE_CASE
.
No unnecessary whitespaces are needed around the arguments when defining or calling a function. This applies for referring array/list elements as well.
Functions
Split your code into individual smaller functions, doing singular tasks. A few examples would be, reading/parsing the fasta files, validating sequence of gene, counting occurrences (try collections.Counter
package to do this).
CSV/TSV
Python comes with inbuilt package csv
which can also be used to write tsv files. You do not need to modify/write each line yourself.
Glob searching for files
Python 3+ has another inbuilt package pathlib
, which supports getting all files using a glob pattern. You would again, not need to manually (metaphorically) aggregate all the files with .fa
or .fasta
extensions.
if __name__
block
For scripts, it is a good practice to put your executable feature inside the if __name__ == "__main__"
clause.
-
\$\begingroup\$ Thank you for all your suggestions! Regarding the built-in csv package, I know of it, but our teacher wanted us to do it without. That's also the case with the whitespace. \$\endgroup\$sachikox– sachikox2020年10月13日 22:57:13 +00:00Commented Oct 13, 2020 at 22:57
-
3\$\begingroup\$ @sachikox not using csv is ok, but whitespaces around functions are frowned upon in python, officially. You can link your teacher to the PEP: python.org/dev/peps/pep-0008/… \$\endgroup\$hjpotter92– hjpotter922020年10月13日 23:35:53 +00:00Commented Oct 13, 2020 at 23:35
Path?
Surely path
is not a single path, since you loop through it. So at the least, this is poorly-named and should be paths
. The variable you iterate through it, fileName
, should be file_name
; similarly for your other function and variable names.
Line iteration
Rather than calling readlines()
, simply iterate over file
itself, which will have the same effect.
Alphabet
Why is it out of order - WYZX
?
Redundant if
if ".fasta" in entry:
new_list.append( entry )
elif ".fa" in entry:
new_list.append( entry )
can collapse the if
to
if '.fa' in entry:
new_list.append(entry)
If you look at the substring pattern, the second includes the first. Also, don't add spaces on the inside of your parens.
Explore related questions
See similar questions with these tags.
numpy.histogram
, I imaginepandas
/counter
/seaborn
(either by themselves or in any combination) would work too. \$\endgroup\$