FASTA-to-tsv conversion script

Question 1

I worked on a script that takes as input a multi-sequence .fasta file and outputs the number of each aminoacid/sequence in a .tsv format.

This is how a .fasta file looks:

>sp|P21515|ACPH_ECOLI Acyl carrier protein phosphodiesterase OS=Escherichia coli (strain K12) OX=83333 GN=acpH PE=1 SV=2 
MNFLAHLHLAHLAESSLSGNLLADFVRGNPEESFPPDVVAGIHMHRRIDVLTDNLPEVREAREWFRSETRRVAPITLDVMWDHFLSRHWSQLSPDFPLQEFVCYAREQVMTILPDSPPRFINLNNYLWSEQWLVRYRDMDFIQNVLNGMASRRPRLDALRDSWYDLDAHYDALETRFWQFYPRMMAQASRKAL

I know there are libraries which do the same thing, however, I had to write this script as a homework, without the help of BioPython or other similar libraries.

My question is: how can I make this code more efficient?

import sys
import os
def readFasta( path ):
 
 seq = { }
 aCount = { }
 alphabet = "ABCDEFGHIJKLMNOPQRSTUVWYZX*-" 
 for fileName in path:
 
 with open( fileName ) as file:
 fastaLine = file.readlines( )
 
 # sequence storage
 
 for line in fastaLine:
 if line[ 0 ] == ">" :
 headerTitle = line.split( "|" )[ 1 ]
 seq[ headerTitle ] = ""
 else:
 seq[ headerTitle ] += line.strip( )
 # aminoacid count
 for id in seq:
 aCount[ id ] = { }
 for sequence in seq[ id ]:
 if sequence not in alphabet:
 print( "Check your sequence! Character: " + str(sequence) + " from sequence: " + id )
 for letter in alphabet:
 aCount[ id ][ letter ] = 0
 
 # count aminoacid occurence/sequence
 
 for name in seq:
 for char in seq[ name ]:
 if char in aCount[ name ]:
 aCount[ name ][ char ] += 1
 
 # write to .csv file
 
 outFile = open( "output.csv" , "w" )
 
 
 outFile.write( "Name" + '\t' )
 
 header = ""
 for letter in alphabet:
 header += letter + '\t'
 outFile.write( header + "Total" )
 outFile.write( '\n' ) 
 
 for name in aCount:
 entry = ""
 counter = 0
 for letter in aCount[ name ]:
 entry += str(aCount[ name ][ letter ]).strip( ) + '\t'
 counter += aCount[ name ][ letter ]
 outFile.write( name + '\t' + entry + '\t' + str( counter ) )
 outFile.write( '\n' )
# files to list
entries = os.listdir( path )
# filter for .fasta/.fa files
new_list = [ ]
for entry in entries:
 if ".fasta" in entry:
 new_list.append( entry )
 elif ".fa" in entry:
 new_list.append( entry )
 
# execute func
readFasta( new_list )

Question 2

Your indentation is wrong. Since indentation is vital in Python, please fix it before we can review it.

Question 3

@Mast fixed it, so sorry, I haven't noticed.

Question 4

Much better, thank you. The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.

Question 5

I don't know if Python has a more efficient way to histogram the characters in a string; I hope so because looping manually is very slow in the CPython interpreter. (Other Python implementations might be less bad.)

Question 6

@PeterCordes Well, there's numpy.histogram, I imagine pandas/counter/seaborn (either by themselves or in any combination) would work too.

Question 7

In addition to the points raised in the other answers:

Extraneous import

The first line is import sys, but I don't see sys used anywhere; this can go.

Explicit naming

It's generally better to use explicit names. Thus, rather than just seq (which might be short for sequence or sequences, for example) spell it out. In this case, it looks like you should use sequences. Similarly, aCounts doesn't tell me what you're counting, other than the fact that there's an a involved; just call it amino_acid_counts.

Consistent naming

When you're talking about the same object in two different places in the code, you should prefer to use consistent names. For example, you use headerTitle for the seq key when reading the file in, but id for the same keys in the next block, and name in the block after that. Again, I would go with a more explicit name like sequence_id.

Shadowing built-in objects

Speaking of id, that's actually a built-in function. It's generally a bad idea to rename something that's built in to the language. This is called "shadowing", and it's possible to break code by doing it — and generally considered bad form.

Keeping the file open longer than it's needed

Once you've run fastaLine = file.readlines( ), you're done with the file, and you can close it — meaning that you're done with the with open block, and you can un-indent the blocks after it. Of course, as Reinderien pointed out, it's probably better to just process each line as you iterate over them. (This is especially important with huge files, where you might want to just load a single line at a time into memory, rather than loading the entire file into memory.) So you'll want that with open block to look more like this:

 # sequence storage
 with open(path, "r") as file:
 for line in file:
 if line[0] == ">" :
 sequence_id = line.split("|")[1]
 sequences[sequence_id] = ""
 else:
 sequences[sequence_id] += line.strip()
 # amino acid count

Note that I've un-indented the amino acid count comment (and everything following it).

Probably excessive looping

I'm guessing that you just want to warn the user about all unrecognized characters in a given line, rather than every single occurrence of every such character. In that case, it's better to make a set of all the characters in the sequence, and check if any are not in alphabet:

sequence_characters = set(seq[id])
if not sequence_characters.issubset(alphabet):
 print(
 "Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
 + "' from sequence: " + str(sequence_id)
 )

Definitely excessive looping

When doing for sequence in seq[ id ], you're looping over every character in the sequence. While I doubt that this is what you want for the previous lines, this definitely is not what you want when doing for letter in alphabet; that loop should be un-indented, so that it's just done once for each id.

Use the built-in `str.count` method to do your counting

Python already knows how to count how many times a character appears in a string: that's the str.count method. So rather than looping over every character in the string and adding 1 to the appropriate character, just loop over the alphabet. You can cut out that loop where you initialize the counts to 0, and just figure out the counts. Putting this together with the previous items (including renaming some variables), you'll have something like this:

 # amino acid count
 for sequence_id, sequence in sequences.items():
 amino_acid_counts[sequence_id] = { }
 sequence_characters = set(sequence)
 if not sequence_characters.issubset(alphabet):
 print(
 "Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
 + "' from sequence: " + str(sequence_id)
 )
 for letter in alphabet:
 amino_acid_counts[sequence_id][letter] = sequence.count(letter)

And that whole loop under # count aminoacid occurence/sequence can be removed. This will be much faster.

Use `with open` block for writing file, too

You correctly used a with open block for reading the files; use it when you write the CSV file, too:

 with open("output.csv", "w") as outFile:

(and indent whatever follows and uses outFile).

No need to add explicit strings

When you write "Name" + '\t', you could just as well write "Name\t". Similarly, when you write

 outFile.write( header + "Total" )
 outFile.write( '\n' )

you could just as well write

 outFile.write(header + "Total\n")

Question 8

str.count() has to loop over the whole string, and you're doing that 26 times. That costs 26x the memory traffic if it's a long sequence that doesn't fit in CPU cache. It's probably still faster than looping manually (assuming CPython where interpreter overhead is killer), but even better would be if Python has an efficient function to histogram, like into a dictionary of key => count. (Then you could check the set of keys against the alphabet instead of doing that separately.)

Question 9

Also, probably it would make more sense to do the alphabet check by looking for a match against the regex [^-*A-Z]. If it matches, there was a character not in that set. Perhaps faster than forcing it to figure out exactly which characters were present/absent, IDK. (If it were just one contiguous range, it could be checked very efficiently if the regex engine had the right optimizations...) But hopefully that goes away with a histogram where we end up with the set of keys as part of the result.

Question 10

@PeterCordes collections.Counter is exactly what you hoped for in your first comment.

Question 11

@Graipher: Thanks. Someone that knows something about Python should post that as an answer, then; if it has an efficient internal implementation it should be much faster than anything proposed in any existing answers.

Question 12

But if they're asked to "do without" csv, I'm thinking the teacher wants them to do things at a more basic level than just calling Counter. It seems the students are at a level where they could use more learning about writing their own algorithms before learning about libraries. And I'd bet regexes are a bit advanced for them too — whereas using a basic built-in type like set is probably something the teacher would approve.

Question 13

Welcome to Code Review!

I'll add to the other answer from Reinderien.

PEP-8

In python, it is common (and recommended) to follow the PEP-8 style guide for writing clean, maintainable and consistent code.

Functions and variables should be named in a lower_snake_case, classes as UpperCamelCase, and constants as UPPER_SNAKE_CASE.

No unnecessary whitespaces are needed around the arguments when defining or calling a function. This applies for referring array/list elements as well.

Functions

Split your code into individual smaller functions, doing singular tasks. A few examples would be, reading/parsing the fasta files, validating sequence of gene, counting occurrences (try collections.Counter package to do this).

CSV/TSV

Python comes with inbuilt package csv which can also be used to write tsv files. You do not need to modify/write each line yourself.

Glob searching for files

Python 3+ has another inbuilt package pathlib, which supports getting all files using a glob pattern. You would again, not need to manually (metaphorically) aggregate all the files with .fa or .fasta extensions.

`if name` block

For scripts, it is a good practice to put your executable feature inside the if __name__ == "__main__" clause.

Question 14

Thank you for all your suggestions! Regarding the built-in csv package, I know of it, but our teacher wanted us to do it without. That's also the case with the whitespace.

Question 15

@sachikox not using csv is ok, but whitespaces around functions are frowned upon in python, officially. You can link your teacher to the PEP: python.org/dev/peps/pep-0008/…

Question 16

Path?

Surely path is not a single path, since you loop through it. So at the least, this is poorly-named and should be paths. The variable you iterate through it, fileName, should be file_name; similarly for your other function and variable names.

Line iteration

Rather than calling readlines(), simply iterate over file itself, which will have the same effect.

Alphabet

Why is it out of order - WYZX?

Redundant `if`

if ".fasta" in entry:
 new_list.append( entry )
elif ".fa" in entry:
 new_list.append( entry )

can collapse the if to

if '.fa' in entry:
 new_list.append(entry)

If you look at the substring pattern, the second includes the first. Also, don't add spaces on the inside of your parens.

Mike Mike 1662 bronze badges · Accepted Answer · 2020-10-14 04:37:49Z

In addition to the points raised in the other answers:

Extraneous import

The first line is import sys, but I don't see sys used anywhere; this can go.

Explicit naming

It's generally better to use explicit names. Thus, rather than just seq (which might be short for sequence or sequences, for example) spell it out. In this case, it looks like you should use sequences. Similarly, aCounts doesn't tell me what you're counting, other than the fact that there's an a involved; just call it amino_acid_counts.

Consistent naming

When you're talking about the same object in two different places in the code, you should prefer to use consistent names. For example, you use headerTitle for the seq key when reading the file in, but id for the same keys in the next block, and name in the block after that. Again, I would go with a more explicit name like sequence_id.

Shadowing built-in objects

Speaking of id, that's actually a built-in function. It's generally a bad idea to rename something that's built in to the language. This is called "shadowing", and it's possible to break code by doing it — and generally considered bad form.

Keeping the file open longer than it's needed

Once you've run fastaLine = file.readlines( ), you're done with the file, and you can close it — meaning that you're done with the with open block, and you can un-indent the blocks after it. Of course, as Reinderien pointed out, it's probably better to just process each line as you iterate over them. (This is especially important with huge files, where you might want to just load a single line at a time into memory, rather than loading the entire file into memory.) So you'll want that with open block to look more like this:

 # sequence storage
 with open(path, "r") as file:
 for line in file:
 if line[0] == ">" :
 sequence_id = line.split("|")[1]
 sequences[sequence_id] = ""
 else:
 sequences[sequence_id] += line.strip()
 # amino acid count

Note that I've un-indented the amino acid count comment (and everything following it).

Probably excessive looping

I'm guessing that you just want to warn the user about all unrecognized characters in a given line, rather than every single occurrence of every such character. In that case, it's better to make a set of all the characters in the sequence, and check if any are not in alphabet:

sequence_characters = set(seq[id])
if not sequence_characters.issubset(alphabet):
 print(
 "Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
 + "' from sequence: " + str(sequence_id)
 )

Definitely excessive looping

When doing for sequence in seq[ id ], you're looping over every character in the sequence. While I doubt that this is what you want for the previous lines, this definitely is not what you want when doing for letter in alphabet; that loop should be un-indented, so that it's just done once for each id.

Use the built-in `str.count` method to do your counting

Python already knows how to count how many times a character appears in a string: that's the str.count method. So rather than looping over every character in the string and adding 1 to the appropriate character, just loop over the alphabet. You can cut out that loop where you initialize the counts to 0, and just figure out the counts. Putting this together with the previous items (including renaming some variables), you'll have something like this:

 # amino acid count
 for sequence_id, sequence in sequences.items():
 amino_acid_counts[sequence_id] = { }
 sequence_characters = set(sequence)
 if not sequence_characters.issubset(alphabet):
 print(
 "Check your sequence! Extra characters: '" + str(sequence_characters - alphabet)
 + "' from sequence: " + str(sequence_id)
 )
 for letter in alphabet:
 amino_acid_counts[sequence_id][letter] = sequence.count(letter)

And that whole loop under # count aminoacid occurence/sequence can be removed. This will be much faster.

Use `with open` block for writing file, too

You correctly used a with open block for reading the files; use it when you write the CSV file, too:

 with open("output.csv", "w") as outFile:

(and indent whatever follows and uses outFile).

No need to add explicit strings

When you write "Name" + '\t', you could just as well write "Name\t". Similarly, when you write

 outFile.write( header + "Total" )
 outFile.write( '\n' )

you could just as well write

 outFile.write(header + "Total\n")

str.count() has to loop over the whole string, and you're doing that 26 times. That costs 26x the memory traffic if it's a long sequence that doesn't fit in CPU cache. It's probably still faster than looping manually (assuming CPython where interpreter overhead is killer), but even better would be if Python has an efficient function to histogram, like into a dictionary of key => count. (Then you could check the set of keys against the alphabet instead of doing that separately.)
Also, probably it would make more sense to do the alphabet check by looking for a match against the regex [^-*A-Z]. If it matches, there was a character not in that set. Perhaps faster than forcing it to figure out exactly which characters were present/absent, IDK. (If it were just one contiguous range, it could be checked very efficiently if the regex engine had the right optimizations...) But hopefully that goes away with a histogram where we end up with the set of keys as part of the result.
@PeterCordes collections.Counter is exactly what you hoped for in your first comment.
@Graipher: Thanks. Someone that knows something about Python should post that as an answer, then; if it has an efficient internal implementation it should be much faster than anything proposed in any existing answers.
But if they're asked to "do without" csv, I'm thinking the teacher wants them to do things at a more basic level than just calling Counter. It seems the students are at a level where they could use more learning about writing their own algorithms before learning about libraries. And I'd bet regexes are a bit advanced for them too — whereas using a basic built-in type like set is probably something the teacher would approve.

Stack Exchange Network

FASTA-to-tsv conversion script

3 Answers 3

Extraneous import

Explicit naming

Consistent naming

Shadowing built-in objects

Keeping the file open longer than it's needed

Probably excessive looping

Definitely excessive looping

Use the built-in `str.count` method to do your counting

Use `with open` block for writing file, too

No need to add explicit strings

PEP-8

Functions

CSV/TSV

Glob searching for files

`if name` block

Path?

Line iteration

Alphabet

Redundant `if`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

FASTA-to-tsv conversion script

3 Answers 3

Extraneous import

Explicit naming

Consistent naming

Shadowing built-in objects

Keeping the file open longer than it's needed

Probably excessive looping

Definitely excessive looping

Use the built-in str.count method to do your counting

Use with open block for writing file, too

No need to add explicit strings

PEP-8

Functions

CSV/TSV

Glob searching for files

if __name__ block

Path?

Line iteration

Alphabet

Redundant if

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use the built-in `str.count` method to do your counting

Use `with open` block for writing file, too

`if name` block

Redundant `if`