Context:
This script reads in a fasta file, counts the number '-' characters, alphabetic characters and total characters for each sequence (sequence descriptions are on lines marked with '>'), and then prints out a processed file with columns where 90% (or more) of '-' characters are removed.
I've tested it with smaller alignments and it worked fine, but it's currently been running for around 4 hours on one alignment (220Mb), which seems slow and probably inefficient.
The script:
import sys
from Bio import AlignIO
alignment = AlignIO.read(open(sys.argv[1]), "fasta")
length = alignment.get_alignment_length()
tofilter = []
for i in range(length):
counterofgaps=0
counterofsites=0
counteroftotal=0
for record in alignment:
for site in record.seq[i]:
if site == '-':
counterofgaps +=1
counteroftotal +=1
elif site.isalpha():
counterofsites +=1
counteroftotal +=1
Fractionofgaps = counterofgaps/counteroftotal * 100
if Fractionofgaps >=90:
tofilter.append(i)
for record in alignment:
print ('>' + record.id)
for i in range(length):
if i not in tofilter:
print (record.seq[i], end ='')
print()
1 Answer 1
Disregarding using a better parser (biopython's SeqIO), here are some immediate speed boosts due to better use of vanilla Python:
You don't actually need the counterofsites
at all. If you only want the fraction of '-'
, this is the same as the average of a sequence of 1
and 0
, where the value is 1
(or equivalently, True
) if the character is '-'
:
from statistics import mean
def gap_fraction(alignment, i):
return mean(site == "-" for record in alignment for site in record.seq[i])
This uses a generator expression to flatten the sites.
The other improvement is using a set
instead of a list
for the to be filtered elements. This is needed since you later do if i not in tofilter
, which needs to scan through the whole list in the worst case. With a set this is immediate, i.e. it is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$. You will only see a real difference if your number of columns to filter gets large (>>100), though.
to_filter = {i for i in range(length) if gap_fraction(alignment, i) > 0.9}
I also used a set comprehension to make this a lot shorter and followed Python's official style-guide, PEP8, which recommends using lower_case_with_underscores
for variables and functions.
You should also keep your calling code under a if __name__ == "__main__":
guard to allow importing from this script form another script without running the code.
SeqIO.parse
instead help? \$\endgroup\$-
char. Potentially, all records that fall into crucial condition ">= 90
% of-
chars " can have those-
chars (gaps) as the rightmost sequence and alpha chars - as leftmost, at the very start. But in such case - all alpha letters would be skipped. Why is that logic correct? \$\endgroup\$