Code to remove columns in a (fasta) file with more than 90% '-' characters in that column

Question 1

Context:

This script reads in a fasta file, counts the number '-' characters, alphabetic characters and total characters for each sequence (sequence descriptions are on lines marked with '>'), and then prints out a processed file with columns where 90% (or more) of '-' characters are removed.

I've tested it with smaller alignments and it worked fine, but it's currently been running for around 4 hours on one alignment (220Mb), which seems slow and probably inefficient.

The script:

import sys
from Bio import AlignIO
alignment = AlignIO.read(open(sys.argv[1]), "fasta")
length = alignment.get_alignment_length()
tofilter = []
for i in range(length):
 counterofgaps=0
 counterofsites=0
 counteroftotal=0
 for record in alignment:
 for site in record.seq[i]:
 if site == '-':
 counterofgaps +=1
 counteroftotal +=1
 elif site.isalpha():
 counterofsites +=1
 counteroftotal +=1
 Fractionofgaps = counterofgaps/counteroftotal * 100
 if Fractionofgaps >=90:
 tofilter.append(i)
for record in alignment:
 print ('>' + record.id)
 for i in range(length):
 if i not in tofilter:
 print (record.seq[i], end ='')
 print()

Question 2

Have you tried running this on smaller alignments? On what Python version are you running this? Would parsing this with SeqIO.parse instead help?

Question 3

Python 3.7.2, yes I have tried on smaller alignments it works. would SeqIO.parse be quicker?

Question 4

@Biomage, Can you share a pastebin.com link with some testable fragement from your input fasta file?

Question 5

pastebin.com/2Ce7mQJa

Question 6

@Biomage, I don't understand why the current approach should skip the consecutive columns even if they are not - char. Potentially, all records that fall into crucial condition ">= 90% of - chars " can have those - chars (gaps) as the rightmost sequence and alpha chars - as leftmost, at the very start. But in such case - all alpha letters would be skipped. Why is that logic correct?

Question 7

Disregarding using a better parser (biopython's SeqIO), here are some immediate speed boosts due to better use of vanilla Python:

You don't actually need the counterofsites at all. If you only want the fraction of '-', this is the same as the average of a sequence of 1 and 0, where the value is 1 (or equivalently, True) if the character is '-':

from statistics import mean
def gap_fraction(alignment, i):
 return mean(site == "-" for record in alignment for site in record.seq[i])

This uses a generator expression to flatten the sites.

The other improvement is using a set instead of a list for the to be filtered elements. This is needed since you later do if i not in tofilter, which needs to scan through the whole list in the worst case. With a set this is immediate, i.e. it is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$. You will only see a real difference if your number of columns to filter gets large (>>100), though.

to_filter = {i for i in range(length) if gap_fraction(alignment, i) > 0.9}

I also used a set comprehension to make this a lot shorter and followed Python's official style-guide, PEP8, which recommends using lower_case_with_underscores for variables and functions.

You should also keep your calling code under a if __name__ == "__main__": guard to allow importing from this script form another script without running the code.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-12-12 11:04:11Z

Disregarding using a better parser (biopython's SeqIO), here are some immediate speed boosts due to better use of vanilla Python:

You don't actually need the counterofsites at all. If you only want the fraction of '-', this is the same as the average of a sequence of 1 and 0, where the value is 1 (or equivalently, True) if the character is '-':

from statistics import mean
def gap_fraction(alignment, i):
 return mean(site == "-" for record in alignment for site in record.seq[i])

This uses a generator expression to flatten the sites.

The other improvement is using a set instead of a list for the to be filtered elements. This is needed since you later do if i not in tofilter, which needs to scan through the whole list in the worst case. With a set this is immediate, i.e. it is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$. You will only see a real difference if your number of columns to filter gets large (>>100), though.

to_filter = {i for i in range(length) if gap_fraction(alignment, i) > 0.9}

I also used a set comprehension to make this a lot shorter and followed Python's official style-guide, PEP8, which recommends using lower_case_with_underscores for variables and functions.

You should also keep your calling code under a if __name__ == "__main__": guard to allow importing from this script form another script without running the code.

Stack Exchange Network

Code to remove columns in a (fasta) file with more than 90% '-' characters in that column

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Code to remove columns in a (fasta) file with more than 90% '-' characters in that column

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions