3
\$\begingroup\$

Context:

This script reads in a fasta file, counts the number '-' characters, alphabetic characters and total characters for each sequence (sequence descriptions are on lines marked with '>'), and then prints out a processed file with columns where 90% (or more) of '-' characters are removed.

I've tested it with smaller alignments and it worked fine, but it's currently been running for around 4 hours on one alignment (220Mb), which seems slow and probably inefficient.

The script:

import sys
from Bio import AlignIO
alignment = AlignIO.read(open(sys.argv[1]), "fasta")
length = alignment.get_alignment_length()
tofilter = []
for i in range(length):
 counterofgaps=0
 counterofsites=0
 counteroftotal=0
 for record in alignment:
 for site in record.seq[i]:
 if site == '-':
 counterofgaps +=1
 counteroftotal +=1
 elif site.isalpha():
 counterofsites +=1
 counteroftotal +=1
 Fractionofgaps = counterofgaps/counteroftotal * 100
 if Fractionofgaps >=90:
 tofilter.append(i)
for record in alignment:
 print ('>' + record.id)
 for i in range(length):
 if i not in tofilter:
 print (record.seq[i], end ='')
 print()
AlexV
7,3532 gold badges24 silver badges47 bronze badges
asked Dec 12, 2019 at 9:25
\$\endgroup\$
6
  • \$\begingroup\$ Have you tried running this on smaller alignments? On what Python version are you running this? Would parsing this with SeqIO.parse instead help? \$\endgroup\$ Commented Dec 12, 2019 at 9:34
  • \$\begingroup\$ Python 3.7.2, yes I have tried on smaller alignments it works. would SeqIO.parse be quicker? \$\endgroup\$ Commented Dec 12, 2019 at 9:42
  • 1
    \$\begingroup\$ @Biomage, Can you share a pastebin.com link with some testable fragement from your input fasta file? \$\endgroup\$ Commented Dec 12, 2019 at 10:37
  • \$\begingroup\$ pastebin.com/2Ce7mQJa \$\endgroup\$ Commented Dec 12, 2019 at 11:28
  • \$\begingroup\$ @Biomage, I don't understand why the current approach should skip the consecutive columns even if they are not - char. Potentially, all records that fall into crucial condition ">= 90% of - chars " can have those - chars (gaps) as the rightmost sequence and alpha chars - as leftmost, at the very start. But in such case - all alpha letters would be skipped. Why is that logic correct? \$\endgroup\$ Commented Dec 12, 2019 at 12:31

1 Answer 1

4
\$\begingroup\$

Disregarding using a better parser (biopython's SeqIO), here are some immediate speed boosts due to better use of vanilla Python:

You don't actually need the counterofsites at all. If you only want the fraction of '-', this is the same as the average of a sequence of 1 and 0, where the value is 1 (or equivalently, True) if the character is '-':

from statistics import mean
def gap_fraction(alignment, i):
 return mean(site == "-" for record in alignment for site in record.seq[i])

This uses a generator expression to flatten the sites.

The other improvement is using a set instead of a list for the to be filtered elements. This is needed since you later do if i not in tofilter, which needs to scan through the whole list in the worst case. With a set this is immediate, i.e. it is \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$. You will only see a real difference if your number of columns to filter gets large (>>100), though.

to_filter = {i for i in range(length) if gap_fraction(alignment, i) > 0.9}

I also used a set comprehension to make this a lot shorter and followed Python's official style-guide, PEP8, which recommends using lower_case_with_underscores for variables and functions.

You should also keep your calling code under a if __name__ == "__main__": guard to allow importing from this script form another script without running the code.

answered Dec 12, 2019 at 11:04
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.