Python FASTA parser using dictionaries without using BioPython or other external libraries

Question 1

I am writing my own parser for FASTA format. I can't use BioPython or anything else, because it's a part of an assignment and our teacher wants us to try to do it manually. For now, I have done this:

def read_file(fasta_file):
 "Parse fasta file"
 count = 0
 headers = []
 sequences = []
 aux = []
 with open("yeast.fna", 'r') as infile:
 for line in infile:
 record = line.rstrip()
 if record and record[0] == '>':
 headers.append(record[1:])
 if count > 0:
 sequences.append(''.join(aux))
 aux = []
 else:
 aux.append(record)
 count += 1
 sequences.append(''.join(aux))
 return headers, sequences
headers, sequences = read_file("yeast.fna")
lengths = 0
countN = 0
acum = 0
lengths = []
n = 0
lengths = [len(entry) for entry in sequences]
count = sum(lengths)
seqlengths = [entry for entry in sequences]
countN = [entry.count("N") + entry.count("n") for entry in sequences]
print("\nTotal length of the assembly : ", count)
print("\n Total number of unsequenced nucleotides (Ns) in the assembly : ", countN)
#arrange the secq sizes in decrossing order
all_len = sorted(lengths, reverse=True)
print(all_len)
#Search for size accumulation until the size is <= half the size
for y in range (len(lengths)):
 if acum <= count/2:
 acum = all_len[y] + acum
 n = y #L50
n = n + 1 # because the vector begin by a 0
print("The L50 is", n)
print("The N50 is", all_len[n-1]) #Seq is n-1 because vector begin in 0

As you can see this parser does some basic stats like calculating L50, ... But the problem with this parser is that it is not very universal since I designed it for the assembled genome of S. Cerivisae which has 17 chromosomes. I know that the best way to do a universal effective parser is to use a dictionary (and make 2 dictionaries).

Can you help me with that?

Here is a sample of input file yeast.fna. It is basically a txt file with headers which begin by a > followed by the dna sequences. I just copy part of it because it's a very big txt file with 17 chromosomes sequences (each one has it own header).

>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
>NC_001134.8 Saccharomyces cerevisiae S288C chromosome II, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagGTTGTCTTTTTATCCCACTTCTTCGCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATcaaaacaatgaaataCTACTACATCAAAACGCATTTTccctagaaaaaaaattttcttacAATATACTATACTAC
ACAATACATAATCACTGACTTTCgtaacaacaatttccttcacTCTCCAACTTCTCTGCTCGAATCTCTACatagtaata
>NC_001148.4 Saccharomyces cerevisiae S288C chromosome XVI, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagaTTGTCTTTTTATCCTACTCTTTCCCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATCAAAACAGTGAAATACTACTACATCAAAACGCATATTccctagaaaaaaaaatttcttacaATATACTATACTAC

Here is my output when I run my code :

Total length of the assembly : 12157105
 Total number of unsequenced nucleotides (Ns) in the assembly : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1531933, 1091291, 1090940, 1078177, 948066, 924431, 813184, 784333, 745751, 666816, 576874, 562643, 439888, 316620, 270161, 230218, 85779]
The L50 is 6
The N50 is 924431

Everything is okay but I need to make a more universal fasta parser which doesn't work only for my specific input file with 17 chromosome. In order to do that I have to use dictionary but I really don't know how to do that. You can also download the full input file. At the top of the page in line "Download sequences in FASTA format for genome, transcript, protein" you click on "genome" and you will download the file which is zipped.

Question 2

Your read_file returns headers and sequences implicitly correlated by their order. A model that is easier to work with is to make sequence class instances that each have their own header and sequence string.

If you accept fasta_file, then why do you hard-code yeast.fna?

yeast.fna's original name is GCF_000146045.2_R64_genomic.fna. It seems unwise to discard this information, unless perhaps yeast.fna is a symlink.

This code is not just a parser - it does a little bit of analysis. Ensure that your parsing code and analysis code are separated, and that neither leaks into the global namespace.

Separate your summary calculation from your summary printing.

As to your concern that

it is not very universal since I designed it for the assembled genome of S. Cerivisae which has 17 chromosomes.

I don't see where you've hard-coded 17 chromosomes, so (apart from other design concerns) this doesn't seem like a problem.

I know that the best way to do a universal effective parser is to use a dictionary (and make 2 dictionaries).

I don't think so. Prefer class instances rather than dictionaries for internal data.

Consider making two lightweight classes - one for each sequence, and one for the assembly overall.

You should do some basic parsing of your header - the first field is the ID, and the rest of the string is the description.

I have validity concerns about the way that you measure unsequenced data. Part of the problem seems to be with the format itself, but part of the problem looks to be with your parser, because it doesn't pay attention to X or -.

Consider using Python's built-in gzip support and operating on that archive directly rather than decompressing it on the file system.

Introduce PEP484 type hinting to your code.

You have descriptive headings for almost all of your summary fields - good! Add one for your all_len.

Suggested

This suggested code exclusively uses modules that are built into Python 3.9.

import gzip
from io import StringIO
from itertools import accumulate
from typing import Iterator, NamedTuple, Optional, Tuple
class FastaSequence(NamedTuple):
 identifier: str
 description: str
 sequence: str
 @staticmethod
 def parse_lines(lines: Iterator[str]) -> Tuple[
 str, # sequence
 Optional[str], # next header
 ]:
 next_header: Optional[str] = None
 def iterate():
 nonlocal next_header
 for line in lines:
 line = line.rstrip()
 if line.startswith('>'):
 next_header = line
 break
 yield line
 return ''.join(iterate()), next_header
 @classmethod
 def from_file(cls, in_file: Iterator[str]) -> Iterator['FastaSequence']:
 header = next(in_file).rstrip()
 while True:
 identifier, description = header[1:].split(maxsplit=1)
 sequence, next_header = cls.parse_lines(in_file)
 yield cls(identifier, description, sequence)
 if next_header is None:
 break
 header = next_header
 @property
 def n_unsequenced(self) -> int:
 # This is problematic and probably wrong, because documentation like
 # http://genetics.bwh.harvard.edu/pph/FASTA.html
 # says that X or - could also indicate unsequenced data, and N might
 # also just mean 'asparagine'.
 # - in particular indicates that this count is only a lower bound as the
 # gap is of indeterminate length.
 return sum(
 1 for x in self.sequence
 if x.lower() in {'n', 'x', '-'}
 )
class FastaAssembly:
 def __init__(self, sequences: Tuple[FastaSequence]) -> None:
 self.sequences = sequences
 
 self.sequence_lengths = sorted(
 (len(seq.sequence) for seq in self.sequences),
 reverse=True,
 )
 self.length = sum(self.sequence_lengths)
 index_50 = self.half_length
 self.l50 = index_50 + 1
 self.n50 = self.sequence_lengths[index_50]
 @property
 def half_length(self) -> int:
 for i, total in enumerate(accumulate(self.sequence_lengths)):
 if total > self.length/2:
 return i
 raise IndexError() # This will never happen
 @classmethod
 def from_file(cls, in_file: Iterator[str]) -> 'FastaAssembly':
 return cls(tuple(FastaSequence.from_file(in_file)))
 @property
 def summary(self) -> str:
 unsequenced = [seq.n_unsequenced for seq in self.sequences]
 return (
 f'Total length of the assembly: {self.length}'
 f'\nTotal number of unsequenced nucleotides (Ns) in the assembly: {unsequenced}'
 f'\nAll sequence lengths: {self.sequence_lengths}'
 f'\nThe L50 is {self.l50}'
 f'\nThe N50 is {self.n50}'
 )
EXAMPLE = \
'''>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
>NC_001134.8 Saccharomyces cerevisiae S288C chromosome II, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagGTTGTCTTTTTATCCCACTTCTTCGCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATcaaaacaatgaaataCTACTACATCAAAACGCATTTTccctagaaaaaaaattttcttacAATATACTATACTAC
ACAATACATAATCACTGACTTTCgtaacaacaatttccttcacTCTCCAACTTCTCTGCTCGAATCTCTACatagtaata
>NC_001148.4 Saccharomyces cerevisiae S288C chromosome XVI, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagaTTGTCTTTTTATCCTACTCTTTCCCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATCAAAACAGTGAAATACTACTACATCAAAACGCATATTccctagaaaaaaaaatttcttacaATATACTATACTAC
'''
def test() -> None:
 with StringIO(EXAMPLE) as f:
 assembly = FastaAssembly.from_file(f)
 print(assembly.summary)
 print()
 with gzip.open('GCF_000146045.2_R64_genomic.fna.gz', 'rt') as f:
 assembly = FastaAssembly.from_file(f)
 print(assembly.summary)
if __name__ == '__main__':
 test()

Output

Total length of the assembly: 800
Total number of unsequenced nucleotides (Ns) in the assembly: [0, 0, 0]
All sequence lengths: [320, 240, 240]
The L50 is 2
The N50 is 240
Total length of the assembly: 12157105
Total number of unsequenced nucleotides (Ns) in the assembly: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
All sequence lengths: [1531933, 1091291, 1090940, 1078177, 948066, 924431, 813184, 784333, 745751, 666816, 576874, 562643, 439888, 316620, 270161, 230218, 85779]
The L50 is 6
The N50 is 924431

Question 3

may i ask why you used NamedTuple instead of dataclasses ? Is it just for unpacking namedtuples or it has more advantages in this case? Thanks!!

Question 4

@AlexDotis Named tuples are naturally immutable and have a simpler API than dataclasses.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2021-11-09 17:26:46Z

Your read_file returns headers and sequences implicitly correlated by their order. A model that is easier to work with is to make sequence class instances that each have their own header and sequence string.

If you accept fasta_file, then why do you hard-code yeast.fna?

yeast.fna's original name is GCF_000146045.2_R64_genomic.fna. It seems unwise to discard this information, unless perhaps yeast.fna is a symlink.

This code is not just a parser - it does a little bit of analysis. Ensure that your parsing code and analysis code are separated, and that neither leaks into the global namespace.

Separate your summary calculation from your summary printing.

As to your concern that

it is not very universal since I designed it for the assembled genome of S. Cerivisae which has 17 chromosomes.

I don't see where you've hard-coded 17 chromosomes, so (apart from other design concerns) this doesn't seem like a problem.

I know that the best way to do a universal effective parser is to use a dictionary (and make 2 dictionaries).

I don't think so. Prefer class instances rather than dictionaries for internal data.

Consider making two lightweight classes - one for each sequence, and one for the assembly overall.

You should do some basic parsing of your header - the first field is the ID, and the rest of the string is the description.

I have validity concerns about the way that you measure unsequenced data. Part of the problem seems to be with the format itself, but part of the problem looks to be with your parser, because it doesn't pay attention to X or -.

Consider using Python's built-in gzip support and operating on that archive directly rather than decompressing it on the file system.

Introduce PEP484 type hinting to your code.

You have descriptive headings for almost all of your summary fields - good! Add one for your all_len.

Suggested

This suggested code exclusively uses modules that are built into Python 3.9.

import gzip
from io import StringIO
from itertools import accumulate
from typing import Iterator, NamedTuple, Optional, Tuple
class FastaSequence(NamedTuple):
 identifier: str
 description: str
 sequence: str
 @staticmethod
 def parse_lines(lines: Iterator[str]) -> Tuple[
 str, # sequence
 Optional[str], # next header
 ]:
 next_header: Optional[str] = None
 def iterate():
 nonlocal next_header
 for line in lines:
 line = line.rstrip()
 if line.startswith('>'):
 next_header = line
 break
 yield line
 return ''.join(iterate()), next_header
 @classmethod
 def from_file(cls, in_file: Iterator[str]) -> Iterator['FastaSequence']:
 header = next(in_file).rstrip()
 while True:
 identifier, description = header[1:].split(maxsplit=1)
 sequence, next_header = cls.parse_lines(in_file)
 yield cls(identifier, description, sequence)
 if next_header is None:
 break
 header = next_header
 @property
 def n_unsequenced(self) -> int:
 # This is problematic and probably wrong, because documentation like
 # http://genetics.bwh.harvard.edu/pph/FASTA.html
 # says that X or - could also indicate unsequenced data, and N might
 # also just mean 'asparagine'.
 # - in particular indicates that this count is only a lower bound as the
 # gap is of indeterminate length.
 return sum(
 1 for x in self.sequence
 if x.lower() in {'n', 'x', '-'}
 )
class FastaAssembly:
 def __init__(self, sequences: Tuple[FastaSequence]) -> None:
 self.sequences = sequences
 
 self.sequence_lengths = sorted(
 (len(seq.sequence) for seq in self.sequences),
 reverse=True,
 )
 self.length = sum(self.sequence_lengths)
 index_50 = self.half_length
 self.l50 = index_50 + 1
 self.n50 = self.sequence_lengths[index_50]
 @property
 def half_length(self) -> int:
 for i, total in enumerate(accumulate(self.sequence_lengths)):
 if total > self.length/2:
 return i
 raise IndexError() # This will never happen
 @classmethod
 def from_file(cls, in_file: Iterator[str]) -> 'FastaAssembly':
 return cls(tuple(FastaSequence.from_file(in_file)))
 @property
 def summary(self) -> str:
 unsequenced = [seq.n_unsequenced for seq in self.sequences]
 return (
 f'Total length of the assembly: {self.length}'
 f'\nTotal number of unsequenced nucleotides (Ns) in the assembly: {unsequenced}'
 f'\nAll sequence lengths: {self.sequence_lengths}'
 f'\nThe L50 is {self.l50}'
 f'\nThe N50 is {self.n50}'
 )
EXAMPLE = \
'''>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
>NC_001134.8 Saccharomyces cerevisiae S288C chromosome II, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagGTTGTCTTTTTATCCCACTTCTTCGCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATcaaaacaatgaaataCTACTACATCAAAACGCATTTTccctagaaaaaaaattttcttacAATATACTATACTAC
ACAATACATAATCACTGACTTTCgtaacaacaatttccttcacTCTCCAACTTCTCTGCTCGAATCTCTACatagtaata
>NC_001148.4 Saccharomyces cerevisiae S288C chromosome XVI, complete sequence
AAATAGCCCTCATGTACGTCTCCTCCAAGCCCTGTTGTCTCTTACCCGGATGTTCAACCAAAAGCTACTTACtaccttta
ttttatgtttactttttatagaTTGTCTTTTTATCCTACTCTTTCCCACTTGTCTCTCGCTACTGCCGTGCAACAAACAC
TAAATCAAAACAGTGAAATACTACTACATCAAAACGCATATTccctagaaaaaaaaatttcttacaATATACTATACTAC
'''
def test() -> None:
 with StringIO(EXAMPLE) as f:
 assembly = FastaAssembly.from_file(f)
 print(assembly.summary)
 print()
 with gzip.open('GCF_000146045.2_R64_genomic.fna.gz', 'rt') as f:
 assembly = FastaAssembly.from_file(f)
 print(assembly.summary)
if __name__ == '__main__':
 test()

Output

Total length of the assembly: 800
Total number of unsequenced nucleotides (Ns) in the assembly: [0, 0, 0]
All sequence lengths: [320, 240, 240]
The L50 is 2
The N50 is 240
Total length of the assembly: 12157105
Total number of unsequenced nucleotides (Ns) in the assembly: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
All sequence lengths: [1531933, 1091291, 1090940, 1078177, 948066, 924431, 813184, 784333, 745751, 666816, 576874, 562643, 439888, 316620, 270161, 230218, 85779]
The L50 is 6
The N50 is 924431

may i ask why you used NamedTuple instead of dataclasses ? Is it just for unpacking namedtuples or it has more advantages in this case? Thanks!!
@AlexDotis Named tuples are naturally immutable and have a simpler API than dataclasses.

Stack Exchange Network

Python FASTA parser using dictionaries without using BioPython or other external libraries

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python FASTA parser using dictionaries without using BioPython or other external libraries

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions