GenBank to FASTA format using regular expressions without Biopython

Question 1

This is a Biopython alternative with pretty straightforward code. How can I make this more concise?

def genbank_to_fasta():
 file = input(r'Input the path to your file: ')
 with open(f'{file}') as f:
 gb = f.readlines()
 locus = re.search('NC_\d+\.\d+', gb[3]).group()
 region = re.search('(\d+)?\.+(\d+)', gb[2])
 definition = re.search('\w.+', gb[1][10:]).group()
 definition = definition.replace(definition[-1], "")
 tag = locus + ":" + region.group(1) + "-" + region.group(2) + " " + definition
 sequence = ""
 for line in (gb):
 pattern = re.compile('[a,t,g,c]{10}')
 matches = pattern.finditer(line)
 for match in matches:
 sequence += match.group().upper()
 end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])
 sequence += end_pattern.group().upper()
 print(len(sequence))
 return sequence, tag

Question 2

@Emma here is an example. 'REGION: 26156329..26157115'

Question 3

@Emma Those numbers are not always 8 digits long it depends. Here are the links for the reference I use. ncbi.nlm.nih.gov/nuccore/… (GenBank file) ncbi.nlm.nih.gov/nuccore/… (FASTA file).

Question 4

@Emma that's good to know. My main problem came with the sequence. If the last group of DNA was not a group of 10, my current code will not parse it so I had to write the end_pattern pattern in order to get the last one. I think there is a better way to do it but I'm not sure.

Question 5

What is the order of the size of the input files? Kilobytes? Gigabytes?

Question 6

@PeterMortensen I believe it is in kilobytes

Question 7

Line iteration

 gb = f.readlines()
 locus = re.search('NC_\d+\.\d+', gb[3]).group()
 region = re.search('(\d+)?\.+(\d+)', gb[2])
 definition = re.search('\w.+', gb[1][10:]).group()
 for line in (gb):
 # ...
 end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])

can be

next(f)
definition = re.search('\w.+', next(f)[10:]).group()
region = re.search('(\d+)?\.+(\d+)', next(f))
locus = re.search('NC_\d+\.\d+', next(f)).group()
gb = tuple(f)
for line in gb:

Since you need gb[-3], you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.

Debugging statements

Remove this:

 print(len(sequence))

or convert it to a logging call at the debug level.

Compilation

Do not do this:

 pattern = re.compile('[a,t,g,c]{10}')

in an inner loop. The whole point of compile is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-06-26 02:57:48Z

Line iteration

 gb = f.readlines()
 locus = re.search('NC_\d+\.\d+', gb[3]).group()
 region = re.search('(\d+)?\.+(\d+)', gb[2])
 definition = re.search('\w.+', gb[1][10:]).group()
 for line in (gb):
 # ...
 end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])

can be

next(f)
definition = re.search('\w.+', next(f)[10:]).group()
region = re.search('(\d+)?\.+(\d+)', next(f))
locus = re.search('NC_\d+\.\d+', next(f)).group()
gb = tuple(f)
for line in gb:

Since you need gb[-3], you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.

Debugging statements

Remove this:

 print(len(sequence))

or convert it to a logging call at the debug level.

Compilation

Do not do this:

 pattern = re.compile('[a,t,g,c]{10}')

in an inner loop. The whole point of compile is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.

Stack Exchange Network

GenBank to FASTA format using regular expressions without Biopython

1 Answer 1

Line iteration

Debugging statements

Compilation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

GenBank to FASTA format using regular expressions without Biopython

1 Answer 1

Line iteration

Debugging statements

Compilation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions