This is a Biopython alternative with pretty straightforward code. How can I make this more concise?
def genbank_to_fasta():
file = input(r'Input the path to your file: ')
with open(f'{file}') as f:
gb = f.readlines()
locus = re.search('NC_\d+\.\d+', gb[3]).group()
region = re.search('(\d+)?\.+(\d+)', gb[2])
definition = re.search('\w.+', gb[1][10:]).group()
definition = definition.replace(definition[-1], "")
tag = locus + ":" + region.group(1) + "-" + region.group(2) + " " + definition
sequence = ""
for line in (gb):
pattern = re.compile('[a,t,g,c]{10}')
matches = pattern.finditer(line)
for match in matches:
sequence += match.group().upper()
end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])
sequence += end_pattern.group().upper()
print(len(sequence))
return sequence, tag
-
1\$\begingroup\$ @Emma here is an example. 'REGION: 26156329..26157115' \$\endgroup\$Ethan Hetrick– Ethan Hetrick2020年06月26日 02:00:55 +00:00Commented Jun 26, 2020 at 2:00
-
1\$\begingroup\$ @Emma Those numbers are not always 8 digits long it depends. Here are the links for the reference I use. ncbi.nlm.nih.gov/nuccore/… (GenBank file) ncbi.nlm.nih.gov/nuccore/… (FASTA file). \$\endgroup\$Ethan Hetrick– Ethan Hetrick2020年06月26日 02:38:32 +00:00Commented Jun 26, 2020 at 2:38
-
1\$\begingroup\$ @Emma that's good to know. My main problem came with the sequence. If the last group of DNA was not a group of 10, my current code will not parse it so I had to write the end_pattern pattern in order to get the last one. I think there is a better way to do it but I'm not sure. \$\endgroup\$Ethan Hetrick– Ethan Hetrick2020年06月26日 02:53:23 +00:00Commented Jun 26, 2020 at 2:53
-
\$\begingroup\$ What is the order of the size of the input files? Kilobytes? Gigabytes? \$\endgroup\$Peter Mortensen– Peter Mortensen2020年06月26日 08:01:15 +00:00Commented Jun 26, 2020 at 8:01
-
\$\begingroup\$ @PeterMortensen I believe it is in kilobytes \$\endgroup\$Ethan Hetrick– Ethan Hetrick2020年06月26日 23:25:38 +00:00Commented Jun 26, 2020 at 23:25
1 Answer 1
Line iteration
gb = f.readlines()
locus = re.search('NC_\d+\.\d+', gb[3]).group()
region = re.search('(\d+)?\.+(\d+)', gb[2])
definition = re.search('\w.+', gb[1][10:]).group()
for line in (gb):
# ...
end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])
can be
next(f)
definition = re.search('\w.+', next(f)[10:]).group()
region = re.search('(\d+)?\.+(\d+)', next(f))
locus = re.search('NC_\d+\.\d+', next(f)).group()
gb = tuple(f)
for line in gb:
Since you need gb[-3]
, you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.
Debugging statements
Remove this:
print(len(sequence))
or convert it to a logging call at the debug level.
Compilation
Do not do this:
pattern = re.compile('[a,t,g,c]{10}')
in an inner loop. The whole point of compile
is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.
Explore related questions
See similar questions with these tags.