6
\$\begingroup\$

This is a Biopython alternative with pretty straightforward code. How can I make this more concise?

def genbank_to_fasta():
 file = input(r'Input the path to your file: ')
 with open(f'{file}') as f:
 gb = f.readlines()
 locus = re.search('NC_\d+\.\d+', gb[3]).group()
 region = re.search('(\d+)?\.+(\d+)', gb[2])
 definition = re.search('\w.+', gb[1][10:]).group()
 definition = definition.replace(definition[-1], "")
 tag = locus + ":" + region.group(1) + "-" + region.group(2) + " " + definition
 sequence = ""
 for line in (gb):
 pattern = re.compile('[a,t,g,c]{10}')
 matches = pattern.finditer(line)
 for match in matches:
 sequence += match.group().upper()
 end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])
 sequence += end_pattern.group().upper()
 print(len(sequence))
 return sequence, tag
Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked Jun 25, 2020 at 23:05
\$\endgroup\$
5
  • 1
    \$\begingroup\$ @Emma here is an example. 'REGION: 26156329..26157115' \$\endgroup\$ Commented Jun 26, 2020 at 2:00
  • 1
    \$\begingroup\$ @Emma Those numbers are not always 8 digits long it depends. Here are the links for the reference I use. ncbi.nlm.nih.gov/nuccore/… (GenBank file) ncbi.nlm.nih.gov/nuccore/… (FASTA file). \$\endgroup\$ Commented Jun 26, 2020 at 2:38
  • 1
    \$\begingroup\$ @Emma that's good to know. My main problem came with the sequence. If the last group of DNA was not a group of 10, my current code will not parse it so I had to write the end_pattern pattern in order to get the last one. I think there is a better way to do it but I'm not sure. \$\endgroup\$ Commented Jun 26, 2020 at 2:53
  • \$\begingroup\$ What is the order of the size of the input files? Kilobytes? Gigabytes? \$\endgroup\$ Commented Jun 26, 2020 at 8:01
  • \$\begingroup\$ @PeterMortensen I believe it is in kilobytes \$\endgroup\$ Commented Jun 26, 2020 at 23:25

1 Answer 1

5
\$\begingroup\$

Line iteration

 gb = f.readlines()
 locus = re.search('NC_\d+\.\d+', gb[3]).group()
 region = re.search('(\d+)?\.+(\d+)', gb[2])
 definition = re.search('\w.+', gb[1][10:]).group()
 for line in (gb):
 # ...
 end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])

can be

next(f)
definition = re.search('\w.+', next(f)[10:]).group()
region = re.search('(\d+)?\.+(\d+)', next(f))
locus = re.search('NC_\d+\.\d+', next(f)).group()
gb = tuple(f)
for line in gb:

Since you need gb[-3], you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.

Debugging statements

Remove this:

 print(len(sequence))

or convert it to a logging call at the debug level.

Compilation

Do not do this:

 pattern = re.compile('[a,t,g,c]{10}')

in an inner loop. The whole point of compile is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.

answered Jun 26, 2020 at 2:57
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.