Python CLI for download/updating large UniProt protein databases

Question 1

Background

This script is can be used as a command line interface (CLI) or a sub-module in another program to download the latest UniProt proteome for a given taxon. Files are downloaded to the same directory as the script.

Code

#!/usr/bin/env python
"UniProt Proteome Updater"
# Copyright James Draper 2017 MIT license
import argparse
import os
import gzip
import itertools
from urllib import request
from dateutil.parser import parse as dt_parse
def check_uniprot(organism='Mouse', file_format='txt', archived=True):
 """Return the latest time-stamp from the local UniProt proteomes.
 If there are no proeteomes available None type is returned.
 Parameters
 ----------
 organism : str
 The taxon id for the species e.g. Mouse, Human, ect...
 file_format : str
 The desired file format e.g. txt or fasta.
 archived : bool
 If True isolates gzipped files.
 Returns
 -------
 top_hit : float or None
 The latest time-stamp in the isolated list.
 """
 # Return a list with files in the same directory as the script.
 top = list(filter(lambda x: x[0] == '.', list(os.walk('.'))))
 # Flatten the top into a single list.
 top = list(itertools.chain.from_iterable(top[0][1:]))
 # Filer for files with the given file format.
 all_format = list(filter(lambda x: file_format in x.split('.'), top))
 all_format.sort()
 if archived:
 all_format = list(filter(lambda x: 'gz' in x.split('.'), all_format))
 all_format.sort()
 else:
 all_format = list(filter(lambda x: 'gz' not in x.split('.'), all_format))
 all_format.sort()
 # Filter for the files that contain 'uniprot-proteome'.
 all_uniprot = list(filter(lambda x: 'uniprot-proteome' in x, all_format))
 # Filter for the correctly formatted file.
 all_uniprot = list(filter(lambda x: len(x.split('-')) == 4, all_uniprot))
 # Filter for the the specified organism.
 all_uniprot = list(filter(lambda x: organism in x, all_uniprot))
 # Sort the list in descending order.
 all_uniprot.sort(reverse=True)
 if len(all_uniprot) > 0:
 # Grab the top hit which should be the newest file.
 top_hit = all_uniprot[0]
 # Grab the timestamp
 top_hit = top_hit.split('-')[-1].split('.')[0]
 top_hit = float(top_hit)
 return top_hit
 else:
 return None
def get_uniprot_proteome(organism='Mouse', file_format='txt', archived=True,
 force=False):
 """Download all the entire proteome for a given taxon.
 Allow 5-15 minutes to download.
 Parameters
 ----------
 organism : str
 The taxon id for the species e.g. Mouse, Human, ect...
 file_format : str
 The desired file format e.g. txt or fasta.
 archived : bool
 If True zip the downloaded file.
 force : bool
 Forces the download even if the file is present.
 """
 # Load the terms into the query.
 query = "?query=organism:{0}&format={1}".format(organism, file_format)
 # Create the request string.
 url = "".join(["http://www.uniprot.org/uniprot/", query])
 # Make request.
 req = request.urlopen(url)
 # Grab the 'Last Modified' string from req.info() then convert to datetime.
 last_modified = dt_parse(req.info()['Last-Modified']).replace(tzinfo=None)
 # Get the time stamp for the latest locally avalible proteome.
 check = check_uniprot(organism=organism, file_format=file_format,
 archived=archived)
 if last_modified.timestamp() == check and force is False:
 print('Your UniProt Proteome is up to date.')
 else:
 print("UniProt Proteome is downloading. This may take a while.")
 time_stamp = str(last_modified.timestamp()).split('.')[0]
 front_term = 'uniprot-proteome'
 fn = '-'.join([front_term, organism, time_stamp])
 fn = '.'.join([fn, file_format])
 if archived:
 fn = '.'.join([fn, 'gz'])
 f = open(fn, 'wb')
 f.write(gzip.compress(req.read()))
 else:
 f = open(fn, 'wb')
 f.write(req.read())
 f.close()
 print('UniProt Proteome has been downloaded:', fn)
 return check
# Commandline interface
parser = argparse.ArgumentParser()
parser.add_argument("-o", "--organism",
 type=str,
 help="The desired organism.",
 nargs='?',
 const="Mouse",
 default="Mouse")
parser.add_argument("-t", "--file_type",
 type=str,
 help="The desired file format.",
 nargs="?",
 const="txt",
 default="txt")
parser.add_argument("-a", "--archived",
 type=bool,
 help="True will use gzip to archive your file.",
 nargs="?",
 const=True,
 default=True)
parser.add_argument("-f", "--force",
 type=bool,
 help="Force the download even if the file is present.",
 nargs="?",
 const=True,
 default=False)
args = parser.parse_args()
if __name__ == '__main__':
 get_uniprot_proteome(args.organism, args.file_type,
 args.archived, args.force)

Questions

Is any way that to improve the performance?
Could multiprocessing or threading be applied anywhere?
Are there any other ways that I could generally improve this code?

All comments and suggestions welcome.

Question 2

Thank you for sharing.

The docstrings are lovely. Kudos.

This comment is helpful:

# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
 archived=archived)

The comments above it are redundant. I advocate deleting them. Comments lie, as bit rot sets in. Too often, when people maintain code it drifts away from the (unchanged) comment. Don't restate in English what is already obvious in the (clear, well written) code.

parser.add_argument("-t", "--file_type",
 type=str,
 help="The desired file format.",
 nargs="?",
 const="txt",
 default="txt")

This should allow 'txt' or 'fasta', only. Please refer to https://docs.python.org/3/library/argparse.html#choices

Maybe the dateutil dependency is worth it:

last_modified = dt_parse(req.info()['Last-Modified']).replace(tzinfo=None)

Personally, I would kind of like to see an explicit date format here, and then datetime's strptime would suffice. Put another way, if a website's date format doesn't conform to the RFC, I would like to know about it.

# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
 archived=archived)

check sounds more like a boolean than a timestamp.

It would be more natural to use positional than keyword=keyword arguments.

"""Return the latest time-stamp from the local UniProt proteomes.

I find that unclear. Perhaps the docstring could spell out that it is a unixtime (seconds since 1970). I'm slightly surprised filenames don't use ISO8601, as that sorts nicely and is much more human-friendly.

# Return a list with files in the same directory as the script.
top = list(filter(lambda x: x[0] == '.', list(os.walk('.'))))
# Flatten the top into a single list.
top = list(itertools.chain.from_iterable(top[0][1:]))
# Filer for files with the given file format.
all_format = list(filter(lambda x: file_format in x.split('.'), top))
all_format.sort()

I don't understand why glob.glob('*.' + file_format) wouldn't suffice. Ok, there's the '.gz' detail, but perhaps instead of a boolean archive parameter we'd prefer to use file_format.endswith('.gz').

# Filter for the files that contain 'uniprot-proteome'.
all_uniprot = list(filter(lambda x: 'uniprot-proteome' in x, all_format))
# Filter for the correctly formatted file.
all_uniprot = list(filter(lambda x: len(x.split('-')) == 4, all_uniprot))
# Filter for the the specified organism.
all_uniprot = list(filter(lambda x: organism in x, all_uniprot))

I believe a single regex would accomplish all that in a clearer fashion.

if len(all_uniprot) > 0:
 # Grab the top hit which should be the newest file.
 top_hit = all_uniprot[0]
 # Grab the timestamp
 top_hit = top_hit.split('-')[-1].split('.')[0]
 top_hit = float(top_hit)
 return top_hit
else:
 return None

This code is tightly coupled to the filtering code above - you assume by construction that if at least one file survived filtering, it will be returned by that clause. In other words, we never execute beyond the if. It would be simpler to omit the else and unconditionally return None. Another approach that costs a few more cycles but yields easier to understand code is to init top_hit to None, scan every entry in the (ascending) sorted list, conditionally assigning a new candidate return value to top_hit, and finally return top_hit. It will have the best value found, either None or the file with largest timestamp.

Is any way that to improve the performance?

Could multiprocessing or threading be applied anywhere?

To answer these questions, "no". The req.read() is going as fast as it can. (One could try to use sendfile, but that hardly matters here. And one could download chunks in parallel, e.g. 1st half & 2nd half, but that goes against TCP's attempt to measure bottleneck bandwidth, better to use a single connection.)

Your command line is not accepting multiple download filespecs, so there's not much opportunity to fork off worker processes.

Question 3

Thank you for your suggestions. What do you mean by 'bit rot' in this context?

Question 4

In addition to J H's answer, I would suggest the following to improve clarity and performance:

Using os.scandir instead of os.walk: os.scandir only scans the directory you specify (rather than the whole tree) and returns os.DirEntry objects, which have an is_file() method (your use of itertools.chain.from_iterators on the result of os.walk included directories as well as files) and a name property that you can use in filtering.
Consolidating your filters so that you only need to do one pass through the results of os.scandir: in the code below, I've consolidated the 7 list constructions in your check_uniprot function to just 1. The associated filters are now in the dir_entry_filter function, below.
Sort the result set only once: there were a few redundant sorts between list constructions.

Below, I have a suggestion for how you could refactor the check_uniprot function to make it faster using the above suggestions. I have removed the docstring only to shorten the answer.

I've added a few other features (incl. checking an empty list using not all_uniprot, a try: ... except: ... block when calling float) that I would be happy to expand on in this answer if desired.

def check_uniprot(organism='Mouse', file_format='txt', archived=True):
 def dir_entry_filter(x):
 _, file_extension = os.path.splitext(x.path)
 return x.is_file(follow_symlinks=False) and \
 file_format == file_extension and \
 archived == ('gz' in x.name.split('.')) and \
 'uniprot-proteome' in x.name and \
 organism in x.name and \
 x.name.count('-') == 4
 all_uniprot = list(filter(dir_entry_filter, os.scandir('.')))
 all_uniprot.sort(reverse=True)
 if not all_uniprot:
 return None
 try:
 return float(all_uniprot[0].split('-')[-1].split('.')[0])
 except ValueError, OverflowError:
 return None

J_H J_H 41.5k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2017-09-12 16:02:06Z

Thank you for sharing.

The docstrings are lovely. Kudos.

This comment is helpful:

# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
 archived=archived)

The comments above it are redundant. I advocate deleting them. Comments lie, as bit rot sets in. Too often, when people maintain code it drifts away from the (unchanged) comment. Don't restate in English what is already obvious in the (clear, well written) code.

parser.add_argument("-t", "--file_type",
 type=str,
 help="The desired file format.",
 nargs="?",
 const="txt",
 default="txt")

This should allow 'txt' or 'fasta', only. Please refer to https://docs.python.org/3/library/argparse.html#choices

Maybe the dateutil dependency is worth it:

last_modified = dt_parse(req.info()['Last-Modified']).replace(tzinfo=None)

Personally, I would kind of like to see an explicit date format here, and then datetime's strptime would suffice. Put another way, if a website's date format doesn't conform to the RFC, I would like to know about it.

# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
 archived=archived)

check sounds more like a boolean than a timestamp.

It would be more natural to use positional than keyword=keyword arguments.

"""Return the latest time-stamp from the local UniProt proteomes.

I find that unclear. Perhaps the docstring could spell out that it is a unixtime (seconds since 1970). I'm slightly surprised filenames don't use ISO8601, as that sorts nicely and is much more human-friendly.

# Return a list with files in the same directory as the script.
top = list(filter(lambda x: x[0] == '.', list(os.walk('.'))))
# Flatten the top into a single list.
top = list(itertools.chain.from_iterable(top[0][1:]))
# Filer for files with the given file format.
all_format = list(filter(lambda x: file_format in x.split('.'), top))
all_format.sort()

I don't understand why glob.glob('*.' + file_format) wouldn't suffice. Ok, there's the '.gz' detail, but perhaps instead of a boolean archive parameter we'd prefer to use file_format.endswith('.gz').

# Filter for the files that contain 'uniprot-proteome'.
all_uniprot = list(filter(lambda x: 'uniprot-proteome' in x, all_format))
# Filter for the correctly formatted file.
all_uniprot = list(filter(lambda x: len(x.split('-')) == 4, all_uniprot))
# Filter for the the specified organism.
all_uniprot = list(filter(lambda x: organism in x, all_uniprot))

I believe a single regex would accomplish all that in a clearer fashion.

if len(all_uniprot) > 0:
 # Grab the top hit which should be the newest file.
 top_hit = all_uniprot[0]
 # Grab the timestamp
 top_hit = top_hit.split('-')[-1].split('.')[0]
 top_hit = float(top_hit)
 return top_hit
else:
 return None

This code is tightly coupled to the filtering code above - you assume by construction that if at least one file survived filtering, it will be returned by that clause. In other words, we never execute beyond the if. It would be simpler to omit the else and unconditionally return None. Another approach that costs a few more cycles but yields easier to understand code is to init top_hit to None, scan every entry in the (ascending) sorted list, conditionally assigning a new candidate return value to top_hit, and finally return top_hit. It will have the best value found, either None or the file with largest timestamp.

Is any way that to improve the performance?

Could multiprocessing or threading be applied anywhere?

To answer these questions, "no". The req.read() is going as fast as it can. (One could try to use sendfile, but that hardly matters here. And one could download chunks in parallel, e.g. 1st half & 2nd half, but that goes against TCP's attempt to measure bottleneck bandwidth, better to use a single connection.)

Your command line is not accepting multiple download filespecs, so there's not much opportunity to fork off worker processes.

Thank you for your suggestions. What do you mean by 'bit rot' in this context?

Stack Exchange Network

Python CLI for download/updating large UniProt protein databases

Background

Code

Questions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python CLI for download/updating large UniProt protein databases

Background

Code

Questions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions