Background
This script is can be used as a command line interface (CLI) or a sub-module in another program to download the latest UniProt proteome for a given taxon. Files are downloaded to the same directory as the script.
Code
#!/usr/bin/env python
"UniProt Proteome Updater"
# Copyright James Draper 2017 MIT license
import argparse
import os
import gzip
import itertools
from urllib import request
from dateutil.parser import parse as dt_parse
def check_uniprot(organism='Mouse', file_format='txt', archived=True):
"""Return the latest time-stamp from the local UniProt proteomes.
If there are no proeteomes available None type is returned.
Parameters
----------
organism : str
The taxon id for the species e.g. Mouse, Human, ect...
file_format : str
The desired file format e.g. txt or fasta.
archived : bool
If True isolates gzipped files.
Returns
-------
top_hit : float or None
The latest time-stamp in the isolated list.
"""
# Return a list with files in the same directory as the script.
top = list(filter(lambda x: x[0] == '.', list(os.walk('.'))))
# Flatten the top into a single list.
top = list(itertools.chain.from_iterable(top[0][1:]))
# Filer for files with the given file format.
all_format = list(filter(lambda x: file_format in x.split('.'), top))
all_format.sort()
if archived:
all_format = list(filter(lambda x: 'gz' in x.split('.'), all_format))
all_format.sort()
else:
all_format = list(filter(lambda x: 'gz' not in x.split('.'), all_format))
all_format.sort()
# Filter for the files that contain 'uniprot-proteome'.
all_uniprot = list(filter(lambda x: 'uniprot-proteome' in x, all_format))
# Filter for the correctly formatted file.
all_uniprot = list(filter(lambda x: len(x.split('-')) == 4, all_uniprot))
# Filter for the the specified organism.
all_uniprot = list(filter(lambda x: organism in x, all_uniprot))
# Sort the list in descending order.
all_uniprot.sort(reverse=True)
if len(all_uniprot) > 0:
# Grab the top hit which should be the newest file.
top_hit = all_uniprot[0]
# Grab the timestamp
top_hit = top_hit.split('-')[-1].split('.')[0]
top_hit = float(top_hit)
return top_hit
else:
return None
def get_uniprot_proteome(organism='Mouse', file_format='txt', archived=True,
force=False):
"""Download all the entire proteome for a given taxon.
Allow 5-15 minutes to download.
Parameters
----------
organism : str
The taxon id for the species e.g. Mouse, Human, ect...
file_format : str
The desired file format e.g. txt or fasta.
archived : bool
If True zip the downloaded file.
force : bool
Forces the download even if the file is present.
"""
# Load the terms into the query.
query = "?query=organism:{0}&format={1}".format(organism, file_format)
# Create the request string.
url = "".join(["http://www.uniprot.org/uniprot/", query])
# Make request.
req = request.urlopen(url)
# Grab the 'Last Modified' string from req.info() then convert to datetime.
last_modified = dt_parse(req.info()['Last-Modified']).replace(tzinfo=None)
# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
archived=archived)
if last_modified.timestamp() == check and force is False:
print('Your UniProt Proteome is up to date.')
else:
print("UniProt Proteome is downloading. This may take a while.")
time_stamp = str(last_modified.timestamp()).split('.')[0]
front_term = 'uniprot-proteome'
fn = '-'.join([front_term, organism, time_stamp])
fn = '.'.join([fn, file_format])
if archived:
fn = '.'.join([fn, 'gz'])
f = open(fn, 'wb')
f.write(gzip.compress(req.read()))
else:
f = open(fn, 'wb')
f.write(req.read())
f.close()
print('UniProt Proteome has been downloaded:', fn)
return check
# Commandline interface
parser = argparse.ArgumentParser()
parser.add_argument("-o", "--organism",
type=str,
help="The desired organism.",
nargs='?',
const="Mouse",
default="Mouse")
parser.add_argument("-t", "--file_type",
type=str,
help="The desired file format.",
nargs="?",
const="txt",
default="txt")
parser.add_argument("-a", "--archived",
type=bool,
help="True will use gzip to archive your file.",
nargs="?",
const=True,
default=True)
parser.add_argument("-f", "--force",
type=bool,
help="Force the download even if the file is present.",
nargs="?",
const=True,
default=False)
args = parser.parse_args()
if __name__ == '__main__':
get_uniprot_proteome(args.organism, args.file_type,
args.archived, args.force)
Questions
Is any way that to improve the performance?
Could multiprocessing or threading be applied anywhere?
Are there any other ways that I could generally improve this code?
All comments and suggestions welcome.
2 Answers 2
Thank you for sharing.
The docstrings are lovely. Kudos.
This comment is helpful:
# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
archived=archived)
The comments above it are redundant. I advocate deleting them. Comments lie, as bit rot sets in. Too often, when people maintain code it drifts away from the (unchanged) comment. Don't restate in English what is already obvious in the (clear, well written) code.
parser.add_argument("-t", "--file_type",
type=str,
help="The desired file format.",
nargs="?",
const="txt",
default="txt")
This should allow 'txt' or 'fasta', only. Please refer to https://docs.python.org/3/library/argparse.html#choices
Maybe the dateutil
dependency is worth it:
last_modified = dt_parse(req.info()['Last-Modified']).replace(tzinfo=None)
Personally, I would kind of like to see an explicit date format here, and then datetime's strptime would suffice. Put another way, if a website's date format doesn't conform to the RFC, I would like to know about it.
# Get the time stamp for the latest locally avalible proteome.
check = check_uniprot(organism=organism, file_format=file_format,
archived=archived)
check
sounds more like a boolean than a timestamp.
It would be more natural to use positional than keyword=keyword arguments.
"""Return the latest time-stamp from the local UniProt proteomes.
I find that unclear. Perhaps the docstring could spell out that it is a unixtime (seconds since 1970). I'm slightly surprised filenames don't use ISO8601, as that sorts nicely and is much more human-friendly.
# Return a list with files in the same directory as the script.
top = list(filter(lambda x: x[0] == '.', list(os.walk('.'))))
# Flatten the top into a single list.
top = list(itertools.chain.from_iterable(top[0][1:]))
# Filer for files with the given file format.
all_format = list(filter(lambda x: file_format in x.split('.'), top))
all_format.sort()
I don't understand why glob.glob('*.' + file_format)
wouldn't suffice. Ok, there's the '.gz' detail, but perhaps instead of a boolean archive
parameter we'd prefer to use file_format.endswith('.gz')
.
# Filter for the files that contain 'uniprot-proteome'.
all_uniprot = list(filter(lambda x: 'uniprot-proteome' in x, all_format))
# Filter for the correctly formatted file.
all_uniprot = list(filter(lambda x: len(x.split('-')) == 4, all_uniprot))
# Filter for the the specified organism.
all_uniprot = list(filter(lambda x: organism in x, all_uniprot))
I believe a single regex would accomplish all that in a clearer fashion.
if len(all_uniprot) > 0:
# Grab the top hit which should be the newest file.
top_hit = all_uniprot[0]
# Grab the timestamp
top_hit = top_hit.split('-')[-1].split('.')[0]
top_hit = float(top_hit)
return top_hit
else:
return None
This code is tightly coupled to the filtering code above - you assume by construction that if at least one file survived filtering, it will be returned by that clause. In other words, we never execute beyond the if
. It would be simpler to omit the else
and unconditionally return None. Another approach that costs a few more cycles but yields easier to understand code is to init top_hit
to None, scan every entry in the (ascending) sorted list, conditionally assigning a new candidate return value to top_hit
, and finally return top_hit
. It will have the best value found, either None or the file with largest timestamp.
- Is any way that to improve the performance?
- Could multiprocessing or threading be applied anywhere?
To answer these questions, "no". The req.read()
is going as fast as it can. (One could try to use sendfile, but that hardly matters here. And one could download chunks in parallel, e.g. 1st half & 2nd half, but that goes against TCP's attempt to measure bottleneck bandwidth, better to use a single connection.)
Your command line is not accepting multiple download filespecs, so there's not much opportunity to fork off worker processes.
-
\$\begingroup\$ Thank you for your suggestions. What do you mean by 'bit rot' in this context? \$\endgroup\$James Draper– James Draper2017年09月12日 16:07:31 +00:00Commented Sep 12, 2017 at 16:07
In addition to J H's answer, I would suggest the following to improve clarity and performance:
- Using
os.scandir
instead ofos.walk
:os.scandir
only scans the directory you specify (rather than the whole tree) and returnsos.DirEntry
objects, which have anis_file()
method (your use ofitertools.chain.from_iterators
on the result ofos.walk
included directories as well as files) and aname
property that you can use in filtering. - Consolidating your filters so that you only need to do one pass through the results of
os.scandir
: in the code below, I've consolidated the 7 list constructions in yourcheck_uniprot
function to just 1. The associated filters are now in thedir_entry_filter
function, below. - Sort the result set only once: there were a few redundant sorts between list constructions.
Below, I have a suggestion for how you could refactor the check_uniprot
function to make it faster using the above suggestions. I have removed the docstring only to shorten the answer.
I've added a few other features (incl. checking an empty list using not all_uniprot
, a try: ... except: ...
block when calling float
) that I would be happy to expand on in this answer if desired.
def check_uniprot(organism='Mouse', file_format='txt', archived=True):
def dir_entry_filter(x):
_, file_extension = os.path.splitext(x.path)
return x.is_file(follow_symlinks=False) and \
file_format == file_extension and \
archived == ('gz' in x.name.split('.')) and \
'uniprot-proteome' in x.name and \
organism in x.name and \
x.name.count('-') == 4
all_uniprot = list(filter(dir_entry_filter, os.scandir('.')))
all_uniprot.sort(reverse=True)
if not all_uniprot:
return None
try:
return float(all_uniprot[0].split('-')[-1].split('.')[0])
except ValueError, OverflowError:
return None
Explore related questions
See similar questions with these tags.