Downloading files using Python-requests

Question 1

I wrote a Python script to download files using multiple (source) IP addresses -- kindly suggest any improvements.

import cgi
import os
import posixpath
import Queue
import threading
import urllib
import urlparse
import random
import re
import shutil
import time
import requests
import requests_toolbelt
def get_IPs():
 """Returns all available IP addresses in a list."""
 # TODO: Windows only. Other options?
 out = []
 for i in os.popen('ipconfig'):
 i = i.strip()
 if i.startswith('IP'):
 out.append(i.rsplit(' ', 1)[-1])
 return out
def get_info(url):
 """Returns name and size of file to be downloaded."""
 try:
 resp = requests.head(url, allow_redirects=True)
 name = cgi.parse_header(resp.headers['content-disposition'])[1]['filename']
 except KeyError:
 path = urlparse.urlsplit(url).path
 name = posixpath.basename(path)
 name = urllib.unquote_plus(name)
 size = int(resp.headers['content-length'])
 return name, size
def worker(url, session, ud, part, size):
 """Downloads a part of the file specified by 'part' parameter."""
 # TODO: optimal tries, timeout?
 for _ in xrange(2):
 try:
 open('%s/%04d' % (ud, part), 'wb').write(
 session.get(url, timeout=(2, 7), headers={'range': 'bytes=%s-%s' % (
 part*chunk, min(size, part*chunk + chunk - 1))}).content)
 break
 except:
 pass
 else:
 worker(url, sessions_queue.get(), ud, part, size)
 sessions_queue.put(session)
def summary(name, size, elapsed):
 """Prints summary of the download after it is completed."""
 print (
 '--\n'
 '%s download completed.\n'
 'Time elapsed: %.2fs\n'
 'Average download speed: %.2f MB/s\n'
 '--' % (name, elapsed, size/elapsed/2**20))
def download(url):
 """Downloads the file pointed to by 'url' parameter."""
 start = time.clock()
 name, size = get_info(url)
 # random id of length 20
 ud = '%0x' % random.getrandbits(80)
 os.mkdir(ud)
 threads = []
 for i in xrange(size/chunk + (size%chunk != 0)):
 t = threading.Thread(target=worker, args=(url, sessions_queue.get(), ud, i, size))
 threads.append(t)
 t.start()
 # characters \/:*?"<>| not allowed in filenames in Windows
 name = re.sub(r'[\\/:*?"<>|]', '_', name)
 # TODO: check if a file is already present with same name
 out = open(name, 'ab')
 for i, t in enumerate(threads):
 t.join()
 out.write(open('%s/%04d' % (ud, i), 'rb').read())
 summary(name, size, time.clock() - start)
 shutil.rmtree(ud)
def main():
 IPs = get_IPs()
 print len(IPs), 'IPs available.'
 for ip in IPs:
 adapter = requests_toolbelt.adapters.SourceAddressAdapter(ip)
 session = requests.Session()
 session.mount('http://', adapter)
 session.mount('https://', adapter)
 sessions_queue.put(session)
 while True:
 threading.Thread(target=download, args=(raw_input(),)).start()
if __name__ == '__main__':
 sessions_queue = Queue.Queue()
 KB = 1024
 MB = 1024*KB
 # TODO: optimal size?
 chunk = 100*KB
 main()

I am using it with about 100 IP addresses on my Ethernet -- each with about 100 KB/s speed. What'd be optimal configuration? (numbers of threads, chunk size)

Question 2

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 3

You could rewrite your get_IPs function to be a list comprehension instead:

return [i.rsplit(' ', 1)[-1] for i in map(str.strip, os.popen('ipconfig'))
 if i.startswith('IP')]

map will call strip on all of the results from 'ipconfig' and then you can iterate over that, ignoring any values that don't start with "IP".

In worker you're using a loop to retry after timeouts. But you're just using 2 arbitrarily. Use a constant here so it's clear what you're doing, and easy to change later:

You also multiple times open files, but you should always try to use with, known as the context manager. It automatically closes the file even in the event that an error is raised. It's the safest way to open a file.

with open(filepath) as filename:
 execute_code_with(filename)
print("Done with filename")

Once you leave that indented block, the file is automatically closed. No need to even call filename.close().

Question 4

Thanks for the answer. * I considered a list comprehension for get_IPs function but I thought it was too much of a stretch. * Added a MAX_TRIES variable. (or constant ;) ) * As there are no references to those opened files, I thought Garbage Collector closes them automatically. -- But I just read in docs that we should not rely on it and always close files.

Question 5

Yes GC often will but not reliably, and if you have errors arising they also may not be cleaned up properly. The key value to with is that it ensures files are closed. Also I like list comprehensions, especially when they're abstracted into a small function anyway :P But that's more of a personal opinion. I showed it partially in case you didn't know it was possible.

score 2 · Accepted Answer · 2016-02-23 10:57:25Z

You could rewrite your get_IPs function to be a list comprehension instead:

return [i.rsplit(' ', 1)[-1] for i in map(str.strip, os.popen('ipconfig'))
 if i.startswith('IP')]

map will call strip on all of the results from 'ipconfig' and then you can iterate over that, ignoring any values that don't start with "IP".

In worker you're using a loop to retry after timeouts. But you're just using 2 arbitrarily. Use a constant here so it's clear what you're doing, and easy to change later:

You also multiple times open files, but you should always try to use with, known as the context manager. It automatically closes the file even in the event that an error is raised. It's the safest way to open a file.

with open(filepath) as filename:
 execute_code_with(filename)
print("Done with filename")

Once you leave that indented block, the file is automatically closed. No need to even call filename.close().

Thanks for the answer. * I considered a list comprehension for get_IPs function but I thought it was too much of a stretch. * Added a MAX_TRIES variable. (or constant ;) ) * As there are no references to those opened files, I thought Garbage Collector closes them automatically. -- But I just read in docs that we should not rely on it and always close files.
Yes GC often will but not reliably, and if you have errors arising they also may not be cleaned up properly. The key value to with is that it ensures files are closed. Also I like list comprehensions, especially when they're abstracted into a small function anyway :P But that's more of a personal opinion. I showed it partially in case you didn't know it was possible.

Stack Exchange Network

Downloading files using Python-requests

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Downloading files using Python-requests

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions