3
\$\begingroup\$

I wrote a Python script to download files using multiple (source) IP addresses -- kindly suggest any improvements.

import cgi
import os
import posixpath
import Queue
import threading
import urllib
import urlparse
import random
import re
import shutil
import time
import requests
import requests_toolbelt
def get_IPs():
 """Returns all available IP addresses in a list."""
 # TODO: Windows only. Other options?
 out = []
 for i in os.popen('ipconfig'):
 i = i.strip()
 if i.startswith('IP'):
 out.append(i.rsplit(' ', 1)[-1])
 return out
def get_info(url):
 """Returns name and size of file to be downloaded."""
 try:
 resp = requests.head(url, allow_redirects=True)
 name = cgi.parse_header(resp.headers['content-disposition'])[1]['filename']
 except KeyError:
 path = urlparse.urlsplit(url).path
 name = posixpath.basename(path)
 name = urllib.unquote_plus(name)
 size = int(resp.headers['content-length'])
 return name, size
def worker(url, session, ud, part, size):
 """Downloads a part of the file specified by 'part' parameter."""
 # TODO: optimal tries, timeout?
 for _ in xrange(2):
 try:
 open('%s/%04d' % (ud, part), 'wb').write(
 session.get(url, timeout=(2, 7), headers={'range': 'bytes=%s-%s' % (
 part*chunk, min(size, part*chunk + chunk - 1))}).content)
 break
 except:
 pass
 else:
 worker(url, sessions_queue.get(), ud, part, size)
 sessions_queue.put(session)
def summary(name, size, elapsed):
 """Prints summary of the download after it is completed."""
 print (
 '--\n'
 '%s download completed.\n'
 'Time elapsed: %.2fs\n'
 'Average download speed: %.2f MB/s\n'
 '--' % (name, elapsed, size/elapsed/2**20))
def download(url):
 """Downloads the file pointed to by 'url' parameter."""
 start = time.clock()
 name, size = get_info(url)
 # random id of length 20
 ud = '%0x' % random.getrandbits(80)
 os.mkdir(ud)
 threads = []
 for i in xrange(size/chunk + (size%chunk != 0)):
 t = threading.Thread(target=worker, args=(url, sessions_queue.get(), ud, i, size))
 threads.append(t)
 t.start()
 # characters \/:*?"<>| not allowed in filenames in Windows
 name = re.sub(r'[\\/:*?"<>|]', '_', name)
 # TODO: check if a file is already present with same name
 out = open(name, 'ab')
 for i, t in enumerate(threads):
 t.join()
 out.write(open('%s/%04d' % (ud, i), 'rb').read())
 summary(name, size, time.clock() - start)
 shutil.rmtree(ud)
def main():
 IPs = get_IPs()
 print len(IPs), 'IPs available.'
 for ip in IPs:
 adapter = requests_toolbelt.adapters.SourceAddressAdapter(ip)
 session = requests.Session()
 session.mount('http://', adapter)
 session.mount('https://', adapter)
 sessions_queue.put(session)
 while True:
 threading.Thread(target=download, args=(raw_input(),)).start()
if __name__ == '__main__':
 sessions_queue = Queue.Queue()
 KB = 1024
 MB = 1024*KB
 # TODO: optimal size?
 chunk = 100*KB
 main()

I am using it with about 100 IP addresses on my Ethernet -- each with about 100 KB/s speed. What'd be optimal configuration? (numbers of threads, chunk size)

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Feb 23, 2016 at 8:29
\$\endgroup\$
1
  • \$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$ Commented Feb 23, 2016 at 12:07

1 Answer 1

2
\$\begingroup\$

You could rewrite your get_IPs function to be a list comprehension instead:

return [i.rsplit(' ', 1)[-1] for i in map(str.strip, os.popen('ipconfig'))
 if i.startswith('IP')]

map will call strip on all of the results from 'ipconfig' and then you can iterate over that, ignoring any values that don't start with "IP".

In worker you're using a loop to retry after timeouts. But you're just using 2 arbitrarily. Use a constant here so it's clear what you're doing, and easy to change later:

You also multiple times open files, but you should always try to use with, known as the context manager. It automatically closes the file even in the event that an error is raised. It's the safest way to open a file.

with open(filepath) as filename:
 execute_code_with(filename)
print("Done with filename")

Once you leave that indented block, the file is automatically closed. No need to even call filename.close().

answered Feb 23, 2016 at 10:57
\$\endgroup\$
2
  • \$\begingroup\$ Thanks for the answer. * I considered a list comprehension for get_IPs function but I thought it was too much of a stretch. * Added a MAX_TRIES variable. (or constant ;) ) * As there are no references to those opened files, I thought Garbage Collector closes them automatically. -- But I just read in docs that we should not rely on it and always close files. \$\endgroup\$ Commented Feb 23, 2016 at 11:18
  • \$\begingroup\$ Yes GC often will but not reliably, and if you have errors arising they also may not be cleaned up properly. The key value to with is that it ensures files are closed. Also I like list comprehensions, especially when they're abstracted into a small function anyway :P But that's more of a personal opinion. I showed it partially in case you didn't know it was possible. \$\endgroup\$ Commented Feb 23, 2016 at 11:22

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.