Multi-threading upload tool

Question 1

We have about 14G files and each is about 150k.

I'm writing a script to upload them to Azure Blob storage and would like to run an upload() function (each with own file to upload) in 5 threads.

Here is how I realized it and it looks like it works - but I still have doubts about how correct this code is.

...
class Loader:
 def __init__(self):
 self.account = 'accname'
 #self.container = 'userdata'
 self.container = 'bar1'
 key = 'DQ4***A=='
 self.base_url = 'http://' + self.account + '.blob.core.windows.net'
 self.blob = BlobService(account_name=self.account, account_key=key)
 def uploader(self, filepath, userfile, locker):
 print 'Uploading file: {}'.format(userfile)
 self.blob.put_block_blob_from_path(self.container,
 userfile,
 filepath,
 max_connections=1,
 max_retries=5,
 retry_wait=1.0)
 locker.release()
 def upload(self, path):
 print('Upload files from {} to {}'.format(path, self.base_url))
 locker = threading.BoundedSemaphore(5)
 for root, dirs, files in os.walk(path):
 print root
 for userfile in files:
 #print 'Uploading {}'.format(os.path.join(root, userfile))
 locker.acquire()
 t = threading.Thread(target=self.uploader, args=(os.path.join(root, userfile), userfile, locker))
 t.start()
...

And script in action:

$ ./storage_upload.py -u --path /tmp/bartest/
Upload files from /tmp/bartest/ to http://accname.blob.core.windows.net
/tmp/bartest/
Uploading file: 15.txt
Uploading file: 12.txt
Uploading file: 7.txt
Uploading file: 13.txt
Uploading file: 19.txt

Question 2

Looks okay if a bit sloppy with different formatting for some print x vs. print(x) calls (the latter being preferred really); you probably should also use new-style classes, i.e. class Loader(object):.

Other than that the main concern I'd have is that the semaphore should be protected against exceptions. This is mostly a concern for bigger scripts, but it's a good habit regardless. Thus, the release method should be called regardless of whether an exception was raised by anything else in the thread - otherwise the program could just get stuck there, which is fine for a one-off script probably.

You should probably also check whether the threading actually does improve the throughput, considering Pythons global interpreter lock.

ferada ferada 11.4k25 silver badges65 bronze badges · Answer 1 · 2016-04-04 10:44:01Z

Looks okay if a bit sloppy with different formatting for some print x vs. print(x) calls (the latter being preferred really); you probably should also use new-style classes, i.e. class Loader(object):.

Other than that the main concern I'd have is that the semaphore should be protected against exceptions. This is mostly a concern for bigger scripts, but it's a good habit regardless. Thus, the release method should be called regardless of whether an exception was raised by anything else in the thread - otherwise the program could just get stuck there, which is fine for a one-off script probably.

You should probably also check whether the threading actually does improve the throughput, considering Pythons global interpreter lock.

Stack Exchange Network

Multi-threading upload tool

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multi-threading upload tool

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions