We have about 14G files and each is about 150k.
I'm writing a script to upload them to Azure Blob storage and would like to run an upload()
function (each with own file to upload) in 5 threads.
Here is how I realized it and it looks like it works - but I still have doubts about how correct this code is.
...
class Loader:
def __init__(self):
self.account = 'accname'
#self.container = 'userdata'
self.container = 'bar1'
key = 'DQ4***A=='
self.base_url = 'http://' + self.account + '.blob.core.windows.net'
self.blob = BlobService(account_name=self.account, account_key=key)
def uploader(self, filepath, userfile, locker):
print 'Uploading file: {}'.format(userfile)
self.blob.put_block_blob_from_path(self.container,
userfile,
filepath,
max_connections=1,
max_retries=5,
retry_wait=1.0)
locker.release()
def upload(self, path):
print('Upload files from {} to {}'.format(path, self.base_url))
locker = threading.BoundedSemaphore(5)
for root, dirs, files in os.walk(path):
print root
for userfile in files:
#print 'Uploading {}'.format(os.path.join(root, userfile))
locker.acquire()
t = threading.Thread(target=self.uploader, args=(os.path.join(root, userfile), userfile, locker))
t.start()
...
And script in action:
$ ./storage_upload.py -u --path /tmp/bartest/ Upload files from /tmp/bartest/ to http://accname.blob.core.windows.net /tmp/bartest/ Uploading file: 15.txt Uploading file: 12.txt Uploading file: 7.txt Uploading file: 13.txt Uploading file: 19.txt
1 Answer 1
Looks okay if a bit sloppy with different formatting for some print x
vs. print(x)
calls (the latter being preferred really); you probably
should also use
new-style classes,
i.e. class Loader(object):
.
Other than that the main concern I'd have is that the semaphore should
be protected against exceptions. This is mostly a concern for bigger
scripts, but it's a good habit regardless. Thus, the release
method
should be called regardless of whether an exception was raised by
anything else in the thread - otherwise the program could just get stuck
there, which is fine for a one-off script probably.
You should probably also check whether the threading actually does improve the throughput, considering Pythons global interpreter lock.
Explore related questions
See similar questions with these tags.