Convert zip to gzip and upload to S3 bucket

Question 1

My code accesses an FTP server, downloads a .zip file, pushes the file contents as .gz to an AWS S3 bucket.

import boto3
import ftplib
import gzip
import io
import zipfile
def _move_to_s3(fname):
 host = 'some_host'
 user = 'some_user'
 passwd = 'some_password'
 tmp_dir = '/tmp/'
 tmp_fname = fname.split('.')[0] + '.gz'
 target_bucket = 'some_bucket'
 s3 = boto3.resource('s3')
 try:
 ftp = ftplib.FTP(host, user, passwd)
 with io.BytesIO() as data, gzip.open(tmp_dir + tmp_fname, 'wb') as gz:
 ftp.retrbinary('RETR ' + fname, data.write)
 gz.write(data.getvalue())
 s3.Object(target_bucket, tmp_fname).put(Body=open(tmp_dir + tmp_fname, 'rb'))
 except Exception as e:
 print e
 finally:
 ftp.quit()
if __name__ == '__main__':
 _move_to_s3('some_file.zip')

I welcome any comments but my main points of interest are:

Assuming the source zip contains only one text file, is it OK to write to the gzip file from a stream containing the contents of the downloaded zip file?
I think I should change the code to hold the gzip in memory, without generating temporary files.

Question 2

Looks okay to me in general. I'd prefer to pass in configuration separately, parsed from a config file or command line arguments, but if it's a one-off script it's probably okay this way.

The temporary files aren't deleted as far as I see; consider using the tempfile module to allocate and delete temporary files automatically.

Also, while the string to bytes conversion works as is, it's also brittle when used with large files, e.g. larger then the amount of RAM available, because it will store the downloaded file in memory. I'd suggest passing in gz.write directly instead:

with gzip.open(tmp_dir + tmp_fname, 'wb') as gz:
 ftp.retrbinary('RETR ' + fname, gz.write)

Unless I'm mistaken that should keep the bytes in order; if not you could still pass in a function that converts only the downloaded chunk to bytes and calls gz.write on the converted value.

Regarding your questions:

Since there's no extraction of data from the zip file, I don't see how that matters. If you were to extract something (a single file) from the zip file, then yes, gzipping that will of course be okay. Make sure that the gzipped files are how you expect them (i.e. a single compressed text file) and that you don't need the file name or other attributes in the original zip archive. At the moment you basically upload some_file.zip.gz to S3, as in, two compressions nested. Probably not what you want.
As argued above that's probably not advisable unless you know that the data fits into memory. If it does, sure, why not.

Some more remarks:

The zipfile import is unused, as mentioned above.
Prefer os.path.join to concatenating strings.
fname/passwd instead of filename/password seems pointless truncation.
print should be used as a function to be forwards compatible, i.e. print(e) instead.
The result of tmp_dir + tmp_fname should be reused instead of writing the same expression twice.

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2015-10-19 23:35:42Z

Looks okay to me in general. I'd prefer to pass in configuration separately, parsed from a config file or command line arguments, but if it's a one-off script it's probably okay this way.

The temporary files aren't deleted as far as I see; consider using the tempfile module to allocate and delete temporary files automatically.

Also, while the string to bytes conversion works as is, it's also brittle when used with large files, e.g. larger then the amount of RAM available, because it will store the downloaded file in memory. I'd suggest passing in gz.write directly instead:

with gzip.open(tmp_dir + tmp_fname, 'wb') as gz:
 ftp.retrbinary('RETR ' + fname, gz.write)

Unless I'm mistaken that should keep the bytes in order; if not you could still pass in a function that converts only the downloaded chunk to bytes and calls gz.write on the converted value.

Regarding your questions:

Since there's no extraction of data from the zip file, I don't see how that matters. If you were to extract something (a single file) from the zip file, then yes, gzipping that will of course be okay. Make sure that the gzipped files are how you expect them (i.e. a single compressed text file) and that you don't need the file name or other attributes in the original zip archive. At the moment you basically upload some_file.zip.gz to S3, as in, two compressions nested. Probably not what you want.
As argued above that's probably not advisable unless you know that the data fits into memory. If it does, sure, why not.

Some more remarks:

The zipfile import is unused, as mentioned above.
Prefer os.path.join to concatenating strings.
fname/passwd instead of filename/password seems pointless truncation.
print should be used as a function to be forwards compatible, i.e. print(e) instead.
The result of tmp_dir + tmp_fname should be reused instead of writing the same expression twice.

Stack Exchange Network

Convert zip to gzip and upload to S3 bucket

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Convert zip to gzip and upload to S3 bucket

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions