My code accesses an FTP server, downloads a .zip
file, pushes the file contents as .gz
to an AWS S3 bucket.
import boto3
import ftplib
import gzip
import io
import zipfile
def _move_to_s3(fname):
host = 'some_host'
user = 'some_user'
passwd = 'some_password'
tmp_dir = '/tmp/'
tmp_fname = fname.split('.')[0] + '.gz'
target_bucket = 'some_bucket'
s3 = boto3.resource('s3')
try:
ftp = ftplib.FTP(host, user, passwd)
with io.BytesIO() as data, gzip.open(tmp_dir + tmp_fname, 'wb') as gz:
ftp.retrbinary('RETR ' + fname, data.write)
gz.write(data.getvalue())
s3.Object(target_bucket, tmp_fname).put(Body=open(tmp_dir + tmp_fname, 'rb'))
except Exception as e:
print e
finally:
ftp.quit()
if __name__ == '__main__':
_move_to_s3('some_file.zip')
I welcome any comments but my main points of interest are:
- Assuming the source zip contains only one text file, is it OK to write to the gzip file from a stream containing the contents of the downloaded zip file?
- I think I should change the code to hold the gzip in memory, without generating temporary files.
1 Answer 1
Looks okay to me in general. I'd prefer to pass in configuration separately, parsed from a config file or command line arguments, but if it's a one-off script it's probably okay this way.
The temporary files aren't deleted as far as I see; consider using the
tempfile
module to
allocate and delete temporary files automatically.
Also, while the string to bytes conversion works as is, it's also
brittle when used with large files, e.g. larger then the amount of RAM
available, because it will store the downloaded file in memory. I'd
suggest passing in gz.write
directly instead:
with gzip.open(tmp_dir + tmp_fname, 'wb') as gz:
ftp.retrbinary('RETR ' + fname, gz.write)
Unless I'm mistaken that should keep the bytes in order; if not you
could still pass in a function that converts only the downloaded chunk
to bytes
and calls gz.write
on the converted value.
Regarding your questions:
- Since there's no extraction of data from the zip file, I don't see how
that matters. If you were to extract something (a single file) from
the zip file, then yes, gzipping that will of course be okay. Make
sure that the gzipped files are how you expect them (i.e. a single
compressed text file) and that you don't need the file name or other
attributes in the original zip archive. At the moment you basically
upload
some_file.zip.gz
to S3, as in, two compressions nested. Probably not what you want. - As argued above that's probably not advisable unless you know that the data fits into memory. If it does, sure, why not.
Some more remarks:
- The
zipfile
import is unused, as mentioned above. - Prefer
os.path.join
to concatenating strings. fname
/passwd
instead offilename
/password
seems pointless truncation.print
should be used as a function to be forwards compatible, i.e.print(e)
instead.- The result of
tmp_dir + tmp_fname
should be reused instead of writing the same expression twice.
Explore related questions
See similar questions with these tags.