I wrote a script for Python 2.78 which is supposed to download and save a file from a given URL using the requests library.
In case that a connection to the server can be established and a valid response is received, the response (e.g. a HTML file) should be downloaded to hard disk.
If the server gives a bad HTTP response code (anything but 200) or no connection could be established at all, a revealing error message should be printed, giving the specific type of error (e.g. 'connection refused' or 'HTTP 400 response code'. In this case, the download attempt should be retried until a predefined maximum of download tries is exceeded.
The URL and other connection parameters such as connection timeout, read timeout, maximum download attempts etc. as well as the location and filename for downloading the file can be specified on top of the script within the CONFIGURATION
section. There can also be found some example URLs which mock bad server behaviour (based on httpbin).
The script works for me as is. However, I get the impression that it is overly complicated and includes too much nesting of the function which reduces readability. Moreover, the two configuration dictionaries are passed through several functions which I have the impression is hard to follow. Any suggestions on how to improve this code in terms of readability, simplicity or good coding style is appreciated!
###########
# IMPORTS #
###########
# file system browsing
import glob, os
# http services / file downloading
import requests
from requests.auth import HTTPBasicAuth
# sleeping
import time
###########
# CONFIGURATION #
###########
# get path from which the script is being run
import os
path = os.path.dirname(os.path.abspath(__file__))
# useful test urls:
# http://httpbin.org/basic-auth/user/passwd <- returns a valid HTTP 200 response that can be downloaded and saved as html
# http://httpbin.org/delay/n <- returns response after n sec of delay (to create a timeout)
# http://httpbin.org/status/404 <- returns a HTTP 404 response
# connection parameters
connectionParameters = {'url' : r'http://httpbin.org/basic-auth/user/passwd',
'user' : 'user',
'password' : 'passwd',
'connectTimeout' : 10,
'readTimeout' : 5,
'maxAttempts' : 3,
'secondsBetweenAttempts' : 3}
# download parameters
downloadParameters = {'folder' : r'{0}'.format(path),
'filename' : 'downloadedFile.html'}
##################
# Functions #
##################
# create custom exception for the case that the maximum of download tries is exceeded
class MaxAttemptsException(requests.RequestException):
pass
# clean up the folder with the downloaded files so that old ones are not checked again in case that the download fails
def cleanUpDownloadFolder(downloadParameters):
print('cleaning up download folder: {0}\n'.format(downloadParameters['folder']))
os.chdir(downloadParameters['folder'])
[[os.remove(r'{0}\{1}'.format(downloadParameters['folder'], filename)) for filename in glob.glob(extension)] for extension in ['*.*']]
return
def getResponseFromServer(connectionParameters, downloadParameters):
try:
response = requests.get(url=connectionParameters['url'], auth=HTTPBasicAuth(connectionParameters['user'], connectionParameters['password']), timeout=(connectionParameters['connectTimeout'], connectionParameters['readTimeout']), verify=True)
success = checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
return success
# if no HTTP response from server, catch connection error and return None response
except requests.RequestException as e:
print('There was an error while connecting to the server. Error message:')
print('{0}\n'.format(e.message))
success = False
return success
def sleepBetweenAttempts(connectionParameters):
print('Waiting {0} seconds before next try...\n'.format(connectionParameters['secondsBetweenAttempts']))
time.sleep(connectionParameters['secondsBetweenAttempts'])
return
def increaseAttempts(attempt):
attempt += 1
return attempt
def saveFile(response, downloadParameters):
try:
with open(os.path.join(downloadParameters['folder'], downloadParameters['filename']), 'wb') as outputFile:
outputFile.write(response.content)
success = True
return success
except EnvironmentError as e:
print('The downloaded file could not be saved on disk.')
print('{0}\n'.format(e))
success = False
return success
def checkHTTPresponseStatus(response, connectionParameters, downloadParameters):
try:
# raise exception if response is not OK
response.raise_for_status()
# if response is OK, download and save file to disk
success = saveFile(response, downloadParameters)
return success
# if response is NOT OK, an exception is raised. catch HTTP error.
except requests.HTTPError as e:
print('Connection to server has been established but there was a HTTP error')
print('{0}\n'.format(e.message))
success = False
return success
def retry(attempt, connectionParameters):
# increase download attempt counter, then sleep and try again if any attempts left
try:
attempt = increaseAttempts(attempt)
if attempt > connectionParameters['maxAttempts']:
raise MaxAttemptsException('MaxAttemptsException: Failed {0} attempts to download the file. Exiting...'.format(connectionParameters['maxAttempts']))
else:
sleepBetweenAttempts(connectionParameters)
except MaxAttemptsException as e:
print('{0}\n'.format(e.message))
exit()
return attempt
def tryToDownloadXmlFileFromServer(connectionParameters, downloadParameters):
attempt = 1
while True:
print('Downloading xml file from {0}, attempt {1}/{2}\n'.format(connectionParameters['url'], attempt, connectionParameters['maxAttempts']))
success = getResponseFromServer(connectionParameters, downloadParameters)
if success:
print('File has been successfully downloaded and saved at {0}\{1}.'.format(downloadParameters['folder'], downloadParameters['filename']))
break
else:
# file could not be downloaded or saved, try again
attempt = retry(attempt, connectionParameters)
return
if __name__ == "__main__":
tryToDownloadXmlFileFromServer(connectionParameters, downloadParameters)
1 Answer 1
Python's naming case is snake_case
(makes sense, right?); you are writing your identifiers and functions is camelCase
.
For example, this:
connectionParameters
becomes:
connection_parameters
And, as another example, this:
cleanUpDownloadFolder
becomes this:
cleanup_download_folder
However, for classes, the naming case is PascalCase
, which you did perfectly well for MaxAttemptsException
.
It is a good idea to put all your import
s together, as you did under your IMPORTS
section. However, there are two problems:
In the next section, you
import os
again.The
import os
in the next section is not up in theIMPORTS
section with the otherimport
s.
Since you are already import
ing os
in the IMPORTS
section, you should just completely remove this line from CONFIGURATION
:
import os
You have some fairly long function names:
tryToDownloadXmlFileFromServer
cleanUpDownloadFolder
checkHTTPresponseStatus
You seem to have some unnecessary words in these names. I would change these to:
download_xml
clean_downloads
check_response
You don't need to be so specific with tryTo
and Folder
and Server
... since your program is all about those, these topics should just be implied in your function names.
When you call the method sys.exit
:
exit()
You should be passing an exit/error code. The exit code describes how execution went to external programs in a single number.
Any non-zero number means that something went wrong. Therefore, for example, when you are exit
ing from the program in retry
, you should be passing a non-zero number to show that something went wrong.
This seems awfully unnecessary:
success = False
return success
Why are you creating a variable with an extremely simplistic value, and then having it returned from the function in the very next line?
Wouldn't it make a lot more sense to just do:
return False
The same sort-of goes to these lines:
success = checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
return success
Why are you creating a brand new variable to only have it destroyed in the next line?
You should just merge the content of success
straight to the return statement, like this:
return checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
You do this in a few other spots, along with the above recommendation. You should fix those spots too.
Your function checkHTTPresponseStatus
is not doing what it seems like it should be doing. The title tells me,
Oh, hey! Me? Yeah, I check to make sure that the response status from the server is a good one.
However, what it really is doing is this:
Oh, hey! Me? Yeah, I save the response file to the disk.
This leads me to my next point.
Starting from the first function you call, tryToDownloadXmlFileFromServer
, all the other functions that are called in this one are just a chain of results.
What I mean is, the success
variable in tryToDownloadXmlFileFromServer
is going to be the return value of saveFile
unless an error occurred along the way; in that case it'd be false.
Why? Well (assuming no errors have occured),
checkHTTPresponseStatus
returns the return ofsaveFile
getResponseFromServer
returns the return ofcheckHTTPresponseStatus
tryToDownloadXmlFileFromServer
uses the return ofgetResponseFromServer
.
To fix this, I'd say that you have two options:
Since these functions do so little, just group them into one function.
Move all error catching to one function, but keep the others doing their job.
I personally would choose the first one, because if you were to string the error catching and returns from all these functions, you would literally be left with a single line.
Here is how I would write the beginning of tryToDownloadXmlFileFromServer
:
def download_xml(connection_params, download_params):
attempt = 1
while True
try:
print('Downloading xml file from {0}, attempt {1}/{2}\n'.format(connectionParameters['url'], attempt, connectionParameters['maxAttempts']))
response = requests.get(url=connectionParameters['url'], auth=HTTPBasicAuth(connectionParameters['user'], connectionParameters['password']), timeout=(connectionParameters['connectTimeout'], connectionParameters['readTimeout']), verify=True)
response.raise_for_status()
with open(os.path.join(downloadParameters['folder'], downloadParameters['filename']), 'wb') as outputFile:
outputFile.write(response.content)
Then, you do the error catching that you were doing in the other functions:
except requests.RequestException as e: #from getResponseFromServer
[error message]
except requests.HTTPError as e: #from checkHTTPresponseStatus
[error message]
except EnvironmentError as e: #from saveFile
[error message]
attempt = retry(attempt, connectionParameters)
Here is what is happening in the error message part:
If there were any errors from the code in the
try
, enter the respectiveexcept
.Once the
except
is finished, set the new attempt amountGo back up the to the
while true
loop and try again.
Explore related questions
See similar questions with these tags.