Download a file from given URL and retry on connection errors

Question 1

I wrote a script for Python 2.78 which is supposed to download and save a file from a given URL using the requests library.

In case that a connection to the server can be established and a valid response is received, the response (e.g. a HTML file) should be downloaded to hard disk.

If the server gives a bad HTTP response code (anything but 200) or no connection could be established at all, a revealing error message should be printed, giving the specific type of error (e.g. 'connection refused' or 'HTTP 400 response code'. In this case, the download attempt should be retried until a predefined maximum of download tries is exceeded.

The URL and other connection parameters such as connection timeout, read timeout, maximum download attempts etc. as well as the location and filename for downloading the file can be specified on top of the script within the CONFIGURATION section. There can also be found some example URLs which mock bad server behaviour (based on httpbin).

The script works for me as is. However, I get the impression that it is overly complicated and includes too much nesting of the function which reduces readability. Moreover, the two configuration dictionaries are passed through several functions which I have the impression is hard to follow. Any suggestions on how to improve this code in terms of readability, simplicity or good coding style is appreciated!

###########
# IMPORTS #
###########
# file system browsing
import glob, os
# http services / file downloading
import requests
from requests.auth import HTTPBasicAuth
# sleeping
import time
###########
# CONFIGURATION #
###########
# get path from which the script is being run
import os
path = os.path.dirname(os.path.abspath(__file__))
# useful test urls:
# http://httpbin.org/basic-auth/user/passwd <- returns a valid HTTP 200 response that can be downloaded and saved as html
# http://httpbin.org/delay/n <- returns response after n sec of delay (to create a timeout)
# http://httpbin.org/status/404 <- returns a HTTP 404 response
# connection parameters
connectionParameters = {'url' : r'http://httpbin.org/basic-auth/user/passwd',
 'user' : 'user',
 'password' : 'passwd',
 'connectTimeout' : 10,
 'readTimeout' : 5,
 'maxAttempts' : 3,
 'secondsBetweenAttempts' : 3}
# download parameters
downloadParameters = {'folder' : r'{0}'.format(path),
 'filename' : 'downloadedFile.html'}
##################
# Functions #
##################
# create custom exception for the case that the maximum of download tries is exceeded
class MaxAttemptsException(requests.RequestException):
 pass
# clean up the folder with the downloaded files so that old ones are not checked again in case that the download fails
def cleanUpDownloadFolder(downloadParameters):
 print('cleaning up download folder: {0}\n'.format(downloadParameters['folder']))
 os.chdir(downloadParameters['folder'])
 [[os.remove(r'{0}\{1}'.format(downloadParameters['folder'], filename)) for filename in glob.glob(extension)] for extension in ['*.*']]
 return
def getResponseFromServer(connectionParameters, downloadParameters):
 try:
 response = requests.get(url=connectionParameters['url'], auth=HTTPBasicAuth(connectionParameters['user'], connectionParameters['password']), timeout=(connectionParameters['connectTimeout'], connectionParameters['readTimeout']), verify=True)
 success = checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
 return success
 # if no HTTP response from server, catch connection error and return None response
 except requests.RequestException as e:
 print('There was an error while connecting to the server. Error message:')
 print('{0}\n'.format(e.message))
 success = False
 return success
def sleepBetweenAttempts(connectionParameters):
 print('Waiting {0} seconds before next try...\n'.format(connectionParameters['secondsBetweenAttempts']))
 time.sleep(connectionParameters['secondsBetweenAttempts'])
 return
def increaseAttempts(attempt):
 attempt += 1
 return attempt
def saveFile(response, downloadParameters):
 try:
 with open(os.path.join(downloadParameters['folder'], downloadParameters['filename']), 'wb') as outputFile:
 outputFile.write(response.content)
 success = True
 return success
 except EnvironmentError as e:
 print('The downloaded file could not be saved on disk.')
 print('{0}\n'.format(e))
 success = False
 return success
def checkHTTPresponseStatus(response, connectionParameters, downloadParameters):
 try:
 # raise exception if response is not OK
 response.raise_for_status()
 # if response is OK, download and save file to disk
 success = saveFile(response, downloadParameters)
 return success
 # if response is NOT OK, an exception is raised. catch HTTP error.
 except requests.HTTPError as e:
 print('Connection to server has been established but there was a HTTP error')
 print('{0}\n'.format(e.message))
 success = False
 return success
def retry(attempt, connectionParameters):
 # increase download attempt counter, then sleep and try again if any attempts left
 try:
 attempt = increaseAttempts(attempt)
 if attempt > connectionParameters['maxAttempts']:
 raise MaxAttemptsException('MaxAttemptsException: Failed {0} attempts to download the file. Exiting...'.format(connectionParameters['maxAttempts']))
 else:
 sleepBetweenAttempts(connectionParameters)
 except MaxAttemptsException as e:
 print('{0}\n'.format(e.message))
 exit()
 return attempt
def tryToDownloadXmlFileFromServer(connectionParameters, downloadParameters):
 attempt = 1
 while True:
 print('Downloading xml file from {0}, attempt {1}/{2}\n'.format(connectionParameters['url'], attempt, connectionParameters['maxAttempts']))
 success = getResponseFromServer(connectionParameters, downloadParameters)
 if success:
 print('File has been successfully downloaded and saved at {0}\{1}.'.format(downloadParameters['folder'], downloadParameters['filename']))
 break
 else:
 # file could not be downloaded or saved, try again
 attempt = retry(attempt, connectionParameters)
 return
if __name__ == "__main__":
 tryToDownloadXmlFileFromServer(connectionParameters, downloadParameters)

Question 2

Python's naming case is snake_case (makes sense, right?); you are writing your identifiers and functions is camelCase.

For example, this:

connectionParameters

becomes:

connection_parameters

And, as another example, this:

cleanUpDownloadFolder

becomes this:

cleanup_download_folder

However, for classes, the naming case is PascalCase, which you did perfectly well for MaxAttemptsException.

It is a good idea to put all your imports together, as you did under your IMPORTS section. However, there are two problems:

In the next section, you import os again.
The import os in the next section is not up in the IMPORTS section with the other imports.

Since you are already importing os in the IMPORTS section, you should just completely remove this line from CONFIGURATION:

import os

You have some fairly long function names:

tryToDownloadXmlFileFromServer
cleanUpDownloadFolder
checkHTTPresponseStatus

You seem to have some unnecessary words in these names. I would change these to:

download_xml
clean_downloads
check_response

You don't need to be so specific with tryTo and Folder and Server... since your program is all about those, these topics should just be implied in your function names.

When you call the method sys.exit:

exit()

You should be passing an exit/error code. The exit code describes how execution went to external programs in a single number.

Any non-zero number means that something went wrong. Therefore, for example, when you are exiting from the program in retry, you should be passing a non-zero number to show that something went wrong.

This seems awfully unnecessary:

success = False
return success

Why are you creating a variable with an extremely simplistic value, and then having it returned from the function in the very next line?

Wouldn't it make a lot more sense to just do:

return False

The same sort-of goes to these lines:

success = checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
return success

Why are you creating a brand new variable to only have it destroyed in the next line?

You should just merge the content of success straight to the return statement, like this:

return checkHTTPresponseStatus(response, connectionParameters, downloadParameters)

You do this in a few other spots, along with the above recommendation. You should fix those spots too.

Your function checkHTTPresponseStatus is not doing what it seems like it should be doing. The title tells me,

Oh, hey! Me? Yeah, I check to make sure that the response status from the server is a good one.

However, what it really is doing is this:

Oh, hey! Me? Yeah, I save the response file to the disk.

This leads me to my next point.

Starting from the first function you call, tryToDownloadXmlFileFromServer, all the other functions that are called in this one are just a chain of results.

What I mean is, the success variable in tryToDownloadXmlFileFromServer is going to be the return value of saveFile unless an error occurred along the way; in that case it'd be false.

Why? Well (assuming no errors have occured),

checkHTTPresponseStatus returns the return of saveFile
getResponseFromServer returns the return of checkHTTPresponseStatus
tryToDownloadXmlFileFromServer uses the return of getResponseFromServer.

To fix this, I'd say that you have two options:

Since these functions do so little, just group them into one function.
Move all error catching to one function, but keep the others doing their job.

I personally would choose the first one, because if you were to string the error catching and returns from all these functions, you would literally be left with a single line.

Here is how I would write the beginning of tryToDownloadXmlFileFromServer:

def download_xml(connection_params, download_params):
 attempt = 1
 while True
 try:
 print('Downloading xml file from {0}, attempt {1}/{2}\n'.format(connectionParameters['url'], attempt, connectionParameters['maxAttempts']))
 response = requests.get(url=connectionParameters['url'], auth=HTTPBasicAuth(connectionParameters['user'], connectionParameters['password']), timeout=(connectionParameters['connectTimeout'], connectionParameters['readTimeout']), verify=True)
 response.raise_for_status()
 with open(os.path.join(downloadParameters['folder'], downloadParameters['filename']), 'wb') as outputFile:
 outputFile.write(response.content)

Then, you do the error catching that you were doing in the other functions:

except requests.RequestException as e: #from getResponseFromServer
 [error message]
except requests.HTTPError as e: #from checkHTTPresponseStatus
 [error message]
except EnvironmentError as e: #from saveFile
 [error message]
attempt = retry(attempt, connectionParameters)

Here is what is happening in the error message part:

If there were any errors from the code in the try, enter the respective except.
Once the except is finished, set the new attempt amount
Go back up the to the while true loop and try again.

SirPython SirPython 13.4k3 gold badges38 silver badges93 bronze badges · Accepted Answer · 2015-07-14 14:48:43Z

Python's naming case is snake_case (makes sense, right?); you are writing your identifiers and functions is camelCase.

For example, this:

connectionParameters

becomes:

connection_parameters

And, as another example, this:

cleanUpDownloadFolder

becomes this:

cleanup_download_folder

However, for classes, the naming case is PascalCase, which you did perfectly well for MaxAttemptsException.

It is a good idea to put all your imports together, as you did under your IMPORTS section. However, there are two problems:

In the next section, you import os again.
The import os in the next section is not up in the IMPORTS section with the other imports.

Since you are already importing os in the IMPORTS section, you should just completely remove this line from CONFIGURATION:

import os

You have some fairly long function names:

tryToDownloadXmlFileFromServer
cleanUpDownloadFolder
checkHTTPresponseStatus

You seem to have some unnecessary words in these names. I would change these to:

download_xml
clean_downloads
check_response

You don't need to be so specific with tryTo and Folder and Server... since your program is all about those, these topics should just be implied in your function names.

When you call the method sys.exit:

exit()

You should be passing an exit/error code. The exit code describes how execution went to external programs in a single number.

Any non-zero number means that something went wrong. Therefore, for example, when you are exiting from the program in retry, you should be passing a non-zero number to show that something went wrong.

This seems awfully unnecessary:

success = False
return success

Why are you creating a variable with an extremely simplistic value, and then having it returned from the function in the very next line?

Wouldn't it make a lot more sense to just do:

return False

The same sort-of goes to these lines:

success = checkHTTPresponseStatus(response, connectionParameters, downloadParameters)
return success

Why are you creating a brand new variable to only have it destroyed in the next line?

You should just merge the content of success straight to the return statement, like this:

return checkHTTPresponseStatus(response, connectionParameters, downloadParameters)

You do this in a few other spots, along with the above recommendation. You should fix those spots too.

Your function checkHTTPresponseStatus is not doing what it seems like it should be doing. The title tells me,

Oh, hey! Me? Yeah, I check to make sure that the response status from the server is a good one.

However, what it really is doing is this:

Oh, hey! Me? Yeah, I save the response file to the disk.

This leads me to my next point.

Starting from the first function you call, tryToDownloadXmlFileFromServer, all the other functions that are called in this one are just a chain of results.

What I mean is, the success variable in tryToDownloadXmlFileFromServer is going to be the return value of saveFile unless an error occurred along the way; in that case it'd be false.

Why? Well (assuming no errors have occured),

checkHTTPresponseStatus returns the return of saveFile
getResponseFromServer returns the return of checkHTTPresponseStatus
tryToDownloadXmlFileFromServer uses the return of getResponseFromServer.

To fix this, I'd say that you have two options:

Since these functions do so little, just group them into one function.
Move all error catching to one function, but keep the others doing their job.

I personally would choose the first one, because if you were to string the error catching and returns from all these functions, you would literally be left with a single line.

Here is how I would write the beginning of tryToDownloadXmlFileFromServer:

def download_xml(connection_params, download_params):
 attempt = 1
 while True
 try:
 print('Downloading xml file from {0}, attempt {1}/{2}\n'.format(connectionParameters['url'], attempt, connectionParameters['maxAttempts']))
 response = requests.get(url=connectionParameters['url'], auth=HTTPBasicAuth(connectionParameters['user'], connectionParameters['password']), timeout=(connectionParameters['connectTimeout'], connectionParameters['readTimeout']), verify=True)
 response.raise_for_status()
 with open(os.path.join(downloadParameters['folder'], downloadParameters['filename']), 'wb') as outputFile:
 outputFile.write(response.content)

Then, you do the error catching that you were doing in the other functions:

except requests.RequestException as e: #from getResponseFromServer
 [error message]
except requests.HTTPError as e: #from checkHTTPresponseStatus
 [error message]
except EnvironmentError as e: #from saveFile
 [error message]
attempt = retry(attempt, connectionParameters)

Here is what is happening in the error message part:

If there were any errors from the code in the try, enter the respective except.
Once the except is finished, set the new attempt amount
Go back up the to the while true loop and try again.

Stack Exchange Network

Download a file from given URL and retry on connection errors

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Download a file from given URL and retry on connection errors

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions