Reading and returning the HTML content from a specified URL

Question 1

I wanted to create a simple function that can read and return the HTML content from a specified URL. This is what reading here and there lead me to:

from socket import timeout
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
def get_html_content(url, max_attempt = 3):
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 content = ""
 attempt = 1
 while True:
 try:
 html_page = urlopen(req, timeout=10)
 content = html_page.read()
 except (HTTPError, URLError, timeout) as e:
 if isinstance(e, HTTPError):
 print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt, max_attempt))
 print('Error code: ', e.code)
 if isinstance(e, URLError):
 print("We failed to reach a server....attempt %d/%d" % (attempt, max_attempt))
 print('Reason: ', e.reason)
 if isinstance(e, timeout):
 print('timeout...attempt %d/%d' % (attempt, max_attempt))
 attempt += 1
 if attempt > max_attempt:
 break
 continue
 else:
 break
 return content

I would use this function to parse the content of many URLs. For if content = "", I would raise a random exception after writing to some file whatever I had already successfully gathered.

Question 2

you may want to consider using requests, possibly with an extension to account for the multiple attempts, or just iterating as you did)

Question 3

There are a couple of minor technical issues:

The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.
In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.
Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.
You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 for attempt in range(max_attempt):
 try:
 return urlopen(req, timeout=10).read()
 except HTTPError as e:
 print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Error code: ', e.code)
 except URLError as e:
 print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Reason: ', e.reason)
 except timeout as e:
 print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2018-07-28 14:07:23Z

There are a couple of minor technical issues:

The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.
In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.
Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.
You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 for attempt in range(max_attempt):
 try:
 return urlopen(req, timeout=10).read()
 except HTTPError as e:
 print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Error code: ', e.code)
 except URLError as e:
 print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Reason: ', e.reason)
 except timeout as e:
 print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))

Stack Exchange Network

Reading and returning the HTML content from a specified URL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading and returning the HTML content from a specified URL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions