I wanted to create a simple function that can read and return the HTML content from a specified URL. This is what reading here and there lead me to:
from socket import timeout
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
def get_html_content(url, max_attempt = 3):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
content = ""
attempt = 1
while True:
try:
html_page = urlopen(req, timeout=10)
content = html_page.read()
except (HTTPError, URLError, timeout) as e:
if isinstance(e, HTTPError):
print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt, max_attempt))
print('Error code: ', e.code)
if isinstance(e, URLError):
print("We failed to reach a server....attempt %d/%d" % (attempt, max_attempt))
print('Reason: ', e.reason)
if isinstance(e, timeout):
print('timeout...attempt %d/%d' % (attempt, max_attempt))
attempt += 1
if attempt > max_attempt:
break
continue
else:
break
return content
I would use this function to parse the content of many URLs. For if content = ""
, I would raise a random exception after writing to some file whatever I had already successfully gathered.
-
2\$\begingroup\$ you may want to consider using requests, possibly with an extension to account for the multiple attempts, or just iterating as you did) \$\endgroup\$WoJ– WoJ2018年07月28日 19:24:10 +00:00Commented Jul 28, 2018 at 19:24
1 Answer 1
There are a couple of minor technical issues:
The
content
variable is unnecessary, because you can simply returnhtml_page.read()
directly. (And you could as well returnurlopen(req, timeout=10).read()
directly...) When the max attempts are reached, you couldreturn ""
instead of relying on thatcontent
was initialized to""
. And how about returningNone
? Then you could simply omit thereturn
statement to the same effect.In the exception handling, there are multiple
if
statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together withelif
.Instead of doing a single
except
statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multipleexcept
statements each with a single error type.You could iterate using
range
for slightly more compact code.
Like this:
def get_html_content(url, max_attempt = 3):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
for attempt in range(max_attempt):
try:
return urlopen(req, timeout=10).read()
except HTTPError as e:
print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
print('Error code: ', e.code)
except URLError as e:
print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
print('Reason: ', e.reason)
except timeout as e:
print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))