6
\$\begingroup\$

I wanted to create a simple function that can read and return the HTML content from a specified URL. This is what reading here and there lead me to:

from socket import timeout
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
def get_html_content(url, max_attempt = 3):
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 content = ""
 attempt = 1
 while True:
 try:
 html_page = urlopen(req, timeout=10)
 content = html_page.read()
 except (HTTPError, URLError, timeout) as e:
 if isinstance(e, HTTPError):
 print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt, max_attempt))
 print('Error code: ', e.code)
 if isinstance(e, URLError):
 print("We failed to reach a server....attempt %d/%d" % (attempt, max_attempt))
 print('Reason: ', e.reason)
 if isinstance(e, timeout):
 print('timeout...attempt %d/%d' % (attempt, max_attempt))
 attempt += 1
 if attempt > max_attempt:
 break
 continue
 else:
 break
 return content

I would use this function to parse the content of many URLs. For if content = "", I would raise a random exception after writing to some file whatever I had already successfully gathered.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 28, 2018 at 13:25
\$\endgroup\$
1
  • 2
    \$\begingroup\$ you may want to consider using requests, possibly with an extension to account for the multiple attempts, or just iterating as you did) \$\endgroup\$ Commented Jul 28, 2018 at 19:24

1 Answer 1

8
\$\begingroup\$

There are a couple of minor technical issues:

  • The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.

  • In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.

  • Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.

  • You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 for attempt in range(max_attempt):
 try:
 return urlopen(req, timeout=10).read()
 except HTTPError as e:
 print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Error code: ', e.code)
 except URLError as e:
 print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
 print('Reason: ', e.reason)
 except timeout as e:
 print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))
answered Jul 28, 2018 at 14:07
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.