Automate the Boring Stuff Chapter 12 - Link Verification

Question 1

Write a program that, given the URL of a web page, will attempt to download every linked page on the page. The program should flag any pages that have a 404 "Not Found" status code and print them out as broken links.

My solution:

# Usage: python link_verification.py "webpage to search" 
import sys, requests, bs4
def link_validator(links):
 for link in links:
 page = link.get("href")
 if str(page).startswith("https://"):
 url = link.get("href")
 res = requests.get(url)
 try:
 res.raise_for_status()
 print(f"Page found: {page}")
 except Exception as exc:
 print(f"There was a problem with {page}:\n{exc}")
def main():
 while True:
 url = sys.argv[1] # Enter a website to search
 res = requests.get(url)
 try:
 res.raise_for_status()
 except Exception as exc:
 print(f"There was a problem: {exc}")
 continue
 soup = bs4.BeautifulSoup(res.text, "html.parser")
 links = soup.select("a")
 link_validator(links)
 print("Search complete")
 break
if __name__ == '__main__':
 main()

Question 2

So far, I think your code looks really good. It's clear and easy to read, you are using library code to reduce boilerplate (r.raise_for_status), and you have an if __name__ guard. Nice work! A few things I'd consider:

`while True: break`

You don't need the loop in your main function, since it only goes through a single iteration unless there's a problem with the URL. In that case, I'd probably just raise rather than continue.

`sys` args

It's usually more conventional to unpack sys args in the if __name__ block and pass them as arguments to main. This way there isn't a weird side effect if you want to use this function in another module:

def main(url):
 res = requests.get(url) 
 ...
if __name__ == "__main__":
 url = sys.argv[1]
 main(url)

`if` guards on loops:

When using a for loop, it's common to see this kind of pattern:

for thing in iterable:
 if condition:
 # do things

While this is easy to understand and straightforward, it can be a little cleaner and avoid extra indentation to check the negative case and skip it:

for thing in iterable:
 if not condition:
 continue
 # do things, note no else

So in your link_validator:

def link_validator(links):
 for link in links:
 page = link.get("href")
 if not str(page).startswith('https://'):
 continue
 url = link.get("href")
 res = requests.get(url)
 try:
 res.raise_for_status()
 print(f"Page found: {page}")
 except Exception as exc:
 print(f"There was a problem with {page}:\n{exc}")

Question 3

The page and url variables inside of link_validator(links) are redundant. url could be removed and requests.get(url) -> requests.get(page).

Requests error handling

request's helper raise_for_status is great and you're using it well! A small improvement here would be catching explicit errors though as catching all exceptions like except Exception could have unintended side effects.

For example, if you accidentally typo'd within the try block and had res.raise_for_sttus(), your code here will handle that and not error. Similarly so for a typo'd, undefined variable name which normally throws a NameError

Additionally, just stringifying an error instance usually loses the specific error type which is probably fine when you're only catching a particular type of error but when you're catching all errors, it can have very ambiguous side effects. Take the following as an example:

urls = {"a": "http://google.com", "b": "https://stackexchange.com"}
try:
 requests.get(urls["c"])
except Exception as e:
 print(e)
# prints the following
# 'c'
# or perhaps easier to play with
try:
 d = {}
 print(d['hi'])
except Exception as e:
 print(e)
# prints
# 'hi'
print(d['hi'])
# raises a KeyError: 'hi' - notice without the exception type, the message is very ambiguous

These may seem like silly examples but I've seen these in production code at a much larger scale produce very hard to find bugs.

So that's all to say you should except requests.exceptions.HTTPError in favor of Exception.

Finally, it likely makes sense to report on any/all "broken" links, but your prompt does specifically only callout reporting 404s:

flag any pages that have a 404 "Not Found" status code and print them out as broken links

Depending on server implementations, raise_for_status may inappropriately raise depending on how you define "broken link", but either way your code will flag more than only 404's so if you wanted to limit to only 404 status codes or just handle a 404 differently, you can do so by checking the status_code of the response which you can do in 2 ways:

res = requests.get(...)
if res.status_code == 404:
 print(f"this {page} was not found.")
# or 
try:
 res.raise_for_status()
except requests.exceptions.HTTPError as e:
 if e.response.status_code == 404:
 print(f"this {page} was not found")
 raise e # or maybe just print("this page was broken but found")

C.Nivs C.Nivs 3,11714 silver badges32 bronze badges · Accepted Answer · 2022-11-07 14:36:44Z

So far, I think your code looks really good. It's clear and easy to read, you are using library code to reduce boilerplate (r.raise_for_status), and you have an if __name__ guard. Nice work! A few things I'd consider:

`while True: break`

You don't need the loop in your main function, since it only goes through a single iteration unless there's a problem with the URL. In that case, I'd probably just raise rather than continue.

`sys` args

It's usually more conventional to unpack sys args in the if __name__ block and pass them as arguments to main. This way there isn't a weird side effect if you want to use this function in another module:

def main(url):
 res = requests.get(url) 
 ...
if __name__ == "__main__":
 url = sys.argv[1]
 main(url)

`if` guards on loops:

When using a for loop, it's common to see this kind of pattern:

for thing in iterable:
 if condition:
 # do things

While this is easy to understand and straightforward, it can be a little cleaner and avoid extra indentation to check the negative case and skip it:

for thing in iterable:
 if not condition:
 continue
 # do things, note no else

So in your link_validator:

def link_validator(links):
 for link in links:
 page = link.get("href")
 if not str(page).startswith('https://'):
 continue
 url = link.get("href")
 res = requests.get(url)
 try:
 res.raise_for_status()
 print(f"Page found: {page}")
 except Exception as exc:
 print(f"There was a problem with {page}:\n{exc}")

Stack Exchange Network

Automate the Boring Stuff Chapter 12 - Link Verification

2 Answers 2

`while True: break`

`sys` args

`if` guards on loops:

Requests error handling

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Automate the Boring Stuff Chapter 12 - Link Verification

2 Answers 2

while True: break

sys args

if guards on loops:

Requests error handling

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`while True: break`

`sys` args

`if` guards on loops: