The project outline:
Write a program that, given the URL of a web page, will attempt to download every linked page on the page. The program should flag any pages that have a 404 "Not Found" status code and print them out as broken links.
My solution:
# Usage: python link_verification.py "webpage to search"
import sys, requests, bs4
def link_validator(links):
for link in links:
page = link.get("href")
if str(page).startswith("https://"):
url = link.get("href")
res = requests.get(url)
try:
res.raise_for_status()
print(f"Page found: {page}")
except Exception as exc:
print(f"There was a problem with {page}:\n{exc}")
def main():
while True:
url = sys.argv[1] # Enter a website to search
res = requests.get(url)
try:
res.raise_for_status()
except Exception as exc:
print(f"There was a problem: {exc}")
continue
soup = bs4.BeautifulSoup(res.text, "html.parser")
links = soup.select("a")
link_validator(links)
print("Search complete")
break
if __name__ == '__main__':
main()
2 Answers 2
So far, I think your code looks really good. It's clear and easy to read, you are using library code to reduce boilerplate (r.raise_for_status
), and you have an if __name__
guard. Nice work! A few things I'd consider:
while True: break
You don't need the loop in your main
function, since it only goes through a single iteration unless there's a problem with the URL. In that case, I'd probably just raise
rather than continue
.
sys
args
It's usually more conventional to unpack sys args in the if __name__
block and pass them as arguments to main
. This way there isn't a weird side effect if you want to use this function in another module:
def main(url):
res = requests.get(url)
...
if __name__ == "__main__":
url = sys.argv[1]
main(url)
if
guards on loops:
When using a for
loop, it's common to see this kind of pattern:
for thing in iterable:
if condition:
# do things
While this is easy to understand and straightforward, it can be a little cleaner and avoid extra indentation to check the negative case and skip it:
for thing in iterable:
if not condition:
continue
# do things, note no else
So in your link_validator
:
def link_validator(links):
for link in links:
page = link.get("href")
if not str(page).startswith('https://'):
continue
url = link.get("href")
res = requests.get(url)
try:
res.raise_for_status()
print(f"Page found: {page}")
except Exception as exc:
print(f"There was a problem with {page}:\n{exc}")
The page
and url
variables inside of link_validator(links)
are redundant. url
could be removed and requests.get(url) -> requests.get(page)
.
Requests error handling
request's helper raise_for_status
is great and you're using it well! A small improvement here would be catching explicit errors though as catching all exceptions like except Exception
could have unintended side effects.
For example, if you accidentally typo'd within the try
block and had res.raise_for_sttus()
, your code here will handle that and not error. Similarly so for a typo'd, undefined variable name which normally throws a NameError
Additionally, just stringifying an error instance usually loses the specific error type which is probably fine when you're only catching a particular type of error but when you're catching all errors, it can have very ambiguous side effects. Take the following as an example:
urls = {"a": "http://google.com", "b": "https://stackexchange.com"}
try:
requests.get(urls["c"])
except Exception as e:
print(e)
# prints the following
# 'c'
# or perhaps easier to play with
try:
d = {}
print(d['hi'])
except Exception as e:
print(e)
# prints
# 'hi'
print(d['hi'])
# raises a KeyError: 'hi' - notice without the exception type, the message is very ambiguous
These may seem like silly examples but I've seen these in production code at a much larger scale produce very hard to find bugs.
So that's all to say you should except requests.exceptions.HTTPError
in favor of Exception
.
Finally, it likely makes sense to report on any/all "broken" links, but your prompt does specifically only callout reporting 404
s:
flag any pages that have a 404 "Not Found" status code and print them out as broken links
Depending on server implementations, raise_for_status
may inappropriately raise depending on how you define "broken link", but either way your code will flag more than only 404's so if you wanted to limit to only 404 status codes or just handle a 404 differently, you can do so by checking the status_code
of the response which you can do in 2 ways:
res = requests.get(...)
if res.status_code == 404:
print(f"this {page} was not found.")
# or
try:
res.raise_for_status()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
print(f"this {page} was not found")
raise e # or maybe just print("this page was broken but found")