Making use of a decorator within a python script

Question 1

I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.

I've applied @get_links decorator which scrapes the titles from its inner page.

However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
 def get_target_link(*args,**kwargs):
 titles = []
 for link in func(*args,**kwargs):
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 title = soup.select_one("h1[itemprop='name'] a").text
 titles.append(title)
 return titles
 return get_target_link
@get_links
def get_info(link):
 ilink = []
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 for items in soup.select(".summary .question-hyperlink"):
 ilink.append(urljoin(url,items.get('href')))
 return ilink
if __name__ == '__main__':
 print(get_info(url))

Question 2

I don't get how check_pagination is supposed to help from the code itself, can you explain its purpose in more details, please?

Question 3

Right you were @Mathias Ettinger . The decorator in my earlier script was for nothing. Check the update. Thanks.

Question 4

Why do you think that using a decorator is appropriate here?

Question 5

Where did you find that I thought it would be appropriate here @200_success?. I'm trying to figure out how decorator works and that's it.

Question 6

Your bolded italicized paragraph seemed to insist on keeping the decorator at all costs.

Question 7

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
 """Load a link to get the page title."""
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
 # return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
 """Get all links from a page."""
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 relative_urls = soup.select(".summary .question-hyperlink")
 return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
 url = "https://stackoverflow.com/questions/tagged/web-scraping"
 links = get_links(url)
 link_titles = [get_title(link) for link in links]
 print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):
 """Get the titles of all links present in `url`."""
 return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()
def get_soup(url):
 res = SESSION.get(url)
 return BeautifulSoup(res.text,"lxml")

You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2018-12-06 15:43:54Z

While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.

Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
 """Load a link to get the page title."""
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
 # return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
 """Get all links from a page."""
 res = requests.get(link)
 soup = BeautifulSoup(res.text,"lxml")
 relative_urls = soup.select(".summary .question-hyperlink")
 return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
 url = "https://stackoverflow.com/questions/tagged/web-scraping"
 links = get_links(url)
 link_titles = [get_title(link) for link in links]
 print(link_titles)

If you really want to, you can then make a new function that uses these two functions:

def get_link_titles(url):
 """Get the titles of all links present in `url`."""
 return [get_title(link) for link in get_links(url)]

In addition, you should use requests.Session to reuse the connection to the website (since you are always connecting to the same host).

You could put getting a page and parsing it with BeautifulSoup into its own function:

SESSION = requests.Session()
def get_soup(url):
 res = SESSION.get(url)
 return BeautifulSoup(res.text,"lxml")

You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).

Stack Exchange Network

Making use of a decorator within a python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Making use of a decorator within a python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions