I've written a script in python which is able to collect links of posts and then fetch the title of each post by going one layer deep from the target page.
I've applied @get_links decorator which scrapes the titles from its inner page.
However, I wish to get any suggestion to improve my existing approach keeping the decorator within as I'm very new to work with it.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(func):
def get_target_link(*args,**kwargs):
titles = []
for link in func(*args,**kwargs):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] a").text
titles.append(title)
return titles
return get_target_link
@get_links
def get_info(link):
ilink = []
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".summary .question-hyperlink"):
ilink.append(urljoin(url,items.get('href')))
return ilink
if __name__ == '__main__':
print(get_info(url))
1 Answer 1
While decorators are fun to learn about (especially when you get to decorators taking arguments and class decorators) and they can be quite useful, I think this decorator should not be one. Sorry.
Your code becomes much easier to read and understand by making this into two functions, one that gets the links and one that gets the title from a link, which you then apply to each link:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_title(link):
"""Load a link to get the page title."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
return soup.title.text.split(" - ")[1] # Will only work exactly like this with Stackexchange
# return soup.select_one("h1[itemprop='name'] a").text
def get_links(link):
"""Get all links from a page."""
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
relative_urls = soup.select(".summary .question-hyperlink")
return [urljoin(url, items.get('href')) for items in relative_urls]
if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
links = get_links(url)
link_titles = [get_title(link) for link in links]
print(link_titles)
If you really want to, you can then make a new function that uses these two functions:
def get_link_titles(url):
"""Get the titles of all links present in `url`."""
return [get_title(link) for link in get_links(url)]
In addition, you should use requests.Session
to reuse the connection to the website (since you are always connecting to the same host).
You could put getting a page and parsing it with BeautifulSoup
into its own function:
SESSION = requests.Session()
def get_soup(url):
res = SESSION.get(url)
return BeautifulSoup(res.text,"lxml")
You might also want to check the headers for a rate limit, because when I ran your code and tried to time it, Stack Exchange temporarily blocked me after some time because the request rate was too high :).
check_pagination
is supposed to help from the code itself, can you explain its purpose in more details, please? \$\endgroup\$