Items derived from one function are getting printed within another for loop in another function

Question 1

I have written some code in python to scrape some items from a webpage. There is a link (titling see more) attach to each container. If I click on that link I can reach a certain page where all the information is available. It is quite convenient to track the see more link and parse all documents from there. However, my goal is to parse the first two documents from the first page and going on to the other page (clicking on the see more button) I'll parse the rest. I have tried to do the same as I described just now. It is doing just fine.

At this point, I'm seriously dubious whether the way I'm doing things is the right one or error-prone because I used two items from earlier page to get printed within the newly defined for loop created in the later function. It is giving me the accurate results though. Any suggestion as to what I did here is ideal or any guidance why I should not practice this will be highly appreciated.

This is the site link. This is the full script:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'replace_with_the_above_link'
def glean_items(main_link):
 res = requests.get(main_link)
 soup = BeautifulSoup(res.text,"lxml")
 for item in soup.select('.course-list-item'):
 event = item.select(".lead")[0].text
 date = item.select(".date")[0].text
 link = urljoin(main_link,item.select(".pull-right a")[0]['href'])
 parse_more(event,date,link)
def parse_more(event_name,ev_date,i_link): ## notice the two items (event_name,ev_date)
 res = requests.get(i_link)
 soup = BeautifulSoup(res.text,"lxml")
 for items in soup.select('.e-loc-cost'):
 location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-add")])
 cost = ' '.join([' '.join(item.text.split()) for item in items.select(".costs")])
 print(event_name,ev_date,location,cost) ##again take a look: I used those two items within this newly created for loop.
if __name__ == '__main__':
 glean_items(url)

Question 2

First of all, usual things (I feel like I'm suggesting these things in most of the web-scraping related discussions):

initialize session as requests.Session() and use session.get() instead of requests.get() - this would speed things up because the underlying TCP connection will be re-used for subsequent queries to the same domain
use SoupStrainer to limit what is parsed when the "soup" object is initialized

And, I am not completely sure about the reliability of the .pull_right a selector. At this point, it reads that you want to follow the link that is on the right in its respective container. The position of the element on the screen may easily change with a design change. What about a selector like a[id*=ReadMore] or .more > a?

Also, note that select()[0] could be replaced with select_one().

Code Style

There is not a lot of things to point out since the code is rather short and straightforward, but there are some PEP8 violations that can be addressed:

watch for the use of blank lines
you miss spaces between the function arguments
url should be named URL, since it is a "constant"
use a single # for comments
group imports properly

Some of the variable names could be more explicit, e.g. event_date in place of ev_date, event_link instead of i_link.

The improved code

from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
URL = 'replace_with_the_above_link'
def glean_items(session, main_link):
 response = session.get(main_link)
 
 parse_only = SoupStrainer(class_='course-list-item')
 soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
 for item in soup.select('.course-list-item'):
 event = item.select_one(".lead").text
 date = item.select_one(".date").text
 link = urljoin(main_link, item.select_one(".pull-right a")['href'])
 parse_more(session, event, date, link)
def parse_more(session, event_name, event_date, event_link):
 response = session.get(event_link)
 
 parse_only = SoupStrainer(class_="e-loc-cost")
 soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
 for items in soup.select('.e-loc-cost'):
 location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-add")])
 cost = ' '.join([' '.join(item.text.split()) for item in items.select(".costs")])
 
 print(event_name, event_date, location, cost)
if __name__ == '__main__':
 with requests.Session() as session:
 glean_items(session, URL)

Question 3

I see our recommendations are quite similar :). And I always forget that requests.Session is a context manager...

Question 4

@Graipher ah, nice, quite similar, but good idea to extract the souping into a separate function! :) Happy holidays!

Question 5

Happy holidays to you as well!

Question 6

Thanks for pointing out the fragility of the selector I used above to grab read more links. Your suggested one is much better and consistent. I have a question on this though. There are two classes in it <div class="pull-right more">. How did you become sure which one is generated dynamically and which one is not? Because, instead of pull-right you prefer to go with more. Thanks again @sir alecxe.

Question 7

@Mithu no problem. Sure, my thinking was that "more" in this case is logical - you are following the link to more information about this search result. Thanks.

Question 8

I would use requests.Session, which allows re-using the connection. This speeds up successive requests (as you are doing here).
I would factor out the getting and souping into a function, to be more DRY. This uses the Python 3.x exclusive yield from to consume its argument as an iterator:

def get(session, url, selector):
 res = session.get(url)
 soup = BeautifulSoup(res.text, "lxml")
 yield from soup.select(selector)

When you only want the first occurrence of something, you can use select_one(...), instead of select(...)[0], which is slightly more readable (and might even be faster, depending on the implementation).
Instead of printing, return/yield the found values and make it the responsibility of the caller to print.
You should have a look at Python's official style-guide, PEP8, which you already mostly follow. One thing it recommends, though, is adding a blank after a comma. It also recommends using UPPER_CASE for global constants.
You could add docstrings to describe what your functions do.

With (most) of these changes implemented, I get this code:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
URL = 'replace_with_the_above_link'
def get(session, url, selector):
 res = session.get(url)
 soup = BeautifulSoup(res.text, "lxml")
 yield from soup.select(selector)
def glean_items(session, main_link):
 for item in get(session, main_link, '.course-list-item'):
 event_name = item.select_one(".lead").text
 date = item.select_one(".date").text
 link = urljoin(main_link, item.select_one(".pull-right a")['href'])
 for items in get(session, link, '.e-loc-cost'):
 location = ' '.join([' '.join(item.text.split())
 for item in items.select(".event-add")])
 cost = ' '.join([' '.join(item.text.split())
 for item in items.select(".costs")])
 yield event_name, date, location, cost
if __name__ == '__main__':
 session = requests.Session()
 for event in glean_items(session, URL):
 print(*event)

Your original code took about 17.5s on my machine, this code takes about 10s.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-12-25 22:20:27Z

First of all, usual things (I feel like I'm suggesting these things in most of the web-scraping related discussions):

initialize session as requests.Session() and use session.get() instead of requests.get() - this would speed things up because the underlying TCP connection will be re-used for subsequent queries to the same domain
use SoupStrainer to limit what is parsed when the "soup" object is initialized

And, I am not completely sure about the reliability of the .pull_right a selector. At this point, it reads that you want to follow the link that is on the right in its respective container. The position of the element on the screen may easily change with a design change. What about a selector like a[id*=ReadMore] or .more > a?

Also, note that select()[0] could be replaced with select_one().

Code Style

There is not a lot of things to point out since the code is rather short and straightforward, but there are some PEP8 violations that can be addressed:

watch for the use of blank lines
you miss spaces between the function arguments
url should be named URL, since it is a "constant"
use a single # for comments
group imports properly

Some of the variable names could be more explicit, e.g. event_date in place of ev_date, event_link instead of i_link.

The improved code

from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
URL = 'replace_with_the_above_link'
def glean_items(session, main_link):
 response = session.get(main_link)
 
 parse_only = SoupStrainer(class_='course-list-item')
 soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
 for item in soup.select('.course-list-item'):
 event = item.select_one(".lead").text
 date = item.select_one(".date").text
 link = urljoin(main_link, item.select_one(".pull-right a")['href'])
 parse_more(session, event, date, link)
def parse_more(session, event_name, event_date, event_link):
 response = session.get(event_link)
 
 parse_only = SoupStrainer(class_="e-loc-cost")
 soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
 for items in soup.select('.e-loc-cost'):
 location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-add")])
 cost = ' '.join([' '.join(item.text.split()) for item in items.select(".costs")])
 
 print(event_name, event_date, location, cost)
if __name__ == '__main__':
 with requests.Session() as session:
 glean_items(session, URL)

I see our recommendations are quite similar :). And I always forget that requests.Session is a context manager...
@Graipher ah, nice, quite similar, but good idea to extract the souping into a separate function! :) Happy holidays!
Thanks for pointing out the fragility of the selector I used above to grab read more links. Your suggested one is much better and consistent. I have a question on this though. There are two classes in it <div class="pull-right more">. How did you become sure which one is generated dynamically and which one is not? Because, instead of pull-right you prefer to go with more. Thanks again @sir alecxe.
@Mithu no problem. Sure, my thinking was that "more" in this case is logical - you are following the link to more information about this search result. Thanks.

Stack Exchange Network

Items derived from one function are getting printed within another for loop in another function

2 Answers 2

Code Style

The improved code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Items derived from one function are getting printed within another for loop in another function

2 Answers 2

Code Style

The improved code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions