I have written some code in python to scrape some items from a webpage. There is a link (titling see more) attach to each container. If I click on that link I can reach a certain page where all the information is available. It is quite convenient to track the see more link and parse all documents from there. However, my goal is to parse the first two documents from the first page and going on to the other page (clicking on the see more button) I'll parse the rest. I have tried to do the same as I described just now. It is doing just fine.
At this point, I'm seriously dubious whether the way I'm doing things is the right one or error-prone because I used two items from earlier page to get printed within the newly defined for loop created in the later function. It is giving me the accurate results though. Any suggestion as to what I did here is ideal or any guidance why I should not practice this will be highly appreciated.
This is the site link. This is the full script:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'replace_with_the_above_link'
def glean_items(main_link):
res = requests.get(main_link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select('.course-list-item'):
event = item.select(".lead")[0].text
date = item.select(".date")[0].text
link = urljoin(main_link,item.select(".pull-right a")[0]['href'])
parse_more(event,date,link)
def parse_more(event_name,ev_date,i_link): ## notice the two items (event_name,ev_date)
res = requests.get(i_link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select('.e-loc-cost'):
location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-add")])
cost = ' '.join([' '.join(item.text.split()) for item in items.select(".costs")])
print(event_name,ev_date,location,cost) ##again take a look: I used those two items within this newly created for loop.
if __name__ == '__main__':
glean_items(url)
2 Answers 2
First of all, usual things (I feel like I'm suggesting these things in most of the web-scraping related discussions):
- initialize
session
asrequests.Session()
and usesession.get()
instead ofrequests.get()
- this would speed things up because the underlying TCP connection will be re-used for subsequent queries to the same domain - use
SoupStrainer
to limit what is parsed when the "soup" object is initialized
And, I am not completely sure about the reliability of the .pull_right a
selector. At this point, it reads that you want to follow the link that is on the right in its respective container. The position of the element on the screen may easily change with a design change. What about a selector like a[id*=ReadMore]
or .more > a
?
Also, note that select()[0]
could be replaced with select_one()
.
Code Style
There is not a lot of things to point out since the code is rather short and straightforward, but there are some PEP8 violations that can be addressed:
- watch for the use of blank lines
- you miss spaces between the function arguments
url
should be namedURL
, since it is a "constant"- use a single
#
for comments - group imports properly
Some of the variable names could be more explicit, e.g. event_date
in place of ev_date
, event_link
instead of i_link
.
The improved code
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
URL = 'replace_with_the_above_link'
def glean_items(session, main_link):
response = session.get(main_link)
parse_only = SoupStrainer(class_='course-list-item')
soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
for item in soup.select('.course-list-item'):
event = item.select_one(".lead").text
date = item.select_one(".date").text
link = urljoin(main_link, item.select_one(".pull-right a")['href'])
parse_more(session, event, date, link)
def parse_more(session, event_name, event_date, event_link):
response = session.get(event_link)
parse_only = SoupStrainer(class_="e-loc-cost")
soup = BeautifulSoup(response.text, "lxml", parse_only=parse_only)
for items in soup.select('.e-loc-cost'):
location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-add")])
cost = ' '.join([' '.join(item.text.split()) for item in items.select(".costs")])
print(event_name, event_date, location, cost)
if __name__ == '__main__':
with requests.Session() as session:
glean_items(session, URL)
-
1\$\begingroup\$ I see our recommendations are quite similar :). And I always forget that
requests.Session
is a context manager... \$\endgroup\$Graipher– Graipher2017年12月25日 22:30:19 +00:00Commented Dec 25, 2017 at 22:30 -
1\$\begingroup\$ @Graipher ah, nice, quite similar, but good idea to extract the souping into a separate function! :) Happy holidays! \$\endgroup\$alecxe– alecxe2017年12月25日 22:35:48 +00:00Commented Dec 25, 2017 at 22:35
-
1\$\begingroup\$ Happy holidays to you as well! \$\endgroup\$Graipher– Graipher2017年12月25日 22:37:22 +00:00Commented Dec 25, 2017 at 22:37
-
\$\begingroup\$ Thanks for pointing out the fragility of the selector I used above to grab
read more
links. Your suggested one is much better and consistent. I have a question on this though. There are two classes in it<div class="pull-right more">
. How did you become sure which one is generated dynamically and which one is not? Because, instead ofpull-right
you prefer to go withmore
. Thanks again @sir alecxe. \$\endgroup\$MITHU– MITHU2017年12月26日 10:58:55 +00:00Commented Dec 26, 2017 at 10:58 -
\$\begingroup\$ @Mithu no problem. Sure, my thinking was that "more" in this case is logical - you are following the link to more information about this search result. Thanks. \$\endgroup\$alecxe– alecxe2017年12月26日 14:47:33 +00:00Commented Dec 26, 2017 at 14:47
- I would use
requests.Session
, which allows re-using the connection. This speeds up successive requests (as you are doing here). - I would factor out the getting and souping into a function, to be more DRY. This uses the Python 3.x exclusive
yield from
to consume its argument as an iterator:
def get(session, url, selector):
res = session.get(url)
soup = BeautifulSoup(res.text, "lxml")
yield from soup.select(selector)
- When you only want the first occurrence of something, you can use
select_one(...)
, instead ofselect(...)[0]
, which is slightly more readable (and might even be faster, depending on the implementation). - Instead of
print
ing, return/yield the found values and make it the responsibility of the caller to print. - You should have a look at Python's official style-guide, PEP8, which you already mostly follow. One thing it recommends, though, is adding a blank after a comma. It also recommends using
UPPER_CASE
for global constants. - You could add
docstrings
to describe what your functions do.
With (most) of these changes implemented, I get this code:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
URL = 'replace_with_the_above_link'
def get(session, url, selector):
res = session.get(url)
soup = BeautifulSoup(res.text, "lxml")
yield from soup.select(selector)
def glean_items(session, main_link):
for item in get(session, main_link, '.course-list-item'):
event_name = item.select_one(".lead").text
date = item.select_one(".date").text
link = urljoin(main_link, item.select_one(".pull-right a")['href'])
for items in get(session, link, '.e-loc-cost'):
location = ' '.join([' '.join(item.text.split())
for item in items.select(".event-add")])
cost = ' '.join([' '.join(item.text.split())
for item in items.select(".costs")])
yield event_name, date, location, cost
if __name__ == '__main__':
session = requests.Session()
for event in glean_items(session, URL):
print(*event)
Your original code took about 17.5s on my machine, this code takes about 10s.
Explore related questions
See similar questions with these tags.