I've written a script to crawl a website recursively until all the links connected to some tutorials are exhausted. It is working smoothly now. There is always room for improvement, though!
import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
visited_links = []
def GrabbingData(url):
base="http://www.wiseowl.co.uk"
visited_links.append(url)
response = requests.get(url)
tree = html.fromstring(response.text)
title = tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a/text()')
link = tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a/@href')
for i,j in zip(title,link):
print(i,j)
pagination=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem' or @class='woPagingNext']/@href")
for nextp in pagination:
url1 = str(base + nextp)
if url1 not in visited_links:
GrabbingData(url1)
GrabbingData(Page_link)
-
\$\begingroup\$ Could you briefly explain your reason for choosing recursion over iteration for this? \$\endgroup\$yuri– yuri2017年05月26日 22:10:02 +00:00Commented May 26, 2017 at 22:10
1 Answer 1
First of all, you don't need to make your solution recursive. Iterative approach in your case would be more intuitive and simple.
Moreover, there is a better way to handle pagination for this particular website - the paginated parts of the video catalog are following http://www.wiseowl.co.uk/videos/default-<number>.htm
pattern, which means that you can start with number=1
up until you get 404
which would conclude the catalog:
import requests
from lxml import html
URL_PATTERN = "http://www.wiseowl.co.uk/videos/default-{}.htm"
with requests.Session() as session:
page_number = 1
while True:
response = session.get(URL_PATTERN.format(page_number))
if response.status_code == 404: # break once the page is not found
break
print("Processing page number {}..".format(page_number))
tree = html.fromstring(response.text)
for video_link in tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a'):
title = video_link.text
link = video_link.attrib['href']
print(title, link)
page_number += 1
Notes about some of the improvements I've made:
re-using the same
Session
instance - it results into a memory usage and performance improvement because it re-uses the same TCP connection:if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
instead of searching the entire tree for the "video" elements 2 times, looping over video link elements directly once
naming: first of all, make sure to follow the
lower_case_with_underscores
Python naming recommendations. And, try to avoid meaningless variables likei
andj
if not used as throwaway loop variables;url1
was not a good variable choice as wellfollow other
PEP8
style guide recommendations - in particular, watch for the spaces around operators and newlines
-
\$\begingroup\$ Thanks sir alecxe, for such invaluable suggestions. I'll try to comply with from now on. You are really the legend. \$\endgroup\$SIM– SIM2017年05月27日 04:34:44 +00:00Commented May 27, 2017 at 4:34