A tiny recursive crawler

Question 1

I've written a script to crawl a website recursively until all the links connected to some tutorials are exhausted. It is working smoothly now. There is always room for improvement, though!

import requests
from lxml import html
Page_link="http://www.wiseowl.co.uk/videos/"
visited_links = []
def GrabbingData(url):
 base="http://www.wiseowl.co.uk"
 visited_links.append(url)
 response = requests.get(url)
 tree = html.fromstring(response.text)
 title = tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a/text()')
 link = tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a/@href')
 for i,j in zip(title,link):
 print(i,j)
 pagination=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem' or @class='woPagingNext']/@href")
 for nextp in pagination:
 url1 = str(base + nextp)
 if url1 not in visited_links:
 GrabbingData(url1)
GrabbingData(Page_link)

Question 2

Could you briefly explain your reason for choosing recursion over iteration for this?

Question 3

First of all, you don't need to make your solution recursive. Iterative approach in your case would be more intuitive and simple.

Moreover, there is a better way to handle pagination for this particular website - the paginated parts of the video catalog are following http://www.wiseowl.co.uk/videos/default-<number>.htm pattern, which means that you can start with number=1 up until you get 404 which would conclude the catalog:

import requests
from lxml import html
URL_PATTERN = "http://www.wiseowl.co.uk/videos/default-{}.htm"
with requests.Session() as session:
 page_number = 1
 while True:
 response = session.get(URL_PATTERN.format(page_number))
 if response.status_code == 404: # break once the page is not found
 break
 print("Processing page number {}..".format(page_number))
 tree = html.fromstring(response.text)
 for video_link in tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a'):
 title = video_link.text
 link = video_link.attrib['href']
 print(title, link)
 page_number += 1

Notes about some of the improvements I've made:

re-using the same Session instance - it results into a memory usage and performance improvement because it re-uses the same TCP connection:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
instead of searching the entire tree for the "video" elements 2 times, looping over video link elements directly once
naming: first of all, make sure to follow the lower_case_with_underscores Python naming recommendations. And, try to avoid meaningless variables like i and j if not used as throwaway loop variables; url1 was not a good variable choice as well
follow other PEP8 style guide recommendations - in particular, watch for the spaces around operators and newlines

Question 4

Thanks sir alecxe, for such invaluable suggestions. I'll try to comply with from now on. You are really the legend.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-05-27 03:50:54Z

First of all, you don't need to make your solution recursive. Iterative approach in your case would be more intuitive and simple.

Moreover, there is a better way to handle pagination for this particular website - the paginated parts of the video catalog are following http://www.wiseowl.co.uk/videos/default-<number>.htm pattern, which means that you can start with number=1 up until you get 404 which would conclude the catalog:

import requests
from lxml import html
URL_PATTERN = "http://www.wiseowl.co.uk/videos/default-{}.htm"
with requests.Session() as session:
 page_number = 1
 while True:
 response = session.get(URL_PATTERN.format(page_number))
 if response.status_code == 404: # break once the page is not found
 break
 print("Processing page number {}..".format(page_number))
 tree = html.fromstring(response.text)
 for video_link in tree.xpath('//p[@class="woVideoListDefaultSeriesTitle"]//a'):
 title = video_link.text
 link = video_link.attrib['href']
 print(title, link)
 page_number += 1

Notes about some of the improvements I've made:

re-using the same Session instance - it results into a memory usage and performance improvement because it re-uses the same TCP connection:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
instead of searching the entire tree for the "video" elements 2 times, looping over video link elements directly once
naming: first of all, make sure to follow the lower_case_with_underscores Python naming recommendations. And, try to avoid meaningless variables like i and j if not used as throwaway loop variables; url1 was not a good variable choice as well
follow other PEP8 style guide recommendations - in particular, watch for the spaces around operators and newlines

Thanks sir alecxe, for such invaluable suggestions. I'll try to comply with from now on. You are really the legend.

Stack Exchange Network

A tiny recursive crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A tiny recursive crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions