I've written a script which when I run parses the link of the item from the left sided menu,then it tracks down each individual links to it's main page. Some links spread through pagination some end in a single page. Whatever the case is, the crawler scrapes name and href of each item. While doing so I came across some duplicates which eventually I shook off using got_already function. The site contains 251 videos and my crawler is able to parse them all. Here is what I have done:
import requests
from lxml import html
got_already=[]
Page_link="http://www.wiseowl.co.uk/videos/"
Blink="http://www.wiseowl.co.uk"
def startpoint(page):
req=requests.Session()
response = req.get(page)
tree = html.fromstring(response.text)
titles = tree.xpath("//ul[@class='woMenuList']//li[@class='woMenuItem']/a/@href")
for title in titles:
if "author" not in title and "year" not in title:
GrabbingData(Blink + title)
midpoint(Blink + title)
def midpoint(links):
req=requests.Session()
response = req.get(links)
tree = html.fromstring(response.text)
for links in tree.xpath("//a[@class='woPagingItem']/@href"):
GrabbingData(Blink + links)
def GrabbingData(url):
req=requests.Session()
response = req.get(url)
tree = html.fromstring(response.text)
for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
title = item.xpath('.//a/text()')[0]
link = item.xpath('.//a/@href')[0]
if title not in got_already:
got_already.append(title)
got_already.append(link)
print(title,link)
startpoint(Page_link)
1 Answer 1
There are a couple of things you can improve. I'll start by mentioning some PEP8 related observations:
- variables / methods should be lower_cased (instead of
GrabbingData
you should writegrabbing_data
; - for readability, you should have two newlines between your methods;
- use spaces around operators like
=
,+
etc; - after each
,
you should put a space; - constants are usually uppercased so
page_link
should bePAGE_LINK
andblink
should beBLINK
; - add docstrings to your functions;
- It's good practice to split the words using
_
(e.g:start_point
instead ofstartpoint
); - it's also recommended to have one newline between builtin modules and 3rd party modules.
Now, about the code
- it's not necessary to create a new
session
in all of your functions. It's preferred to create only one session for the same website as it can give a boost of speed to your scrapper. More, the session object allows you to persist certain parameters across requests (if needed); - you only use
got_already
in your last function, so define it there. Or better yet, use aSet()
- instead of
if "author" not in title and "year" not in title
you can useall()
PAGE_LINK
is irrelevant and confusing. You can build your full video links using justBLINK
.- last but not least, I'd use
beautifulsoup
for html parsing: but, here I'd like to add that The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider usingsoupparser
only as a fallback for certain cases.
That said, if I were you, I'd have built the scrapper like this:
from requests import Session
from bs4 import BeautifulSoup as bs
def get_html_source(_session, url, min_pag, max_pag):
"""Yield the source html for each set of videos"""
for i in range(min_pag, max_pag):
html_source = _session.get('{}/default-{}.htm'.format(url, i)).text
yield bs(html_source, 'html.parser')
def parse_html(soup, url):
"""
Process the html and get all the titles and the
video urls that match the condition
"""
for a in soup.find_all('p', attrs={'class', 'woVideoListDefaultSeriesTitle'}):
video_title = a.find('a').text
video_url = a.find('a', href=True)['href']
if "author" not in video_title and "year" not in video_title:
print('Title: {} | Url: {}{}'.format(video_title, url, video_url))
def main():
_session = Session()
url = 'http://www.wiseowl.co.uk/videos'
for soup in get_html_source(_session, url, 1, 14):
parse_html(soup, url)
if __name__ == '__main__':
main()
Things that I did differently:
Intro
If you were paying attention, the website adds \url\video\default-N.htm
at the end of the url depending on the set of pages you click in the bottom of the page. There's a total of 13 urls.
Coding around
In my first function, I've created a function which will go through each page and will yield the source html for each one.
In the second function, I've just used BeautifulSoup
to get the title and the urls of the videos that you're interested in. I've also cut off A LOT of the logic you had because it wasn't needed anymore. I've also printed the data a bit different than you did. You should know that printing huge amounts of data has a big impact on performance. I'd recommend logging into a file.
The third function is the one that will be called in our safeguard if __name__ == '__main__'