Scraping titles and links from a site using python

Question 1

I've written a script which when I run parses the link of the item from the left sided menu,then it tracks down each individual links to it's main page. Some links spread through pagination some end in a single page. Whatever the case is, the crawler scrapes name and href of each item. While doing so I came across some duplicates which eventually I shook off using got_already function. The site contains 251 videos and my crawler is able to parse them all. Here is what I have done:

import requests
from lxml import html
got_already=[]
Page_link="http://www.wiseowl.co.uk/videos/"
Blink="http://www.wiseowl.co.uk"
def startpoint(page):
 req=requests.Session()
 response = req.get(page)
 tree = html.fromstring(response.text)
 titles = tree.xpath("//ul[@class='woMenuList']//li[@class='woMenuItem']/a/@href")
 for title in titles:
 if "author" not in title and "year" not in title:
 GrabbingData(Blink + title)
 midpoint(Blink + title)
def midpoint(links):
 req=requests.Session()
 response = req.get(links)
 tree = html.fromstring(response.text)
 for links in tree.xpath("//a[@class='woPagingItem']/@href"):
 GrabbingData(Blink + links)
def GrabbingData(url):
 req=requests.Session()
 response = req.get(url)
 tree = html.fromstring(response.text)
 for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
 title = item.xpath('.//a/text()')[0]
 link = item.xpath('.//a/@href')[0]
 if title not in got_already:
 got_already.append(title)
 got_already.append(link)
 print(title,link)
startpoint(Page_link)

Question 2

There are a couple of things you can improve. I'll start by mentioning some PEP8 related observations:

variables / methods should be lower_cased (instead of GrabbingData you should write grabbing_data;
for readability, you should have two newlines between your methods;
use spaces around operators like =, + etc;
after each , you should put a space;
constants are usually uppercased so page_link should be PAGE_LINK and blink should be BLINK;
add docstrings to your functions;
It's good practice to split the words using _ (e.g: start_point instead of startpoint);
it's also recommended to have one newline between builtin modules and 3rd party modules.

Now, about the code

it's not necessary to create a new session in all of your functions. It's preferred to create only one session for the same website as it can give a boost of speed to your scrapper. More, the session object allows you to persist certain parameters across requests (if needed);
you only use got_already in your last function, so define it there. Or better yet, use a Set()
instead of if "author" not in title and "year" not in title you can use all()
PAGE_LINK is irrelevant and confusing. You can build your full video links using just BLINK.
last but not least, I'd use beautifulsoup for html parsing: but, here I'd like to add that The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

That said, if I were you, I'd have built the scrapper like this:

from requests import Session
from bs4 import BeautifulSoup as bs
def get_html_source(_session, url, min_pag, max_pag):
 """Yield the source html for each set of videos"""
 for i in range(min_pag, max_pag):
 html_source = _session.get('{}/default-{}.htm'.format(url, i)).text
 yield bs(html_source, 'html.parser')
def parse_html(soup, url):
 """
 Process the html and get all the titles and the 
 video urls that match the condition
 """
 for a in soup.find_all('p', attrs={'class', 'woVideoListDefaultSeriesTitle'}):
 video_title = a.find('a').text
 video_url = a.find('a', href=True)['href']
 if "author" not in video_title and "year" not in video_title:
 print('Title: {} | Url: {}{}'.format(video_title, url, video_url))
def main():
 _session = Session()
 url = 'http://www.wiseowl.co.uk/videos'
 
 for soup in get_html_source(_session, url, 1, 14):
 parse_html(soup, url)
if __name__ == '__main__':
 main()

Things that I did differently:

Intro

If you were paying attention, the website adds \url\video\default-N.htm at the end of the url depending on the set of pages you click in the bottom of the page. There's a total of 13 urls.

Coding around

In my first function, I've created a function which will go through each page and will yield the source html for each one.

In the second function, I've just used BeautifulSoup to get the title and the urls of the videos that you're interested in. I've also cut off A LOT of the logic you had because it wasn't needed anymore. I've also printed the data a bit different than you did. You should know that printing huge amounts of data has a big impact on performance. I'd recommend logging into a file.

The third function is the one that will be called in our safeguard if __name__ == '__main__'

score 2 · Accepted Answer · 2017-06-02 11:26:27Z

There are a couple of things you can improve. I'll start by mentioning some PEP8 related observations:

variables / methods should be lower_cased (instead of GrabbingData you should write grabbing_data;
for readability, you should have two newlines between your methods;
use spaces around operators like =, + etc;
after each , you should put a space;
constants are usually uppercased so page_link should be PAGE_LINK and blink should be BLINK;
add docstrings to your functions;
It's good practice to split the words using _ (e.g: start_point instead of startpoint);
it's also recommended to have one newline between builtin modules and 3rd party modules.

Now, about the code

it's not necessary to create a new session in all of your functions. It's preferred to create only one session for the same website as it can give a boost of speed to your scrapper. More, the session object allows you to persist certain parameters across requests (if needed);
you only use got_already in your last function, so define it there. Or better yet, use a Set()
instead of if "author" not in title and "year" not in title you can use all()
PAGE_LINK is irrelevant and confusing. You can build your full video links using just BLINK.
last but not least, I'd use beautifulsoup for html parsing: but, here I'd like to add that The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

That said, if I were you, I'd have built the scrapper like this:

from requests import Session
from bs4 import BeautifulSoup as bs
def get_html_source(_session, url, min_pag, max_pag):
 """Yield the source html for each set of videos"""
 for i in range(min_pag, max_pag):
 html_source = _session.get('{}/default-{}.htm'.format(url, i)).text
 yield bs(html_source, 'html.parser')
def parse_html(soup, url):
 """
 Process the html and get all the titles and the 
 video urls that match the condition
 """
 for a in soup.find_all('p', attrs={'class', 'woVideoListDefaultSeriesTitle'}):
 video_title = a.find('a').text
 video_url = a.find('a', href=True)['href']
 if "author" not in video_title and "year" not in video_title:
 print('Title: {} | Url: {}{}'.format(video_title, url, video_url))
def main():
 _session = Session()
 url = 'http://www.wiseowl.co.uk/videos'
 
 for soup in get_html_source(_session, url, 1, 14):
 parse_html(soup, url)
if __name__ == '__main__':
 main()

Things that I did differently:

Intro

If you were paying attention, the website adds \url\video\default-N.htm at the end of the url depending on the set of pages you click in the bottom of the page. There's a total of 13 urls.

Coding around

In my first function, I've created a function which will go through each page and will yield the source html for each one.

In the second function, I've just used BeautifulSoup to get the title and the urls of the videos that you're interested in. I've also cut off A LOT of the logic you had because it wasn't needed anymore. I've also printed the data a bit different than you did. You should know that printing huge amounts of data has a big impact on performance. I'd recommend logging into a file.

The third function is the one that will be called in our safeguard if __name__ == '__main__'

Stack Exchange Network

Scraping titles and links from a site using python

1 Answer 1

Now, about the code

Intro

Coding around

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraping titles and links from a site using python

1 Answer 1

Now, about the code

Intro

Coding around

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions