YouTube Search Result Scraper

Question 1

This is a program I wrote in Python using the BeautifulSoup library. The program scrapes YouTube search results for a given query and extracts data from the channels returned in the search results.

I'm just looking for some tips on how to make my code look (and function) better. I removed most of the redundancies but to me the code still feels ugly.

Suggestions?

#!/usr/bin/python
# http://docs.python-requests.org/en/latest/user/quickstart/
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/
import csv
import re
import requests
import time
from bs4 import BeautifulSoup
# scrapes the title 
def getTitle():
 d = soup.find_all("h1", "branded-page-header-title")
 for i in d:
 name = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
 f.write(name+',')
 print('\t\t%s') % (name)
# scrapes the subscriber and view count
def getStats():
 b = soup.find_all("li", "about-stat ") # trailing space is required.
 for i in b:
 value = i.b.text.strip().replace(',','')
 name = i.b.next_sibling.strip().replace(',','')
 f.write(value+',')
 print('\t\t%s = %s') % (name, value)
# scrapes the description
def getDescription():
 c = soup.find_all("div", "about-description")
 for i in c:
 description = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
 f.write(description+',')
 #print('\t\t%s') % (description)
# scrapes all the external links 
def getLinks():
 a = soup.find_all("a", "about-channel-link ") # trailing space is required.
 for i in a:
 url = i.get('href')
 f.write(url+',')
 print('\t\t%s') % (url)
# scrapes the related channels
def getRelated():
 s = soup.find_all("h3", "yt-lockup-title")
 for i in s:
 t = i.find_all(href=re.compile("user"))
 for i in t:
 url = 'https://www.youtube.com'+i.get('href')
 rCSV.write(url+'\n')
 print('\t\t%s,%s') % (i.text, url) 
f = open("youtube-scrape-data.csv", "w+")
rCSV = open("related-channels.csv", "w+")
visited = []
base = "https://www.youtube.com/results?search_query="
q = ['search+query+here']
page = "&page="
count = 1
pagesToScrape = 20
for query in q:
 while count <= pagesToScrape:
 scrapeURL = base + str(query) + page + str(count)
 print('Scraping %s\n') %(scrapeURL)
 r = requests.get(scrapeURL)
 soup = BeautifulSoup(r.text)
 users = soup.find_all("div", "yt-lockup-byline")
 for each in users:
 a = each.find_all(href=re.compile("user"))
 for i in a:
 url = 'https://www.youtube.com'+i.get('href')+'/about'
 if url in visited:
 print('\t%s has already been scraped\n\n') %(url)
 else:
 r = requests.get(url)
 soup = BeautifulSoup(r.text)
 f.write(url+',')
 print('\t%s') % (url)
 getTitle()
 getStats()
 getDescription()
 getLinks()
 getRelated()
 f.write('\n') 
 print('\n')
 visited.append(url)
 time.sleep(3)
 count += 1 
 time.sleep(3)
 print('\n')
 count = 1
 print('\n') 
f.close()

Question 2

I'm pretty much a n00b to programming myself, so take my advice with a grain of salt... but I would try making each of your "get..." functions into a method of a class (let's say YoutubeVid). It's __init__ would set all the attributes at once, without printing. A seperate function, let's say print_attributes could do the printing. Once you code that part, you would replace this:

 else:
 r = requests.get(url)
 soup = BeautifulSoup(r.text)
 f.write(url+',')
 print('\t%s') % (url)
 getTitle()
 getStats()
 getDescription()
 getLinks()
 getRelated()
 f.write('\n') 
 print('\n')
 visited.append(url)
 time.sleep(3)

With something like this:

 else:
 r = requests.get(url)
 soup = BeautifulSoup(r.text)
 video_page = YoutubeVid(soup)
 print_attributes(video_page)

I'm sorry I don't have the time to work out a more detailed example, but if that makes any sense to you, maybe you can give it a try and post what you come up with.

Also, a minor detail regarding function names... Mixed case like getTitle() is depricated. Lowercase with underscores like get_title() is prefered. See the PEP Style Guide.

Question 3

Code Style

There are multiple PEP8 code style violations, some of them are:

naming - use lower_case_with_underscores naming style
spaces around operators
blank lines between imports and functions

You should also improve your variable naming - for example, d, b, i are not meaningful - think of more descriptive names - remember: Code is much more often read than written.

HTML-parsing and Web-scraping

instantiate requests.Session() and reuse to make requests - this would give you a performance boost "for free":

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
you can replace .find_all() calls with more explicit and robust select() calls and use CSS selectors. For instance, soup.find_all("h1", "branded-page-header-title") would become soup.select("h1.branded-page-header-title")

it is also a good idea to specify a parse that BeautifulSoup uses under-the-hood explicitly:

soup = BeautifulSoup(r.text, "html.parser") # built-in, no extra dependencies
# soup = BeautifulSoup(r.text, "lxml") # the fastest
# soup = BeautifulSoup(r.text, "html5lib") # the most lenient

Other

use with context manager when dealing with file-like objects
either remove the unused csv import or use it to write the results into CSV files
convert comments before the functions into proper docstrings

Brian Z Brian Z 1312 bronze badges · Answer 1 · 2015-05-29 10:13:55Z

I'm pretty much a n00b to programming myself, so take my advice with a grain of salt... but I would try making each of your "get..." functions into a method of a class (let's say YoutubeVid). It's __init__ would set all the attributes at once, without printing. A seperate function, let's say print_attributes could do the printing. Once you code that part, you would replace this:

 else:
 r = requests.get(url)
 soup = BeautifulSoup(r.text)
 f.write(url+',')
 print('\t%s') % (url)
 getTitle()
 getStats()
 getDescription()
 getLinks()
 getRelated()
 f.write('\n') 
 print('\n')
 visited.append(url)
 time.sleep(3)

With something like this:

 else:
 r = requests.get(url)
 soup = BeautifulSoup(r.text)
 video_page = YoutubeVid(soup)
 print_attributes(video_page)

I'm sorry I don't have the time to work out a more detailed example, but if that makes any sense to you, maybe you can give it a try and post what you come up with.

Also, a minor detail regarding function names... Mixed case like getTitle() is depricated. Lowercase with underscores like get_title() is prefered. See the PEP Style Guide.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 2 · 2017-08-01 19:53:07Z

Code Style

There are multiple PEP8 code style violations, some of them are:

naming - use lower_case_with_underscores naming style
spaces around operators
blank lines between imports and functions

You should also improve your variable naming - for example, d, b, i are not meaningful - think of more descriptive names - remember: Code is much more often read than written.

HTML-parsing and Web-scraping

instantiate requests.Session() and reuse to make requests - this would give you a performance boost "for free":

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
you can replace .find_all() calls with more explicit and robust select() calls and use CSS selectors. For instance, soup.find_all("h1", "branded-page-header-title") would become soup.select("h1.branded-page-header-title")

it is also a good idea to specify a parse that BeautifulSoup uses under-the-hood explicitly:

soup = BeautifulSoup(r.text, "html.parser") # built-in, no extra dependencies
# soup = BeautifulSoup(r.text, "lxml") # the fastest
# soup = BeautifulSoup(r.text, "html5lib") # the most lenient

Other

use with context manager when dealing with file-like objects
either remove the unused csv import or use it to write the results into CSV files
convert comments before the functions into proper docstrings

Stack Exchange Network

YouTube Search Result Scraper

2 Answers 2

Code Style

HTML-parsing and Web-scraping

Other

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions