This is a program I wrote in Python using the BeautifulSoup library. The program scrapes YouTube search results for a given query and extracts data from the channels returned in the search results.
I'm just looking for some tips on how to make my code look (and function) better. I removed most of the redundancies but to me the code still feels ugly.
Suggestions?
#!/usr/bin/python
# http://docs.python-requests.org/en/latest/user/quickstart/
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/
import csv
import re
import requests
import time
from bs4 import BeautifulSoup
# scrapes the title
def getTitle():
d = soup.find_all("h1", "branded-page-header-title")
for i in d:
name = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
f.write(name+',')
print('\t\t%s') % (name)
# scrapes the subscriber and view count
def getStats():
b = soup.find_all("li", "about-stat ") # trailing space is required.
for i in b:
value = i.b.text.strip().replace(',','')
name = i.b.next_sibling.strip().replace(',','')
f.write(value+',')
print('\t\t%s = %s') % (name, value)
# scrapes the description
def getDescription():
c = soup.find_all("div", "about-description")
for i in c:
description = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
f.write(description+',')
#print('\t\t%s') % (description)
# scrapes all the external links
def getLinks():
a = soup.find_all("a", "about-channel-link ") # trailing space is required.
for i in a:
url = i.get('href')
f.write(url+',')
print('\t\t%s') % (url)
# scrapes the related channels
def getRelated():
s = soup.find_all("h3", "yt-lockup-title")
for i in s:
t = i.find_all(href=re.compile("user"))
for i in t:
url = 'https://www.youtube.com'+i.get('href')
rCSV.write(url+'\n')
print('\t\t%s,%s') % (i.text, url)
f = open("youtube-scrape-data.csv", "w+")
rCSV = open("related-channels.csv", "w+")
visited = []
base = "https://www.youtube.com/results?search_query="
q = ['search+query+here']
page = "&page="
count = 1
pagesToScrape = 20
for query in q:
while count <= pagesToScrape:
scrapeURL = base + str(query) + page + str(count)
print('Scraping %s\n') %(scrapeURL)
r = requests.get(scrapeURL)
soup = BeautifulSoup(r.text)
users = soup.find_all("div", "yt-lockup-byline")
for each in users:
a = each.find_all(href=re.compile("user"))
for i in a:
url = 'https://www.youtube.com'+i.get('href')+'/about'
if url in visited:
print('\t%s has already been scraped\n\n') %(url)
else:
r = requests.get(url)
soup = BeautifulSoup(r.text)
f.write(url+',')
print('\t%s') % (url)
getTitle()
getStats()
getDescription()
getLinks()
getRelated()
f.write('\n')
print('\n')
visited.append(url)
time.sleep(3)
count += 1
time.sleep(3)
print('\n')
count = 1
print('\n')
f.close()
2 Answers 2
I'm pretty much a n00b to programming myself, so take my advice with a grain of salt... but I would try making each of your "get..." functions into a method of a class (let's say YoutubeVid
). It's __init__
would set all the attributes at once, without printing. A seperate function, let's say print_attributes
could do the printing. Once you code that part, you would replace this:
else:
r = requests.get(url)
soup = BeautifulSoup(r.text)
f.write(url+',')
print('\t%s') % (url)
getTitle()
getStats()
getDescription()
getLinks()
getRelated()
f.write('\n')
print('\n')
visited.append(url)
time.sleep(3)
With something like this:
else:
r = requests.get(url)
soup = BeautifulSoup(r.text)
video_page = YoutubeVid(soup)
print_attributes(video_page)
I'm sorry I don't have the time to work out a more detailed example, but if that makes any sense to you, maybe you can give it a try and post what you come up with.
Also, a minor detail regarding function names... Mixed case like getTitle()
is depricated. Lowercase with underscores like get_title()
is prefered. See the PEP Style Guide.
Code Style
There are multiple PEP8 code style violations, some of them are:
- naming - use
lower_case_with_underscores
naming style - spaces around operators
- blank lines between imports and functions
You should also improve your variable naming - for example, d
, b
, i
are not meaningful - think of more descriptive names - remember: Code is much more often read than written.
HTML-parsing and Web-scraping
instantiate
requests.Session()
and reuse to make requests - this would give you a performance boost "for free":if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you can replace
.find_all()
calls with more explicit and robustselect()
calls and use CSS selectors. For instance,soup.find_all("h1", "branded-page-header-title")
would becomesoup.select("h1.branded-page-header-title")
it is also a good idea to specify a parse that
BeautifulSoup
uses under-the-hood explicitly:soup = BeautifulSoup(r.text, "html.parser") # built-in, no extra dependencies # soup = BeautifulSoup(r.text, "lxml") # the fastest # soup = BeautifulSoup(r.text, "html5lib") # the most lenient
Other
- use
with
context manager when dealing with file-like objects - either remove the unused
csv
import or use it to write the results into CSV files - convert comments before the functions into proper docstrings
Explore related questions
See similar questions with these tags.