Python script to scrape titles of public Youtube playlist

Question 1

Just started in Python; wrote a script to get the names of all the titles in a public Youtube playlist given as input, but it got messier than it might have to be.

I looked around online and found HTMLParser, which I used to extract the titles, but it had problems with encoding which might have to do with there being foreign characters in the playlist HTML, so I messed around with encodes and decodes until it worked. But is there a prettier way to fix the problem?

import urllib.request
from html.parser import HTMLParser
playlistUrl = input("gib nem: ")
with urllib.request.urlopen(playlistUrl) as response:
 playlist = response.read()
html = playlist.decode("utf-8").encode('cp1252','replace').decode('cp1252')
titles = ""
class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 for attr in attrs:
 if attr[0] == "data-title":
 global titles
 titles += attr[1] + "\n"
parser = MyHTMLParser()
parser.feed(html)
print(titles)
with open("playlistNames.txt", "w") as f:
 f.write(titles)

Question 2

Well, how you handle the output of the titles can be improved. You don't need to fall back to using global variables here. They are very rarely really needed. Here it would be easier to make handle_starttag a generator, which is then consumed by str.join:

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 for attr in attrs:
 if attr[0] == "data-title":
 yield attr[1]
parser = MyHTMLParser()
titles = '\n'.join(parser.feed(html))
print(titles)

This assumes that HTMLParser.feed does not return any other values except from within the handle_starttag method (and that it actually returns the output of handle_starttag).

Note that I increased the number of spaces to 4 per indentation level, as recommended by Python's official style-guide, PEP8.

You also might want to add an early exit if the tag is not the correct tag.

If those assumptions above about feed are wrong, you might want to look for a different tool. Most parsing is done with BeautifulSoup, as far as I can tell. It offers strainers, with which you can reduce the amount of HTML to parse to only those tags you care about and CSS selectors which would let you directly select all of those tags with the right attribute.

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2018-02-02 10:12:22Z

Well, how you handle the output of the titles can be improved. You don't need to fall back to using global variables here. They are very rarely really needed. Here it would be easier to make handle_starttag a generator, which is then consumed by str.join:

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 for attr in attrs:
 if attr[0] == "data-title":
 yield attr[1]
parser = MyHTMLParser()
titles = '\n'.join(parser.feed(html))
print(titles)

This assumes that HTMLParser.feed does not return any other values except from within the handle_starttag method (and that it actually returns the output of handle_starttag).

Note that I increased the number of spaces to 4 per indentation level, as recommended by Python's official style-guide, PEP8.

You also might want to add an early exit if the tag is not the correct tag.

If those assumptions above about feed are wrong, you might want to look for a different tool. Most parsing is done with BeautifulSoup, as far as I can tell. It offers strainers, with which you can reduce the amount of HTML to parse to only those tags you care about and CSS selectors which would let you directly select all of those tags with the right attribute.

Stack Exchange Network

Python script to scrape titles of public Youtube playlist

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script to scrape titles of public Youtube playlist

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions