2
\$\begingroup\$

Just started in Python; wrote a script to get the names of all the titles in a public Youtube playlist given as input, but it got messier than it might have to be.

I looked around online and found HTMLParser, which I used to extract the titles, but it had problems with encoding which might have to do with there being foreign characters in the playlist HTML, so I messed around with encodes and decodes until it worked. But is there a prettier way to fix the problem?

import urllib.request
from html.parser import HTMLParser
playlistUrl = input("gib nem: ")
with urllib.request.urlopen(playlistUrl) as response:
 playlist = response.read()
html = playlist.decode("utf-8").encode('cp1252','replace').decode('cp1252')
titles = ""
class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 for attr in attrs:
 if attr[0] == "data-title":
 global titles
 titles += attr[1] + "\n"
parser = MyHTMLParser()
parser.feed(html)
print(titles)
with open("playlistNames.txt", "w") as f:
 f.write(titles)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Feb 2, 2018 at 5:05
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

Well, how you handle the output of the titles can be improved. You don't need to fall back to using global variables here. They are very rarely really needed. Here it would be easier to make handle_starttag a generator, which is then consumed by str.join:

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 for attr in attrs:
 if attr[0] == "data-title":
 yield attr[1]
parser = MyHTMLParser()
titles = '\n'.join(parser.feed(html))
print(titles)

This assumes that HTMLParser.feed does not return any other values except from within the handle_starttag method (and that it actually returns the output of handle_starttag).

Note that I increased the number of spaces to 4 per indentation level, as recommended by Python's official style-guide, PEP8.

You also might want to add an early exit if the tag is not the correct tag.


If those assumptions above about feed are wrong, you might want to look for a different tool. Most parsing is done with BeautifulSoup, as far as I can tell. It offers strainers, with which you can reduce the amount of HTML to parse to only those tags you care about and CSS selectors which would let you directly select all of those tags with the right attribute.

answered Feb 2, 2018 at 10:12
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.