I have very recently learned some Python and found out about web scraping.
This is my first bash at writing some code to get a playlist from a site I had previously been copying and pasting. I would love it if someone could have a look at the code and review it for me. Any advice, tips, suggestions welcome.
sample URL - https://www.bbc.co.uk/programmes/m000yzhb
# Create environment by importing required functions
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Ask user for URL of playlist
my_url = input("paste the URL here: ")
# Grab and process URL. Having uClient.Close() in this section bugged out - not sure why.
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html, "html.parser")
# Prep CSV file for writing.
filename = "take the floor.csv"
f = open(filename,"w")
# Find the section with the date of the broadcast and assign it to a variable.
containers = page_soup.findAll("div",{"class":"broadcast-event__time beta"})
date_container = containers[0]
# Pull the exact info and assign it to a variable.
show_date = str(date_container["title"])
# Find the section with the playlist in it and assign it to a variable.
containers = page_soup.findAll("div",{"class":"feature__description centi"})
title_container = containers[0]
# Pull the exact info and assign it to a variable.
dance_title = str(title_container)
# Clean the data a little
dance1 = dance_title.replace("<br/>" , " ")
dance2 = dance1.replace("<p>", " ")
#Write to CSV.
f.write(show_date + "," + dance2.replace("</p>",""))
f.close()
Notes: My next steps with this code (but I dont know where to start) would be to have this code run automatically once per month finding the webpages itself.(4 most recent broadcasts)
Also worth noting is that I have added some code in here to show some data manipulation but I actually do this manipulation with a macro in excel since I found it easier to get the results I wanted that way.
I have really enjoyed this project. I have learnt a huge amount and it has saved me a bunch of time manually copying and pasting this data to a word document.
Thank you for taking this time to look at this any suggestions or tips are greatly appreciated!
Cheers
2 Answers 2
I don't think the CSV format is benefiting you all that much here. You haven't written header names, and it seems you only have one column - and maybe only one row? Either you should have multiple columns with headers, or give up on CSV and just call it a text file.
Otherwise:
- Prefer Requests over
urllib
- Check for errors during the HTTP request
- If possible, constrain the part of the DOM that BeautifulSoup parses, using a strainer
The first part could look like:
def download_programme(session: Session, name: str) -> str:
with session.get(
'https://www.bbc.co.uk/programmes/' + name,
headers={'Accept': 'text/html'},
) as response:
response.raise_for_status()
return response.text
CSV
I agree with reinderien. You're not writing a CSV file. You're just writing a file you've added the extension .csv to. I suggest going at looking at an example CSV file to understand the format, which may or may not fit your needs.
If you do decide to write a CSV, one line per dance might be a reasonable format.
Libraries
- As Reinderien mentions, prefer requests over urllib
- BeautifulSoup is a good choice already
- If you're going to read or write csvs, use the standard
csv
library. But as I mention above, I don't think you're using the format right now.
Style
- All code should be in a function, or in an
if __name__ == '__main__':
block. - Consider splitting code into functions. Separate getting a URL, grabbing the HTML, parsing the HTML, and writing a CSV file. This code is short enough that it's not necessary, but if you want functions that's how I would split the code.
- Several of your comments are useless, such as
# Pull the exact info and assign it to a variable.
. Not everything needs a comment--in this case, your variables names are good enough to not need them. - Your comments are all about the line they're next to, which is not good style. I often have a good idea of what each line does already as a programmer, so it's equally important to help me understand what the program does as a whole. Document what the entire program does at the very top, maybe with sample output. A larger section should be labeled with a comment like "read the HTML and extract a list of dances", which summarizes what a whole set of lines does. If you've seen an outline in English class, imagine something similar.
Next Steps
- You can't run it monthly as long as you're taking input from the user when it runs, because you might not be there! Take input from a file or command-line arguments instead. Over time, I've found this is more flexible in general if you're the only user.
- What operating system are you on?
- If you're on Windows, I'm told you can use 'scheduled tasks' to run your script monthly.
- If you're on Linux,
cron
can run your task monthly. - If you're on Mac, you can do the same stuff as in Linux, or you could use Automator to run your script monthly.
-
\$\begingroup\$ Thank you both so much for your input, I really appreciate you both taking the time to look at the code and offer feedback. I agree, I am not writing an CSV properly, I need to work on formatting the data in python. I basically got the information to my CSV file and created a macro to format the data before adding it to my main workbook with all the other playlists. Thank you very much for the link Zachary I will look into formatting my data before writing it to CSV. Thank you both for all the comments, each comment has help me learn and further my understanding! Cheers Craig \$\endgroup\$Noob_warning2– Noob_warning22021年10月23日 04:21:17 +00:00Commented Oct 23, 2021 at 4:21