Web scraping playlist from website and writing to CSV

Question 1

I have very recently learned some Python and found out about web scraping.

This is my first bash at writing some code to get a playlist from a site I had previously been copying and pasting. I would love it if someone could have a look at the code and review it for me. Any advice, tips, suggestions welcome.

sample URL - https://www.bbc.co.uk/programmes/m000yzhb

# Create environment by importing required functions
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Ask user for URL of playlist
my_url = input("paste the URL here: ") 
# Grab and process URL. Having uClient.Close() in this section bugged out - not sure why. 
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html, "html.parser")
# Prep CSV file for writing. 
filename = "take the floor.csv"
f = open(filename,"w")
# Find the section with the date of the broadcast and assign it to a variable.
containers = page_soup.findAll("div",{"class":"broadcast-event__time beta"})
date_container = containers[0]
# Pull the exact info and assign it to a variable.
show_date = str(date_container["title"])
# Find the section with the playlist in it and assign it to a variable.
containers = page_soup.findAll("div",{"class":"feature__description centi"})
title_container = containers[0]
# Pull the exact info and assign it to a variable.
dance_title = str(title_container)
# Clean the data a little
dance1 = dance_title.replace("<br/>" , " ")
dance2 = dance1.replace("<p>", " ")
#Write to CSV.
f.write(show_date + "," + dance2.replace("</p>",""))
f.close()

Notes: My next steps with this code (but I dont know where to start) would be to have this code run automatically once per month finding the webpages itself.(4 most recent broadcasts)

Also worth noting is that I have added some code in here to show some data manipulation but I actually do this manipulation with a macro in excel since I found it easier to get the results I wanted that way.

I have really enjoyed this project. I have learnt a huge amount and it has saved me a bunch of time manually copying and pasting this data to a word document.

Thank you for taking this time to look at this any suggestions or tips are greatly appreciated!

Cheers

Question 2

I don't think the CSV format is benefiting you all that much here. You haven't written header names, and it seems you only have one column - and maybe only one row? Either you should have multiple columns with headers, or give up on CSV and just call it a text file.

Otherwise:

Prefer Requests over urllib
Check for errors during the HTTP request
If possible, constrain the part of the DOM that BeautifulSoup parses, using a strainer

The first part could look like:

def download_programme(session: Session, name: str) -> str:
 with session.get(
 'https://www.bbc.co.uk/programmes/' + name,
 headers={'Accept': 'text/html'},
 ) as response:
 response.raise_for_status()
 return response.text

Question 3

CSV

I agree with reinderien. You're not writing a CSV file. You're just writing a file you've added the extension .csv to. I suggest going at looking at an example CSV file to understand the format, which may or may not fit your needs.

If you do decide to write a CSV, one line per dance might be a reasonable format.

Libraries

As Reinderien mentions, prefer requests over urllib
BeautifulSoup is a good choice already
If you're going to read or write csvs, use the standard csv library. But as I mention above, I don't think you're using the format right now.

Style

All code should be in a function, or in an if __name__ == '__main__': block.
Consider splitting code into functions. Separate getting a URL, grabbing the HTML, parsing the HTML, and writing a CSV file. This code is short enough that it's not necessary, but if you want functions that's how I would split the code.
Several of your comments are useless, such as # Pull the exact info and assign it to a variable.. Not everything needs a comment--in this case, your variables names are good enough to not need them.
Your comments are all about the line they're next to, which is not good style. I often have a good idea of what each line does already as a programmer, so it's equally important to help me understand what the program does as a whole. Document what the entire program does at the very top, maybe with sample output. A larger section should be labeled with a comment like "read the HTML and extract a list of dances", which summarizes what a whole set of lines does. If you've seen an outline in English class, imagine something similar.

Next Steps

You can't run it monthly as long as you're taking input from the user when it runs, because you might not be there! Take input from a file or command-line arguments instead. Over time, I've found this is more flexible in general if you're the only user.
What operating system are you on?
- If you're on Windows, I'm told you can use 'scheduled tasks' to run your script monthly.
- If you're on Linux, cron can run your task monthly.
- If you're on Mac, you can do the same stuff as in Linux, or you could use Automator to run your script monthly.

Question 4

Thank you both so much for your input, I really appreciate you both taking the time to look at the code and offer feedback. I agree, I am not writing an CSV properly, I need to work on formatting the data in python. I basically got the information to my CSV file and created a macro to format the data before adding it to my main workbook with all the other playlists. Thank you very much for the link Zachary I will look into formatting my data before writing it to CSV. Thank you both for all the comments, each comment has help me learn and further my understanding! Cheers Craig

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2021-10-14 19:49:06Z

I don't think the CSV format is benefiting you all that much here. You haven't written header names, and it seems you only have one column - and maybe only one row? Either you should have multiple columns with headers, or give up on CSV and just call it a text file.

Otherwise:

Prefer Requests over urllib
Check for errors during the HTTP request
If possible, constrain the part of the DOM that BeautifulSoup parses, using a strainer

The first part could look like:

def download_programme(session: Session, name: str) -> str:
 with session.get(
 'https://www.bbc.co.uk/programmes/' + name,
 headers={'Accept': 'text/html'},
 ) as response:
 response.raise_for_status()
 return response.text

Zachary Vance Zachary Vance 1,5639 silver badges19 bronze badges · Answer 2 · 2021-10-22 22:46:15Z

CSV

I agree with reinderien. You're not writing a CSV file. You're just writing a file you've added the extension .csv to. I suggest going at looking at an example CSV file to understand the format, which may or may not fit your needs.

If you do decide to write a CSV, one line per dance might be a reasonable format.

Libraries

As Reinderien mentions, prefer requests over urllib
BeautifulSoup is a good choice already
If you're going to read or write csvs, use the standard csv library. But as I mention above, I don't think you're using the format right now.

Style

All code should be in a function, or in an if __name__ == '__main__': block.
Consider splitting code into functions. Separate getting a URL, grabbing the HTML, parsing the HTML, and writing a CSV file. This code is short enough that it's not necessary, but if you want functions that's how I would split the code.
Several of your comments are useless, such as # Pull the exact info and assign it to a variable.. Not everything needs a comment--in this case, your variables names are good enough to not need them.
Your comments are all about the line they're next to, which is not good style. I often have a good idea of what each line does already as a programmer, so it's equally important to help me understand what the program does as a whole. Document what the entire program does at the very top, maybe with sample output. A larger section should be labeled with a comment like "read the HTML and extract a list of dances", which summarizes what a whole set of lines does. If you've seen an outline in English class, imagine something similar.

Next Steps

You can't run it monthly as long as you're taking input from the user when it runs, because you might not be there! Take input from a file or command-line arguments instead. Over time, I've found this is more flexible in general if you're the only user.
What operating system are you on?
- If you're on Windows, I'm told you can use 'scheduled tasks' to run your script monthly.
- If you're on Linux, cron can run your task monthly.
- If you're on Mac, you can do the same stuff as in Linux, or you could use Automator to run your script monthly.

Thank you both so much for your input, I really appreciate you both taking the time to look at the code and offer feedback. I agree, I am not writing an CSV properly, I need to work on formatting the data in python. I basically got the information to my CSV file and created a macro to format the data before adding it to my main workbook with all the other playlists. Thank you very much for the link Zachary I will look into formatting my data before writing it to CSV. Thank you both for all the comments, each comment has help me learn and further my understanding! Cheers Craig

Stack Exchange Network

Web scraping playlist from website and writing to CSV

2 Answers 2

CSV

Libraries

Style

Next Steps

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraping playlist from website and writing to CSV

2 Answers 2

CSV

Libraries

Style

Next Steps

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions