code format and steps web scraping using beautiful soup

Question 1

I've done simple web scraping and want to make sure all my steps are correct? Is it considered clean code? Is there a better way to use the multi-page scraping feature?

import requests
from bs4 import BeautifulSoup
import pandas as pd
def main():
 data = []
 
 for page_num in range(1,51):
 url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
 headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'}
 response = requests.get(url, headers = headers)
 soup = BeautifulSoup(response.content, "lxml")
 books = soup.find_all('article', class_ = 'product_pod')
 
 for book in books:
 name = book.find('img').attrs['alt']
 price = book.find('p', class_ = 'price_color').text.strip()
 link = 'https://books.toscrape.com/' + book.find('a').attrs['href']
 stock = book.find('p', class_ = 'instock availability').text.strip()
 data.append([name, price, link,stock])
 
 df = pd.DataFrame(data, columns=['name', 'price', 'link', 'stock'])
 df.to_csv('data.csv')
 
main()

Question 2

Welcome to Code Review! To help reviewers give you better answers, we need to know what the code is intended to achieve. Please add sufficient context to your question to describe the purpose of the code. We want to know why much more than how. The more you tell us about what your code is for, the easier it will be for reviewers to help you. Also, edit the title to simply summarise the task, rather than your concerns about the code.

Question 3

There's very little here to review. I'd say run it through black and flake8, give the function a better name and you're good.

Question 4

I agree here is little to review. Two tips though: use requests.Session() and always check the status code. If it's not 200, the request has not succeeded. You should not proceed with parsing.

In the aforementioned page it is also explained how you can implement a retry feature in case of transient errors.

The problem is that if an error occurs in the middle of your script (the odds are fairly high this could happen), then an exception will occur and you'll lose all the data already scraped.

It would be better to have a small dedicated function and call it in a loop like this:

for page_num in range(1,51):
 results = fetch_page(page_num: int):
 ...

Append the data to the CSV file at each iteration, not in bulk at the end. Use: df.to_csv('data.csv', mode='a').

The code itself is very short and it is also easy to read and comprehend. What is missing is exception handling. Not just for requests, but also for Beautifulsoup. The HTML tags you are expecting may not always be there. Your code will usually crash, but it could also return None, which is not what you want.

Instead of returning a list, use something more flexible like a namedtuple, a custom class or even a dict, so that you can insert more fields or change order without having to rewrite upstream code.

If you have more extensive needs, then consider using a tool such as Scrapy, so that you don't reinvent the wheel for nothing.

Question 5

Since your df is just using a default range index, I suggest ignoring the index column with index=False when saving:

df.to_csv('data.csv', index=False)

Otherwise whenever someone tries to load your data with read_csv('data.csv'), they'll get an "Unnamed: 0" column that just duplicates the index:

>>> pd.read_csv('data.csv')
# Unnamed: 0 name price link stock
# 0 0 A Light in the Attic 51ドル.77 https://books.toscrape.com/a-light-in-the-atti... In stock
# 1 1 Tipping the Velvet 53ドル.74 https://books.toscrape.com/tipping-the-velvet_... In stock
# .. ... ... ... ... ...

This is a common/annoying issue:

How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?

I have a situation wherein sometimes when I read a csv from df, I get an unwanted index-like column named Unnamed: 0.

This is very annoying! Does anyone have an idea on how to get rid of this?

Yes, future users can work around it by specifying read_csv('data.csv', index_col=0), but it's just an unnecessary annoyance that can be prevented from the start by saving it once properly.

Also some formatting nits:

Don't use spaces around default params (e.g., class_ = 'product_pod' -> class_='product_pod')
Don't mix single/double quotes unless needed (e.g., "lxml" is double while everything else is single)
Use a space after commas (e.g., range and data.append have some missing spaces)

Question 6

I would like to see a few things generally to make this code easier to work with in future.

Rather than have main do all the work I would have main call one or more further methods (with descriptive names e.g. scrape_pages)
(1,51) - what is 51? (Trick question, it's a magic number) May it possibly change? This could be a parameter to scrape_pages for example def scrape_pages(max_page)
You are setting headers on every page loop iteration - seems like wasted work
I would probably break the "page-loop" body and the "book-loop" body into their own methods. I realise that implementing this would negate my previous point possibly.
I would probably also have scrape_pages return the array and then the responsibility would be on the calling code to decide what to do with it, e.g. write_to_csv

Please note I am not a Python guy or familiar with beautifulsoup. These are just my immediate code-cleanliness thoughts.

Kate Kate 8,0789 silver badges21 bronze badges · Answer 1 · 2024-05-09 15:46:54Z

I agree here is little to review. Two tips though: use requests.Session() and always check the status code. If it's not 200, the request has not succeeded. You should not proceed with parsing.

In the aforementioned page it is also explained how you can implement a retry feature in case of transient errors.

The problem is that if an error occurs in the middle of your script (the odds are fairly high this could happen), then an exception will occur and you'll lose all the data already scraped.

It would be better to have a small dedicated function and call it in a loop like this:

for page_num in range(1,51):
 results = fetch_page(page_num: int):
 ...

Append the data to the CSV file at each iteration, not in bulk at the end. Use: df.to_csv('data.csv', mode='a').

The code itself is very short and it is also easy to read and comprehend. What is missing is exception handling. Not just for requests, but also for Beautifulsoup. The HTML tags you are expecting may not always be there. Your code will usually crash, but it could also return None, which is not what you want.

Instead of returning a list, use something more flexible like a namedtuple, a custom class or even a dict, so that you can insert more fields or change order without having to rewrite upstream code.

If you have more extensive needs, then consider using a tool such as Scrapy, so that you don't reinvent the wheel for nothing.

tdy tdy 2,2661 gold badge10 silver badges21 bronze badges · Answer 2 · 2024-05-09 16:15:06Z

Since your df is just using a default range index, I suggest ignoring the index column with index=False when saving:

df.to_csv('data.csv', index=False)

Otherwise whenever someone tries to load your data with read_csv('data.csv'), they'll get an "Unnamed: 0" column that just duplicates the index:

>>> pd.read_csv('data.csv')
# Unnamed: 0 name price link stock
# 0 0 A Light in the Attic 51ドル.77 https://books.toscrape.com/a-light-in-the-atti... In stock
# 1 1 Tipping the Velvet 53ドル.74 https://books.toscrape.com/tipping-the-velvet_... In stock
# .. ... ... ... ... ...

This is a common/annoying issue:

How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?

I have a situation wherein sometimes when I read a csv from df, I get an unwanted index-like column named Unnamed: 0.

This is very annoying! Does anyone have an idea on how to get rid of this?

Yes, future users can work around it by specifying read_csv('data.csv', index_col=0), but it's just an unnecessary annoyance that can be prevented from the start by saving it once properly.

Also some formatting nits:

Don't use spaces around default params (e.g., class_ = 'product_pod' -> class_='product_pod')
Don't mix single/double quotes unless needed (e.g., "lxml" is double while everything else is single)
Use a space after commas (e.g., range and data.append have some missing spaces)

El Ronnoco El Ronnoco 1511 bronze badge · Answer 3 · 2024-05-10 11:33:01Z

I would like to see a few things generally to make this code easier to work with in future.

Rather than have main do all the work I would have main call one or more further methods (with descriptive names e.g. scrape_pages)
(1,51) - what is 51? (Trick question, it's a magic number) May it possibly change? This could be a parameter to scrape_pages for example def scrape_pages(max_page)
You are setting headers on every page loop iteration - seems like wasted work
I would probably break the "page-loop" body and the "book-loop" body into their own methods. I realise that implementing this would negate my previous point possibly.
I would probably also have scrape_pages return the array and then the responsibility would be on the calling code to decide what to do with it, e.g. write_to_csv

Please note I am not a Python guy or familiar with beautifulsoup. These are just my immediate code-cleanliness thoughts.

Stack Exchange Network

code format and steps web scraping using beautiful soup

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

code format and steps web scraping using beautiful soup

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions