I've done simple web scraping and want to make sure all my steps are correct? Is it considered clean code? Is there a better way to use the multi-page scraping feature?
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main():
data = []
for page_num in range(1,51):
url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
books = soup.find_all('article', class_ = 'product_pod')
for book in books:
name = book.find('img').attrs['alt']
price = book.find('p', class_ = 'price_color').text.strip()
link = 'https://books.toscrape.com/' + book.find('a').attrs['href']
stock = book.find('p', class_ = 'instock availability').text.strip()
data.append([name, price, link,stock])
df = pd.DataFrame(data, columns=['name', 'price', 'link', 'stock'])
df.to_csv('data.csv')
main()
-
7\$\begingroup\$ Welcome to Code Review! To help reviewers give you better answers, we need to know what the code is intended to achieve. Please add sufficient context to your question to describe the purpose of the code. We want to know why much more than how. The more you tell us about what your code is for, the easier it will be for reviewers to help you. Also, edit the title to simply summarise the task, rather than your concerns about the code. \$\endgroup\$Toby Speight– Toby Speight2024年05月09日 09:58:57 +00:00Commented May 9, 2024 at 9:58
-
\$\begingroup\$ There's very little here to review. I'd say run it through black and flake8, give the function a better name and you're good. \$\endgroup\$ggorlen– ggorlen2024年05月09日 15:04:20 +00:00Commented May 9, 2024 at 15:04
3 Answers 3
I agree here is little to review. Two tips though: use requests.Session() and always check the status code. If it's not 200, the request has not succeeded. You should not proceed with parsing.
In the aforementioned page it is also explained how you can implement a retry feature in case of transient errors.
The problem is that if an error occurs in the middle of your script (the odds are fairly high this could happen), then an exception will occur and you'll lose all the data already scraped.
It would be better to have a small dedicated function and call it in a loop like this:
for page_num in range(1,51):
results = fetch_page(page_num: int):
...
Append the data to the CSV file at each iteration, not in bulk at the end.
Use: df.to_csv('data.csv', mode='a')
.
The code itself is very short and it is also easy to read and comprehend. What is missing is exception handling. Not just for requests, but also for Beautifulsoup. The HTML tags you are expecting may not always be there. Your code will usually crash, but it could also return None, which is not what you want.
Instead of returning a list, use something more flexible like a namedtuple, a custom class or even a dict, so that you can insert more fields or change order without having to rewrite upstream code.
If you have more extensive needs, then consider using a tool such as Scrapy, so that you don't reinvent the wheel for nothing.
Since your df
is just using a default range index, I suggest ignoring the index column with index=False
when saving:
df.to_csv('data.csv', index=False)
Otherwise whenever someone tries to load your data with read_csv('data.csv')
, they'll get an "Unnamed: 0" column that just duplicates the index:
>>> pd.read_csv('data.csv')
# Unnamed: 0 name price link stock
# 0 0 A Light in the Attic 51ドル.77 https://books.toscrape.com/a-light-in-the-atti... In stock
# 1 1 Tipping the Velvet 53ドル.74 https://books.toscrape.com/tipping-the-velvet_... In stock
# .. ... ... ... ... ...
This is a common/annoying issue:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
I have a situation wherein sometimes when I read a csv from df, I get an unwanted index-like column named Unnamed: 0.
This is very annoying! Does anyone have an idea on how to get rid of this?
Yes, future users can work around it by specifying read_csv('data.csv', index_col=0)
, but it's just an unnecessary annoyance that can be prevented from the start by saving it once properly.
Also some formatting nits:
- Don't use spaces around default params (e.g.,
class_ = 'product_pod'
->class_='product_pod'
) - Don't mix single/double quotes unless needed (e.g.,
"lxml"
is double while everything else is single) - Use a space after commas (e.g.,
range
anddata.append
have some missing spaces)
I would like to see a few things generally to make this code easier to work with in future.
- Rather than have
main
do all the work I would havemain
call one or more further methods (with descriptive names e.g.scrape_pages
) (1,51)
- what is51
? (Trick question, it's a magic number) May it possibly change? This could be a parameter toscrape_pages
for exampledef scrape_pages(max_page)
- You are setting
headers
on every page loop iteration - seems like wasted work - I would probably break the "page-loop" body and the "book-loop" body into their own methods. I realise that implementing this would negate my previous point possibly.
- I would probably also have
scrape_pages
return the array and then the responsibility would be on the calling code to decide what to do with it, e.g.write_to_csv
Please note I am not a Python guy or familiar with beautifulsoup. These are just my immediate code-cleanliness thoughts.
Explore related questions
See similar questions with these tags.