I am 12 days old into Python and web scraping and managed to write my first ever automation script. Please review my code and point out blunders If any.
What do I want to achieve?
I want to scrape all chapters of each Novel in each category and post it on a WordPress blog to test. Please point out anything that I missed, and is mandatory to run this script on the WordPress blog.
from requests import get
from bs4 import BeautifulSoup
import re
r = get(site,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
category = soup.findAll(class_="search-by-genre")
# Getting all categories
categories = []
for link in soup.findAll(href=re.compile(r'/category/\w+$')):
print("Category:", link.text)
category_link = link['href']
categories.append(category_link)
# Getting all Novel Headers
for category in categories:
r = get(category_link,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
Novels_header = soup.findAll(class_="top-novel-header")
# Getting Novels' Title and Link
for Novel_names in Novels_header:
print("Novel:", Novel_names.text.strip())
Novel_link = Novel_names.find('a')['href']
# Getting Novel's Info
r = get(Novel_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
Novel_divs = soup.findAll(class_="chapter-chs")
# Novel Chapters
for articles in Novel_divs:
article_ch = articles.findAll("a")
for chapters in article_ch:
ch = chapters["href"]
# Getting article
r = get(ch, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.content, "lxml")
title = soup.find(class_="block-title")
print(title.text.strip())
full_article = soup.find("div", {"class": "desc"})
# remove ads inside the text:
for ads in full_article.select('center, small, a'):
ads.extract()
print(full_article.get_text(strip=True, separator='\n'))
2 Answers 2
Naming
Variable names should be snake_case
, and should represent what they are containing. I would also use req
instead of r
. The extra two characters aren't going to cause a heartache.
Constants
You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE
, then just use that wherever you need headers. I would do the same for site
.
List Comprehension
I would go about collecting categories in this way:
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:
for category in categories:
print(category)
Also, it seems like you assign category_link
to the last element in the list, so that can go just outside the list comprehension.
Save your assignments
Instead of assigning the result of soup.find
to a variable, then using it in a loop, simply put that soup.find
in the loop. Take a look:
for articles in soup.findAll(class_="chapter-chs"):
for chapters in articles.findAll("a"):
....
As a result of the above changes, you code would look something like this:
from requests import get
from bs4 import BeautifulSoup
import re
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}
SITE = "https://readlightnovel.org/"
req = get(SITE, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
category = soup.findAll(class_="search-by-genre")
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
category_link = categories[-1]
# Getting all Novel Headers
for category in categories:
req = get(category_link, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
novels_header = soup.findAll(class_="top-novel-header")
# Getting Novels' Title and Link
for novel_names in novels_header:
print("Novel:", novel_names.text.strip())
novel_link = novel_names.find('a')['href']
# Getting Novel's Info
req = get(novel_link, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
# Novel Chapters
for articles in soup.findAll(class_="chapter-chs"):
for chapters in articles.findAll("a"):
ch = chapters["href"]
# Getting article
req = get(ch, headers=HEADERS)
soup = BeautifulSoup(req.content, "lxml")
title = soup.find(class_="block-title")
print(title.text.strip())
full_article = soup.find("div", {"class": "desc"})
# remove ads inside the text:
for ads in full_article.select('center, small, a'):
ads.extract()
print(full_article.get_text(strip=True, separator='\n'))
I think you can even get rid of the regular expressions. I prefer to use the BS4 functions.
Instead of:
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
This statement is equivalent using a CSS selector:
categories = [link['href'] for link in soup.select("a[href*=\/category\/]")]
That means: fetch all the a href
tags that contain text /category/
(escaped).
Explore related questions
See similar questions with these tags.