5
\$\begingroup\$

I am 12 days old into Python and web scraping and managed to write my first ever automation script. Please review my code and point out blunders If any.

What do I want to achieve?

I want to scrape all chapters of each Novel in each category and post it on a WordPress blog to test. Please point out anything that I missed, and is mandatory to run this script on the WordPress blog.

from requests import get
from bs4 import BeautifulSoup
import re
r = get(site,
 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
category = soup.findAll(class_="search-by-genre")
# Getting all categories
categories = []
for link in soup.findAll(href=re.compile(r'/category/\w+$')):
 print("Category:", link.text)
 category_link = link['href']
 categories.append(category_link)
# Getting all Novel Headers
for category in categories:
 r = get(category_link,
 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.text, "lxml")
 Novels_header = soup.findAll(class_="top-novel-header")
 # Getting Novels' Title and Link
 for Novel_names in Novels_header:
 print("Novel:", Novel_names.text.strip())
 Novel_link = Novel_names.find('a')['href']
 # Getting Novel's Info
 r = get(Novel_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.text, "lxml")
 Novel_divs = soup.findAll(class_="chapter-chs")
 # Novel Chapters
 for articles in Novel_divs:
 article_ch = articles.findAll("a")
 for chapters in article_ch:
 ch = chapters["href"]
 # Getting article
 r = get(ch, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.content, "lxml")
 title = soup.find(class_="block-title")
 print(title.text.strip())
 full_article = soup.find("div", {"class": "desc"})
 # remove ads inside the text:
 for ads in full_article.select('center, small, a'):
 ads.extract()
 print(full_article.get_text(strip=True, separator='\n'))
Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked May 16, 2020 at 16:54
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

Naming

Variable names should be snake_case, and should represent what they are containing. I would also use req instead of r. The extra two characters aren't going to cause a heartache.

Constants

You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE, then just use that wherever you need headers. I would do the same for site.

List Comprehension

I would go about collecting categories in this way:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:

for category in categories:
 print(category)

Also, it seems like you assign category_link to the last element in the list, so that can go just outside the list comprehension.

Save your assignments

Instead of assigning the result of soup.find to a variable, then using it in a loop, simply put that soup.find in the loop. Take a look:

for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ....


As a result of the above changes, you code would look something like this:

from requests import get
from bs4 import BeautifulSoup
import re
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}
SITE = "https://readlightnovel.org/"
req = get(SITE, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
category = soup.findAll(class_="search-by-genre")
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
category_link = categories[-1]
# Getting all Novel Headers
for category in categories:
 req = get(category_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 novels_header = soup.findAll(class_="top-novel-header")
 # Getting Novels' Title and Link
 for novel_names in novels_header:
 print("Novel:", novel_names.text.strip())
 novel_link = novel_names.find('a')['href']
 # Getting Novel's Info
 req = get(novel_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 # Novel Chapters
 for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ch = chapters["href"]
 # Getting article
 req = get(ch, headers=HEADERS)
 soup = BeautifulSoup(req.content, "lxml")
 title = soup.find(class_="block-title")
 print(title.text.strip())
 full_article = soup.find("div", {"class": "desc"})
 # remove ads inside the text:
 for ads in full_article.select('center, small, a'):
 ads.extract()
 print(full_article.get_text(strip=True, separator='\n'))
answered May 16, 2020 at 19:00
\$\endgroup\$
3
\$\begingroup\$

I think you can even get rid of the regular expressions. I prefer to use the BS4 functions.

Instead of:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

This statement is equivalent using a CSS selector:

categories = [link['href'] for link in soup.select("a[href*=\/category\/]")]

That means: fetch all the a href tags that contain text /category/ (escaped).

answered May 16, 2020 at 22:09
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.