Python Web scraping

Question 1

I am 12 days old into Python and web scraping and managed to write my first ever automation script. Please review my code and point out blunders If any.

What do I want to achieve?

I want to scrape all chapters of each Novel in each category and post it on a WordPress blog to test. Please point out anything that I missed, and is mandatory to run this script on the WordPress blog.

from requests import get
from bs4 import BeautifulSoup
import re
r = get(site,
 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
category = soup.findAll(class_="search-by-genre")
# Getting all categories
categories = []
for link in soup.findAll(href=re.compile(r'/category/\w+$')):
 print("Category:", link.text)
 category_link = link['href']
 categories.append(category_link)
# Getting all Novel Headers
for category in categories:
 r = get(category_link,
 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.text, "lxml")
 Novels_header = soup.findAll(class_="top-novel-header")
 # Getting Novels' Title and Link
 for Novel_names in Novels_header:
 print("Novel:", Novel_names.text.strip())
 Novel_link = Novel_names.find('a')['href']
 # Getting Novel's Info
 r = get(Novel_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.text, "lxml")
 Novel_divs = soup.findAll(class_="chapter-chs")
 # Novel Chapters
 for articles in Novel_divs:
 article_ch = articles.findAll("a")
 for chapters in article_ch:
 ch = chapters["href"]
 # Getting article
 r = get(ch, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
 soup = BeautifulSoup(r.content, "lxml")
 title = soup.find(class_="block-title")
 print(title.text.strip())
 full_article = soup.find("div", {"class": "desc"})
 # remove ads inside the text:
 for ads in full_article.select('center, small, a'):
 ads.extract()
 print(full_article.get_text(strip=True, separator='\n'))

Question 2

Naming

Variable names should be snake_case, and should represent what they are containing. I would also use req instead of r. The extra two characters aren't going to cause a heartache.

Constants

You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE, then just use that wherever you need headers. I would do the same for site.

List Comprehension

I would go about collecting categories in this way:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:

for category in categories:
 print(category)

Also, it seems like you assign category_link to the last element in the list, so that can go just outside the list comprehension.

Save your assignments

Instead of assigning the result of soup.find to a variable, then using it in a loop, simply put that soup.find in the loop. Take a look:

for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ....

As a result of the above changes, you code would look something like this:

from requests import get
from bs4 import BeautifulSoup
import re
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}
SITE = "https://readlightnovel.org/"
req = get(SITE, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
category = soup.findAll(class_="search-by-genre")
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
category_link = categories[-1]
# Getting all Novel Headers
for category in categories:
 req = get(category_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 novels_header = soup.findAll(class_="top-novel-header")
 # Getting Novels' Title and Link
 for novel_names in novels_header:
 print("Novel:", novel_names.text.strip())
 novel_link = novel_names.find('a')['href']
 # Getting Novel's Info
 req = get(novel_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 # Novel Chapters
 for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ch = chapters["href"]
 # Getting article
 req = get(ch, headers=HEADERS)
 soup = BeautifulSoup(req.content, "lxml")
 title = soup.find(class_="block-title")
 print(title.text.strip())
 full_article = soup.find("div", {"class": "desc"})
 # remove ads inside the text:
 for ads in full_article.select('center, small, a'):
 ads.extract()
 print(full_article.get_text(strip=True, separator='\n'))

Question 3

I think you can even get rid of the regular expressions. I prefer to use the BS4 functions.

Instead of:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

This statement is equivalent using a CSS selector:

categories = [link['href'] for link in soup.select("a[href*=\/category\/]")]

That means: fetch all the a href tags that contain text /category/ (escaped).

Ben A Ben A 10.7k5 gold badges37 silver badges101 bronze badges · Accepted Answer · 2020-05-16 19:00:06Z

Naming

Variable names should be snake_case, and should represent what they are containing. I would also use req instead of r. The extra two characters aren't going to cause a heartache.

Constants

You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE, then just use that wherever you need headers. I would do the same for site.

List Comprehension

I would go about collecting categories in this way:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:

for category in categories:
 print(category)

Also, it seems like you assign category_link to the last element in the list, so that can go just outside the list comprehension.

Save your assignments

Instead of assigning the result of soup.find to a variable, then using it in a loop, simply put that soup.find in the loop. Take a look:

for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ....

As a result of the above changes, you code would look something like this:

from requests import get
from bs4 import BeautifulSoup
import re
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}
SITE = "https://readlightnovel.org/"
req = get(SITE, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
category = soup.findAll(class_="search-by-genre")
categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
category_link = categories[-1]
# Getting all Novel Headers
for category in categories:
 req = get(category_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 novels_header = soup.findAll(class_="top-novel-header")
 # Getting Novels' Title and Link
 for novel_names in novels_header:
 print("Novel:", novel_names.text.strip())
 novel_link = novel_names.find('a')['href']
 # Getting Novel's Info
 req = get(novel_link, headers=HEADERS)
 soup = BeautifulSoup(req.text, "lxml")
 # Novel Chapters
 for articles in soup.findAll(class_="chapter-chs"):
 for chapters in articles.findAll("a"):
 ch = chapters["href"]
 # Getting article
 req = get(ch, headers=HEADERS)
 soup = BeautifulSoup(req.content, "lxml")
 title = soup.find(class_="block-title")
 print(title.text.strip())
 full_article = soup.find("div", {"class": "desc"})
 # remove ads inside the text:
 for ads in full_article.select('center, small, a'):
 ads.extract()
 print(full_article.get_text(strip=True, separator='\n'))

Stack Exchange Network

Python Web scraping

2 Answers 2

Naming

Constants

List Comprehension

Save your assignments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python Web scraping

2 Answers 2

Naming

Constants

List Comprehension

Save your assignments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions