I just started learning Python and made a script that scrapes a corporate directory https://www.sccci.org.sg/en/directory/corporate/members
So far it gets all the company names and their details under one category page (although i do intend on making it automatically get the details of the other pages too in the future) and writes them into a text file.
However the code uses a lot of nested loops and i am looking to see if there is a better way to code it in terms of efficiency and good practice.
Here is the code:
#Author: James
#Date: 9/11/2017
#nlist stores company names
#detlist stores the details
#finlist stores the href links required to scrape the subsites of SCCCI
import requests
from bs4 import BeautifulSoup
check = False
finlist = []
nlist = []
detlist = []
r = requests.get("https://www.sccci.org.sg/en/directory/corporate/members?ind=150")
soup = BeautifulSoup(r.content, "html.parser")
#finds all the links in the html class "listing" and store them in "finlist"
for items in soup.findAll("div", {"class" : "listing"}):
for a in items.findAll("ol"):
for b in a.findAll("a"):
finlist.append("https://www.sccci.org.sg" + b.get("href"))
#enters each site in finlist and gets the company name found in "member-name"
for record in finlist:
print("Entering " + record + "...")
lr = requests.get(record)
lsoup = BeautifulSoup(lr.content, "html.parser")
for o in lsoup.findAll(["span"], {"class" : "member-name"}):
nlist.append(o.get_text("\n",strip=True))
for o in lsoup.findAll("div", {"class" : "member-info hidden"}):
detlist.append(o.get_text("\n",strip=True))
#this loops checks for any additional pages in the link and searches though the additional sites for names and details too
for j in lsoup.findAll("li", {"class" : "pager-item"}):
for b in j.findAll("a"):
print(" Entering https://www.sccci.org.sg" + b.get("href") + "...")
mR = requests.get("https://www.sccci.org.sg" + b.get("href"))
mSoup = BeautifulSoup(mR.content, "html.parser")
for soups in mSoup.findAll("span", {"class" : "member-name"}):
nlist.append(soups.get_text("\n",strip=True))
for soups in mSoup.findAll("div", {"class" : "member-info hidden"}):
detlist.append(soups.get_text("\n",strip=True))
# Request process end -- File dump process start --
print("Start file dump...")
text_file = open("debug.txt", "w", encoding="utf-8")
#combines the namelist and detaillist into the finalList
finalList = [j for i in zip(nlist,detlist) for j in i]
for zippy in finalList:
zippy = zippy.replace(" "," ")
zipstring = str(zippy)
text_file.write(zipstring + "\n\n")
text_file.close()
text_file_names = open("cnames.txt", "w", encoding="utf-8")
count = 0
for names in nlist:
count+= 1
names = str(names)
text_file_names.write(str(count) + " | " + names + "\n")
text_file_names.close()
text_file_pnames = open("names.txt", "w", encoding="utf-8")
for pnames in nlist:
pnames = str(pnames)
text_file_pnames.write(pnames + "\n")
text_file_pnames.close()
finalitem = len(finalList)/2
print("Done | " + str(finalitem) + " items")
print("Files generated: names.txt | debug.txt | cnames.txt")
2 Answers 2
It seems like you are creating a kind of a crawler. To improve on performance, you can use a Session
object. It will utilize the underlying TCP connection, making the script run faster.
You can also use BeautifulSoup
's SoupStrainer
to only parse the listing
tag. This way, your crawler won't have to go through each and every line of the HTML document.
Additionally on perfomance, you can use the lxml
parser which happens to be faster than html.parser
.
You seem to be using the base link quite often, move it into a constant and add href
tags to it as needed.
As an example, the way you are filling up your finlist
can be written as such.
import requests
from bs4 import BeautifulSoup, SoupStrainer
URL = 'https://www.sccci.org.sg'
session = requests.Session()
response = session.get(URL + '/en/directory/corporate/members?ind=150')
strainer = SoupStrainer(class_='listing')
soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)
listing_links = [URL + link['href'] for link in soup.select('ol[start] a[href]')]
Note that I have used all of the above and CSS selectors resulting in reducing the amount of for
loops to just one. Populating finlist
takes about 2 seconds on my machine.
Staying on finlist
, it's not a particularly descriptive variable name. Perhaps something like listing_links
would be better?
I can't go through the rest of the crawler as I am short on time, but for file operations, use the with
statement. It will ensure the file is closed, even if an error occurs.
Example usage:
foo = '1234'
with open('bar.txt', 'w') as f:
f.write(foo)
Lukasz has already pinpointed multiple issues, but if we were to talk specifically about removing the nested loops with something less nested, I would look into CSS selectors and list comprehensions.
For instance, the following part:
for items in soup.findAll("div", {"class" : "listing"}):
for a in items.findAll("ol"):
for b in a.findAll("a"):
finlist.append("https://www.sccci.org.sg" + b.get("href"))
can be replaced with:
finlist = ["https://www.sccci.org.sg" + a.get("href")
for a in soup.select(".listing ol a")]
Explore related questions
See similar questions with these tags.