Python webscraping for lists of company names and details using BeautifulSoup

Question 1

I just started learning Python and made a script that scrapes a corporate directory https://www.sccci.org.sg/en/directory/corporate/members

So far it gets all the company names and their details under one category page (although i do intend on making it automatically get the details of the other pages too in the future) and writes them into a text file.

However the code uses a lot of nested loops and i am looking to see if there is a better way to code it in terms of efficiency and good practice.

Here is the code:

#Author: James
#Date: 9/11/2017
#nlist stores company names
#detlist stores the details
#finlist stores the href links required to scrape the subsites of SCCCI
import requests
from bs4 import BeautifulSoup
check = False
finlist = []
nlist = []
detlist = []
r = requests.get("https://www.sccci.org.sg/en/directory/corporate/members?ind=150")
soup = BeautifulSoup(r.content, "html.parser")
#finds all the links in the html class "listing" and store them in "finlist"
for items in soup.findAll("div", {"class" : "listing"}):
 for a in items.findAll("ol"):
 for b in a.findAll("a"):
 finlist.append("https://www.sccci.org.sg" + b.get("href"))
#enters each site in finlist and gets the company name found in "member-name"
for record in finlist:
 print("Entering " + record + "...")
 lr = requests.get(record)
 lsoup = BeautifulSoup(lr.content, "html.parser")
 for o in lsoup.findAll(["span"], {"class" : "member-name"}):
 nlist.append(o.get_text("\n",strip=True))
 for o in lsoup.findAll("div", {"class" : "member-info hidden"}):
 detlist.append(o.get_text("\n",strip=True))
 #this loops checks for any additional pages in the link and searches though the additional sites for names and details too
 for j in lsoup.findAll("li", {"class" : "pager-item"}):
 for b in j.findAll("a"):
 print(" Entering https://www.sccci.org.sg" + b.get("href") + "...")
 mR = requests.get("https://www.sccci.org.sg" + b.get("href"))
 mSoup = BeautifulSoup(mR.content, "html.parser")
 for soups in mSoup.findAll("span", {"class" : "member-name"}):
 nlist.append(soups.get_text("\n",strip=True))
 for soups in mSoup.findAll("div", {"class" : "member-info hidden"}):
 detlist.append(soups.get_text("\n",strip=True))
# Request process end -- File dump process start --
print("Start file dump...")
text_file = open("debug.txt", "w", encoding="utf-8")
#combines the namelist and detaillist into the finalList
finalList = [j for i in zip(nlist,detlist) for j in i]
for zippy in finalList:
 zippy = zippy.replace(" "," ")
 zipstring = str(zippy)
 text_file.write(zipstring + "\n\n")
text_file.close()
text_file_names = open("cnames.txt", "w", encoding="utf-8")
count = 0
for names in nlist:
 count+= 1
 names = str(names)
 text_file_names.write(str(count) + " | " + names + "\n")
text_file_names.close()
text_file_pnames = open("names.txt", "w", encoding="utf-8")
for pnames in nlist:
 pnames = str(pnames)
 text_file_pnames.write(pnames + "\n")
text_file_pnames.close()
finalitem = len(finalList)/2
print("Done | " + str(finalitem) + " items")
print("Files generated: names.txt | debug.txt | cnames.txt")

Question 2

It seems like you are creating a kind of a crawler. To improve on performance, you can use a Session object. It will utilize the underlying TCP connection, making the script run faster.

You can also use BeautifulSoup's SoupStrainer to only parse the listing tag. This way, your crawler won't have to go through each and every line of the HTML document.

Additionally on perfomance, you can use the lxml parser which happens to be faster than html.parser.

You seem to be using the base link quite often, move it into a constant and add href tags to it as needed.

As an example, the way you are filling up your finlist can be written as such.

import requests
from bs4 import BeautifulSoup, SoupStrainer
URL = 'https://www.sccci.org.sg'
session = requests.Session()
response = session.get(URL + '/en/directory/corporate/members?ind=150')
strainer = SoupStrainer(class_='listing')
soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)
listing_links = [URL + link['href'] for link in soup.select('ol[start] a[href]')]

Note that I have used all of the above and CSS selectors resulting in reducing the amount of for loops to just one. Populating finlist takes about 2 seconds on my machine.

Staying on finlist, it's not a particularly descriptive variable name. Perhaps something like listing_links would be better?

I can't go through the rest of the crawler as I am short on time, but for file operations, use the with statement. It will ensure the file is closed, even if an error occurs. Example usage:

foo = '1234'
with open('bar.txt', 'w') as f:
 f.write(foo)

Question 3

Lukasz has already pinpointed multiple issues, but if we were to talk specifically about removing the nested loops with something less nested, I would look into CSS selectors and list comprehensions.

For instance, the following part:

for items in soup.findAll("div", {"class" : "listing"}):
 for a in items.findAll("ol"):
 for b in a.findAll("a"):
 finlist.append("https://www.sccci.org.sg" + b.get("href"))

can be replaced with:

finlist = ["https://www.sccci.org.sg" + a.get("href")
 for a in soup.select(".listing ol a")]

Luke Luke 1,1207 silver badges18 bronze badges · Accepted Answer · 2017-09-12 12:19:36Z

It seems like you are creating a kind of a crawler. To improve on performance, you can use a Session object. It will utilize the underlying TCP connection, making the script run faster.

You can also use BeautifulSoup's SoupStrainer to only parse the listing tag. This way, your crawler won't have to go through each and every line of the HTML document.

Additionally on perfomance, you can use the lxml parser which happens to be faster than html.parser.

You seem to be using the base link quite often, move it into a constant and add href tags to it as needed.

As an example, the way you are filling up your finlist can be written as such.

import requests
from bs4 import BeautifulSoup, SoupStrainer
URL = 'https://www.sccci.org.sg'
session = requests.Session()
response = session.get(URL + '/en/directory/corporate/members?ind=150')
strainer = SoupStrainer(class_='listing')
soup = BeautifulSoup(response.content, 'lxml', parse_only=strainer)
listing_links = [URL + link['href'] for link in soup.select('ol[start] a[href]')]

Note that I have used all of the above and CSS selectors resulting in reducing the amount of for loops to just one. Populating finlist takes about 2 seconds on my machine.

Staying on finlist, it's not a particularly descriptive variable name. Perhaps something like listing_links would be better?

I can't go through the rest of the crawler as I am short on time, but for file operations, use the with statement. It will ensure the file is closed, even if an error occurs. Example usage:

foo = '1234'
with open('bar.txt', 'w') as f:
 f.write(foo)

Stack Exchange Network

Python webscraping for lists of company names and details using BeautifulSoup

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python webscraping for lists of company names and details using BeautifulSoup

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions