Using python and beautifulsoup to iterate through a list of websites to find a particular string

Question 1

I'm attempting to find companies who mention a particular service in on their homepage. To do this, I am iterating through a csv file with two columns - ID and URL. I'm using BeautifulSoup to get the html and regex to find the string.

At present, my code works, but it feels very clunky and takes forever. I'm also not writing my matching IDs to the new csv, which I haven't been able to figure out.

Since this is at least working, hopefully, it will help someone else who is spinning their wheels trying to figure it out.

How can it be improved?

import requests
from bs4 import BeautifulSoup
import re
import csv
with open('web1.csv', mode='r') as infile:
 reader = csv.reader(infile)
 with open('websites_new.csv', mode='w') as outfile:
 writer = csv.writer(outfile)
 mydict = dict((rows[0],rows[1]) for rows in reader)
newlist = []
for v in mydict.itervalues():
 try:
 page = requests.get('http://www.' + v)
 except:
 pass
 soup = BeautifulSoup(page.content, 'html.parser')
 soupString = str(soup)
 re1='.*?'
 re2='(secretword)'
 rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
 m = rg.search(soupString)
 if m is None:
 value = 'x'
 newlist.extend(value)
 else:
 newlist.extend(v)
print newlist

Question 2

First of all, since you are applying a regular expression pattern to the complete source of the page, you don't need an HTML parser like BeautifulSoup - directly search inside the page.content.

And, if you need to go the HTML parsing route and speed matters, choose either lxml, or lxml parser with BeautifulSoup.

You may also look into reusing the same requests.Session() instance - it may have a positive impact on performance.

Overall though, your approach is blocking/synchronous - your code processes URLs one by one - it would not process the next URL until it is done with the current one. Look into tools like Scrapy to approach the problem in the asynchronous/non-blocking fashion.

Question 3

Overall, I think your code is simple and nice enough. I agree with the points raised in alecxe's answer too though.

One thing that I noticed when skimming your code for the first time is the use of re1 and re2 on lines 21 and 22, respectively. Normally, a good rule of thumb is if you're numbering your variables, you might want to put them into a list.

However, as you only seem to have two regular expressions, I can understand if that might feel a little redundant. Regardless, I think you should at least make those variable names meaningful by putting their intended function in their names (e.g. instead of re2 perhaps reSecretWord). Obviously this depends on your style guide/preferences.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-05-18 02:43:58Z

First of all, since you are applying a regular expression pattern to the complete source of the page, you don't need an HTML parser like BeautifulSoup - directly search inside the page.content.

And, if you need to go the HTML parsing route and speed matters, choose either lxml, or lxml parser with BeautifulSoup.

You may also look into reusing the same requests.Session() instance - it may have a positive impact on performance.

Overall though, your approach is blocking/synchronous - your code processes URLs one by one - it would not process the next URL until it is done with the current one. Look into tools like Scrapy to approach the problem in the asynchronous/non-blocking fashion.

jmcph4 jmcph4 1986 bronze badges · Answer 2 · 2017-05-18 05:08:52Z

Overall, I think your code is simple and nice enough. I agree with the points raised in alecxe's answer too though.

One thing that I noticed when skimming your code for the first time is the use of re1 and re2 on lines 21 and 22, respectively. Normally, a good rule of thumb is if you're numbering your variables, you might want to put them into a list.

However, as you only seem to have two regular expressions, I can understand if that might feel a little redundant. Regardless, I think you should at least make those variable names meaningful by putting their intended function in their names (e.g. instead of re2 perhaps reSecretWord). Obviously this depends on your style guide/preferences.

Stack Exchange Network

Using python and beautifulsoup to iterate through a list of websites to find a particular string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using python and beautifulsoup to iterate through a list of websites to find a particular string

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions