7
\$\begingroup\$

I'm attempting to find companies who mention a particular service in on their homepage. To do this, I am iterating through a csv file with two columns - ID and URL. I'm using BeautifulSoup to get the html and regex to find the string.

At present, my code works, but it feels very clunky and takes forever. I'm also not writing my matching IDs to the new csv, which I haven't been able to figure out.

Since this is at least working, hopefully, it will help someone else who is spinning their wheels trying to figure it out.

How can it be improved?

import requests
from bs4 import BeautifulSoup
import re
import csv
with open('web1.csv', mode='r') as infile:
 reader = csv.reader(infile)
 with open('websites_new.csv', mode='w') as outfile:
 writer = csv.writer(outfile)
 mydict = dict((rows[0],rows[1]) for rows in reader)
newlist = []
for v in mydict.itervalues():
 try:
 page = requests.get('http://www.' + v)
 except:
 pass
 soup = BeautifulSoup(page.content, 'html.parser')
 soupString = str(soup)
 re1='.*?'
 re2='(secretword)'
 rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
 m = rg.search(soupString)
 if m is None:
 value = 'x'
 newlist.extend(value)
 else:
 newlist.extend(v)
print newlist
asked May 18, 2017 at 1:59
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

First of all, since you are applying a regular expression pattern to the complete source of the page, you don't need an HTML parser like BeautifulSoup - directly search inside the page.content.

And, if you need to go the HTML parsing route and speed matters, choose either lxml, or lxml parser with BeautifulSoup.

You may also look into reusing the same requests.Session() instance - it may have a positive impact on performance.


Overall though, your approach is blocking/synchronous - your code processes URLs one by one - it would not process the next URL until it is done with the current one. Look into tools like Scrapy to approach the problem in the asynchronous/non-blocking fashion.

answered May 18, 2017 at 2:43
\$\endgroup\$
5
\$\begingroup\$

Overall, I think your code is simple and nice enough. I agree with the points raised in alecxe's answer too though.

One thing that I noticed when skimming your code for the first time is the use of re1 and re2 on lines 21 and 22, respectively. Normally, a good rule of thumb is if you're numbering your variables, you might want to put them into a list.

However, as you only seem to have two regular expressions, I can understand if that might feel a little redundant. Regardless, I think you should at least make those variable names meaningful by putting their intended function in their names (e.g. instead of re2 perhaps reSecretWord). Obviously this depends on your style guide/preferences.

answered May 18, 2017 at 5:08
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.