Object-oriented web scraping with Python

Question 1

I usually write function-only Python programs, but have decided on an OOP approach (my first thereof) for my current program, a web-scraper. Here is the working code I wrote which I am looking to have critiqued:

import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
class Page(object):
 def __init__(self, url):
 self.url = url
 def get_source(self):
 self.source = urllib2.urlopen(url).read()
 return self.source 
class HubPage(Page):
 def has_vacancies(self):
 return not(any(text for text in NO_VACANCIES if text in self.source.lower()))
urls = []
with open('25.csv', 'rb') as spreadsheet:
 reader = csv.reader(spreadsheet)
 for row in reader:
 urls.append(row[0].strip())
for url in urls:
 page = HubPage(url)
 source = page.get_source()
 if page.has_vacancies():
 print 'Has vacancies'

Some context: HubPage represents a typical 'jobs' page on a company's web site. I am subclassing Page because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages.

Here's my issue: I know from experience that urllib2, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.

Is it the fact that class instantiations are involved (unnecessarily, perhaps)?
Might the fact that HubPage is inherited be at cause?
Is the call to any() expensive?

Question 2

I made an edit to your post, there were a few things which were not related to the code. Feel welcome to review it and rollback the edit, if needed.

Question 3

It's sad that I needed to add those two things. But people seem more intent on going around closing valid posts that also help others, than actually contributing value. It seems the only way to try prevent this is to explicitly state that it is my code, it works etc. etc. Yawn.

Question 4

You'll get more acquainted with the scope here as it goes, but mostly it's better to focus on what your code does and how, etc.

Question 5

have you done benchmarks versus requests module?

Question 6

@chicks I am aware of the requests module, but as you can see from my questions, the page-fetching module is not the issue here. It's moreover the scaffolding I have placed around.

Question 7

Use top-level functions

Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under if __name__ == '__main__' to:

separate the intents;
reuse the code more easily;
not have random code execution when importing the module.

Use `filter`

You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:

interesting_urls = filter(has_vacancies, url)

with the proper definition of has_vacancies.

You also only print 'Has vacancies' when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:

print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
 print url

Use generators

There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:

def read_urls(filename):
 with open(filename, 'rb') as spreadsheet:
 reader = csv.reader(spreadsheet)
 for row in reader:
 yield row[0].strip()

It also allows you to parametrize with the name of the file to read, in case you have several of them.

Do not force OOP when not required

You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:

def has_vacancies(url):
 source = urllib2.urlopen(url).read()
 return not(any(text for text in NO_VACANCIES if text in source.lower()))

See what I did there? This is the exact function we need for the filter I introduced earlier.

Remove extraneous computation

Calling lower() on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.

Of a lower interest, any is converting pieces of text to their truthy value after a test text in source already returned True. You also store source, self.url, self.source (and, to a lesser extend, page) without really needing it. But I’ve shown ways to avoid it.

Proposed improvements

import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
def has_vacancies(url):
 source = urllib2.urlopen(url).read().lower()
 return not any(text in source for text in NO_VACANCIES)
def read_urls(filename):
 with open(filename, 'rb') as spreadsheet:
 reader = csv.reader(spreadsheet)
 for row in reader:
 yield row[0].strip()
def check_vacancies(filename):
 print 'Following URLs have vacancies'
 for url in filter(has_vacancies, read_urls(filename)):
 print url
if __name__ == '__main__':
 check_vacancies('25.csv')

Going further

You might be interested in trying the re module to perform text matches, there may be a performance improvement.

Alternatively, the multiprocessing module provide a map_async performed in parallel by a pool of workers. You might want to change has_vacancies a bit and use that instead of filter to increase overall performances.

You may be interested into handling exceptions from urllib2 so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:

def has_vacancies(url):
 try:
 source = urllib2.urlopen(url).read()
 except urllib2.URLError:
 print 'Error processing', url
 return False
 else:
 source = source.lower()
 return not any(text in source for text in NO_VACANCIES)

Question 8

Many thanks. You say, 'Calling lower() once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.' But I am only calling it once, am I not?

Question 9

@Pyderman My intuition is telling me that if text in self.source.lower() will call lower for each text. Will need to examine compiled bytecode to be sure.

Question 10

I see the distinction now, and yes it would certainly appear that lower() was being called for every single character in the source.

Question 11

@Pyderman Just checked with the dis module by turning the generator expression fed into any into a list comprehension and assigning it to a variable... Using if text in self.source.lower() will perform a (costly) call to lower at each iteration and not work with a cached result of this call.

Question 12

Interesting. Still on the matter of speed, it has been suggested that I convert that NO_VACANCIES list to a set. Would you agree that this might speed things up further? See here: stackoverflow.com/questions/35181153/…

score 2 · Answer 1 · 2016-02-02 12:37:32Z

Use top-level functions

Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under if __name__ == '__main__' to:

separate the intents;
reuse the code more easily;
not have random code execution when importing the module.

Use `filter`

You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:

interesting_urls = filter(has_vacancies, url)

with the proper definition of has_vacancies.

You also only print 'Has vacancies' when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:

print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
 print url

Use generators

There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:

def read_urls(filename):
 with open(filename, 'rb') as spreadsheet:
 reader = csv.reader(spreadsheet)
 for row in reader:
 yield row[0].strip()

It also allows you to parametrize with the name of the file to read, in case you have several of them.

Do not force OOP when not required

You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:

def has_vacancies(url):
 source = urllib2.urlopen(url).read()
 return not(any(text for text in NO_VACANCIES if text in source.lower()))

See what I did there? This is the exact function we need for the filter I introduced earlier.

Remove extraneous computation

Calling lower() on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.

Of a lower interest, any is converting pieces of text to their truthy value after a test text in source already returned True. You also store source, self.url, self.source (and, to a lesser extend, page) without really needing it. But I’ve shown ways to avoid it.

Proposed improvements

import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
def has_vacancies(url):
 source = urllib2.urlopen(url).read().lower()
 return not any(text in source for text in NO_VACANCIES)
def read_urls(filename):
 with open(filename, 'rb') as spreadsheet:
 reader = csv.reader(spreadsheet)
 for row in reader:
 yield row[0].strip()
def check_vacancies(filename):
 print 'Following URLs have vacancies'
 for url in filter(has_vacancies, read_urls(filename)):
 print url
if __name__ == '__main__':
 check_vacancies('25.csv')

Going further

You might be interested in trying the re module to perform text matches, there may be a performance improvement.

Alternatively, the multiprocessing module provide a map_async performed in parallel by a pool of workers. You might want to change has_vacancies a bit and use that instead of filter to increase overall performances.

You may be interested into handling exceptions from urllib2 so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:

def has_vacancies(url):
 try:
 source = urllib2.urlopen(url).read()
 except urllib2.URLError:
 print 'Error processing', url
 return False
 else:
 source = source.lower()
 return not any(text in source for text in NO_VACANCIES)

Many thanks. You say, 'Calling lower() once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.' But I am only calling it once, am I not?
@Pyderman My intuition is telling me that if text in self.source.lower() will call lower for each text. Will need to examine compiled bytecode to be sure.
I see the distinction now, and yes it would certainly appear that lower() was being called for every single character in the source.
@Pyderman Just checked with the dis module by turning the generator expression fed into any into a list comprehension and assigning it to a variable... Using if text in self.source.lower() will perform a (costly) call to lower at each iteration and not work with a cached result of this call.
Interesting. Still on the matter of speed, it has been suggested that I convert that NO_VACANCIES list to a set. Would you agree that this might speed things up further? See here: stackoverflow.com/questions/35181153/…

Stack Exchange Network

Object-oriented web scraping with Python

1 Answer 1

Use top-level functions

Use `filter`

Use generators

Do not force OOP when not required

Remove extraneous computation

Proposed improvements

Going further

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Object-oriented web scraping with Python

1 Answer 1

Use top-level functions

Use filter

Use generators

Do not force OOP when not required

Remove extraneous computation

Proposed improvements

Going further

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `filter`