I usually write function-only Python programs, but have decided on an OOP approach (my first thereof) for my current program, a web-scraper. Here is the working code I wrote which I am looking to have critiqued:
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
class Page(object):
def __init__(self, url):
self.url = url
def get_source(self):
self.source = urllib2.urlopen(url).read()
return self.source
class HubPage(Page):
def has_vacancies(self):
return not(any(text for text in NO_VACANCIES if text in self.source.lower()))
urls = []
with open('25.csv', 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
urls.append(row[0].strip())
for url in urls:
page = HubPage(url)
source = page.get_source()
if page.has_vacancies():
print 'Has vacancies'
Some context: HubPage
represents a typical 'jobs' page on a company's web site. I am subclassing Page
because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages.
Here's my issue: I know from experience that urllib2
, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.
- Is it the fact that class instantiations are involved (unnecessarily, perhaps)?
- Might the fact that
HubPage
is inherited be at cause? - Is the call to
any()
expensive?
1 Answer 1
Use top-level functions
Reading URLs from a file and checking each of them for vacancies is done at the top-level of the file. You should enclose this code into functions and put the remaining top-level code under if __name__ == '__main__'
to:
- separate the intents;
- reuse the code more easily;
- not have random code execution when
import
ing the module.
Use filter
You are only interested in knowing which URL have vacancies. The easiest way to perform that is to:
interesting_urls = filter(has_vacancies, url)
with the proper definition of has_vacancies
.
You also only print 'Has vacancies'
when a given URL has so, letting the user try to figure out which one actually has. You might want to print the URL instead to make your script more usable:
print 'Following URLs have vacancies'
for url in filter(has_vacancies, urls):
print url
Use generators
There is no need to store all the URLs you’ll be checking up-front. Yield them from a generator since only one is processed at a time:
def read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()
It also allows you to parametrize with the name of the file to read, in case you have several of them.
Do not force OOP when not required
You are not interested in storing a state about each URL. You just want a simple question answered: "does this URL points to a page that provide vacancies?" Thus a simple function will do, for each URL:
def has_vacancies(url):
source = urllib2.urlopen(url).read()
return not(any(text for text in NO_VACANCIES if text in source.lower()))
See what I did there? This is the exact function we need for the filter
I introduced earlier.
Remove extraneous computation
Calling lower()
on a string has its cost. Calling it once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.
Of a lower interest, any
is converting pieces of text to their truthy value after a test text in source
already returned True
. You also store source
, self.url
, self.source
(and, to a lesser extend, page
) without really needing it. But I’ve shown ways to avoid it.
Proposed improvements
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
def has_vacancies(url):
source = urllib2.urlopen(url).read().lower()
return not any(text in source for text in NO_VACANCIES)
def read_urls(filename):
with open(filename, 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
yield row[0].strip()
def check_vacancies(filename):
print 'Following URLs have vacancies'
for url in filter(has_vacancies, read_urls(filename)):
print url
if __name__ == '__main__':
check_vacancies('25.csv')
Going further
You might be interested in trying the re
module to perform text matches, there may be a performance improvement.
Alternatively, the multiprocessing
module provide a map_async
performed in parallel by a pool of workers. You might want to change has_vacancies
a bit and use that instead of filter
to increase overall performances.
You may be interested into handling exceptions from urllib2
so a sudden networking failure or an invalid URL won't crash everything. Something simple can do:
def has_vacancies(url):
try:
source = urllib2.urlopen(url).read()
except urllib2.URLError:
print 'Error processing', url
return False
else:
source = source.lower()
return not any(text in source for text in NO_VACANCIES)
-
\$\begingroup\$ Many thanks. You say, 'Calling
lower()
once for every piece of text your are looking for in said string is even more costly. You should call it only once after fetching the source of the URL.' But I am only calling it once, am I not? \$\endgroup\$Pyderman– Pyderman2016年02月02日 17:07:19 +00:00Commented Feb 2, 2016 at 17:07 -
\$\begingroup\$ @Pyderman My intuition is telling me that
if text in self.source.lower()
will calllower
for eachtext
. Will need to examine compiled bytecode to be sure. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年02月02日 17:15:11 +00:00Commented Feb 2, 2016 at 17:15 -
\$\begingroup\$ I see the distinction now, and yes it would certainly appear that
lower()
was being called for every single character in the source. \$\endgroup\$Pyderman– Pyderman2016年02月03日 03:38:10 +00:00Commented Feb 3, 2016 at 3:38 -
\$\begingroup\$ @Pyderman Just checked with the
dis
module by turning the generator expression fed intoany
into a list comprehension and assigning it to a variable... Usingif text in self.source.lower()
will perform a (costly) call tolower
at each iteration and not work with a cached result of this call. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年02月03日 15:11:00 +00:00Commented Feb 3, 2016 at 15:11 -
\$\begingroup\$ Interesting. Still on the matter of speed, it has been suggested that I convert that NO_VACANCIES list to a
set
. Would you agree that this might speed things up further? See here: stackoverflow.com/questions/35181153/… \$\endgroup\$Pyderman– Pyderman2016年02月03日 15:36:11 +00:00Commented Feb 3, 2016 at 15:36
Explore related questions
See similar questions with these tags.
requests
module? \$\endgroup\$