Optimizing the speed of a web scraper

Question 1

I have just written this code to scrape some data from a website. In its current state it works fine, however, going by my tests on the script, I discovered that with the amount of data I am processing, it will take a few days to finish the task, Is there a way to improve its performance? I will insert a sample of the data as the bulk of it.

Input data in CSV format:

Code Origin
1 Eisenstadt
2 Tirana
3 St Pölten Hbf
6 Wien Westbahnhof
7 Wien Hauptbahnhof
8 Klagenfurt Hbf
9 Villach Hbf
11 Graz Hbf
12 Liezen

Code:

# import needed libraries
import csv
from datetime import datetime
from mechanize import Browser
from bs4 import BeautifulSoup
def datareader(datafile):
 """ This function reads the cities from csv file and processes
 them into an O-D for input into the web scrapper """
 # Read the csv
 with open(datafile, 'r') as f:
 reader = csv.reader(f)
 header = reader.next()
 ListOfCities = [lines for lines in reader]
 temp = ListOfCities[:]
 city_num = []
 city_orig_dest = []
 for i in ListOfCities:
 for j in temp:
 ans1 = i[0], j[0]
 if ans1[0] != ans1[1]:
 city_num.append(ans1)
 ans = (unicode(i[1], 'iso-8859-1'), unicode(j[1], 'iso-8859-1'),i[0], j[0])
 if ans[0] != ans[1] and ans[2] != ans[3]:
 city_orig_dest.append(ans)
 yield city_orig_dest
input_data = datareader('BAK.csv') # Input data here
def webscrapper(x):
 """ This function scraped the required website and extracts the
 quickest connection time within given time durations """
 #Create a browser object
 br = Browser()
 # Ignore robots.txt
 br.set_handle_robots(False)
 # Google demands a user-agent that isn't a robot
 br.addheaders = [('User-agent', 'Chrome')]
 # Retrieve the Google home page, saving the response
 br.open('http://fahrplan.sbb.ch/bin/query.exe/en')
 # Select the 6th form
 br.select_form(nr=6)
 # Assign origin and destination to the o d variables
 o = i[0].encode('iso-8859-1')
 d = i[1].encode('iso-8859-1')
 print 'o-d:', i[0], i[1]
 # Enter the text input (This section should be automated to read multiple text input as shown in the question)
 br.form["REQ0JourneyStopsS0G"] = o # Origin train station (From)
 br.form["REQ0JourneyStopsZ0G"] = d # Destination train station (To)
 br.form["REQ0JourneyTime"] = x # Search Time
 br.form["date"] = '18.01.17' # Search Date
 # Get the search results
 br.submit()
 #Click the later link three times to get trip times
 for _ in xrange(5):
 # Looking at some results in link format
 for l in br.links(text='Later'):
 pass
 response = br.follow_link(l)
 # get the response from mechanize Browser
 soup = BeautifulSoup(br.response().read(), 'lxml', from_encoding="utf-8")
 trs = soup.select('table.hfs_overview tr')
 connections_times = []
 ListOfSearchTimes = []
 # Scrape the search results from the resulting table
 for tr in trs:
 locations = tr.select('td.location')
 if len(locations) > 0:
 time = tr.select('td.time')[0].contents[0].strip()
 ListOfSearchTimes.append(time.encode('latin-1'))
 durations = tr.select('td.duration')
 # Check that the duration cell is not empty
 if len(durations) == 0:
 duration = ''
 else:
 duration = durations[0].contents[0].strip()
 # Convert duration time string to minutes
 def get_sec(time_str):
 h, m = time_str.split(':')
 return int(h) * 60 + int(m)
 connections_times.append(get_sec(duration))
 def group(lst, n):
 return zip(*[lst[i::n] for i in range(n)])
 arrivals_and_departure_pair = group(ListOfSearchTimes, 2)
 #Check that the selected departures for one interval occurs before the departure of the next interval
 fmt = '%H:%M'
 finalDepartureList = []
 for ind, res in arrivals_and_departure_pair:
 t1 = datetime.strptime(ind, fmt)
 if x == '05:30':
 control = datetime.strptime('09:00', fmt)
 if x == '09:00':
 control = datetime.strptime('12:00', fmt)
 if x == '12:00':
 control = datetime.strptime('15:00', fmt)
 if x == '15:00':
 control = datetime.strptime('18:00', fmt)
 if x == '18:00':
 control = datetime.strptime('21:00', fmt)
 if x == '21:00':
 control = datetime.strptime('05:30', fmt)
 if t1 < control:
 finalDepartureList.append(ind)
 # Get the the list of connection times for the departures above
 fastest_connect = connections_times[:len(finalDepartureList)]
 #Get the fastest connections time and catch any error when there is no connection between the OD pairs
 try:
 best_connect = sorted(fastest_connect)[0]
 print 'fastest connection', best_connect
 # print duration
 except IndexError:
 print "No Connection"
 #print
 #Return the result of the search
 if len(fastest_connect) == 0:
 return [i[2], i[3], '999999']
 else:
 return [i[2], i[3], str(best_connect)]
# List of time intervals
times = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
# Write the heading of the output text file
headings = ["from_BAKCode", "to_BAKCode", "interval", "duration"]
f = open("traveltime_rail2rail_2017.txt", "w+")
f.write(','.join(headings))
f.write('\n')
f.close()
# Call the web scraper function
for i in input_data.next():
 for index, time in enumerate(times):
 result = webscrapper(time)
 result.insert(2, str(index+1))
 print 'result:', result
 print
 f = open("traveltime_rail2rail_2017.txt", "a")
 f.write(','.join(result[0:4]))
 f.write('\n')
 f.close()

Question 2

Performance Issues

The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using Scrapy (which is the best thing happened in the Python's web-scraping world), or something like grequests.

Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a SoupStrainer class:

from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)
trs = soup.select('tr')

The other thing you can try is to switch from mechanize to requests using a single requests.Session() instance for all the requests. This way, the underlying TCP connection will be reused which may result into a performance improvement.

There are also some things you are re-doing over and over again in the loops. Things like the control variable should be pre-computed beforehand.

And, avoid redefining the get_sec() function inside the loop - defined it beforehand.

Also, use min() function instead of calling sorted() and getting the first element.

Code Style Issues

if len(locations) > 0: can be improved as if locations:
if len(durations) == 0: can be improved as if not durations:
if len(fastest_connect) == 0: can be improved as if not fastest_connect:
.select(..)[0] can be replaced with .select_one(...)
BeautifulSoup understands file-like objects as well, replace br.response().read() with br.response()

organize imports as per PEP8 recommendations:

import csv
from datetime import datetime
from bs4 import BeautifulSoup
from mechanize import Browser

the # import needed libraries comment does not make much sense
no need for the extra newline before the function docstrings
put the main program logic into if __name__ == '__main__': to avoid it being executed on import
by introducing the time variable, you are shadowing the imported time module
properly define constants (for example, the time format, or the magical 999999 number)
use with context manager when dealing with files
remove the unused header variable
skip the CSV header via the next() built-in function: next(reader, None)

A note about Python 3 compatibility:

use next() function instead of the .next() method
range() vs xrange() (cross-Python way to handle both)
use print() function instead of a statement

Here is a sample code that uses requests to make a search (note that we handle the default parameters "manually" - if you want to automatically handle the default parameter values as in case of mechanize, look into MechanicalSoup or RoboBrowser):

import requests
from bs4 import BeautifulSoup, SoupStrainer
def merge_two_dicts(x, y):
 """Given two dicts, merge them into a new dict as a shallow copy."""
 z = x.copy()
 z.update(y)
 return z
url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
 "changeQueryInputData=yes&start": "Search connection",
 "REQ0Total_KissRideMotorClass": "404",
 "REQ0Total_KissRideCarClass": "5",
 "REQ0Total_KissRide_maxDist": "10000000",
 "REQ0Total_KissRide_minDist": "0",
 "REQComparisonCarload": "0",
 "REQ0JourneyStopsS0A": "255",
 "REQ0JourneyStopsZ0A": "255",
 "REQ0JourneyStops1.0G": "",
 "REQ0JourneyStops1.0A": "1",
 "REQ0JourneyStopover1": ""
}
with requests.Session() as session:
 session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
 session.get(url) # visit the main page (might not be actually needed)
 # sample parameters
 params = {
 "REQ0JourneyStopsS0G": "Eisenstadt",
 "REQ0JourneyStopsZ0G": "Tirano, Stazione",
 "date": "27.02.17",
 "REQ0JourneyTime": "17:00"
 }
 response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))
 parse_only = SoupStrainer("table", class_="hfs_overview")
 soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)
 # print out times for demonstration purposes
 trs = soup.select('tr')
 for tr in trs:
 time = tr.select_one('td.time')
 if time:
 print(time.get_text(strip=True))

Question 3

Thanks for all your suggestions. can you write a line of code showing how the requests can be used in the code.

Question 4

@Nobi yup, good idea, I was thinking about that, will do, thanks.

Question 5

with your suggestions, there is definitely a significant improvement (10secs saved on each iteration) which will lead to considerable time saving since I have a lot of iterations to deal with. Also, very helpful tips for my future programming efforts.

Question 6

@Nobi awesome! Glad to see it is helping. I've included a sample requests code (please try it out and measure if it is faster than mechanize). Thanks.

Question 7

I know this is coming very late, but are you able to show me any other scheme to further increase the speed of this code. context: After your good work, one iteration now runs in about 4secs and I have just over 1million iterations to run, that will take over 40days, that's a lot of time which I dont have. Please help

Question 8

Put this at the beginning:

import threading as th
class scrape(th.Thread()):
 def __init___(self, time):
 self.time = time
 def run(self):
 return webscrapper(self.time)

And this instead of the for loop at the end:

for i in input_data.next():
 for index, time in enumerate(times):
 result = scrape(time)
 result = result.start()
 result.insert(2, str(index+1))
 print 'result:', result
 print
 f = open("traveltime_rail2rail_2017.txt", "a")
 f.write(','.join(result[0:4]))
 f.write('\n')
 f.close()

Question 9

Welcome to StackExchange Code Review! Please see: How do I write a good answer?, where you will find: "Every answer must make at least one insightful observation about the code in the question. Answers that merely provide an alternate solution with no explanation or justification do not constitute valid Code Review answers and may be deleted".

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-02-27 14:25:49Z

Performance Issues

The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using Scrapy (which is the best thing happened in the Python's web-scraping world), or something like grequests.

Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a SoupStrainer class:

from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)
trs = soup.select('tr')

The other thing you can try is to switch from mechanize to requests using a single requests.Session() instance for all the requests. This way, the underlying TCP connection will be reused which may result into a performance improvement.

There are also some things you are re-doing over and over again in the loops. Things like the control variable should be pre-computed beforehand.

And, avoid redefining the get_sec() function inside the loop - defined it beforehand.

Also, use min() function instead of calling sorted() and getting the first element.

Code Style Issues

if len(locations) > 0: can be improved as if locations:
if len(durations) == 0: can be improved as if not durations:
if len(fastest_connect) == 0: can be improved as if not fastest_connect:
.select(..)[0] can be replaced with .select_one(...)
BeautifulSoup understands file-like objects as well, replace br.response().read() with br.response()

organize imports as per PEP8 recommendations:

import csv
from datetime import datetime
from bs4 import BeautifulSoup
from mechanize import Browser

the # import needed libraries comment does not make much sense
no need for the extra newline before the function docstrings
put the main program logic into if __name__ == '__main__': to avoid it being executed on import
by introducing the time variable, you are shadowing the imported time module
properly define constants (for example, the time format, or the magical 999999 number)
use with context manager when dealing with files
remove the unused header variable
skip the CSV header via the next() built-in function: next(reader, None)

A note about Python 3 compatibility:

use next() function instead of the .next() method
range() vs xrange() (cross-Python way to handle both)
use print() function instead of a statement

Here is a sample code that uses requests to make a search (note that we handle the default parameters "manually" - if you want to automatically handle the default parameter values as in case of mechanize, look into MechanicalSoup or RoboBrowser):

import requests
from bs4 import BeautifulSoup, SoupStrainer
def merge_two_dicts(x, y):
 """Given two dicts, merge them into a new dict as a shallow copy."""
 z = x.copy()
 z.update(y)
 return z
url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
 "changeQueryInputData=yes&start": "Search connection",
 "REQ0Total_KissRideMotorClass": "404",
 "REQ0Total_KissRideCarClass": "5",
 "REQ0Total_KissRide_maxDist": "10000000",
 "REQ0Total_KissRide_minDist": "0",
 "REQComparisonCarload": "0",
 "REQ0JourneyStopsS0A": "255",
 "REQ0JourneyStopsZ0A": "255",
 "REQ0JourneyStops1.0G": "",
 "REQ0JourneyStops1.0A": "1",
 "REQ0JourneyStopover1": ""
}
with requests.Session() as session:
 session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
 session.get(url) # visit the main page (might not be actually needed)
 # sample parameters
 params = {
 "REQ0JourneyStopsS0G": "Eisenstadt",
 "REQ0JourneyStopsZ0G": "Tirano, Stazione",
 "date": "27.02.17",
 "REQ0JourneyTime": "17:00"
 }
 response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))
 parse_only = SoupStrainer("table", class_="hfs_overview")
 soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)
 # print out times for demonstration purposes
 trs = soup.select('tr')
 for tr in trs:
 time = tr.select_one('td.time')
 if time:
 print(time.get_text(strip=True))

Thanks for all your suggestions. can you write a line of code showing how the requests can be used in the code.
@Nobi yup, good idea, I was thinking about that, will do, thanks.
with your suggestions, there is definitely a significant improvement (10secs saved on each iteration) which will lead to considerable time saving since I have a lot of iterations to deal with. Also, very helpful tips for my future programming efforts.
@Nobi awesome! Glad to see it is helping. I've included a sample requests code (please try it out and measure if it is faster than mechanize). Thanks.
I know this is coming very late, but are you able to show me any other scheme to further increase the speed of this code. context: After your good work, one iteration now runs in about 4secs and I have just over 1million iterations to run, that will take over 40days, that's a lot of time which I dont have. Please help

Stack Exchange Network

Optimizing the speed of a web scraper

2 Answers 2

Performance Issues

Code Style Issues

A note about Python 3 compatibility:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing the speed of a web scraper

2 Answers 2

Performance Issues

Code Style Issues

A note about Python 3 compatibility:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions