I have just written this code to scrape some data from a website. In its current state it works fine, however, going by my tests on the script, I discovered that with the amount of data I am processing, it will take a few days to finish the task, Is there a way to improve its performance? I will insert a sample of the data as the bulk of it.
Input data in CSV format:
Code Origin
1 Eisenstadt
2 Tirana
3 St Pölten Hbf
6 Wien Westbahnhof
7 Wien Hauptbahnhof
8 Klagenfurt Hbf
9 Villach Hbf
11 Graz Hbf
12 Liezen
Code:
# import needed libraries
import csv
from datetime import datetime
from mechanize import Browser
from bs4 import BeautifulSoup
def datareader(datafile):
""" This function reads the cities from csv file and processes
them into an O-D for input into the web scrapper """
# Read the csv
with open(datafile, 'r') as f:
reader = csv.reader(f)
header = reader.next()
ListOfCities = [lines for lines in reader]
temp = ListOfCities[:]
city_num = []
city_orig_dest = []
for i in ListOfCities:
for j in temp:
ans1 = i[0], j[0]
if ans1[0] != ans1[1]:
city_num.append(ans1)
ans = (unicode(i[1], 'iso-8859-1'), unicode(j[1], 'iso-8859-1'),i[0], j[0])
if ans[0] != ans[1] and ans[2] != ans[3]:
city_orig_dest.append(ans)
yield city_orig_dest
input_data = datareader('BAK.csv') # Input data here
def webscrapper(x):
""" This function scraped the required website and extracts the
quickest connection time within given time durations """
#Create a browser object
br = Browser()
# Ignore robots.txt
br.set_handle_robots(False)
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Chrome')]
# Retrieve the Google home page, saving the response
br.open('http://fahrplan.sbb.ch/bin/query.exe/en')
# Select the 6th form
br.select_form(nr=6)
# Assign origin and destination to the o d variables
o = i[0].encode('iso-8859-1')
d = i[1].encode('iso-8859-1')
print 'o-d:', i[0], i[1]
# Enter the text input (This section should be automated to read multiple text input as shown in the question)
br.form["REQ0JourneyStopsS0G"] = o # Origin train station (From)
br.form["REQ0JourneyStopsZ0G"] = d # Destination train station (To)
br.form["REQ0JourneyTime"] = x # Search Time
br.form["date"] = '18.01.17' # Search Date
# Get the search results
br.submit()
#Click the later link three times to get trip times
for _ in xrange(5):
# Looking at some results in link format
for l in br.links(text='Later'):
pass
response = br.follow_link(l)
# get the response from mechanize Browser
soup = BeautifulSoup(br.response().read(), 'lxml', from_encoding="utf-8")
trs = soup.select('table.hfs_overview tr')
connections_times = []
ListOfSearchTimes = []
# Scrape the search results from the resulting table
for tr in trs:
locations = tr.select('td.location')
if len(locations) > 0:
time = tr.select('td.time')[0].contents[0].strip()
ListOfSearchTimes.append(time.encode('latin-1'))
durations = tr.select('td.duration')
# Check that the duration cell is not empty
if len(durations) == 0:
duration = ''
else:
duration = durations[0].contents[0].strip()
# Convert duration time string to minutes
def get_sec(time_str):
h, m = time_str.split(':')
return int(h) * 60 + int(m)
connections_times.append(get_sec(duration))
def group(lst, n):
return zip(*[lst[i::n] for i in range(n)])
arrivals_and_departure_pair = group(ListOfSearchTimes, 2)
#Check that the selected departures for one interval occurs before the departure of the next interval
fmt = '%H:%M'
finalDepartureList = []
for ind, res in arrivals_and_departure_pair:
t1 = datetime.strptime(ind, fmt)
if x == '05:30':
control = datetime.strptime('09:00', fmt)
if x == '09:00':
control = datetime.strptime('12:00', fmt)
if x == '12:00':
control = datetime.strptime('15:00', fmt)
if x == '15:00':
control = datetime.strptime('18:00', fmt)
if x == '18:00':
control = datetime.strptime('21:00', fmt)
if x == '21:00':
control = datetime.strptime('05:30', fmt)
if t1 < control:
finalDepartureList.append(ind)
# Get the the list of connection times for the departures above
fastest_connect = connections_times[:len(finalDepartureList)]
#Get the fastest connections time and catch any error when there is no connection between the OD pairs
try:
best_connect = sorted(fastest_connect)[0]
print 'fastest connection', best_connect
# print duration
except IndexError:
print "No Connection"
#print
#Return the result of the search
if len(fastest_connect) == 0:
return [i[2], i[3], '999999']
else:
return [i[2], i[3], str(best_connect)]
# List of time intervals
times = ['05:30', '09:00', '12:00', '15:00', '18:00', '21:00']
# Write the heading of the output text file
headings = ["from_BAKCode", "to_BAKCode", "interval", "duration"]
f = open("traveltime_rail2rail_2017.txt", "w+")
f.write(','.join(headings))
f.write('\n')
f.close()
# Call the web scraper function
for i in input_data.next():
for index, time in enumerate(times):
result = webscrapper(time)
result.insert(2, str(index+1))
print 'result:', result
print
f = open("traveltime_rail2rail_2017.txt", "a")
f.write(','.join(result[0:4]))
f.write('\n')
f.close()
2 Answers 2
Performance Issues
The main bottleneck here is the blocking nature of the program. You are processing urls one by one sequentially - you don't process the next url until you are done with the current one. This can be solved by switching to an asynchronous approach - either using Scrapy
(which is the best thing happened in the Python's web-scraping world), or something like grequests
.
Also, the HTML parsing speed can be improved by parsing only the relevant part of the document with a SoupStrainer
class:
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(br.response(), 'lxml', from_encoding="utf-8", parse_only=parse_only)
trs = soup.select('tr')
The other thing you can try is to switch from mechanize
to requests
using a single requests.Session()
instance for all the requests. This way, the underlying TCP connection will be reused which may result into a performance improvement.
There are also some things you are re-doing over and over again in the loops. Things like the control
variable should be pre-computed beforehand.
And, avoid redefining the get_sec()
function inside the loop - defined it beforehand.
Also, use min()
function instead of calling sorted()
and getting the first element.
Code Style Issues
if len(locations) > 0:
can be improved asif locations:
if len(durations) == 0:
can be improved asif not durations:
if len(fastest_connect) == 0:
can be improved asif not fastest_connect:
.select(..)[0]
can be replaced with.select_one(...)
BeautifulSoup
understands file-like objects as well, replacebr.response().read()
withbr.response()
organize imports as per PEP8 recommendations:
import csv from datetime import datetime from bs4 import BeautifulSoup from mechanize import Browser
the
# import needed libraries
comment does not make much sense- no need for the extra newline before the function docstrings
- put the main program logic into
if __name__ == '__main__':
to avoid it being executed on import - by introducing the
time
variable, you are shadowing the importedtime
module - properly define constants (for example, the time format, or the magical
999999
number) - use
with
context manager when dealing with files - remove the unused
header
variable - skip the CSV header via the
next()
built-in function:next(reader, None)
A note about Python 3 compatibility:
- use
next()
function instead of the.next()
method range()
vsxrange()
(cross-Python way to handle both)- use
print()
function instead of a statement
Here is a sample code that uses requests
to make a search (note that we handle the default parameters "manually" - if you want to automatically handle the default parameter values as in case of mechanize
, look into MechanicalSoup
or RoboBrowser
):
import requests
from bs4 import BeautifulSoup, SoupStrainer
def merge_two_dicts(x, y):
"""Given two dicts, merge them into a new dict as a shallow copy."""
z = x.copy()
z.update(y)
return z
url = "http://fahrplan.sbb.ch/bin/query.exe/en"
DEFAULT_PARAMS = {
"changeQueryInputData=yes&start": "Search connection",
"REQ0Total_KissRideMotorClass": "404",
"REQ0Total_KissRideCarClass": "5",
"REQ0Total_KissRide_maxDist": "10000000",
"REQ0Total_KissRide_minDist": "0",
"REQComparisonCarload": "0",
"REQ0JourneyStopsS0A": "255",
"REQ0JourneyStopsZ0A": "255",
"REQ0JourneyStops1.0G": "",
"REQ0JourneyStops1.0A": "1",
"REQ0JourneyStopover1": ""
}
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
session.get(url) # visit the main page (might not be actually needed)
# sample parameters
params = {
"REQ0JourneyStopsS0G": "Eisenstadt",
"REQ0JourneyStopsZ0G": "Tirano, Stazione",
"date": "27.02.17",
"REQ0JourneyTime": "17:00"
}
response = session.post(url, data=merge_two_dicts(DEFAULT_PARAMS, params))
parse_only = SoupStrainer("table", class_="hfs_overview")
soup = BeautifulSoup(response.content, "lxml", parse_only=parse_only)
# print out times for demonstration purposes
trs = soup.select('tr')
for tr in trs:
time = tr.select_one('td.time')
if time:
print(time.get_text(strip=True))
-
\$\begingroup\$ Thanks for all your suggestions. can you write a line of code showing how the requests can be used in the code. \$\endgroup\$Nobi– Nobi2017年02月27日 14:52:33 +00:00Commented Feb 27, 2017 at 14:52
-
\$\begingroup\$ @Nobi yup, good idea, I was thinking about that, will do, thanks. \$\endgroup\$alecxe– alecxe2017年02月27日 14:52:51 +00:00Commented Feb 27, 2017 at 14:52
-
\$\begingroup\$ with your suggestions, there is definitely a significant improvement (10secs saved on each iteration) which will lead to considerable time saving since I have a lot of iterations to deal with. Also, very helpful tips for my future programming efforts. \$\endgroup\$Nobi– Nobi2017年02月27日 15:42:48 +00:00Commented Feb 27, 2017 at 15:42
-
\$\begingroup\$ @Nobi awesome! Glad to see it is helping. I've included a sample
requests
code (please try it out and measure if it is faster than mechanize). Thanks. \$\endgroup\$alecxe– alecxe2017年02月27日 16:28:17 +00:00Commented Feb 27, 2017 at 16:28 -
\$\begingroup\$ I know this is coming very late, but are you able to show me any other scheme to further increase the speed of this code. context: After your good work, one iteration now runs in about 4secs and I have just over 1million iterations to run, that will take over 40days, that's a lot of time which I dont have. Please help \$\endgroup\$Nobi– Nobi2017年03月11日 13:38:18 +00:00Commented Mar 11, 2017 at 13:38
Put this at the beginning:
import threading as th
class scrape(th.Thread()):
def __init___(self, time):
self.time = time
def run(self):
return webscrapper(self.time)
And this instead of the for loop at the end:
for i in input_data.next():
for index, time in enumerate(times):
result = scrape(time)
result = result.start()
result.insert(2, str(index+1))
print 'result:', result
print
f = open("traveltime_rail2rail_2017.txt", "a")
f.write(','.join(result[0:4]))
f.write('\n')
f.close()
-
4\$\begingroup\$ Welcome to StackExchange Code Review! Please see: How do I write a good answer?, where you will find: "Every answer must make at least one insightful observation about the code in the question. Answers that merely provide an alternate solution with no explanation or justification do not constitute valid Code Review answers and may be deleted". \$\endgroup\$Stephen Rauch– Stephen Rauch2017年02月26日 17:33:02 +00:00Commented Feb 26, 2017 at 17:33
Explore related questions
See similar questions with these tags.