Below is a portion of the code i have written to scrape the bikesales.com.au website for details of bikes for sales (The full code is here). This finds all the 'href' attributes on each search page and tries to request the html for each href corresponding to each bike for sale. My code works correctly, however I had to add some retry attempts with exponential backoff to avoid the following error:
ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
The code works correctly, however I would like to avoid the backoff approach if possible.
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def get_html_content(url, multiplier=1):
"""
Retrieve the contents of the url.
"""
# Be a responisble scraper.
# The multiplier is used to exponentially increase the delay when there are several attempts at connecting to the url
time.sleep(2*multiplier)
# Get the html from the url
try:
with closing(get(url)) as resp:
content_type = resp.headers['Content-Type'].lower()
if is_good_response(resp):
return resp.content
else:
# Unable to get the url response
return None
except RequestException as e:
print("Error during requests to {0} : {1}".format(url, str(e)))
if __name__ == '__main__':
baseUrl = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_html_content(url)
html = BeautifulSoup(content, 'html.parser')
BikeList = html.findAll("a", {"class": "item-link-container"})
# Cycle through the list of bikes on each search page.
for bike in BikeList:
# Get the URL for each bike.
individualBikeURL = bike.attrs['href']
BikeContent = get_html_content(baseUrl+individualBikeURL)
# Reset the miltipler for each new url
multiplier = 1
## occasionally the connection is lost, so try again.
## Im not sure why the connection is lost, i might be that the site is trying to guard against scraping software.
# If initial attempt to connect to the url was unsuccessful, try again with an increasing delay
while (BikeContent == None):
# Limit the exponential delay to 16x
if (multiplier < 16):
multiplier *= 2
BikeContent = get_html_content(baseUrl+individualBikeURL,multiplier)
My question is, Is there something that i am missing in the implementation of the request? or, is this just a result of the site denying scraping tools?
1 Answer 1
- I assume
is_good_response
is just checking for a 200 response code. - Merge
is_good_response
,get_html_content
and the insides of your for-loop in your main together.
This makes the main code:
from requests import get
from bs4 import BeautifulSoup
if __name__ == '__main__':
baseUrl = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_html_content(url)
html = BeautifulSoup(content, 'html.parser')
BikeList = html.findAll("a", {"class": "item-link-container"})
for bike in bike_list:
individualBikeURL = bike.attrs['href']
bike_content = get_bike(baseUrl+individualBikeURL)
Where we will be focusing on:
def get_bike(url):
multiplier = 1
while (BikeContent == None):
time.sleep(2*multiplier)
try:
with closing(get(url)) as resp:
content_type = resp.headers['Content-Type'].lower()
if 200 <= resp.status_code < 300:
return resp.content
except RequestException as e:
print("Error during requests to {0} : {1}".format(url, str(e)))
if (multiplier < 16):
multiplier *= 2
return None
Allow a retry argument. Retry should also act differently on different values:
- None - Don't retry.
- -1 - Retry infinatly.
- n - Retry until \2ドル^n\$.
- iterator - loop through for the delays
We can also add another function to work the same way your previous code did.
You shouldn't need to use
contextlib.closing
, asResponse.close
"should not normally need to be called explicitly."- You don't need
content_type
inget_bike
. - You should use
*args
and**kwargs
so you can userequests.get
s arguments if you ever need to. - You can allow this to work with
post
and other request methods if you take the method as a parameter.
import itertools
import collections.abc
import requests.exceptions
def request(method, retry=None, *args, **kwargs):
if retry is None:
retry = iter()
elif retry == -1:
retry = (2**i for i in itertools.count())
elif isinstance(retry, int):
retry = (2**i for i in range(retry))
elif isinstance(retry, collections.abc.Iterable):
pass
else:
raise ValueError('Unknown retry {retry}'.format(retry=retry))
for sleep in itertools.chain([0], retry):
if sleep:
time.sleep(sleep)
try:
resp = method(*args, **kwargs)
if 200 <= resp.status_code < 300:
return resp.content
except requests.exceptions.RequestException as e:
print('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def bike_retrys():
for i in range(5):
yield 2**i
while True:
yield 16
To improve the rest of the code:
- Use snake case.
- Constants should be in upper snake case.
- Use the above code.
- Use
import requests
, rather thanfrom requests import get
. - You can make a little helper function to call
request
, so usage is cleaner.
import requests
from bs4 import BeautifulSoup
def get_bike(*args, **kwargs):
return request(requests.get, bike_retrys(), *args, **kwargs)
if __name__ == '__main__':
BASE_URL = 'https://www.bikesales.com.au/'
url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'
content = get_bike(url)
html = BeautifulSoup(content, 'html.parser')
bike_list = html.findAll("a", {"class": "item-link-container"})
for bike in bike_list:
bike_content = get_bike(BASE_URL + bike.attrs['href'])
-
\$\begingroup\$
if 200 <= resp.status_code < 300
=>if resp.ok
? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年05月29日 15:45:18 +00:00Commented May 29, 2018 at 15:45 -
\$\begingroup\$ @MathiasEttinger I didn't know
resp.ok
was a thing. However, from the documentation, it is the same as200 <= resp.status_code < 400
. \$\endgroup\$2018年05月29日 15:48:20 +00:00Commented May 29, 2018 at 15:48 -
\$\begingroup\$ Right, but since
allow_redirects=False
is not used here, all 3xx are converted to the final element. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年05月29日 15:57:51 +00:00Commented May 29, 2018 at 15:57 -
\$\begingroup\$ @MathiasEttinger I'll admit I don't know much about 3xx. From what you put it'd be better to use it whether we use
allow_redirects
or not. I'll edit my answer in a bit, or you can if you want. \$\endgroup\$2018年05月29日 16:12:29 +00:00Commented May 29, 2018 at 16:12
Explore related questions
See similar questions with these tags.
is_good_response
? \$\endgroup\$