I am trying to improve the performance of my scraper and plug up any possible security leaks (identifying information being revealed).
Ideally, I would like to achieve a performance of 10 pages per second. What would I need to do to achieve the biggest performance boost besides get a faster connection / dedicated server? What could be improved?
PS: I am only using eBay.com as an example here. The production version of the scraper will obey robot.txt requests, avoid peak traffic hours, and be throttled so I am not effectively DDoSing the site.
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.getLogger("requests").setLevel(logging.WARNING)
# NOTE: The next two sections are for demo purposes only, they will be imported from modules
# this will be stored in proxies.py module
from random import choice
proxies = [
{'host': '1.2.3.4', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
{'host': '2.3.4.5', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
]
def check_proxy(session, proxy_host):
response = session.get('http://canihazip.com/s')
returned_ip = response.text
if returned_ip != proxy_host:
raise StandardError('Proxy check failed: {} not used while requesting'.format(proxy_host))
def random_proxy():
return choice(proxies)
# / end of proxies.py
# this will be stored in user_agents.py module
from random import choice
user_agents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
]
def random_user_agent():
return choice(user_agents)
# / end of user_agents.py
def scrape_results_page(url):
proxy = random_proxy() # will be proxies.random_proxy()
session = requests.Session()
session.headers = random_user_agent() # will imported with "from user_agents import random_user_agent"
session.proxies = {'http': "http://{username}:{password}@{host}:{port}/".format(**proxy)}
check_proxy(session, proxy['host']) # will be proxies.check_proxy(session, proxy['host'])
response = session.get(url)
soup = BeautifulSoup(response.text, 'lxml')
try:
first_item_title = soup.find('h3', class_="lvtitle").text.strip()
return first_item_title
except Exception as e:
print url
logging.error(e, exc_info=True)
return None
page_numbers = range(1, 10)
search_url = 'http://www.ebay.com/sch/Gardening-Supplies-/2032/i.html?_pgn='
url_list = [search_url + str(i) for i in page_numbers]
# Make the Pool of workers
pool = ThreadPool(8)
# Open the urls in their own threads and return the results
results = pool.map(scrape_results_page, url_list)
# Close the pool and wait for the work to finish
pool.close()
pool.join()
To get it working you will need to enter valid proxies or comment out the session.proxies = ...
and check_proxy(session, proxy['host'])
lines.
4 Answers 4
Performance
Each call to scrape_results_page
will also call check_proxy
.
As such, the check_proxy
will get called for the same proxies multiple times, and I'm wondering if there's a reason for re-checking the proxies.
If not, then you can save time by checking the list of proxies once at the beginning of the program,
and remove the check from scrape_results_page
.
Once you have a set of verified proxies, instead of selecting them randomly, it would be better to use them round-robin style, to balance the load. Admittedly this tip may not make any visible difference whatsoever.
Other improvements
There are a lot of parameters buried in the implementation, for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others. It would be better to define such values at the top, in variables with descriptive names, all upper-cased to follow the convention for "constants".
Intrusive Code:
When including code from somewhere else, the script parser will automatically execute it. I really don't want multithreaded code to run for \$n\$ seconds, when I import something.
You should include a if __name__ == '__main__'
-block to fix that behaviour. In addition to that, I'd personally extract the last four lines into a separate method.
Impossible Async:
Let's say I want to run this, and do something else on the main-thread; not possible! You instantly join to the pool after starting to scrape the results.
Instead of that, I'd personally keep the pool
open and async, unless explicitly calling a .join
method:
def join():
pool.join()
Aside from that, janos makes awesome points, I'd just want to see a little documentation on what scrape_results_page
does.
Requests is lovely to use, unfortunately it's very slow. If you are aiming for maximum speed increase the best use of your time would be spent switching to pycurl
.
True multi-threading with no python GIL issues and it's very fast. It's a bit low level to work with but if speed is of primary importance, I've never used a better option.
There's also an existing scraping framework which I'm a big fan of called Grab which implements pycurl
and is quite nice to use, the only problem is that until recently it was only documented in Russian and only really used by Russian speakers so I had to do a bit of source code browsing to work out how I was supposed to be using it.
-
3\$\begingroup\$ Welcome to Code Review! Your post is looking good so far, but it could really use some more. Could you write out some examples of using
pycurl
in your answer, and explain why your solution is better than the OP's? And, is there anything else in the OP's code that you notice? \$\endgroup\$SirPython– SirPython2015年10月14日 22:39:58 +00:00Commented Oct 14, 2015 at 22:39
Have you considered using the Proxicity.io APIs (https://www.proxicity.io/api)? They would allow you to ditch the proxy list, and would relieve a lot of issues with trying to rotate the proxies yourself. Not to mention, proxy lists get stale pretty fast. Getting the proxy is as easy as:
proxy = requests.get('https://www.proxicity.io/api/v1/APIKEY/proxy').json()
Result:
{
"cookiesSupport": true,
"country": "US",
"curl": "http://107.151.136.205:80",
"getSupport": true,
"httpsSupport": false,
"ip": "107.151.136.205",
"ipPort": "107.151.136.205:80",
"isAnonymous": true,
"lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
"port": "80",
"postSupport": true,
"protocol": "http",
"refererSupport": true,
"userAgentSupport": true
}
Disclaimer: I am affiliated with Proxicity.io.
-
1\$\begingroup\$ The api's only give you 12 requests/minute for free, and the OP is looking for at least 10 requests per second ("Ideally, I would like to achieve a performance of 10 pages per second."). \$\endgroup\$Pimgd– Pimgd2016年07月20日 20:46:27 +00:00Commented Jul 20, 2016 at 20:46
Explore related questions
See similar questions with these tags.