Multithreaded web scraper with proxy and user agent switching

Question 1

I am trying to improve the performance of my scraper and plug up any possible security leaks (identifying information being revealed).

Ideally, I would like to achieve a performance of 10 pages per second. What would I need to do to achieve the biggest performance boost besides get a faster connection / dedicated server? What could be improved?

PS: I am only using eBay.com as an example here. The production version of the scraper will obey robot.txt requests, avoid peak traffic hours, and be throttled so I am not effectively DDoSing the site.

import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.getLogger("requests").setLevel(logging.WARNING)
# NOTE: The next two sections are for demo purposes only, they will be imported from modules
# this will be stored in proxies.py module
from random import choice
proxies = [
 {'host': '1.2.3.4', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
 {'host': '2.3.4.5', 'port': '1234', 'username': 'myuser', 'password': 'pw'},
]
def check_proxy(session, proxy_host):
 response = session.get('http://canihazip.com/s')
 returned_ip = response.text
 if returned_ip != proxy_host:
 raise StandardError('Proxy check failed: {} not used while requesting'.format(proxy_host))
def random_proxy():
 return choice(proxies)
# / end of proxies.py
# this will be stored in user_agents.py module
from random import choice
user_agents = [
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9',
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
]
def random_user_agent():
 return choice(user_agents)
# / end of user_agents.py
def scrape_results_page(url):
 proxy = random_proxy() # will be proxies.random_proxy()
 session = requests.Session()
 session.headers = random_user_agent() # will imported with "from user_agents import random_user_agent"
 session.proxies = {'http': "http://{username}:{password}@{host}:{port}/".format(**proxy)}
 check_proxy(session, proxy['host']) # will be proxies.check_proxy(session, proxy['host'])
 response = session.get(url)
 soup = BeautifulSoup(response.text, 'lxml')
 try:
 first_item_title = soup.find('h3', class_="lvtitle").text.strip()
 return first_item_title
 except Exception as e:
 print url
 logging.error(e, exc_info=True)
 return None
page_numbers = range(1, 10)
search_url = 'http://www.ebay.com/sch/Gardening-Supplies-/2032/i.html?_pgn='
url_list = [search_url + str(i) for i in page_numbers]
# Make the Pool of workers
pool = ThreadPool(8)
# Open the urls in their own threads and return the results
results = pool.map(scrape_results_page, url_list)
# Close the pool and wait for the work to finish
pool.close()
pool.join()

To get it working you will need to enter valid proxies or comment out the session.proxies = ... and check_proxy(session, proxy['host']) lines.

Question 2

Performance

Each call to scrape_results_page will also call check_proxy. As such, the check_proxy will get called for the same proxies multiple times, and I'm wondering if there's a reason for re-checking the proxies. If not, then you can save time by checking the list of proxies once at the beginning of the program, and remove the check from scrape_results_page.

Once you have a set of verified proxies, instead of selecting them randomly, it would be better to use them round-robin style, to balance the load. Admittedly this tip may not make any visible difference whatsoever.

Other improvements

There are a lot of parameters buried in the implementation, for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others. It would be better to define such values at the top, in variables with descriptive names, all upper-cased to follow the convention for "constants".

Question 3

Intrusive Code:

When including code from somewhere else, the script parser will automatically execute it. I really don't want multithreaded code to run for \$n\$ seconds, when I import something.

You should include a if __name__ == '__main__'-block to fix that behaviour. In addition to that, I'd personally extract the last four lines into a separate method.

Impossible Async:

Let's say I want to run this, and do something else on the main-thread; not possible! You instantly join to the pool after starting to scrape the results.

Instead of that, I'd personally keep the pool open and async, unless explicitly calling a .join method:

def join():
 pool.join()

Aside from that, janos makes awesome points, I'd just want to see a little documentation on what scrape_results_page does.

Question 4

Requests is lovely to use, unfortunately it's very slow. If you are aiming for maximum speed increase the best use of your time would be spent switching to pycurl.

True multi-threading with no python GIL issues and it's very fast. It's a bit low level to work with but if speed is of primary importance, I've never used a better option.

There's also an existing scraping framework which I'm a big fan of called Grab which implements pycurl and is quite nice to use, the only problem is that until recently it was only documented in Russian and only really used by Russian speakers so I had to do a bit of source code browsing to work out how I was supposed to be using it.

Pycurl Docs

Multithreaded example

Question 5

Welcome to Code Review! Your post is looking good so far, but it could really use some more. Could you write out some examples of using pycurl in your answer, and explain why your solution is better than the OP's? And, is there anything else in the OP's code that you notice?

Question 6

Have you considered using the Proxicity.io APIs (https://www.proxicity.io/api)? They would allow you to ditch the proxy list, and would relieve a lot of issues with trying to rotate the proxies yourself. Not to mention, proxy lists get stale pretty fast. Getting the proxy is as easy as:

proxy = requests.get('https://www.proxicity.io/api/v1/APIKEY/proxy').json()

Result:

{
 "cookiesSupport": true,
 "country": "US",
 "curl": "http://107.151.136.205:80",
 "getSupport": true,
 "httpsSupport": false,
 "ip": "107.151.136.205",
 "ipPort": "107.151.136.205:80",
 "isAnonymous": true,
 "lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
 "port": "80",
 "postSupport": true,
 "protocol": "http",
 "refererSupport": true,
 "userAgentSupport": true
}

Disclaimer: I am affiliated with Proxicity.io.

Question 7

The api's only give you 12 requests/minute for free, and the OP is looking for at least 10 requests per second ("Ideally, I would like to achieve a performance of 10 pages per second.").

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 1 · 2015-10-10 20:46:38Z

Performance

Each call to scrape_results_page will also call check_proxy. As such, the check_proxy will get called for the same proxies multiple times, and I'm wondering if there's a reason for re-checking the proxies. If not, then you can save time by checking the list of proxies once at the beginning of the program, and remove the check from scrape_results_page.

Once you have a set of verified proxies, instead of selecting them randomly, it would be better to use them round-robin style, to balance the load. Admittedly this tip may not make any visible difference whatsoever.

Other improvements

There are a lot of parameters buried in the implementation, for example the ebay url, canihazip.com, the number of threads and page numbers, and possibly others. It would be better to define such values at the top, in variables with descriptive names, all upper-cased to follow the convention for "constants".

Vogel612 Vogel612 25.5k7 gold badges59 silver badges141 bronze badges · Answer 2 · 2015-10-11 12:59:27Z

Intrusive Code:

When including code from somewhere else, the script parser will automatically execute it. I really don't want multithreaded code to run for \$n\$ seconds, when I import something.

You should include a if __name__ == '__main__'-block to fix that behaviour. In addition to that, I'd personally extract the last four lines into a separate method.

Impossible Async:

Let's say I want to run this, and do something else on the main-thread; not possible! You instantly join to the pool after starting to scrape the results.

Instead of that, I'd personally keep the pool open and async, unless explicitly calling a .join method:

def join():
 pool.join()

Aside from that, janos makes awesome points, I'd just want to see a little documentation on what scrape_results_page does.

Tohst Tohst 211 bronze badge · Answer 3 · 2015-10-14 22:33:53Z

Requests is lovely to use, unfortunately it's very slow. If you are aiming for maximum speed increase the best use of your time would be spent switching to pycurl.

True multi-threading with no python GIL issues and it's very fast. It's a bit low level to work with but if speed is of primary importance, I've never used a better option.

There's also an existing scraping framework which I'm a big fan of called Grab which implements pycurl and is quite nice to use, the only problem is that until recently it was only documented in Russian and only really used by Russian speakers so I had to do a bit of source code browsing to work out how I was supposed to be using it.

Pycurl Docs

Multithreaded example

Welcome to Code Review! Your post is looking good so far, but it could really use some more. Could you write out some examples of using pycurl in your answer, and explain why your solution is better than the OP's? And, is there anything else in the OP's code that you notice?

Collin From Proxicity.io Collin From Proxicity.io 1 · Answer 4 · 2016-07-20 20:09:22Z

Have you considered using the Proxicity.io APIs (https://www.proxicity.io/api)? They would allow you to ditch the proxy list, and would relieve a lot of issues with trying to rotate the proxies yourself. Not to mention, proxy lists get stale pretty fast. Getting the proxy is as easy as:

proxy = requests.get('https://www.proxicity.io/api/v1/APIKEY/proxy').json()

Result:

{
 "cookiesSupport": true,
 "country": "US",
 "curl": "http://107.151.136.205:80",
 "getSupport": true,
 "httpsSupport": false,
 "ip": "107.151.136.205",
 "ipPort": "107.151.136.205:80",
 "isAnonymous": true,
 "lastChecked": "Tue, 31 May 2016 12:36:45 GMT",
 "port": "80",
 "postSupport": true,
 "protocol": "http",
 "refererSupport": true,
 "userAgentSupport": true
}

Disclaimer: I am affiliated with Proxicity.io.

The api's only give you 12 requests/minute for free, and the OP is looking for at least 10 requests per second ("Ideally, I would like to achieve a performance of 10 pages per second.").

Stack Exchange Network

Multithreaded web scraper with proxy and user agent switching

4 Answers 4

Performance

Other improvements

Intrusive Code:

Impossible Async:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multithreaded web scraper with proxy and user agent switching

4 Answers 4

Performance

Other improvements

Intrusive Code:

Impossible Async:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions