Extract name, address and phone number from some web pages using multiprocessing

Question 1

I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.

Here goes the full script:

import requests 
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
 response = requests.get(url).text
 tree = fromstring(response)
 for title in tree.cssselect("div.info"):
 name = title.cssselect("a.business-name span")[0].text
 try:
 street = title.cssselect("span.street-address")[0].text
 except IndexError: street = ""
 try:
 phone = title.cssselect("div[class^=phones]")[0].text
 except IndexError: phone = ""
 print(name, street, phone)
if __name__ == '__main__':
 links = [link.format(page) for page in range(1,4)]
 with Pool(4) as p:
 p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

Question 2

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.

Question 3

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).

Question 4

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',
 params={'search_terms': 'coffee',
 'geo_location_terms': 'Los Angeles, CA',
 'page': page})

Then, rather than calling format, you simply pass in the page parameter.

Diane M Diane M 5132 silver badges7 bronze badges · Answer 1 · 2018-11-16 17:58:29Z

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 2 · 2018-11-17 07:39:42Z

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',
 params={'search_terms': 'coffee',
 'geo_location_terms': 'Los Angeles, CA',
 'page': page})

Then, rather than calling format, you simply pass in the page parameter.

Stack Exchange Network

Extract name, address and phone number from some web pages using multiprocessing

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extract name, address and phone number from some web pages using multiprocessing

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions