2
\$\begingroup\$

I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.

Here goes the full script:

import requests 
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
 response = requests.get(url).text
 tree = fromstring(response)
 for title in tree.cssselect("div.info"):
 name = title.cssselect("a.business-name span")[0].text
 try:
 street = title.cssselect("span.street-address")[0].text
 except IndexError: street = ""
 try:
 phone = title.cssselect("div[class^=phones]")[0].text
 except IndexError: phone = ""
 print(name, street, phone)
if __name__ == '__main__':
 links = [link.format(page) for page in range(1,4)]
 with Pool(4) as p:
 p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

Toby Speight
87.7k14 gold badges104 silver badges325 bronze badges
asked Nov 16, 2018 at 9:38
\$\endgroup\$
1
  • \$\begingroup\$ Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5. \$\endgroup\$ Commented Nov 16, 2018 at 12:35

2 Answers 2

1
\$\begingroup\$

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).

answered Nov 16, 2018 at 17:58
\$\endgroup\$
1
\$\begingroup\$

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',
 params={'search_terms': 'coffee',
 'geo_location_terms': 'Los Angeles, CA',
 'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered Nov 17, 2018 at 7:39
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.