I've written a script in Python using the multiprocessing
module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing
, I'm not sure whether I did everything in the right way. It works without error, though.
Here goes the full script:
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)
Any idea to make it more robust will be highly appreciated.
-
\$\begingroup\$ Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5. \$\endgroup\$Solomon Ucko– Solomon Ucko2018年11月16日 12:35:20 +00:00Commented Nov 16, 2018 at 12:35
2 Answers 2
Proper use of a Pool p
should include p.close()
and p.join()
(in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool
can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format
, you simply pass in the page
parameter.
Explore related questions
See similar questions with these tags.