I've written a script in Python using multiprocessing to handle multiple process at the same time and make the scraping process faster. I've used locking within it to prevent two processes from changing its internal state. As I'm very new to implement locking within multiprocessing, I suppose there is room for improvement.
What the scraper does is collect the name, address and phone number of every coffee shop traversing multiple pages from Yellowpage listings.
import requests
from lxml.html import fromstring
from multiprocessing import Process, Lock
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
itemstorage = []
def get_info(url,lock,itemstorage):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
itemstorage.extend([name, street, phone])
return printer(lock,itemstorage)
def printer(lock,data):
lock.acquire()
try:
print(data)
finally:
lock.release()
if __name__ == '__main__':
lock = Lock()
for i in [link.format(page) for page in range(1,15)]:
p = Process(target=get_info, args=(i,lock,itemstorage))
p.start()
2 Answers 2
Style
Please read PEP8 and use a consistent style throughout your code:
- put a space after comas;
- use a linebreak after colons;
- use UPPERCASE names for constants...
Using locks
First off, locks support the context manager protocol and you can thus simplify your printer
to:
def printer(lock, data):
with lock:
print(data)
which may not warant a method on its own.
But most importantly, you say that
I've used locking within it to prevent two processes from changing its internal state.
but you're not changing any shared state at all. All you are doing with this lock is preventing outputs to be mismatched on the screen. Let's take a look at a modified version of your script: I’ve stored the process started so I can join
them and I print itemstorage
after all computation is done.
if __name__ == '__main__':
lock = Lock()
processes = [
Process(target=get_info, args=(link.format(page), lock, itemstorage))
for page in range(1, 15)
]
for p in processes:
p.start()
for p in processes:
p.join()
print('itemstorage is', itemstorage)
This prints
[...actual results snipped...]
itemstorage is []
This is because each process is operating on its own copy of itemstorage
and nothing is done to retrieve data afterward. Instead, you should have your processes return
the result and store them in itemstorage
yourself. In fact, this very process is already implemented using multiprocessing.Pool.map
.
Simplifying element retrieval
Since you extract text from the dom 3 times per title
, you can extract an helper function to simplify that task. Doing so, it will be even easier to build the return list using a list-comprehension:
def extract(element, descriptor, default=None):
try:
return element.cssselect(descriptor)[0].text
except IndexError:
if default is None:
raise
return default
def get_info(url):
response = requests.get(url).text
tree = fromstring(response)
return [(
extract(title, "a.business-name span"),
extract(title, "span.street-address", ""),
extract(title, "div[class^=phones]", ""),
) for title in tree.cssselect("div.info")]
This changes a bit the structure but I believe it is an improvement to better access the information. You can still use itertools.chain.from_iterable
if need be to flatten the returned list.
Proposed improvements
import itertools
from multiprocessing import Pool
import requests
from lxml.html import fromstring
LINK = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def extract(element, descriptor, default=None):
try:
return element.cssselect(descriptor)[0].text
except IndexError:
if default is None:
raise
return default
def get_info(url):
response = requests.get(url)
tree = fromstring(response.content)
return [(
extract(title, "a.business-name span"),
extract(title, "span.street-address", ""),
extract(title, "div[class^=phones]", ""),
) for title in tree.cssselect("div.info")]
if __name__ == '__main__':
pages_count = 14
with Pool(processes=pages_count) as pool:
urls = [LINK.format(page) for page in range(1, pages_count + 1)]
itemstorage = pool.map(get_info, urls)
for result in itertools.chain.from_iterable(itemstorage):
print(result)
Note that I also changed the document parsing part. For one lxml
is perfectly capable of handling bytes
so you don't have to perform decoding yourself; for two decoding into a string blindly can lead to using an improper charset, which lxml
can handle by looking into the appropriate meta
tag.
-
\$\begingroup\$ Always like to see your intervention @Mathias Ettinger. Very insightful. Thanks. \$\endgroup\$SIM– SIM2018年12月05日 17:17:51 +00:00Commented Dec 5, 2018 at 17:17
General Feedback
This code is fairly straight-forward and easy to read. There are only three functions and none of them extend beyond thirteen lines. If you really wanted more atomic functions you could abstract some functionality from get_info
, for instance the code that parses each listing.
The name link
doesn’t feel as appropriate for a string literal that represents a URL as something like url_template
. Also see the related section below about naming constants.
The description was somewhat vague and didn’t specify whether each listing should correspond to an individual list within the returned list but if so, one could use itemstorage.append()
instead of itemstorage.extend()
.
Suggestions
While it isn’t a requirement, it is recommended that each function have a docstring.
All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings. Public methods (including the
__init__
constructor) should also have docstrings.1
Additionally, while constants aren’t really different than variables in python, idiomatically uppercase letters are used for the naming constants in python as well as many other languages. As was mentioned above, url_template
feels more appropriate for the string currently named link
so it may improve readability to use uppercase letters to signify that is a constant: URL_TEMPLATE
.
Explore related questions
See similar questions with these tags.