Implementing lock within a Python script

Question 1

I've written a script in Python using multiprocessing to handle multiple process at the same time and make the scraping process faster. I've used locking within it to prevent two processes from changing its internal state. As I'm very new to implement locking within multiprocessing, I suppose there is room for improvement.

What the scraper does is collect the name, address and phone number of every coffee shop traversing multiple pages from Yellowpage listings.

import requests 
from lxml.html import fromstring
from multiprocessing import Process, Lock
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
itemstorage = []
def get_info(url,lock,itemstorage):
 response = requests.get(url).text
 tree = fromstring(response)
 for title in tree.cssselect("div.info"):
 name = title.cssselect("a.business-name span")[0].text
 try:
 street = title.cssselect("span.street-address")[0].text
 except IndexError: street = ""
 try:
 phone = title.cssselect("div[class^=phones]")[0].text
 except IndexError: phone = ""
 itemstorage.extend([name, street, phone])
 return printer(lock,itemstorage)
def printer(lock,data): 
 lock.acquire()
 try:
 print(data)
 finally:
 lock.release()
if __name__ == '__main__':
 lock = Lock()
 for i in [link.format(page) for page in range(1,15)]:
 p = Process(target=get_info, args=(i,lock,itemstorage))
 p.start()

Question 2

Style

Please read PEP8 and use a consistent style throughout your code:

put a space after comas;
use a linebreak after colons;
use UPPERCASE names for constants...

Using locks

First off, locks support the context manager protocol and you can thus simplify your printer to:

def printer(lock, data):
 with lock:
 print(data)

which may not warant a method on its own.

But most importantly, you say that

I've used locking within it to prevent two processes from changing its internal state.

but you're not changing any shared state at all. All you are doing with this lock is preventing outputs to be mismatched on the screen. Let's take a look at a modified version of your script: I’ve stored the process started so I can join them and I print itemstorage after all computation is done.

if __name__ == '__main__':
 lock = Lock()
 processes = [
 Process(target=get_info, args=(link.format(page), lock, itemstorage))
 for page in range(1, 15)
 ]
 for p in processes:
 p.start()
 for p in processes:
 p.join()
 print('itemstorage is', itemstorage)

This prints

[...actual results snipped...]
itemstorage is []

This is because each process is operating on its own copy of itemstorage and nothing is done to retrieve data afterward. Instead, you should have your processes return the result and store them in itemstorage yourself. In fact, this very process is already implemented using multiprocessing.Pool.map.

Simplifying element retrieval

Since you extract text from the dom 3 times per title, you can extract an helper function to simplify that task. Doing so, it will be even easier to build the return list using a list-comprehension:

def extract(element, descriptor, default=None):
 try:
 return element.cssselect(descriptor)[0].text
 except IndexError:
 if default is None:
 raise
 return default
def get_info(url):
 response = requests.get(url).text
 tree = fromstring(response)
 return [(
 extract(title, "a.business-name span"),
 extract(title, "span.street-address", ""),
 extract(title, "div[class^=phones]", ""),
 ) for title in tree.cssselect("div.info")]

This changes a bit the structure but I believe it is an improvement to better access the information. You can still use itertools.chain.from_iterable if need be to flatten the returned list.

Proposed improvements

import itertools
from multiprocessing import Pool
import requests
from lxml.html import fromstring
LINK = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def extract(element, descriptor, default=None):
 try:
 return element.cssselect(descriptor)[0].text
 except IndexError:
 if default is None:
 raise
 return default
def get_info(url):
 response = requests.get(url)
 tree = fromstring(response.content)
 return [(
 extract(title, "a.business-name span"),
 extract(title, "span.street-address", ""),
 extract(title, "div[class^=phones]", ""),
 ) for title in tree.cssselect("div.info")]
if __name__ == '__main__':
 pages_count = 14
 with Pool(processes=pages_count) as pool:
 urls = [LINK.format(page) for page in range(1, pages_count + 1)]
 itemstorage = pool.map(get_info, urls)
 for result in itertools.chain.from_iterable(itemstorage):
 print(result)

Note that I also changed the document parsing part. For one lxml is perfectly capable of handling bytes so you don't have to perform decoding yourself; for two decoding into a string blindly can lead to using an improper charset, which lxml can handle by looking into the appropriate meta tag.

Question 3

Always like to see your intervention @Mathias Ettinger. Very insightful. Thanks.

Question 4

General Feedback

This code is fairly straight-forward and easy to read. There are only three functions and none of them extend beyond thirteen lines. If you really wanted more atomic functions you could abstract some functionality from get_info, for instance the code that parses each listing.

The name link doesn’t feel as appropriate for a string literal that represents a URL as something like url_template. Also see the related section below about naming constants.

The description was somewhat vague and didn’t specify whether each listing should correspond to an individual list within the returned list but if so, one could use itemstorage.append() instead of itemstorage.extend().

Suggestions

While it isn’t a requirement, it is recommended that each function have a docstring.

All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings. Public methods (including the __init__ constructor) should also have docstrings.¹

Additionally, while constants aren’t really different than variables in python, idiomatically uppercase letters are used for the naming constants in python as well as many other languages. As was mentioned above, url_template feels more appropriate for the string currently named link so it may improve readability to use uppercase letters to signify that is a constant: URL_TEMPLATE.

¹_{https://www.python.org/dev/peps/pep-0257/}

score 3 · Accepted Answer · 2018-12-05 17:06:08Z

Style

Please read PEP8 and use a consistent style throughout your code:

put a space after comas;
use a linebreak after colons;
use UPPERCASE names for constants...

Using locks

First off, locks support the context manager protocol and you can thus simplify your printer to:

def printer(lock, data):
 with lock:
 print(data)

which may not warant a method on its own.

But most importantly, you say that

I've used locking within it to prevent two processes from changing its internal state.

but you're not changing any shared state at all. All you are doing with this lock is preventing outputs to be mismatched on the screen. Let's take a look at a modified version of your script: I’ve stored the process started so I can join them and I print itemstorage after all computation is done.

if __name__ == '__main__':
 lock = Lock()
 processes = [
 Process(target=get_info, args=(link.format(page), lock, itemstorage))
 for page in range(1, 15)
 ]
 for p in processes:
 p.start()
 for p in processes:
 p.join()
 print('itemstorage is', itemstorage)

This prints

[...actual results snipped...]
itemstorage is []

This is because each process is operating on its own copy of itemstorage and nothing is done to retrieve data afterward. Instead, you should have your processes return the result and store them in itemstorage yourself. In fact, this very process is already implemented using multiprocessing.Pool.map.

Simplifying element retrieval

Since you extract text from the dom 3 times per title, you can extract an helper function to simplify that task. Doing so, it will be even easier to build the return list using a list-comprehension:

def extract(element, descriptor, default=None):
 try:
 return element.cssselect(descriptor)[0].text
 except IndexError:
 if default is None:
 raise
 return default
def get_info(url):
 response = requests.get(url).text
 tree = fromstring(response)
 return [(
 extract(title, "a.business-name span"),
 extract(title, "span.street-address", ""),
 extract(title, "div[class^=phones]", ""),
 ) for title in tree.cssselect("div.info")]

This changes a bit the structure but I believe it is an improvement to better access the information. You can still use itertools.chain.from_iterable if need be to flatten the returned list.

Proposed improvements

import itertools
from multiprocessing import Pool
import requests
from lxml.html import fromstring
LINK = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def extract(element, descriptor, default=None):
 try:
 return element.cssselect(descriptor)[0].text
 except IndexError:
 if default is None:
 raise
 return default
def get_info(url):
 response = requests.get(url)
 tree = fromstring(response.content)
 return [(
 extract(title, "a.business-name span"),
 extract(title, "span.street-address", ""),
 extract(title, "div[class^=phones]", ""),
 ) for title in tree.cssselect("div.info")]
if __name__ == '__main__':
 pages_count = 14
 with Pool(processes=pages_count) as pool:
 urls = [LINK.format(page) for page in range(1, pages_count + 1)]
 itemstorage = pool.map(get_info, urls)
 for result in itertools.chain.from_iterable(itemstorage):
 print(result)

Note that I also changed the document parsing part. For one lxml is perfectly capable of handling bytes so you don't have to perform decoding yourself; for two decoding into a string blindly can lead to using an improper charset, which lxml can handle by looking into the appropriate meta tag.

Always like to see your intervention @Mathias Ettinger. Very insightful. Thanks.

Stack Exchange Network

Implementing lock within a Python script

2 Answers 2

Style

Using locks

Simplifying element retrieval

Proposed improvements

General Feedback

Suggestions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing lock within a Python script

2 Answers 2

Style

Using locks

Simplifying element retrieval

Proposed improvements

General Feedback

Suggestions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions