Using asyncio and aiohttp in classes

Question 1

To better familiarize with async requests I wrote a very simple scraper that relies on aiohttp to retrieve some basic information from the product page (product name and availability status) or an Italian e-commerce retailer.
Code is organized according the following structure:

`stores.py`

stores module contains a prototype AsyncScraper that basically holds all request-related methods: takes care of building the coroutine task list (one coroutine each product to be scraped) plus a method to dispatch the request and extract target information.

Given each website has different DOMs, each e-commerce website will have its own class implementing specific extraction methods.

import asyncio
from asyncio.tasks import wait_for
from aiohttp.client import ClientSession
from bs4 import BeautifulSoup
import const
class AsyncScraper:
 """
 A base scraper class to interact with a website.
 """
 def __init__(self):
 self.product_ids = None
 self.base_url = None
 self.content = None
 # Placeholder method
 def get_product_title():
 pass
 # Placeholder method
 def get_product_availability():
 pass
 async def _get_tasks(self):
 tasks = []
 async with ClientSession() as s:
 for product in self.product_ids:
 tasks.append(wait_for(self._scrape_elem(product, s), 20))
 print(tasks)
 return await asyncio.gather(*tasks)
 async def _scrape_elem(self, product, session):
 async with session.get(
 self._build_url(product), raise_for_status=True
 ) as res:
 if res.status != 200:
 print(f"something went wrong: {res.status}")
 page_content = await res.text()
 self.content = BeautifulSoup(page_content, "html.parser")
 # Extract product attributes
 title = self.get_product_title()
 availability = self.get_product_availability()
 # Check if stuff is actually working
 print(f"{title} - {availability}")
 def scrape_stuff(self):
 loop = asyncio.get_event_loop()
 loop.run_until_complete(self._get_tasks())
 def _build_url(self, product_id):
 return f"{self.base_url}{product_id}"
class EuronicsScraper(AsyncScraper):
 """
 Class implementing extractions logic for euronics.it
 """
 base_url = "https://www.euronics.it/"
 def __init__(self):
 self.product_ids = const.euronics_prods
 def get_product_title(self):
 title = self.content.find(
 "h1", {"class": "productDetails__name"}
 ).text.strip()
 return title
 def get_product_availability(self):
 avail_kw = ["prenota", "aggiungi"]
 availability = self.content.find(
 "span", {"class": "button__title--iconTxt"}
 ).text.strip()
 # Availability will be inferred from button text
 if any(word in availability.lower() for word in avail_kw):
 availability = "Disponibile"
 else:
 availability = "Non disponibile"
 return availability

`const.py`

Target products to be scraped are stored in a const module. This is as simple as declaring a set of product IDs.

# Products ids to be scraped
euronics_prods = (
 "obiettivi-zoom/nikon/50mm-f12-nikkor/eProd162017152/",
 "tostapane-tostiere/ariete/155/eProd172015168/",
)

`runner.py`

The script is ultimately run by iterating over a list of scrapers and invoking their scrape_stuff method, inherited from the AsyncScraper parent class.

"""
This is just a helper used as a script runner
"""
from stores import EuronicsScraper
def main():
 scrapers = [EuronicsScraper()]
 for scraper in scrapers:
 scraper.scrape_stuff()
if __name__ == "__main__":
 main()

Questions

I am mainly interested if I've overlooked anything major that might get this piece of code hard to rework or to debug in the future. While I was writing it, it made complete sense to me as:

Implementing a new scraper is just a matter of subclassing AsyncScraper and implementing extractions methods.
All request-related logic is in one place. It might be necessary to override these methods for classes dealing with websites that need some js interaction (probably using an headless browser using selenium) but I feel it's way beyond the scope of this review.

One thing I am not too fond of (probably need to dive deeper into inheritance) is the use of placeholder methods in AsyncScraper as it'll force me to implement n dummy methods (where n is the number of website-specific methods that can be found in the other classes). I feel this is a bit of a hack and kind of defeats the purpose of class inheritance.

Any advice is more than welcome.

Question 2

One thing I am not too fond of (probably need to dive deeper into inheritance) is the use of placeholder methods in AsyncScraper as it'll force me to implement n dummy methods (where n is the number of website-specific methods that can be found in the other classes). I feel this is a bit of a hack and kind of defeats the purpose of class inheritance.

Instead of additional placeholder methods in AsyncScraper, you could use a single abstract method that returns a dict of additional site-specific data. Then concrete classes would override the single abstract method for the n additional data points. Something like:

stores.py

class AsyncScraper:
...
 def get_site_specific_details() -> dict[str, str]:
 raise NotImplementedError() # or pass if this is optional
...
 async def _scrape_elem(self, product, session):
 
 ...
 # Extract product attributes
 title = self.get_product_title()
 availability = self.get_product_availability()
 additional_details = get_site_specific_details()
 # Check if stuff is actually working
 print(f"{title} - {availability}")
 print("Additional details: ")
 for name, value in additional_details.items():
 print(f"{name}: {value}")
 ...
class SomeNewScraper(AsyncScraper):
...
 
 def get_site_specific_details() -> dict[str, str]:
 details = {}
 positive_reviews = self.content.find("...")
 details["positive_reviews"] = positive_reviews
 
 ...
 
 return details

Then AsyncScraper can focus on the minimum set of attributes required across all site scrapers.

Note: Python does have an Abstract Base Classes lib, but I'm not familiar with it. My example probably isn't using the best syntax, but conceptually I think it gets the point across.

Question 3

Thanks for taking some time to give it some thought, this looks like a step in the right direction - will definitely play around with it.

Question 4

Post what you come up with... I like to learn new approaches.

Question 5

I've implemented your approach and I believe it serves the purpose well. Obv it forces to be careful with helper methods to avoid ending up with a too long get_site_specific_details (but again, it is minor).

dstricks dstricks 1261 bronze badge · Accepted Answer · 2021-03-18 23:49:42Z

One thing I am not too fond of (probably need to dive deeper into inheritance) is the use of placeholder methods in AsyncScraper as it'll force me to implement n dummy methods (where n is the number of website-specific methods that can be found in the other classes). I feel this is a bit of a hack and kind of defeats the purpose of class inheritance.

Instead of additional placeholder methods in AsyncScraper, you could use a single abstract method that returns a dict of additional site-specific data. Then concrete classes would override the single abstract method for the n additional data points. Something like:

stores.py

class AsyncScraper:
...
 def get_site_specific_details() -> dict[str, str]:
 raise NotImplementedError() # or pass if this is optional
...
 async def _scrape_elem(self, product, session):
 
 ...
 # Extract product attributes
 title = self.get_product_title()
 availability = self.get_product_availability()
 additional_details = get_site_specific_details()
 # Check if stuff is actually working
 print(f"{title} - {availability}")
 print("Additional details: ")
 for name, value in additional_details.items():
 print(f"{name}: {value}")
 ...
class SomeNewScraper(AsyncScraper):
...
 
 def get_site_specific_details() -> dict[str, str]:
 details = {}
 positive_reviews = self.content.find("...")
 details["positive_reviews"] = positive_reviews
 
 ...
 
 return details

Then AsyncScraper can focus on the minimum set of attributes required across all site scrapers.

Note: Python does have an Abstract Base Classes lib, but I'm not familiar with it. My example probably isn't using the best syntax, but conceptually I think it gets the point across.

Thanks for taking some time to give it some thought, this looks like a step in the right direction - will definitely play around with it.
Post what you come up with... I like to learn new approaches.
I've implemented your approach and I believe it serves the purpose well. Obv it forces to be careful with helper methods to avoid ending up with a too long get_site_specific_details (but again, it is minor).

Stack Exchange Network

Using asyncio and aiohttp in classes

`stores.py`

`const.py`

`runner.py`

Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Using asyncio and aiohttp in classes

stores.py

const.py

runner.py

Questions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

`stores.py`

`const.py`

`runner.py`