To better familiarize with async requests I wrote a very simple scraper that relies on aiohttp
to retrieve some basic information from the product page (product name and availability status) or an Italian e-commerce retailer.
Code is organized according the following structure:
stores.py
stores
module contains a prototype AsyncScraper
that basically holds all request-related methods: takes care of building the coroutine task list (one coroutine each product to be scraped) plus a method to dispatch the request and extract target information.
Given each website has different DOMs, each e-commerce website will have its own class implementing specific extraction methods.
import asyncio
from asyncio.tasks import wait_for
from aiohttp.client import ClientSession
from bs4 import BeautifulSoup
import const
class AsyncScraper:
"""
A base scraper class to interact with a website.
"""
def __init__(self):
self.product_ids = None
self.base_url = None
self.content = None
# Placeholder method
def get_product_title():
pass
# Placeholder method
def get_product_availability():
pass
async def _get_tasks(self):
tasks = []
async with ClientSession() as s:
for product in self.product_ids:
tasks.append(wait_for(self._scrape_elem(product, s), 20))
print(tasks)
return await asyncio.gather(*tasks)
async def _scrape_elem(self, product, session):
async with session.get(
self._build_url(product), raise_for_status=True
) as res:
if res.status != 200:
print(f"something went wrong: {res.status}")
page_content = await res.text()
self.content = BeautifulSoup(page_content, "html.parser")
# Extract product attributes
title = self.get_product_title()
availability = self.get_product_availability()
# Check if stuff is actually working
print(f"{title} - {availability}")
def scrape_stuff(self):
loop = asyncio.get_event_loop()
loop.run_until_complete(self._get_tasks())
def _build_url(self, product_id):
return f"{self.base_url}{product_id}"
class EuronicsScraper(AsyncScraper):
"""
Class implementing extractions logic for euronics.it
"""
base_url = "https://www.euronics.it/"
def __init__(self):
self.product_ids = const.euronics_prods
def get_product_title(self):
title = self.content.find(
"h1", {"class": "productDetails__name"}
).text.strip()
return title
def get_product_availability(self):
avail_kw = ["prenota", "aggiungi"]
availability = self.content.find(
"span", {"class": "button__title--iconTxt"}
).text.strip()
# Availability will be inferred from button text
if any(word in availability.lower() for word in avail_kw):
availability = "Disponibile"
else:
availability = "Non disponibile"
return availability
const.py
Target products to be scraped are stored in a const
module. This is as simple as declaring a set of product IDs.
# Products ids to be scraped
euronics_prods = (
"obiettivi-zoom/nikon/50mm-f12-nikkor/eProd162017152/",
"tostapane-tostiere/ariete/155/eProd172015168/",
)
runner.py
The script is ultimately run by iterating over a list of scrapers and invoking their scrape_stuff
method, inherited from the AsyncScraper
parent class.
"""
This is just a helper used as a script runner
"""
from stores import EuronicsScraper
def main():
scrapers = [EuronicsScraper()]
for scraper in scrapers:
scraper.scrape_stuff()
if __name__ == "__main__":
main()
Questions
I am mainly interested if I've overlooked anything major that might get this piece of code hard to rework or to debug in the future. While I was writing it, it made complete sense to me as:
- Implementing a new scraper is just a matter of subclassing
AsyncScraper
and implementing extractions methods. - All request-related logic is in one place. It might be necessary to override these methods for classes dealing with websites that need some js interaction (probably using an headless browser using
selenium
) but I feel it's way beyond the scope of this review.
One thing I am not too fond of (probably need to dive deeper into inheritance) is the use of placeholder methods in AsyncScraper
as it'll force me to implement n dummy methods (where n is the number of website-specific methods that can be found in the other classes). I feel this is a bit of a hack and kind of defeats the purpose of class inheritance.
Any advice is more than welcome.
1 Answer 1
One thing I am not too fond of (probably need to dive deeper into inheritance) is the use of placeholder methods in AsyncScraper as it'll force me to implement n dummy methods (where n is the number of website-specific methods that can be found in the other classes). I feel this is a bit of a hack and kind of defeats the purpose of class inheritance.
Instead of additional placeholder methods in AsyncScraper
, you could use a single abstract method that returns a dict
of additional site-specific data. Then concrete classes would override the single abstract method for the n additional data points. Something like:
stores.py
class AsyncScraper:
...
def get_site_specific_details() -> dict[str, str]:
raise NotImplementedError() # or pass if this is optional
...
async def _scrape_elem(self, product, session):
...
# Extract product attributes
title = self.get_product_title()
availability = self.get_product_availability()
additional_details = get_site_specific_details()
# Check if stuff is actually working
print(f"{title} - {availability}")
print("Additional details: ")
for name, value in additional_details.items():
print(f"{name}: {value}")
...
class SomeNewScraper(AsyncScraper):
...
def get_site_specific_details() -> dict[str, str]:
details = {}
positive_reviews = self.content.find("...")
details["positive_reviews"] = positive_reviews
...
return details
Then AsyncScraper
can focus on the minimum set of attributes required across all site scrapers.
Note: Python does have an Abstract Base Classes lib, but I'm not familiar with it. My example probably isn't using the best syntax, but conceptually I think it gets the point across.
-
\$\begingroup\$ Thanks for taking some time to give it some thought, this looks like a step in the right direction - will definitely play around with it. \$\endgroup\$anddt– anddt2021年03月19日 13:17:10 +00:00Commented Mar 19, 2021 at 13:17
-
\$\begingroup\$ Post what you come up with... I like to learn new approaches. \$\endgroup\$dstricks– dstricks2021年03月19日 20:31:01 +00:00Commented Mar 19, 2021 at 20:31
-
\$\begingroup\$ I've implemented your approach and I believe it serves the purpose well. Obv it forces to be careful with helper methods to avoid ending up with a too long
get_site_specific_details
(but again, it is minor). \$\endgroup\$anddt– anddt2021年03月21日 12:15:59 +00:00Commented Mar 21, 2021 at 12:15
Explore related questions
See similar questions with these tags.