The following code is an async beginner exploration into asynchronous file downloads, to try and improve the download time of files from a specific website.
Tasks:
The tasks are as follows:
Extract the latest publication link. It is the top link under Latest Statistics and is the first match returned by css selector
.cta__button
.
N.B. Each month this link updates so the next monthly publication e.g. 8 Apr 2021 the link will update to that associated with Mental Health Services Monthly Statistics Performance January, Provisional February 2021.
Visit the extracted link: https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics/performance-december-2020-provisional-january-2021 and, from there, extract the download links for all the files listed under Resources; currently 17 files.
Finally, download all those files, using the retrieved urls, and save them to the location specified by
folder
variable.
Set-up:
Python 3.9.0 64-bit Windows 10
Request:
I would appreciate any suggested improvements to this code e.g. should I have re-factored the co-routine fetch_download_links
into two co-routines, each with a ClientSession
, where one co-routine gets the initial link to where to get the resources, and the second co-routine to retrieve the actual resource links?
mhsmsAsynDownloads.py
import time
import os
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
async def fetch_download_links(url:str) -> list:
async with aiohttp.ClientSession() as session:
r = await session.get(url, ssl = False)
html = await r.text()
soup = bs(html, 'lxml')
link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
r = await session.get(link, ssl = False)
html = await r.text()
soup = bs(html, 'lxml')
files = [i['href'] for i in soup.select('.attachment a')]
return files
async def place_file(source: str) -> None:
async with aiohttp.ClientSession() as session:
file_name = source.split('/')[-1]
file_name = urllib.parse.unquote(file_name)
r = await session.get(source, ssl = False)
content = await r.read()
async with aiofiles.open(folder + file_name, 'wb') as f:
await f.write(content)
async def main():
tasks = []
urls = await fetch_download_links('https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics')
for url in urls:
tasks.append(place_file(url))
await asyncio.gather(*tasks)
folder = 'C:/Users/<User>/OneDrive/Desktop/testing/'
if __name__ == '__main__':
t1 = time.perf_counter()
print("process started...")
asyncio.get_event_loop().run_until_complete(main())
os.startfile(folder[:-1])
t2 = time.perf_counter()
print(f'Completed in {t2-t1} seconds.')
References/Notes:
- I wrote this after watching an async tutorial on YouTube by Andrei Dumitrescu. The example above is my own
- https://docs.aiohttp.org/en/v0.20.0/client.html
- https://docs.aiohttp.org/en/stable/client_advanced.html
- Data is publicly available
1 Answer 1
Looks fine to me, you could potentially shave off some repetition, but it's not like it matters that much for a small script.
That said, I'd reuse the session, inline a few variables, fix the naming (constants should be uppercase) and use some standard library tools like os.path
to make the script more error-proof. The list
return type could also be more concrete and mention that it's a list of strings for example.
Also, and that's a bit more important, I'd rather not have a script slurp in the whole HTTP response like that! Who knows how big the files are, plus it's wasteful ... better to stream the response directly to disk! Thankfully the APIs do exist, using .content.iter_any
(or _chunked
if you prefer to have a fixed chunk size) you can just iterate over each read data block and write them to the file like so:
import time
import os
import os.path
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
import typing
STATISTICS_URL = 'https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics'
OUTPUT_FOLDER = 'C:/Users/<User>/OneDrive/Desktop/testing/'
async def fetch_download_links(session: aiohttp.ClientSession, url: str) -> typing.List[str]:
r = await session.get(url, ssl=False)
soup = bs(await r.text(), 'lxml')
link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
r = await session.get(link, ssl=False)
soup = bs(await r.text(), 'lxml')
return [i['href'] for i in soup.select('.attachment a')]
async def place_file(session: aiohttp.ClientSession, source: str) -> None:
r = await session.get(source, ssl=False)
file_name = urllib.parse.unquote(source.split('/')[-1])
async with aiofiles.open(os.path.join(OUTPUT_FOLDER, file_name), 'wb') as f:
async for data in r.content.iter_any():
await f.write(data)
async def main():
async with aiohttp.ClientSession() as session:
urls = await fetch_download_links(session, STATISTICS_URL)
await asyncio.gather(*[place_file(session, url) for url in urls])
if __name__ == '__main__':
t1 = time.perf_counter()
print('process started...')
asyncio.get_event_loop().run_until_complete(main())
os.startfile(OUTPUT_FOLDER[:-1])
t2 = time.perf_counter()
print(f'Completed in {t2-t1} seconds.')
Explore related questions
See similar questions with these tags.