7
\$\begingroup\$

The following code is an async beginner exploration into asynchronous file downloads, to try and improve the download time of files from a specific website.


Tasks:

The tasks are as follows:

  1. Visit https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics

  2. Extract the latest publication link. It is the top link under Latest Statistics and is the first match returned by css selector .cta__button.

    enter image description here

N.B. Each month this link updates so the next monthly publication e.g. 8 Apr 2021 the link will update to that associated with Mental Health Services Monthly Statistics Performance January, Provisional February 2021.

  1. Visit the extracted link: https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics/performance-december-2020-provisional-january-2021 and, from there, extract the download links for all the files listed under Resources; currently 17 files.

    enter image description here

  2. Finally, download all those files, using the retrieved urls, and save them to the location specified by folder variable.


Set-up:

Python 3.9.0 64-bit Windows 10


Request:

I would appreciate any suggested improvements to this code e.g. should I have re-factored the co-routine fetch_download_links into two co-routines, each with a ClientSession, where one co-routine gets the initial link to where to get the resources, and the second co-routine to retrieve the actual resource links?


mhsmsAsynDownloads.py

import time
import os
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
async def fetch_download_links(url:str) -> list: 
 async with aiohttp.ClientSession() as session:
 r = await session.get(url, ssl = False)
 html = await r.text()
 soup = bs(html, 'lxml')
 link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
 r = await session.get(link, ssl = False)
 html = await r.text()
 soup = bs(html, 'lxml')
 files = [i['href'] for i in soup.select('.attachment a')]
 return files
async def place_file(source: str) -> None:
 async with aiohttp.ClientSession() as session:
 file_name = source.split('/')[-1]
 file_name = urllib.parse.unquote(file_name)
 r = await session.get(source, ssl = False)
 content = await r.read()
 
 async with aiofiles.open(folder + file_name, 'wb') as f:
 await f.write(content)
 
async def main():
 tasks = []
 urls = await fetch_download_links('https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics')
 
 for url in urls:
 tasks.append(place_file(url)) 
 
 await asyncio.gather(*tasks)
 
folder = 'C:/Users/<User>/OneDrive/Desktop/testing/'
if __name__ == '__main__':
 
 t1 = time.perf_counter() 
 print("process started...") 
 asyncio.get_event_loop().run_until_complete(main())
 os.startfile(folder[:-1])
 t2 = time.perf_counter()
 print(f'Completed in {t2-t1} seconds.')

References/Notes:

  1. I wrote this after watching an async tutorial on YouTube by Andrei Dumitrescu. The example above is my own
  2. https://docs.aiohttp.org/en/v0.20.0/client.html
  3. https://docs.aiohttp.org/en/stable/client_advanced.html
  4. Data is publicly available
asked Apr 5, 2021 at 6:09
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Looks fine to me, you could potentially shave off some repetition, but it's not like it matters that much for a small script.

That said, I'd reuse the session, inline a few variables, fix the naming (constants should be uppercase) and use some standard library tools like os.path to make the script more error-proof. The list return type could also be more concrete and mention that it's a list of strings for example.

Also, and that's a bit more important, I'd rather not have a script slurp in the whole HTTP response like that! Who knows how big the files are, plus it's wasteful ... better to stream the response directly to disk! Thankfully the APIs do exist, using .content.iter_any (or _chunked if you prefer to have a fixed chunk size) you can just iterate over each read data block and write them to the file like so:

import time
import os
import os.path
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
import typing
STATISTICS_URL = 'https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics'
OUTPUT_FOLDER = 'C:/Users/<User>/OneDrive/Desktop/testing/'
async def fetch_download_links(session: aiohttp.ClientSession, url: str) -> typing.List[str]:
 r = await session.get(url, ssl=False)
 soup = bs(await r.text(), 'lxml')
 link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
 r = await session.get(link, ssl=False)
 soup = bs(await r.text(), 'lxml')
 return [i['href'] for i in soup.select('.attachment a')]
async def place_file(session: aiohttp.ClientSession, source: str) -> None:
 r = await session.get(source, ssl=False)
 file_name = urllib.parse.unquote(source.split('/')[-1])
 async with aiofiles.open(os.path.join(OUTPUT_FOLDER, file_name), 'wb') as f:
 async for data in r.content.iter_any():
 await f.write(data)
async def main():
 async with aiohttp.ClientSession() as session:
 urls = await fetch_download_links(session, STATISTICS_URL)
 await asyncio.gather(*[place_file(session, url) for url in urls])
if __name__ == '__main__':
 t1 = time.perf_counter()
 print('process started...')
 asyncio.get_event_loop().run_until_complete(main())
 os.startfile(OUTPUT_FOLDER[:-1])
 t2 = time.perf_counter()
 print(f'Completed in {t2-t1} seconds.')
answered Apr 5, 2021 at 20:25
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.