Async download of files

Question 1

The following code is an async beginner exploration into asynchronous file downloads, to try and improve the download time of files from a specific website.

Tasks:

The tasks are as follows:

Visit https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics
Extract the latest publication link. It is the top link under Latest Statistics and is the first match returned by css selector .cta__button.

enter image description here

N.B. Each month this link updates so the next monthly publication e.g. 8 Apr 2021 the link will update to that associated with Mental Health Services Monthly Statistics Performance January, Provisional February 2021.

Visit the extracted link: https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics/performance-december-2020-provisional-january-2021 and, from there, extract the download links for all the files listed under Resources; currently 17 files.

enter image description here
Finally, download all those files, using the retrieved urls, and save them to the location specified by folder variable.

Set-up:

Python 3.9.0 64-bit Windows 10

Request:

I would appreciate any suggested improvements to this code e.g. should I have re-factored the co-routine fetch_download_links into two co-routines, each with a ClientSession, where one co-routine gets the initial link to where to get the resources, and the second co-routine to retrieve the actual resource links?

mhsmsAsynDownloads.py

import time
import os
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
async def fetch_download_links(url:str) -> list: 
 async with aiohttp.ClientSession() as session:
 r = await session.get(url, ssl = False)
 html = await r.text()
 soup = bs(html, 'lxml')
 link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
 r = await session.get(link, ssl = False)
 html = await r.text()
 soup = bs(html, 'lxml')
 files = [i['href'] for i in soup.select('.attachment a')]
 return files
async def place_file(source: str) -> None:
 async with aiohttp.ClientSession() as session:
 file_name = source.split('/')[-1]
 file_name = urllib.parse.unquote(file_name)
 r = await session.get(source, ssl = False)
 content = await r.read()
 
 async with aiofiles.open(folder + file_name, 'wb') as f:
 await f.write(content)
 
async def main():
 tasks = []
 urls = await fetch_download_links('https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics')
 
 for url in urls:
 tasks.append(place_file(url)) 
 
 await asyncio.gather(*tasks)
 
folder = 'C:/Users/<User>/OneDrive/Desktop/testing/'
if __name__ == '__main__':
 
 t1 = time.perf_counter() 
 print("process started...") 
 asyncio.get_event_loop().run_until_complete(main())
 os.startfile(folder[:-1])
 t2 = time.perf_counter()
 print(f'Completed in {t2-t1} seconds.')

References/Notes:

I wrote this after watching an async tutorial on YouTube by Andrei Dumitrescu. The example above is my own
https://docs.aiohttp.org/en/v0.20.0/client.html
https://docs.aiohttp.org/en/stable/client_advanced.html
Data is publicly available

Question 2

Looks fine to me, you could potentially shave off some repetition, but it's not like it matters that much for a small script.

That said, I'd reuse the session, inline a few variables, fix the naming (constants should be uppercase) and use some standard library tools like os.path to make the script more error-proof. The list return type could also be more concrete and mention that it's a list of strings for example.

Also, and that's a bit more important, I'd rather not have a script slurp in the whole HTTP response like that! Who knows how big the files are, plus it's wasteful ... better to stream the response directly to disk! Thankfully the APIs do exist, using .content.iter_any (or _chunked if you prefer to have a fixed chunk size) you can just iterate over each read data block and write them to the file like so:

import time
import os
import os.path
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
import typing
STATISTICS_URL = 'https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics'
OUTPUT_FOLDER = 'C:/Users/<User>/OneDrive/Desktop/testing/'
async def fetch_download_links(session: aiohttp.ClientSession, url: str) -> typing.List[str]:
 r = await session.get(url, ssl=False)
 soup = bs(await r.text(), 'lxml')
 link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
 r = await session.get(link, ssl=False)
 soup = bs(await r.text(), 'lxml')
 return [i['href'] for i in soup.select('.attachment a')]
async def place_file(session: aiohttp.ClientSession, source: str) -> None:
 r = await session.get(source, ssl=False)
 file_name = urllib.parse.unquote(source.split('/')[-1])
 async with aiofiles.open(os.path.join(OUTPUT_FOLDER, file_name), 'wb') as f:
 async for data in r.content.iter_any():
 await f.write(data)
async def main():
 async with aiohttp.ClientSession() as session:
 urls = await fetch_download_links(session, STATISTICS_URL)
 await asyncio.gather(*[place_file(session, url) for url in urls])
if __name__ == '__main__':
 t1 = time.perf_counter()
 print('process started...')
 asyncio.get_event_loop().run_until_complete(main())
 os.startfile(OUTPUT_FOLDER[:-1])
 t2 = time.perf_counter()
 print(f'Completed in {t2-t1} seconds.')

ferada ferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2021-04-05 20:25:04Z

Looks fine to me, you could potentially shave off some repetition, but it's not like it matters that much for a small script.

That said, I'd reuse the session, inline a few variables, fix the naming (constants should be uppercase) and use some standard library tools like os.path to make the script more error-proof. The list return type could also be more concrete and mention that it's a list of strings for example.

Also, and that's a bit more important, I'd rather not have a script slurp in the whole HTTP response like that! Who knows how big the files are, plus it's wasteful ... better to stream the response directly to disk! Thankfully the APIs do exist, using .content.iter_any (or _chunked if you prefer to have a fixed chunk size) you can just iterate over each read data block and write them to the file like so:

import time
import os
import os.path
from bs4 import BeautifulSoup as bs
import aiohttp
import aiofiles
import asyncio
import urllib
import typing
STATISTICS_URL = 'https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-services-monthly-statistics'
OUTPUT_FOLDER = 'C:/Users/<User>/OneDrive/Desktop/testing/'
async def fetch_download_links(session: aiohttp.ClientSession, url: str) -> typing.List[str]:
 r = await session.get(url, ssl=False)
 soup = bs(await r.text(), 'lxml')
 link = 'https://digital.nhs.uk' + soup.select_one('.cta__button')['href']
 r = await session.get(link, ssl=False)
 soup = bs(await r.text(), 'lxml')
 return [i['href'] for i in soup.select('.attachment a')]
async def place_file(session: aiohttp.ClientSession, source: str) -> None:
 r = await session.get(source, ssl=False)
 file_name = urllib.parse.unquote(source.split('/')[-1])
 async with aiofiles.open(os.path.join(OUTPUT_FOLDER, file_name), 'wb') as f:
 async for data in r.content.iter_any():
 await f.write(data)
async def main():
 async with aiohttp.ClientSession() as session:
 urls = await fetch_download_links(session, STATISTICS_URL)
 await asyncio.gather(*[place_file(session, url) for url in urls])
if __name__ == '__main__':
 t1 = time.perf_counter()
 print('process started...')
 asyncio.get_event_loop().run_until_complete(main())
 os.startfile(OUTPUT_FOLDER[:-1])
 t2 = time.perf_counter()
 print(f'Completed in {t2-t1} seconds.')

Stack Exchange Network

Async download of files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Async download of files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions