Skip to main content
Code Review

Return to Question

Post Reopened by pacmaninbw , Stephen Rauch, Sᴀᴍ Onᴇᴌᴀ , Toby Speight, rolfl
Spelling and grammar
Source Link
Toby Speight
  • 87.9k
  • 14
  • 104
  • 325

python asynchronous Asynchronous web scrappingscraping

This is my solution to a vacancy test task (a task itself: task )"vacancy test" task.

I'm not sure at all if I gothave correctly aimplemented the task, but here is my solution.

  1. Parse rows of table from a URL and extract some data (make a list of jsons from this data).
  2. Parse rows of table from a URL and download pdfPDF (link is placed inside a row).
  1. Have I picked a good architecture? Especially I'm curious about the self-implemented if item not in list python pattern because basically, this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I don'tdidn't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scrappingscraping.
  3. Maybe I need to optimize the check ifwhether the form already exists in the list of forms, and not iterate over the entire list every time, but apply some sorting algorithm or something else?
  4. IsDo I need to add await keyword in a tasks.append(await save_page_form_pdf(rows=rows, session=session)) line?

python asynchronous web scrapping

This is my solution to a vacancy test task (a task itself: task ).

I'm not sure at all if I got correctly a task, but here is my solution.

  1. Parse rows of table from a URL and extract some data (make a list of jsons from this data)
  2. Parse rows of table from a URL and download pdf (link is placed inside a row)
  1. Have I picked good architecture? Especially I'm curious about the self-implemented if item not in list python pattern because basically, this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I don't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scrapping.
  3. Maybe need to optimize the check if the form already exists in the list of forms and not iterate over the entire list every time, but apply some sorting algorithm or something else?
  4. Is I need to add await keyword in a tasks.append(await save_page_form_pdf(rows=rows, session=session)) line?

Asynchronous web scraping

This is my solution to a "vacancy test" task.

I'm not sure at all if I have correctly implemented the task, but here is my solution.

  1. Parse rows of table from a URL and extract some data (make a list of jsons from this data).
  2. Parse rows of table from a URL and download PDF (link is placed inside a row).
  1. Have I picked a good architecture? Especially I'm curious about the self-implemented if item not in list python pattern because this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I didn't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scraping.
  3. Maybe I need to optimize the check whether the form already exists in the list of forms, and not iterate over the entire list every time, but apply some sorting algorithm or something else?
  4. Do I need to add await keyword in a tasks.append(await save_page_form_pdf(rows=rows, session=session)) line?
added 178 characters in body
Source Link

This is my solution fromto a vacancy test task (a task itself: task ).

I'm not sure at all if I got correctly a task, but here is my solution.Note: a

Goals of code is incorrect currently:

  1. Parse rows of table from a URL and extract some data (make a list of jsons from this data)
  2. Parse rows of table from a URL and download pdf (link is placed inside a row)

This is my solution from a vacancy test task.

I'm not sure at all if I got correctly a task, but here is my solution.Note: a code is incorrect currently

This is my solution to a vacancy test task (a task itself: task ).

I'm not sure at all if I got correctly a task, but here is my solution.

Goals of code:

  1. Parse rows of table from a URL and extract some data (make a list of jsons from this data)
  2. Parse rows of table from a URL and download pdf (link is placed inside a row)
Code refactoring
Source Link
import asyncio
import aiohttpaiofiles
from sysaiohttp import exitClientSession as sys_exitaiohttp_ClientSession
from bs4 import BeautifulSoup
from pathlib import timeitPath as pathlib_Path
from json import dumps as json_dumps
from loguru import logger
deffrom is_item_exist(dict_key:sys str,import array:platform list):as sys_platform
"""
The# only# way# IHARDCODED found# to# check#
MIN_YEAR if= dict2018
MAX_YEAR key= is2020
def inis_key_in_dict(dict_key: astr, array: list
"""):
 for i, dict_ in enumerate(array):
 if dict_['form_number'] == dict_key:
 return i
async def load_pagesload_page(url: str, session, ):
 shiftasync =with 0session.get(url=url, ) #as Toresp:
 iterate over pages if 200 <= resp.status <= 299:
 return await resp.read(pagination)
 pages = [] else:
 url = logger.error(f'httpsf'Can not access a page, returned status code://apps.irs.gov/app/picklist/list/priorFormPublication {resp.html?status_code}')

async def parse_page_forms_json(rows: list) -> list[str]: # Every page contain f'indexOfFirstRow={shift}&sortColumn=sortOrder&value=&criteria=&resultsPerPage=200&isDescending=false')set of forms
 asyncpage_forms with= aiohttp[]
 for row in rows:
 try:
 form_number = row.ClientSessionfindChild(['LeftCellSpacer', 'a']).string.split(' as(', session:1)[0] # split to remove non eng ver
 for shift in range current_year = int(10)row.find(name='td', attrs={'class': 'EndCellSpacer'}).string.strip())
 asyncindex with= session.getis_key_in_dict(url=urldict_key=form_number, array=page_forms) as resp if index is None:
 ifpage_forms.append({
 200 <= resp.status <= 299:
 'form_number': form_number, # Title awaitin resp.text()reality
 pages'form_title': row.appendfind(respname='td', attrs={'class': 'MiddleCellSpacer'}).textstring.strip()),
 shift'min_year': +=current_year,
 200 # 200 rows per page 'max_year': current_year,
 else:})
 else: # If exists - modify form
 print('Cant access a page')
 if page_forms[index]['min_year'] > current_year:
 sys_exit(1)
 return pages
 
def parse_pages(pages): page_forms[index]['min_year'] = current_year
 pages_forms = [] # Result
 for page in pageselif page_forms[index]['max_year'] < current_year:
 soup = BeautifulSoup(page.content, 'html.parser') page_forms[index]['max_year'] = current_year
 except Exception as e:
 table = soup logger.finderror(name='table', attrs={'class'f'Error: 'picklist-dataTable'{e}')
 return [json_dumps(page_form) for page_form rowsin =page_forms] table.find_all # What to do whit this data?
async def save_page_form_pdf(name='tr'rows, session)[1:]
 for row in rows:
 page_forms = [] try:
 for row in rows current_year = int(row.find(name='td', attrs={'class': 'EndCellSpacer'}).string.strip())
 form_numberform_number_elem = row.findChild(['LeftCellSpacer', 'a']) # link and name
 form_number = form_number_elem.string.split(' (', 1)[0] # Removesplit to remove non eng versionsver
 of forms
 if MIN_YEAR <= current_year <= MAX_YEAR:
  resp = intawait load_page(rowurl=form_number_elem.find(name='td'attrs['href'], attrs={'class': 'EndCellSpacer'}session=session)
 # See https://docs.stringpython.striporg/3/library/pathlib.html#pathlib.Path.mkdir
 pathlib_Path(form_number).mkdir(parents=True, exist_ok=True) # exist_ok - skip FileExistsError
 index filename = is_item_existf"{form_number}/{form_number}_{current_year}.pdf"
 async with aiofiles.open(dict_key=form_numberfile=filename, array=page_formsmode='wb') as f:
 if index is not None await f.write(resp)
 except Exception as e:
 logger.error(f'Can not save file, iferror: page_forms[index]['min_year']{e}')
async >def current_yearmain():
 async with aiohttp_ClientSession() as session:
 tasks = []
 page_forms[index]['min_year'] pagination = current_year0
 while 1:
 # printurl = ('changed')f'https://apps.irs.gov/app/picklist/list/priorFormPublication.html?'
 elif page_forms[index]['max_year'] < current_year:f'indexOfFirstRow={pagination}&'
 f'sortColumn=sortOrder&'
 page_forms[index]['max_year'] = current_year f'value=&criteria=&'
 f'resultsPerPage=200&'
 # print('changed' f'isDescending=false')
 else:page = await load_page(url=url, session=session)
 soup = BeautifulSoup(page, 'html.parser')
 form_title table = rowsoup.find(name='td'name='table', attrs={'class': 'MiddleCellSpacer''picklist-dataTable'}).string.strip() # Target
 rows = page_formstable.appendfind_all({
name='tr')[1:] # [1:] - Just wrong HTML
 if 'form_number'rows: form_number,
 tasks.append(await parse_page_forms_json(rows=rows)) # 'form_title':Task form_title,1
 tasks.append(await save_page_form_pdf(rows=rows, session=session)) 'min_year':# current_year,Task 2
 pagination += 200
 'max_year': current_year,
 else: # Stop pagination
 })
 pages_forms.append(page_forms) break
 return pages_forms
def main await asyncio.gather(*tasks):
 return parse_pages(load_pages()) pass
if __name__ == '__main__':
 start# =See timeithttps://github.default_timercom/encode/httpx/issues/914#issuecomment-622586610 ()
exit code is 0, loopbut =error asyncio.get_event_loop(is exists)
 loopif sys_platform.run_until_complete(mainstartswith()'win'):
 print asyncio.set_event_loop_policy(timeitasyncio.default_timerWindowsSelectorEventLoopPolicy() - start)

Error:

line 36, in parse_pages
 logger.add('errors.txt', forlevel='ERROR', pagerotation="30 indays", pages:backtrace=True)
TypeError: 'coroutine' object is not iterableasyncio.run(main())

The logic of code:

  1. Asynchronously get a page with table rows (tax forms)
  2. Do some actions with forms (blocking and non-blocking)
  3. Repeat P.s. You can skip code of all functions except main, load_page and save_page_form_pdf (they are no matter).
  1. Have I picked a good architecture? Especially I'm curious about the self-implemented if item not in list python pattern because basically, this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I don't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scrapping.
  3. Maybe need to optimize the check if the form already exists in the list of forms and not iterate over the entire list every time, but apply some sorting algorithm or something else?
  4. Is I need to add await keyword in a tasks.append(await save_page_form_pdf(rows=rows, session=session)) line?
import asyncio
import aiohttp
from sys import exit as sys_exit
from bs4 import BeautifulSoup
import timeit
def is_item_exist(dict_key: str, array: list):
"""
The only way I found to check if dict key is in a list
"""
 for i, dict_ in enumerate(array):
 if dict_['form_number'] == dict_key:
 return i
async def load_pages():
 shift = 0 # To iterate over pages (pagination)
 pages = []
 url = (f'https://apps.irs.gov/app/picklist/list/priorFormPublication.html?'
 f'indexOfFirstRow={shift}&sortColumn=sortOrder&value=&criteria=&resultsPerPage=200&isDescending=false')
 async with aiohttp.ClientSession() as session:
 for shift in range(10):
 async with session.get(url=url) as resp:
 if 200 <= resp.status <= 299:
 await resp.text()
 pages.append(resp.text())
 shift += 200 # 200 rows per page
 else:
 print('Cant access a page')
 sys_exit(1)
 return pages
 
def parse_pages(pages):
 pages_forms = [] # Result
 for page in pages:
 soup = BeautifulSoup(page.content, 'html.parser')
 
 table = soup.find(name='table', attrs={'class': 'picklist-dataTable'})
 rows = table.find_all(name='tr')[1:]
 page_forms = []
 for row in rows:
 form_number = row.findChild(['LeftCellSpacer', 'a']).string.split(' (', 1)[0] # Remove non eng versions of forms
 current_year = int(row.find(name='td', attrs={'class': 'EndCellSpacer'}).string.strip())
 index = is_item_exist(dict_key=form_number, array=page_forms)
 if index is not None:
 if page_forms[index]['min_year'] > current_year:
 page_forms[index]['min_year'] = current_year
 # print('changed')
 elif page_forms[index]['max_year'] < current_year:
 page_forms[index]['max_year'] = current_year
 # print('changed')
 else:
 form_title = row.find(name='td', attrs={'class': 'MiddleCellSpacer'}).string.strip() page_forms.append({
 'form_number': form_number,
 'form_title': form_title,
 'min_year': current_year,
 'max_year': current_year,
 })
 pages_forms.append(page_forms)
 return pages_forms
def main():
 return parse_pages(load_pages())
if __name__ == '__main__':
 start = timeit.default_timer()
 loop = asyncio.get_event_loop()
 loop.run_until_complete(main())
 print(timeit.default_timer() - start)

Error:

line 36, in parse_pages
 for page in pages:
TypeError: 'coroutine' object is not iterable
  1. Have I picked a good architecture? Especially I'm curious about self-implemented if item not in list python pattern because basically, this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I don't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scrapping.
  3. Maybe need to optimize the check if the form already exists in the list of forms and not iterate over the entire list every time, but apply some sorting algorithm or something else?
import asyncio
import aiofiles
from aiohttp import ClientSession as aiohttp_ClientSession
from bs4 import BeautifulSoup
from pathlib import Path as pathlib_Path
from json import dumps as json_dumps
from loguru import logger
from sys import platform as sys_platform
# # # HARDCODED # # #
MIN_YEAR = 2018
MAX_YEAR = 2020
def is_key_in_dict(dict_key: str, array: list):
 for i, dict_ in enumerate(array):
 if dict_['form_number'] == dict_key:
 return i
async def load_page(url: str, session, ):
 async with session.get(url=url, ) as resp:
  if 200 <= resp.status <= 299:
 return await resp.read()
  else:
 logger.error(f'Can not access a page, returned status code: {resp.status_code}')

async def parse_page_forms_json(rows: list) -> list[str]: # Every page contain set of forms
 page_forms = []
 for row in rows:
 try:
 form_number = row.findChild(['LeftCellSpacer', 'a']).string.split(' (', 1)[0] # split to remove non eng ver
  current_year = int(row.find(name='td', attrs={'class': 'EndCellSpacer'}).string.strip())
 index = is_key_in_dict(dict_key=form_number, array=page_forms)  if index is None:
 page_forms.append({
 'form_number': form_number, # Title in reality
 'form_title': row.find(name='td', attrs={'class': 'MiddleCellSpacer'}).string.strip(),
 'min_year': current_year,
  'max_year': current_year,
 })
 else: # If exists - modify form
 if page_forms[index]['min_year'] > current_year:
  page_forms[index]['min_year'] = current_year
 elif page_forms[index]['max_year'] < current_year:
  page_forms[index]['max_year'] = current_year
 except Exception as e:
  logger.error(f'Error: {e}')
 return [json_dumps(page_form) for page_form in page_forms]  # What to do whit this data?
async def save_page_form_pdf(rows, session):
 for row in rows:
  try:
  current_year = int(row.find(name='td', attrs={'class': 'EndCellSpacer'}).string.strip())
 form_number_elem = row.findChild(['LeftCellSpacer', 'a']) # link and name
 form_number = form_number_elem.string.split(' (', 1)[0] # split to remove non eng ver
 if MIN_YEAR <= current_year <= MAX_YEAR:
  resp = await load_page(url=form_number_elem.attrs['href'], session=session)
 # See https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir
 pathlib_Path(form_number).mkdir(parents=True, exist_ok=True) # exist_ok - skip FileExistsError
  filename = f"{form_number}/{form_number}_{current_year}.pdf"
 async with aiofiles.open(file=filename, mode='wb') as f:
  await f.write(resp)
 except Exception as e:
 logger.error(f'Can not save file, error: {e}')
async def main():
 async with aiohttp_ClientSession() as session:
 tasks = []
 pagination = 0
 while 1:
 url = (f'https://apps.irs.gov/app/picklist/list/priorFormPublication.html?'
 f'indexOfFirstRow={pagination}&'
 f'sortColumn=sortOrder&'
  f'value=&criteria=&'
 f'resultsPerPage=200&'
  f'isDescending=false')
 page = await load_page(url=url, session=session)
 soup = BeautifulSoup(page, 'html.parser')
 table = soup.find(name='table', attrs={'class': 'picklist-dataTable'}) # Target
 rows = table.find_all(name='tr')[1:] # [1:] - Just wrong HTML
 if rows:
 tasks.append(await parse_page_forms_json(rows=rows)) # Task 1
 tasks.append(await save_page_form_pdf(rows=rows, session=session)) # Task 2
 pagination += 200
 else: # Stop pagination
  break
  await asyncio.gather(*tasks)
  pass
if __name__ == '__main__':
 # See https://github.com/encode/httpx/issues/914#issuecomment-622586610 (exit code is 0, but error is exists)
 if sys_platform.startswith('win'):
  asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
 logger.add('errors.txt', level='ERROR', rotation="30 days", backtrace=True)
 asyncio.run(main())

The logic of code:

  1. Asynchronously get a page with table rows (tax forms)
  2. Do some actions with forms (blocking and non-blocking)
  3. Repeat P.s. You can skip code of all functions except main, load_page and save_page_form_pdf (they are no matter).
  1. Have I picked good architecture? Especially I'm curious about the self-implemented if item not in list python pattern because basically, this pattern is good for a simple object inside a list, but I have a list of the dicts and checking by dict key. After some experiments, I don't find a better realization than presented here.
  2. Is an asynchronous solution good enough? It's my first experience with asynchronous web scrapping.
  3. Maybe need to optimize the check if the form already exists in the list of forms and not iterate over the entire list every time, but apply some sorting algorithm or something else?
  4. Is I need to add await keyword in a tasks.append(await save_page_form_pdf(rows=rows, session=session)) line?
Post Closed as "Not suitable for this site" by Graipher, greybeard, pacmaninbw , Sᴀᴍ Onᴇᴌᴀ , Toby Speight
smoothed rough edges
Source Link
greybeard
  • 7.4k
  • 3
  • 21
  • 55
Loading
Source Link
Loading
lang-py

AltStyle によって変換されたページ (->オリジナル) /