I've been working at speeding up my web scraping with the asyncio
library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.
import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
"""
A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
"""
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url):
"""
Given the url for a chapter, extract the relevant text from it
:param url: the url for the chapter to scrape
:return: a string containing the chapter's text
"""
sem = asyncio.Semaphore(5)
with (yield from sem):
page = yield from get(url)
tree = etree.HTML(page)
paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1]
return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)
def generate_links():
"""
Generate the links to each of the chapters
:return: A list of strings containing every url to visit
"""
start_url = 'https://twigserial.wordpress.com/'
base_url = 'https://twigserial.wordpress.com/category/story/'
tree = etree.HTML(requests.get(start_url).text)
xpath = './/*/option[@class="level-2"]/text()'
return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run():
links = generate_links()
chapters = []
for f in asyncio.as_completed([extract_text(link) for link in links]):
result = yield from f
chapters.append(result)
return chapters
def main():
loop = asyncio.get_event_loop()
chapters = loop.run_until_complete(run())
print(len(chapters))
if __name__ == '__main__':
main()
-
\$\begingroup\$ For future readers attempting to run this from Jupyter notebook: note the current Jupyter's Tornado 5.0 update wil result in RuntimeError: This event loop is already running. Unclosed client session when running this. Resolution: stackoverflow.com/questions/47518874/… \$\endgroup\$QHarr– QHarr2018年11月21日 10:41:55 +00:00Commented Nov 21, 2018 at 10:41
1 Answer 1
Looks ... great? Not a lot to complain about really.
The semaphore doesn't do anything though used like this, it should be
passed in from the top to protect the get
/aiohttp.request
. You can
see that if you print
something right before the HTTP request.
Also, the result of asyncio.as_completed
will be in random order, so
be sure to sort the resulting chapters somehow, e.g. by returning both
the URL and the collected text from extract_text
.
A couple of small things as well:
- List comprehensions are okay, but with just a single argument it can
be shorter and equally performant just to use
map
. - The URL constants should ideally be defined on the top level; at least
base_url
can also be defined by concatenating withstart_url
. Alternatively they could be passed in togenerate_links
. Then again, it's unlikely that another blog has the exact same layout? - The manual
append
inrun
seems unnecessary, I'd rewrite it into a list of generators and use a list comprehension instead. - At the moment
generate_links
is called fromrun
; I think it makes more sense to call it from themain
function: it doesn't need to run concurrently and you could think of a situation where you'd pass in the result of a different function to be fetched and collected.
All in all, I'd maybe change things to the code below. Of course if you were to add things to it, I'd recommend looking into command line arguments and configuration files, ...
import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
"""
A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
"""
response = yield from aiohttp.request('GET', *args, **kwargs)
return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url, sem):
"""
Given the url for a chapter, extract the relevant text from it
:param url: the url for the chapter to scrape
:return: a string containing the chapter's text
"""
with (yield from sem):
page = yield from get(url)
tree = etree.HTML(page)
paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]
return url, b'\n'.join(map(etree.tostring, paragraphs))
def generate_links():
"""
Generate the links to each of the chapters
:return: A list of strings containing every url to visit
"""
start_url = 'https://twigserial.wordpress.com/'
base_url = start_url + 'category/story/'
tree = etree.HTML(requests.get(start_url).text)
xpath = './/*/option[@class="level-2"]/text()'
return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run(links):
sem = asyncio.Semaphore(5)
fetchers = [extract_text(link, sem) for link in links]
return [(yield from f) for f in asyncio.as_completed(fetchers)]
def main():
loop = asyncio.get_event_loop()
chapters = loop.run_until_complete(run(generate_links()))
print(len(chapters))
if __name__ == '__main__':
main()
-
\$\begingroup\$ shouldn't you run loop.close() at the end of main? \$\endgroup\$Derek Adair– Derek Adair2016年05月03日 00:53:35 +00:00Commented May 3, 2016 at 0:53
-
\$\begingroup\$ No idea, I expected the original code to be correct. Does it matter in
main
? Lastly, while I found some examples withclose
, there are some without ... so I'm not sure about it. Do you have a good reference for it? \$\endgroup\$ferada– ferada2016年05月03日 08:07:37 +00:00Commented May 3, 2016 at 8:07 -
\$\begingroup\$ i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. also this. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha! \$\endgroup\$Derek Adair– Derek Adair2016年05月03日 14:54:23 +00:00Commented May 3, 2016 at 14:54
-
\$\begingroup\$ maybe only useful when you run loop.run_forever() instead of loop.run_until_complete() \$\endgroup\$Derek Adair– Derek Adair2016年05月03日 15:01:48 +00:00Commented May 3, 2016 at 15:01
Explore related questions
See similar questions with these tags.