Web Scraping with Python + asyncio

Question 1

I've been working at speeding up my web scraping with the asyncio library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.

import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
 """
 A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
 http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
 """
 response = yield from aiohttp.request('GET', *args, **kwargs)
 return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url):
 """
 Given the url for a chapter, extract the relevant text from it
 :param url: the url for the chapter to scrape
 :return: a string containing the chapter's text
 """
 sem = asyncio.Semaphore(5)
 with (yield from sem):
 page = yield from get(url)
 tree = etree.HTML(page)
 paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1]
 return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)
def generate_links():
 """
 Generate the links to each of the chapters
 :return: A list of strings containing every url to visit
 """
 start_url = 'https://twigserial.wordpress.com/'
 base_url = 'https://twigserial.wordpress.com/category/story/'
 tree = etree.HTML(requests.get(start_url).text)
 xpath = './/*/option[@class="level-2"]/text()'
 return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run():
 links = generate_links()
 chapters = []
 for f in asyncio.as_completed([extract_text(link) for link in links]):
 result = yield from f
 chapters.append(result)
 return chapters
def main():
 loop = asyncio.get_event_loop()
 chapters = loop.run_until_complete(run())
 print(len(chapters))
if __name__ == '__main__':
 main()

Question 2

For future readers attempting to run this from Jupyter notebook: note the current Jupyter's Tornado 5.0 update wil result in RuntimeError: This event loop is already running. Unclosed client session when running this. Resolution: stackoverflow.com/questions/47518874/…

Question 3

Looks ... great? Not a lot to complain about really.

The semaphore doesn't do anything though used like this, it should be passed in from the top to protect the get/aiohttp.request. You can see that if you print something right before the HTTP request.

Also, the result of asyncio.as_completed will be in random order, so be sure to sort the resulting chapters somehow, e.g. by returning both the URL and the collected text from extract_text.

A couple of small things as well:

List comprehensions are okay, but with just a single argument it can be shorter and equally performant just to use map.
The URL constants should ideally be defined on the top level; at least base_url can also be defined by concatenating with start_url. Alternatively they could be passed in to generate_links. Then again, it's unlikely that another blog has the exact same layout?
The manual append in run seems unnecessary, I'd rewrite it into a list of generators and use a list comprehension instead.
At the moment generate_links is called from run; I think it makes more sense to call it from the main function: it doesn't need to run concurrently and you could think of a situation where you'd pass in the result of a different function to be fetched and collected.

All in all, I'd maybe change things to the code below. Of course if you were to add things to it, I'd recommend looking into command line arguments and configuration files, ...

import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
 """
 A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
 http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
 """
 response = yield from aiohttp.request('GET', *args, **kwargs)
 return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url, sem):
 """
 Given the url for a chapter, extract the relevant text from it
 :param url: the url for the chapter to scrape
 :return: a string containing the chapter's text
 """
 with (yield from sem):
 page = yield from get(url)
 tree = etree.HTML(page)
 paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]
 return url, b'\n'.join(map(etree.tostring, paragraphs))
def generate_links():
 """
 Generate the links to each of the chapters
 :return: A list of strings containing every url to visit
 """
 start_url = 'https://twigserial.wordpress.com/'
 base_url = start_url + 'category/story/'
 tree = etree.HTML(requests.get(start_url).text)
 xpath = './/*/option[@class="level-2"]/text()'
 return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run(links):
 sem = asyncio.Semaphore(5)
 fetchers = [extract_text(link, sem) for link in links]
 return [(yield from f) for f in asyncio.as_completed(fetchers)]
def main():
 loop = asyncio.get_event_loop()
 chapters = loop.run_until_complete(run(generate_links()))
 print(len(chapters))
if __name__ == '__main__':
 main()

Question 4

shouldn't you run loop.close() at the end of main?

Question 5

No idea, I expected the original code to be correct. Does it matter in main? Lastly, while I found some examples with close, there are some without ... so I'm not sure about it. Do you have a good reference for it?

Question 6

i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. also this. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha!

Question 7

maybe only useful when you run loop.run_forever() instead of loop.run_until_complete()

ferada 11.4k26 silver badges66 bronze badges · Accepted Answer · 2015-12-01 00:03:55Z

Looks ... great? Not a lot to complain about really.

The semaphore doesn't do anything though used like this, it should be passed in from the top to protect the get/aiohttp.request. You can see that if you print something right before the HTTP request.

Also, the result of asyncio.as_completed will be in random order, so be sure to sort the resulting chapters somehow, e.g. by returning both the URL and the collected text from extract_text.

A couple of small things as well:

List comprehensions are okay, but with just a single argument it can be shorter and equally performant just to use map.
The URL constants should ideally be defined on the top level; at least base_url can also be defined by concatenating with start_url. Alternatively they could be passed in to generate_links. Then again, it's unlikely that another blog has the exact same layout?
The manual append in run seems unnecessary, I'd rewrite it into a list of generators and use a list comprehension instead.
At the moment generate_links is called from run; I think it makes more sense to call it from the main function: it doesn't need to run concurrently and you could think of a situation where you'd pass in the result of a different function to be fetched and collected.

All in all, I'd maybe change things to the code below. Of course if you were to add things to it, I'd recommend looking into command line arguments and configuration files, ...

import aiohttp
import asyncio
import requests
from lxml import etree
@asyncio.coroutine
def get(*args, **kwargs):
 """
 A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at
 http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
 """
 response = yield from aiohttp.request('GET', *args, **kwargs)
 return (yield from response.read_and_close())
@asyncio.coroutine
def extract_text(url, sem):
 """
 Given the url for a chapter, extract the relevant text from it
 :param url: the url for the chapter to scrape
 :return: a string containing the chapter's text
 """
 with (yield from sem):
 page = yield from get(url)
 tree = etree.HTML(page)
 paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]
 return url, b'\n'.join(map(etree.tostring, paragraphs))
def generate_links():
 """
 Generate the links to each of the chapters
 :return: A list of strings containing every url to visit
 """
 start_url = 'https://twigserial.wordpress.com/'
 base_url = start_url + 'category/story/'
 tree = etree.HTML(requests.get(start_url).text)
 xpath = './/*/option[@class="level-2"]/text()'
 return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]
@asyncio.coroutine
def run(links):
 sem = asyncio.Semaphore(5)
 fetchers = [extract_text(link, sem) for link in links]
 return [(yield from f) for f in asyncio.as_completed(fetchers)]
def main():
 loop = asyncio.get_event_loop()
 chapters = loop.run_until_complete(run(generate_links()))
 print(len(chapters))
if __name__ == '__main__':
 main()

No idea, I expected the original code to be correct. Does it matter in main? Lastly, while I found some examples with close, there are some without ... so I'm not sure about it. Do you have a good reference for it?
i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. also this. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha!
maybe only useful when you run loop.run_forever() instead of loop.run_until_complete()

Stack Exchange Network

Web Scraping with Python + asyncio

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Web Scraping with Python + asyncio

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions