EdmundMartin/Scrapio

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
examples		examples
scrapio		scrapio
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Scrapio

Asyncio web scraping framework. The project aims to make easy to write a high performance crawlers with little knowledge of asyncio, while giving enough flexibility so that users can customise behaviour of their scrapers. It also supports Uvloop, and can be used in conjunction with custom clients allowing for browser based rendering.

Install

pip install scrapio

The project can be installed using Pip.

Hello Crawl Example

from collections import defaultdict
import aiofiles # external dependency
import lxml.html as lh
from scrapio.crawlers.base_crawler import BaseCrawler # import from scrapio.scrapers on version 0.14 and lower
from scrapio.utils.helpers import response_to_html
class OurScraper(BaseCrawler):
 def parse_result(self, response):
 html = response_to_html(response)
 dom = lh.fromstring(html)
 result = defaultdict(lambda: "N/A")
 result['url'] = response.url
 title = dom.cssselect('title')
 h1 = dom.cssselect('h1')
 if title:
 result['title'] = title[0].text_content()
 if h1:
 result['h1'] = h1[0].text_content()
 return result
 async def save_results(self, result):
 if result:
 async with aiofiles.open('example_output.csv', 'a') as f:
 url = result.get('url')
 title = result.get('title')
 h1 = result.get('h1')
 await f.write('"{}","{}","{}"\n'.format(url, title, h1))
if __name__ == '__main__':
 scraper = OurScraper('http://edmundmartin.com')
 scraper.run_crawler(10)

The above represents a fully functional scraper using the Scrapio framework. We overide the parse_result and save_results from the base scraper class. We then initialize the crawler with our start URL and set the number of scraping processes and the number of parsing processes.

Custom Link Parsing

The default behaviour of the link parser can be overwriting the behaviour of the base link parsing class as is outlined in the example below.

from collections import defaultdict
import aiofiles # external dependency
import lxml.html as lh
from scrapio.crawlers import BaseCrawler
from scrapio.utils.helpers import response_to_html
from scrapio.structures.filtering import URLFilter
class PythonURLFilter(URLFilter):
 def can_crawl(self, host: str, url: str):
 if 'edmundmartin.com' in host and 'python' in url.lower():
 return True
 return False
class OurScraper(BaseCrawler):
 def parse_result(self, response):
 html = response_to_html(response)
 dom = lh.fromstring(html)
 result = defaultdict(lambda: "N/A")
 result['url'] = response.url
 title = dom.cssselect('title')
 h1 = dom.cssselect('h1')
 if title:
 result['title'] = title[0].text_content()
 if h1:
 result['h1'] = h1[0].text_content()
 return result
 async def save_results(self, result):
 if result:
 async with aiofiles.open('example_output.csv', 'a') as f:
 url = result.get('url')
 title = result.get('title')
 h1 = result.get('h1')
 await f.write('"{}","{}","{}"\n'.format(url, title, h1))
if __name__ == '__main__':
 scraper = OurScraper('http://edmundmartin.com', custom_filter=PythonURLFilter)
 scraper.run_crawler(10)

About

Asyncio web crawling framework. Work in progress.

Languages

Python 100.0%

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

EdmundMartin/Scrapio

Folders and files

Latest commit

History

Repository files navigation

Scrapio

Install

Hello Crawl Example

Custom Link Parsing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

EdmundMartin/Scrapio

Folders and files

Latest commit

History

Repository files navigation

Scrapio

Install

Hello Crawl Example

Custom Link Parsing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages