crawley lets you crawl websites and extract structured data with a tiny,
declarative API. This is the modernized release: the legacy eventlet / elixir
stack has been replaced by asyncio, httpx and SQLAlchemy 2.x.
π Documentation: https://jmg.github.io/crawley/ β or run mkdocs serve
locally (see Development).
- High speed asynchronous crawler powered by
asyncio+httpx. - A modern, ergonomic scraping API (
fetch,Document, CSS/XPath,extract). - Extract data with your favourite tool: XPath, CSS selectors or PyQuery (a jQuery-like API).
- Politeness built in:
robots.txt, per-host rate limiting and retries with exponential backoff. - Persist to relational databases (SQLite, PostgreSQL, MySQL, Oracle) via SQLAlchemy 2.x, to MongoDB / CouchDB, or export to JSON / XML / CSV.
- Cookie handling and proxies out of the box.
- A small DSL to define scrapers declaratively.
- Command line tools (
crawley startproject,crawley run, ...). - Optional visual scraping browser (PySide6).
- Python 3.9+
~$ pip install crawley # core (httpx, lxml, pyquery, cssselect) ~$ pip install "crawley[sql]" # + SQLAlchemy for relational storage ~$ pip install "crawley[mongo]" # + pymongo ~$ pip install "crawley[gui]" # + PySide6 visual browser ~$ pip install "crawley[dev]" # tests + linters
From a checkout:
~$ pip install -e ".[dev]"
import asyncio from crawley.crawlers import BaseCrawler from crawley.scrapers import BaseScraper from crawley.extractors import XPathExtractor class QuotesScraper(BaseScraper): # only pages matching these patterns are scraped ("%" is a wildcard) matching_urls = ["%quotes.toscrape.com%"] def scrape(self, response): for quote in response.html.xpath("//div[@class='quote']"): text = quote.xpath(".//span[@class='text']")[0].text author = quote.xpath(".//small[@class='author']")[0].text print(author, "->", text) class QuotesCrawler(BaseCrawler): start_urls = ["https://quotes.toscrape.com/"] scrapers = [QuotesScraper] max_depth = 2 extractor = XPathExtractor # or CSSExtractor / PyQueryExtractor # Synchronous entry point: QuotesCrawler().run() # ...or await it from your own event loop: # asyncio.run(QuotesCrawler().start())
Need a one-off request?
from crawley.toolbox import request response = request("https://example.com") print(response.status_code, response.html.xpath("//title")[0].text)
For "just scrape this page" use cases there's a small, ergonomic API
(Γ la parsel / requests-html) built on the same httpx + lxml stack.
Selectors accept an optional ::text or ::attr(name) suffix.
from crawley.scraping import fetch doc = fetch("https://quotes.toscrape.com/") doc.title # -> "Quotes to Scrape" doc.css_first("h1").text # first match (an Element) doc.css("span.text::text") # list of texts doc.css("a::attr(href)") # list of (absolute) hrefs doc.links() # de-duplicated absolute links # Declarative extraction: a string selector -> one value, [selector] -> a list doc.extract({ "quote": "span.text::text", "author": "small.author::text", "tags": ["a.tag::text"], })
Fetch many pages concurrently, or scrape an url in one call:
import asyncio from crawley.scraping import afetch_all, scrape scrape("https://example.com", {"title": "h1::text"}) docs = asyncio.run(afetch_all(["https://a.com", "https://b.com"]))
The same shortcuts (response.css, response.css_first, response.extract,
response.doc) are available on the crawler's response object inside
scrape().
For full crawls there's a Scrapy-style Spider: yield Requests (or
response.follow(...)) to navigate and dicts/Items to emit data, with item
pipelines, rule-based crawling and optional JavaScript rendering. See
docs/spiders.md.
from crawley.spider import Spider class BlogSpider(Spider): start_urls = ["https://example.com/blog/"] def parse(self, response): # default callback for href in response.css("a.post::attr(href)"): yield response.follow(href, callback=self.parse_post) nxt = response.css_first("a.next::attr(href)") if nxt: yield response.follow(nxt) # follows pagination def parse_post(self, response): yield {"title": response.css_first("h1").text, "url": response.url} BlogSpider().run()
- Item pipelines:
crawley.pipelines.ItemPipeline+DropItem. - Rule-based:
CrawlSpider+Rule(LinkExtractor(allow=..., deny=...)). - Sitemaps:
SitemapSpider(sitemap_urls=[...]). - JavaScript:
render_js = True(installcrawley[js]+playwright install).
~$ crawley startproject myproject ~$ cd myproject
from crawley.persistance import Entity, UrlEntity, Field, Unicode class Package(Entity): updated = Field(Unicode(255)) package = Field(Unicode(255)) description = Field(Unicode(255))
from crawley.crawlers import BaseCrawler from crawley.scrapers import BaseScraper from crawley.extractors import XPathExtractor from models import * class pypiScraper(BaseScraper): matching_urls = ["%"] def scrape(self, response): for tr in response.html.xpath("//table/tr"): Package(package=tr[1].text, description=tr[2].text) class pypiCrawler(BaseCrawler): start_urls = ["https://pypi.org/"] scrapers = [pypiScraper] max_depth = 0 extractor = XPathExtractor
~$ crawley runOther commands: crawley syncdb, crawley migratedb, crawley shell <url>,
crawley browser <url>.
| Extractor | response.html is... |
Query with |
|---|---|---|
XPathExtractor |
an lxml tree |
.xpath(...) |
CSSExtractor |
an lxml tree |
.getroot().cssselect(...) |
PyQueryExtractor |
a PyQuery object |
pq("div.foo") |
RawExtractor |
the raw html str |
anything you like |
Crawl responsibly with a few class attributes (see
docs/politeness.md):
class PoliteCrawler(BaseCrawler): start_urls = ["https://example.com/"] respect_robots = True # honour robots.txt (+ Crawl-delay) crawl_delay = 1.0 # >= 1s between requests to the same host max_concurrency_per_host = 2 # at most 2 concurrent requests per host max_retries = 3 # retry 429/5xx + network errors... retry_backoff = 0.5 # ...with exponential backoff + jitter
Retries honour the Retry-After header, and on_robots_blocked(url) lets you
react to disallowed urls.
~$ pip install -e ".[dev]" ~$ pytest # run the (hermetic) test suite ~$ ruff check crawley ~$ pip install -e ".[docs]" && mkdocs serve # docs preview
The test suite spins up a local HTTP server, so it never hits the network.
Runnable, documented scripts live in examples/:
| File | Shows |
|---|---|
01_scraping_quickstart.py |
The scraping API: fetch, CSS/XPath, extract. |
02_crawler.py |
A crawler that follows pagination. |
03_polite_crawler.py |
robots.txt, rate limiting and retries. |
04_persistence_json.py |
Persisting scraped data to JSON. |
05_concurrent_fetch.py |
Concurrent fetching with afetch_all. |
~$ python examples/01_scraping_quickstart.pyEvery example is exercised by the test suite against a local server, so they stay in sync with the code.
GPL v3