Important
CrawlWall is alpha and experimental. It is useful for local testing, demos, and careful shadow-mode trials, but it is not yet a battle-tested production security boundary. Review the policy, verifier, and ledger behavior before enforcing blocks on real traffic.
A self-hosted Caddy module for AI crawler blocking, bot verification, rate limiting, metered access, and signed crawl receipts.
CrawlWall sits in front of your application and turns robots.txt-style crawler policy into enforceable HTTP-edge rules using YAML and CEL. It identifies crawlers, verifies their identity, evaluates policy, records what happened, and can sign receipts for metered access.
The short version is:
robots.txtis advisory- CrawlWall is enforcement
- YAML is the config container
- CEL is the policy language
- Caddy is the runtime
- Why this exists
- Mental model
- Architecture
- How a request is handled
- Getting started
- Requirements
- Keys and receipts
- Policy shape
- Writing policy rules
- Verifiers
- Client IP and trusted proxies
- Actions
- CLI
- Project layout
- Scope
- Status
- Help
- License
Sites increasingly need something more precise than:
- "please do not crawl this"
- "this bot says it is Google"
- "this path should maybe cost money"
That is awkward to express with robots.txt, awkward to audit in application
code, and annoying to keep consistent across services.
CrawlWall moves that logic into the HTTP edge and gives it a stable shape:
| Concern | CrawlWall answer |
|---|---|
| Is this crawler known? | Match on User-Agent |
| Is it really that crawler? | Verify by reverse DNS or IP ranges |
| What should happen? | Evaluate CEL rules in priority order |
| Need proof later? | Write a ledger event with a stable event ID |
| Need metering? | allow_metered + signed receipts |
The point is not to be clever. The point is to be explicit, inspectable, and replaceable.
Think of CrawlWall as four subsystems glued together inside a Caddy handler:
- Bot identification: map a request to a known bot definition or
unknown. - Verification: decide whether the claimed crawler identity is trustworthy.
- Policy evaluation: run CEL rules against the request context.
- Audit trail: write the event and optionally sign a receipt.
That means the project is not "a YAML parser" and not "a crawler blocklist." It is a policy runtime.
flowchart LR
A["HTTP request"] --> B["Bot identifier"]
B --> C["Verifier"]
C --> D["Policy engine (CEL)"]
D --> E["Decision"]
E --> F["Allow / Block / Rate limit / Allow metered"]
E --> H["Signed receipt (optional)"]
H --> G["Ledger writer"]
F --> I["Upstream app"]
The startup path matters as much as the request path.
At startup CrawlWall:
- loads
crawlwall.yaml - validates the config
- compiles CEL expressions
- opens the ledger backend
- prepares verifiers
- loads the receipt signer
If a CEL expression is broken, startup should fail. That is the right pain location.
| Step | What happens |
|---|---|
| 1 | Read User-Agent and identify the claimed crawler |
| 2 | Verify the request source according to that crawler's verifier |
| 3 | Build the policy input: bot, request, site, sets, labels |
| 4 | Evaluate rules by ascending priority |
| 5 | Enforce the first matching action |
| 6 | If requested, sign a receipt over the stable event ID |
| 7 | Write one ledger record containing the decision and receipt metadata |
The policy input is intentionally small and boring. It is easier to extend a plain model than to untangle a magical one.
Do not start from a blank file.
This repo ships with two starter policies:
| File | Use it when |
|---|---|
examples/minimal.yaml |
You want a readable starter with no receipt signing |
examples/full.yaml |
You want the full V1 shape with metering and signed receipts |
examples/policy-fixtures.yaml |
You want regression tests for policy behavior |
There is also a scaffold command:
go run ./cmd/crawlwall init --profile minimal go run ./cmd/crawlwall init --profile full
That writes:
crawlwall.yamlCaddyfile.gitignorecrawlwall.keyandcrawlwall.pubunless you disable key generation
If you want the scaffold without keys yet:
go run ./cmd/crawlwall init --profile minimal --generate-keys=false
- Go matching the version in
go.mod xcaddyto build a Caddy binary with the CrawlWall module- Caddy for config validation and runtime
- A SQLite ledger path when
ledger.enabledistrue
Build a custom Caddy binary with xcaddy.
From this local checkout:
go mod tidy xcaddy build --with github.com/jolovicdev/crawlwall=.
From a published module version:
xcaddy build --with github.com/jolovicdev/crawlwall@latest
Check that the module is present:
caddy list-modules | grep crawlwallValidate the config:
go run ./cmd/crawlwall policy check --config ./crawlwall.yaml caddy validate --config ./Caddyfile --adapter caddyfile
Run it:
caddy run --config ./Caddyfile --adapter caddyfile
Try a few requests:
curl http://localhost:8080/
curl http://localhost:8080/archive/a
curl -A "GPTBot/1.1" http://localhost:8080/archive/aNote
The docs use plain executable names on purpose. Use whatever binary name your environment produces.
Receipt signing uses Ed25519.
The private key is sensitive and should never be committed. This repo ignores
it by default in .gitignore.
You have two normal ways to create keys:
- let
crawlwall initgenerate them - generate them yourself with
openssl
openssl genpkey -algorithm Ed25519 -out crawlwall.key openssl pkey -in crawlwall.key -pubout -out crawlwall.pub
Receipt config looks like this:
receipts: enabled: true signer: type: ed25519 key_file: ./crawlwall.key
Receipts are for proving what decision was made for a request. In V1 they are used for metered access and audit, not settlement.
The top-level config model is stable even if the individual rules change:
| Section | Purpose |
|---|---|
site |
Site identity and mode |
runtime |
Failure behavior and default action |
ledger |
Event recording settings |
receipts |
Receipt signer configuration |
bots |
Known crawler definitions and verifier settings |
sets |
Reusable policy data |
rules |
CEL expressions plus actions |
site.mode controls enforcement:
| Mode | Effect |
|---|---|
shadow |
Log decisions without enforcing blocks or rate limits |
observe |
Alias for shadow, kept for older configs |
enforce |
Enforce policy decisions |
Use shadow before blocking crawlers on a production site. It lets you inspect
the ledger first, which is less exciting than debugging a self-inflicted 403
storm.
Start with the policy guide. It explains the available CEL inputs, rule priority, shadow mode, common recipes, verifier cache status, and fixture tests.
{
order crawlwall before reverse_proxy
}
:8080 {
crawlwall {
policy ./crawlwall.yaml
ledger sqlite://./crawlwall.db
fail_mode block
}
reverse_proxy localhost:3000
}- id: meter_training_on_protected_paths priority: 200 when: > bot.verified && bot.class == "ai_training" && sets.protected_paths.exists(p, request.path.startsWith(p)) action: type: allow_metered price: amount: 0.002 currency: USD unit: request audit: receipt: true tags: ["ai_training", "metered"]
Full V1 policy example:
version: crawlwall.io/v1 site: id: local-dev host: localhost mode: enforce runtime: fail_mode: block default_action: type: allow ledger: enabled: true receipts: enabled: true signer: type: ed25519 key_file: ./crawlwall.key bots: - id: googlebot name: Googlebot class: search match: user_agents: - "Googlebot" verify: type: reverse_dns allowed_suffixes: - ".googlebot.com" - ".google.com" - id: gptbot name: GPTBot class: ai_training match: user_agents: - "GPTBot" verify: type: ip_ranges sources: - "https://openai.com/gptbot.json" refresh: 1h stale_action: fail_closed max_stale: 0s - id: unknown name: Unknown class: unknown match: default: true verify: type: none sets: protected_paths: - "/archive" - "/datasets" - "/reports" known_ai_training: - "gptbot" - "claudebot" rules: - id: block_spoofed_known_bots priority: 10 when: > bot.claimed && !bot.verified action: type: block status: 403 reason: spoofed_bot audit: receipt: true tags: ["spoofed", "security"] - id: allow_verified_search priority: 100 when: > bot.verified && bot.class == "search" action: type: allow audit: receipt: false tags: ["search"] - id: meter_training_on_protected_paths priority: 200 when: > bot.verified && bot.class == "ai_training" && sets.protected_paths.exists(p, request.path.startsWith(p)) action: type: allow_metered price: amount: 0.002 currency: USD unit: request audit: receipt: true tags: ["ai_training", "metered"] - id: rate_limit_ai_training_elsewhere priority: 300 when: > bot.verified && bot.class == "ai_training" action: type: rate_limit limit: key: "bot.id" rpm: 120 audit: receipt: true tags: ["ai_training"] - id: block_unknown_protected_paths priority: 900 when: > bot.class == "unknown" && sets.protected_paths.exists(p, request.path.startsWith(p)) action: type: block status: 403 reason: unknown_crawler_protected_path audit: receipt: true tags: ["unknown", "blocked"]
V1 ships with three verifier types:
| Verifier | What it means |
|---|---|
none |
No verification step; useful for the unknown catch-all bot |
reverse_dns |
Verify by PTR lookup and forward-confirm the result |
ip_ranges |
Verify by matching the request IP against fetched CIDR ranges |
This is the standard pattern used for bots like Googlebot:
- resolve remote IP to PTR names
- require a configured suffix match
- resolve the PTR hostname back to A/AAAA
- require the original IP to be present
Completed reverse-DNS decisions are cached per IP for five minutes to avoid doing PTR and forward lookups on every request from a claimed crawler.
This is the simpler model for bots that publish source ranges:
- fetch remote JSON
- extract CIDRs
- cache them in memory
- refresh on the configured interval
- match the request IP against the cache
Important
A GPTBot request only verifies as true if the actual source IP falls
inside OpenAI's published GPTBot ranges at evaluation time.
There is one unavoidable freshness tradeoff: CrawlWall can only know about an
IP range rotation after it refreshes the provider document. A shorter
refresh reduces that window but makes more network calls.
When a refresh is due and the provider document cannot be fetched,
stale_action controls whether the expired cache is still trusted:
| Field | Default | Meaning |
|---|---|---|
refresh |
12h |
How often to refetch the range document |
stale_action |
fail_closed |
Refuse expired ranges after refresh failure |
max_stale |
0s |
Extra stale-cache time for use_stale |
Security-first config:
verify: type: ip_ranges sources: - "https://openai.com/gptbot.json" refresh: 1h stale_action: fail_closed max_stale: 0s
Availability-first config:
verify: type: ip_ranges sources: - "https://openai.com/gptbot.json" refresh: 1h stale_action: use_stale max_stale: 24h
Use fail_closed when spoof resistance matters more than crawler availability.
Use use_stale only when temporarily blocking a legitimate crawler is worse
than trusting a bounded stale range cache.
CrawlWall verifies crawlers against Caddy's trusted-proxy-aware client IP. If
Caddy receives traffic directly, that is the socket remote address. If Caddy is
behind a CDN, load balancer, or reverse proxy, configure Caddy's server-level
trusted_proxies and client_ip_headers options so forwarded client IP headers
are trusted only from known proxy ranges.
Example:
{
servers {
trusted_proxies static private_ranges
client_ip_headers X-Forwarded-For CF-Connecting-IP
}
order crawlwall before reverse_proxy
}Do not trust arbitrary X-Forwarded-For headers from the public internet. That
turns crawler verification into wishful thinking with a header parser.
V1 supports four actions:
| Action | Effect |
|---|---|
allow |
Let the request pass |
block |
Return an error response immediately |
rate_limit |
Allow within a configured rate, then return 429 |
allow_metered |
Allow the request and record pricing metadata |
allow_metered is intentionally narrow. It does not try to settle payment,
issue invoices, or do 402 handshakes. It records the metering event and signs a
receipt so that payment can be built later without changing the core decision
engine.
The CLI exists to make policy iteration less miserable:
go run ./cmd/crawlwall init --profile minimal go run ./cmd/crawlwall policy check --config ./crawlwall.yaml go run ./cmd/crawlwall policy eval \ --config ./crawlwall.yaml --ua "GPTBot/1.1" \ --path "/archive/a" --ip 20.125.66.81 go run ./cmd/crawlwall policy test \ --config ./crawlwall.yaml --fixtures ./examples/policy-fixtures.yaml go run ./cmd/crawlwall verifiers status --config ./crawlwall.yaml go run ./cmd/crawlwall ledger report --db ./crawlwall.db --since 24h go run ./cmd/crawlwall ledger export --db ./crawlwall.db --format jsonl go run ./cmd/crawlwall ledger vacuum --db ./crawlwall.db --older-than 30d go run ./cmd/crawlwall receipts verify \ --file ./ledger-export.jsonl --public-key ./crawlwall.pub
Useful split:
init: create a starting pointpolicy check: validate and compilepolicy eval: answer "what would happen to this request?"policy test: run fixture-based policy regression testsverifiers status: show IP range verifier cache healthledger report: summarize observed trafficledger export: dump the event logledger vacuum: delete old events and compact the SQLite filereceipts verify: validate signed receipt output
cmd/crawlwall/ CLI
docs/ usage guides
examples/ starter policies
internal/bot/ user-agent matching and bot registry
internal/config/ YAML load and validation
internal/ledger/ ledger interface and SQLite backend
internal/policy/ CEL environment, compile, evaluate
internal/ratelimit/ in-memory limiter
internal/receipt/ canonical receipts and Ed25519 signing
internal/scaffold/ starter templates for init
internal/verify/ reverse DNS and IP range verifiers
The main interface worth caring about is the ledger boundary. Request handling
depends only on an EventWriter contract: one fully-formed event in, storage
error out. Reporting and export are separate interfaces, so a Postgres,
webhook, or queue-backed writer does not have to pretend it is SQLite.
Included in V1:
- Caddy handler
- CEL policy engine
- reverse DNS verification
- IP range verification
- verifier cache status checks
- shadow mode for dry-run policy rollout
- SQLite ledger
- ledger retention cleanup
- signed receipts
- local reporting and export
- policy fixture tests
Deliberately not included in V1:
- payment processing
- dashboards
- distributed quotas
- extra webserver integrations
- policy languages beyond CEL
The current implementation has been exercised with:
go test ./...- custom Caddy builds through
xcaddy - Caddy config validation
- live requests through Caddy to a local upstream
- verifier cache status checks
- policy fixture tests
- ledger export
- receipt verification
- integration tests for blocked, shadowed, metered, and rate-limited flows
That means the current claim is modest but honest:
CrawlWall is a self-hosted Caddy crawler access-control layer. It identifies bots, verifies identity, evaluates CEL policy rules, enforces allow/block/rate-limit/metered decisions, stores a crawler ledger, and exports signed crawl receipts.
Use GitHub Issues for bugs, security-relevant behavior questions, and integration reports. Include the Caddyfile, CrawlWall policy, request path, user agent, and observed ledger row when possible.
MIT. See LICENSE.
Cloudflare's pay-per-crawl docs were useful inspiration for separating metering from payment, but CrawlWall stays much smaller and self-hosted: