Name	Name	Last commit message	Last commit date
Latest commit History 5 Commits
cmd/crawlwall	cmd/crawlwall
docs	docs
examples	examples
internal	internal
.gitattributes	.gitattributes
.gitignore	.gitignore
Caddyfile	Caddyfile
LICENSE	LICENSE
README.md	README.md
caddyfile.go	caddyfile.go
crawlwall.yaml	crawlwall.yaml
go.mod	go.mod
go.sum	go.sum
module.go	module.go
module_integration_test.go	module_integration_test.go

CrawlWall: Caddy AI Crawler Access Control

Important

CrawlWall is alpha and experimental. It is useful for local testing, demos, and careful shadow-mode trials, but it is not yet a battle-tested production security boundary. Review the policy, verifier, and ledger behavior before enforcing blocks on real traffic.

A self-hosted Caddy module for AI crawler blocking, bot verification, rate limiting, metered access, and signed crawl receipts.

CrawlWall sits in front of your application and turns robots.txt-style crawler policy into enforceable HTTP-edge rules using YAML and CEL. It identifies crawlers, verifies their identity, evaluates policy, records what happened, and can sign receipts for metered access.

The short version is:

robots.txt is advisory
CrawlWall is enforcement
YAML is the config container
CEL is the policy language
Caddy is the runtime

Why this exists
Mental model
Architecture
How a request is handled
Getting started
Requirements
Keys and receipts
Policy shape
Writing policy rules
Verifiers
Client IP and trusted proxies
Actions
CLI
Project layout
Scope
Status
Help
License

Why this exists

Sites increasingly need something more precise than:

"please do not crawl this"
"this bot says it is Google"
"this path should maybe cost money"

That is awkward to express with robots.txt, awkward to audit in application code, and annoying to keep consistent across services.

CrawlWall moves that logic into the HTTP edge and gives it a stable shape:

Concern	CrawlWall answer
Is this crawler known?	Match on `User-Agent`
Is it really that crawler?	Verify by reverse DNS or IP ranges
What should happen?	Evaluate CEL rules in priority order
Need proof later?	Write a ledger event with a stable event ID
Need metering?	`allow_metered` + signed receipts

The point is not to be clever. The point is to be explicit, inspectable, and replaceable.

Mental model

Think of CrawlWall as four subsystems glued together inside a Caddy handler:

Bot identification: map a request to a known bot definition or unknown.
Verification: decide whether the claimed crawler identity is trustworthy.
Policy evaluation: run CEL rules against the request context.
Audit trail: write the event and optionally sign a receipt.

That means the project is not "a YAML parser" and not "a crawler blocklist." It is a policy runtime.

Architecture

flowchart LR
 A["HTTP request"] --> B["Bot identifier"]
 B --> C["Verifier"]
 C --> D["Policy engine (CEL)"]
 D --> E["Decision"]
 E --> F["Allow / Block / Rate limit / Allow metered"]
 E --> H["Signed receipt (optional)"]
 H --> G["Ledger writer"]
 F --> I["Upstream app"]

The startup path matters as much as the request path.

At startup CrawlWall:

loads crawlwall.yaml
validates the config
compiles CEL expressions
opens the ledger backend
prepares verifiers
loads the receipt signer

If a CEL expression is broken, startup should fail. That is the right pain location.

How a request is handled

Step	What happens
1	Read `User-Agent` and identify the claimed crawler
2	Verify the request source according to that crawler's verifier
3	Build the policy input: `bot`, `request`, `site`, `sets`, `labels`
4	Evaluate rules by ascending priority
5	Enforce the first matching action
6	If requested, sign a receipt over the stable event ID
7	Write one ledger record containing the decision and receipt metadata

The policy input is intentionally small and boring. It is easier to extend a plain model than to untangle a magical one.

Getting started

Do not start from a blank file.

This repo ships with two starter policies:

File	Use it when
`examples/minimal.yaml`	You want a readable starter with no receipt signing
`examples/full.yaml`	You want the full V1 shape with metering and signed receipts
`examples/policy-fixtures.yaml`	You want regression tests for policy behavior

There is also a scaffold command:

go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall init --profile full

That writes:

crawlwall.yaml
Caddyfile
.gitignore
crawlwall.key and crawlwall.pub unless you disable key generation

If you want the scaffold without keys yet:

go run ./cmd/crawlwall init --profile minimal --generate-keys=false

Requirements

Go matching the version in go.mod
xcaddy to build a Caddy binary with the CrawlWall module
Caddy for config validation and runtime
A SQLite ledger path when ledger.enabled is true

Build

Build a custom Caddy binary with xcaddy.

From this local checkout:

go mod tidy
xcaddy build --with github.com/jolovicdev/crawlwall=.

From a published module version:

xcaddy build --with github.com/jolovicdev/crawlwall@latest

Check that the module is present:

caddy list-modules | grep crawlwall

Validate the config:

go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
caddy validate --config ./Caddyfile --adapter caddyfile

Run it:

caddy run --config ./Caddyfile --adapter caddyfile

Try a few requests:

curl http://localhost:8080/
curl http://localhost:8080/archive/a
curl -A "GPTBot/1.1" http://localhost:8080/archive/a

Note

The docs use plain executable names on purpose. Use whatever binary name your environment produces.

Keys and receipts

Receipt signing uses Ed25519.

The private key is sensitive and should never be committed. This repo ignores it by default in .gitignore.

You have two normal ways to create keys:

let crawlwall init generate them
generate them yourself with openssl

openssl genpkey -algorithm Ed25519 -out crawlwall.key
openssl pkey -in crawlwall.key -pubout -out crawlwall.pub

Receipt config looks like this:

receipts:
 enabled: true
 signer:
 type: ed25519
 key_file: ./crawlwall.key

Receipts are for proving what decision was made for a request. In V1 they are used for metered access and audit, not settlement.

Policy shape

The top-level config model is stable even if the individual rules change:

Section	Purpose
`site`	Site identity and mode
`runtime`	Failure behavior and default action
`ledger`	Event recording settings
`receipts`	Receipt signer configuration
`bots`	Known crawler definitions and verifier settings
`sets`	Reusable policy data
`rules`	CEL expressions plus actions

site.mode controls enforcement:

Mode	Effect
`shadow`	Log decisions without enforcing blocks or rate limits
`observe`	Alias for `shadow`, kept for older configs
`enforce`	Enforce policy decisions

Use shadow before blocking crawlers on a production site. It lets you inspect the ledger first, which is less exciting than debugging a self-inflicted 403 storm.

Writing policy rules

Start with the policy guide. It explains the available CEL inputs, rule priority, shadow mode, common recipes, verifier cache status, and fixture tests.

Example Caddyfile

{
 order crawlwall before reverse_proxy
}
:8080 {
 crawlwall {
 policy ./crawlwall.yaml
 ledger sqlite://./crawlwall.db
 fail_mode block
 }
 reverse_proxy localhost:3000
}

Example rule

- id: meter_training_on_protected_paths
 priority: 200
 when: >
 bot.verified &&
 bot.class == "ai_training" &&
 sets.protected_paths.exists(p, request.path.startsWith(p))
 action:
 type: allow_metered
 price:
 amount: 0.002
 currency: USD
 unit: request
 audit:
 receipt: true
 tags: ["ai_training", "metered"]

Example config

Full V1 policy example:

version: crawlwall.io/v1
site:
 id: local-dev
 host: localhost
 mode: enforce
runtime:
 fail_mode: block
 default_action:
 type: allow
ledger:
 enabled: true
receipts:
 enabled: true
 signer:
 type: ed25519
 key_file: ./crawlwall.key
bots:
 - id: googlebot
 name: Googlebot
 class: search
 match:
 user_agents:
 - "Googlebot"
 verify:
 type: reverse_dns
 allowed_suffixes:
 - ".googlebot.com"
 - ".google.com"
 - id: gptbot
 name: GPTBot
 class: ai_training
 match:
 user_agents:
 - "GPTBot"
 verify:
 type: ip_ranges
 sources:
 - "https://openai.com/gptbot.json"
 refresh: 1h
 stale_action: fail_closed
 max_stale: 0s
 - id: unknown
 name: Unknown
 class: unknown
 match:
 default: true
 verify:
 type: none
sets:
 protected_paths:
 - "/archive"
 - "/datasets"
 - "/reports"
 known_ai_training:
 - "gptbot"
 - "claudebot"
rules:
 - id: block_spoofed_known_bots
 priority: 10
 when: >
 bot.claimed && !bot.verified
 action:
 type: block
 status: 403
 reason: spoofed_bot
 audit:
 receipt: true
 tags: ["spoofed", "security"]
 - id: allow_verified_search
 priority: 100
 when: >
 bot.verified && bot.class == "search"
 action:
 type: allow
 audit:
 receipt: false
 tags: ["search"]
 - id: meter_training_on_protected_paths
 priority: 200
 when: >
 bot.verified &&
 bot.class == "ai_training" &&
 sets.protected_paths.exists(p, request.path.startsWith(p))
 action:
 type: allow_metered
 price:
 amount: 0.002
 currency: USD
 unit: request
 audit:
 receipt: true
 tags: ["ai_training", "metered"]
 - id: rate_limit_ai_training_elsewhere
 priority: 300
 when: >
 bot.verified && bot.class == "ai_training"
 action:
 type: rate_limit
 limit:
 key: "bot.id"
 rpm: 120
 audit:
 receipt: true
 tags: ["ai_training"]
 - id: block_unknown_protected_paths
 priority: 900
 when: >
 bot.class == "unknown" &&
 sets.protected_paths.exists(p, request.path.startsWith(p))
 action:
 type: block
 status: 403
 reason: unknown_crawler_protected_path
 audit:
 receipt: true
 tags: ["unknown", "blocked"]

Verifiers

V1 ships with three verifier types:

Verifier	What it means
`none`	No verification step; useful for the `unknown` catch-all bot
`reverse_dns`	Verify by PTR lookup and forward-confirm the result
`ip_ranges`	Verify by matching the request IP against fetched CIDR ranges

`reverse_dns`

This is the standard pattern used for bots like Googlebot:

resolve remote IP to PTR names
require a configured suffix match
resolve the PTR hostname back to A/AAAA
require the original IP to be present

Completed reverse-DNS decisions are cached per IP for five minutes to avoid doing PTR and forward lookups on every request from a claimed crawler.

`ip_ranges`

This is the simpler model for bots that publish source ranges:

fetch remote JSON
extract CIDRs
cache them in memory
refresh on the configured interval
match the request IP against the cache

Important

A GPTBot request only verifies as true if the actual source IP falls inside OpenAI's published GPTBot ranges at evaluation time.

There is one unavoidable freshness tradeoff: CrawlWall can only know about an IP range rotation after it refreshes the provider document. A shorter refresh reduces that window but makes more network calls.

When a refresh is due and the provider document cannot be fetched, stale_action controls whether the expired cache is still trusted:

Field	Default	Meaning
`refresh`	`12h`	How often to refetch the range document
`stale_action`	`fail_closed`	Refuse expired ranges after refresh failure
`max_stale`	`0s`	Extra stale-cache time for `use_stale`

Security-first config:

verify:
 type: ip_ranges
 sources:
 - "https://openai.com/gptbot.json"
 refresh: 1h
 stale_action: fail_closed
 max_stale: 0s

Availability-first config:

verify:
 type: ip_ranges
 sources:
 - "https://openai.com/gptbot.json"
 refresh: 1h
 stale_action: use_stale
 max_stale: 24h

Use fail_closed when spoof resistance matters more than crawler availability. Use use_stale only when temporarily blocking a legitimate crawler is worse than trusting a bounded stale range cache.

Client IP and trusted proxies

CrawlWall verifies crawlers against Caddy's trusted-proxy-aware client IP. If Caddy receives traffic directly, that is the socket remote address. If Caddy is behind a CDN, load balancer, or reverse proxy, configure Caddy's server-level trusted_proxies and client_ip_headers options so forwarded client IP headers are trusted only from known proxy ranges.

Example:

{
 servers {
 trusted_proxies static private_ranges
 client_ip_headers X-Forwarded-For CF-Connecting-IP
 }
 order crawlwall before reverse_proxy
}

Do not trust arbitrary X-Forwarded-For headers from the public internet. That turns crawler verification into wishful thinking with a header parser.

Actions

V1 supports four actions:

Action	Effect
`allow`	Let the request pass
`block`	Return an error response immediately
`rate_limit`	Allow within a configured rate, then return `429`
`allow_metered`	Allow the request and record pricing metadata

allow_metered is intentionally narrow. It does not try to settle payment, issue invoices, or do 402 handshakes. It records the metering event and signs a receipt so that payment can be built later without changing the core decision engine.

CLI

The CLI exists to make policy iteration less miserable:

go run ./cmd/crawlwall init --profile minimal
go run ./cmd/crawlwall policy check --config ./crawlwall.yaml
go run ./cmd/crawlwall policy eval \
 --config ./crawlwall.yaml --ua "GPTBot/1.1" \
 --path "/archive/a" --ip 20.125.66.81
go run ./cmd/crawlwall policy test \
 --config ./crawlwall.yaml --fixtures ./examples/policy-fixtures.yaml
go run ./cmd/crawlwall verifiers status --config ./crawlwall.yaml
go run ./cmd/crawlwall ledger report --db ./crawlwall.db --since 24h
go run ./cmd/crawlwall ledger export --db ./crawlwall.db --format jsonl
go run ./cmd/crawlwall ledger vacuum --db ./crawlwall.db --older-than 30d
go run ./cmd/crawlwall receipts verify \
 --file ./ledger-export.jsonl --public-key ./crawlwall.pub

Useful split:

init: create a starting point
policy check: validate and compile
policy eval: answer "what would happen to this request?"
policy test: run fixture-based policy regression tests
verifiers status: show IP range verifier cache health
ledger report: summarize observed traffic
ledger export: dump the event log
ledger vacuum: delete old events and compact the SQLite file
receipts verify: validate signed receipt output

Project layout

cmd/crawlwall/ CLI
docs/ usage guides
examples/ starter policies
internal/bot/ user-agent matching and bot registry
internal/config/ YAML load and validation
internal/ledger/ ledger interface and SQLite backend
internal/policy/ CEL environment, compile, evaluate
internal/ratelimit/ in-memory limiter
internal/receipt/ canonical receipts and Ed25519 signing
internal/scaffold/ starter templates for init
internal/verify/ reverse DNS and IP range verifiers

The main interface worth caring about is the ledger boundary. Request handling depends only on an EventWriter contract: one fully-formed event in, storage error out. Reporting and export are separate interfaces, so a Postgres, webhook, or queue-backed writer does not have to pretend it is SQLite.

Scope

Included in V1:

Caddy handler
CEL policy engine
reverse DNS verification
IP range verification
verifier cache status checks
shadow mode for dry-run policy rollout
SQLite ledger
ledger retention cleanup
signed receipts
local reporting and export
policy fixture tests

Deliberately not included in V1:

payment processing
dashboards
distributed quotas
extra webserver integrations
policy languages beyond CEL

Status

The current implementation has been exercised with:

go test ./...
custom Caddy builds through xcaddy
Caddy config validation
live requests through Caddy to a local upstream
verifier cache status checks
policy fixture tests
ledger export
receipt verification
integration tests for blocked, shadowed, metered, and rate-limited flows

That means the current claim is modest but honest:

CrawlWall is a self-hosted Caddy crawler access-control layer. It identifies bots, verifies identity, evaluates CEL policy rules, enforces allow/block/rate-limit/metered decisions, stores a crawler ledger, and exports signed crawl receipts.

Help

Use GitHub Issues for bugs, security-relevant behavior questions, and integration reports. Include the Caddyfile, CrawlWall policy, request path, user agent, and observed ledger row when possible.

License

MIT. See LICENSE.

Inspiration

Cloudflare's pay-per-crawl docs were useful inspiration for separating metering from payment, but CrawlWall stays much smaller and self-hosted:

What is pay per crawl?

Folders and files

Latest commit

History

Repository files navigation

CrawlWall: Caddy AI Crawler Access Control

Contents

Why this exists

Mental model

Architecture

How a request is handled

Getting started

Requirements

Build

Keys and receipts

Policy shape

Writing policy rules

Example Caddyfile

Example rule

Example config

Verifiers

reverse_dns

ip_ranges

Client IP and trusted proxies

Actions

CLI

Project layout

Scope

Status

Help

License

Inspiration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`reverse_dns`

`ip_ranges`