feat: add /add-guardrails skill — per-agent-group input/output guardrails#2726

Open

amit-shafnir wants to merge 4 commits into

nanocoai:main from

amit-shafnir:feat/add-guardrails

Open

feat: add /add-guardrails skill — per-agent-group input/output guardrails #2726
amit-shafnir wants to merge 4 commits into
nanocoai:main from
amit-shafnir:feat/add-guardrails

Conversation

@amit-shafnir

@amit-shafnir amit-shafnir commented Jun 10, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

Summary

Adds the /add-guardrails skill: optional, per-agent-group input/output guardrails — deterministic regex/keyphrase rules (prompt-injection phrase blocking, credential-leak patterns) with block/flag actions, chat alerts, and a host-side quarantine audit trail. Fails closed on broken config; zero overhead for unconfigured groups.

This PR ships only .claude/skills/add-guardrails/ — no trunk behavior changes. Per the skills model, the guardrails code lives as frozen copies under the skill's resources/; users opt in by running /add-guardrails, which materializes the code and wiring into their install. REMOVE.md reverses it completely.

Enforcement layers (what the skill installs)

Inbound (host) — router gate in deliverToAgent(): a blocked message never wakes the container.
Inbound (container) — poll-loop hooks re-check rows that bypass the router (tasks, on_wake, a2a), running before pre-task scripts so a blocked task never executes its script.
Outbound (container) — result-text and all five MCP tool paths (send_message, send_file caption + display filename, edit_message, ask_user_question, send_card string leaves) get an actionable tool error on block.
Outbound (host) — delivery checkpoint in deliverMessage() re-checks every non-system row's content string leaves before platform delivery. This is the layer an injected agent cannot bypass by INSERTing into outbound.db directly; it has deliberately no exemption mechanism, and its alerts go straight through the adapter so they can't recurse.

Design notes

Every deliverable string is evaluated separately (never joined), so anchored regexes match and JSON escaping can't defeat keyphrases containing quotes/newlines; block beats flag across texts.
Default chat alerts name only the rule id + type — never the matched content or reason; the quarantine record keeps the full reason for audit, and a per-rule message field overrides the user-facing wording.
Quarantine is block-but-retain JSONL under data/guardrails/<group-id>/ (rotated, size-capped), never agent-readable; container-side input blocks carry messageId instead of content since outbound.db is agent-readable.
Strict whole-config validation: any invalid rule blocks all traffic for the group (with an admin alert) until fixed — picked up within ~5s, no restart.

Skill-guidelines conformance

Mostly-add shape; all reach-ins are minimal colocated dynamic-import + call blocks.
Behavior tests drive the real MCP tool handlers (mcp-hooks.test.ts, bun:test); structural AST/marker wiring tests pin every reach-in's placement, the hook order, and byte-identity of the intentionally duplicated rules.ts across the two trees.
Idempotent marker-guarded apply, complete REMOVE.md, resources shipped byte-identical to the applied tree.

Test plan

Verification was done by applying the skill to a checkout and running the full suites there:

pnpm run build + pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmit clean on the applied tree
pnpm test — 397 host tests pass (incl. 48 guardrails + wiring)
bun test in container/agent-runner — 128 tests pass (incl. 27 guardrails + MCP-hook behavior tests)
diff -r both skill resource trees against the applied code — byte-identical (the applied tree is preserved on a local branch for future skill iteration)

🤖 Generated with Claude Code

@amit-shafnir amit-shafnir requested review from gabi-simons and gavrielc as code owners

June 10, 2026 12:32

@amit-shafnir @claude


 feat: add /add-guardrails skill — per-agent-group input/output guardr...

fd77109

...ails
Utility skill that installs optional deterministic guardrails for an
agent group: regex/keyphrase rules with block/flag actions, evaluated
per text at four enforcement layers (container inbound, MCP send hooks,
host outbound delivery, router inbound), fail-closed on broken config,
with a host-side quarantine audit trail and chat alerts.
Ships only .claude/skills/add-guardrails/ — no trunk behavior changes.
Resources were verified by applying them to a checkout: build and
typecheck clean, 397 host vitest + 128 container bun tests green;
resources are byte-identical to the applied tree.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@amit-shafnir amit-shafnir force-pushed the feat/add-guardrails branch from 06544fe to fd77109 Compare

June 10, 2026 13:04

@amit-shafnir amit-shafnir changed the title ~~(削除) feat: add /add-guardrails — per-agent-group input/output guardrails (削除ここまで)~~ (追記) feat: add /add-guardrails skill — per-agent-group input/output guardrails (追記ここまで)

Jun 10, 2026

amit-shafnir and others added 3 commits

June 10, 2026 17:40

@amit-shafnir @claude


 feat: generic module hook seams for optional host/container modules

fcda8ea

Optional modules can now attach to the message path without editing core:
- src/module-hooks.ts — inbound message gates (router, pre-write),
 outbound message gates (delivery, pre-send), mount contributors
 (container spawn). Empty registries are exact no-ops.
- container/agent-runner/src/hooks.ts — inbound batch hooks (run before
 pre-task scripts in both the initial and follow-up poll paths) and
 result-text hooks (null suppresses dispatch).
- mcp-tools/server.ts — tool-call middleware chain around the single
 dispatch chokepoint, covering every registered tool.
- container/agent-runner/src/modules.ts — registration barrel imported
 by both container entry points (poll loop + MCP server); wrong-process
 registrations are inert.
deliverMessage now takes the existing OutboundMessage type instead of an
inline duplicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@amit-shafnir @claude


 fix: writeOutboundDirect opened outbound.db read-only, so every direc...

ace0c97

...t host write threw
openOutboundDb is readonly:true by design (host reads, container writes),
but writeOutboundDirect — the command gate's deny-response path — used it
for an INSERT, so every call threw 'attempt to write a readonly database'
and the deny response never reached the user. Use the RW opener, mirroring
the host-sweep orphan-claim fix (8d022fd). Callers only write when they've
decided not to wake the container for the message; even seq parity is
preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@amit-shafnir @claude


 refactor(add-guardrails): install via trunk hook seams — no core edits

3e30268

The skill previously instructed ~11 marker-guarded edit blocks across 8
core files, guarded by an AST wiring test. It now rides the generic hook
seams: install = copy the module dirs + one barrel-import line per side
+ build.
- host: index.ts is now pure import-time registration (inbound gate,
 outbound delivery gate, guardrails-dir RO mount contributor, quarantine
 delivery action); the old index.ts logic lives unchanged in inbound.ts;
 delivery-gate.ts absorbs the alert-delivery block that SKILL.md used to
 splice into deliverMessage.
- container: register.ts registers the inbound batch hook, result-text
 hook, and tool middleware; tool-middleware.ts guards every MCP send path
 through the dispatch chokepoint via a per-tool text-extractor map
 (covering a future tool = one map entry).
- guardrails-wiring.test.ts (AST test) deleted — there are no hand edits
 left to drift. The rules.ts host↔container byte-identity guard moved to
 registration.test.ts; mcp-hooks.test.ts now drives the middleware
 directly.
- SKILL.md phases collapse to cp + 2 barrel lines + build; REMOVE.md to
 the reverse. Two deliberate semantic shifts documented: MCP block alerts
 route via session routing (not the resolved to= destination), and the
 guard now runs before tool-handler validation (a blocked send_file
 filename can no longer probe path existence).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

This was referenced Jun 11, 2026

🦞 OpenClaw 生态日报 2026年06月11日 hehongtao88/big_model_radar#12

Open

🦞 OpenClaw 生态日报 2026年06月11日 litang9/big_model_radar#46

Open

Labels

None yet

1 participant

@amit-shafnir

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add /add-guardrails skill — per-agent-group input/output guardrails#2726

feat: add /add-guardrails skill — per-agent-group input/output guardrails #2726
amit-shafnir wants to merge 4 commits into
nanocoai:main from
amit-shafnir:feat/add-guardrails

Conversation

@amit-shafnir amit-shafnir commented Jun 10, 2026 •

edited

Loading

Uh oh!

Summary

Enforcement layers (what the skill installs)

Design notes

Skill-guidelines conformance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

@amit-shafnir amit-shafnir commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Enforcement layers (what the skill installs)

Design notes

Skill-guidelines conformance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

@amit-shafnir amit-shafnir commented Jun 10, 2026 •

edited

Loading