Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: add /add-guardrails skill — per-agent-group input/output guardrails#2726

Open
amit-shafnir wants to merge 4 commits into
nanocoai:main from
amit-shafnir:feat/add-guardrails
Open

feat: add /add-guardrails skill — per-agent-group input/output guardrails #2726
amit-shafnir wants to merge 4 commits into
nanocoai:main from
amit-shafnir:feat/add-guardrails

Conversation

@amit-shafnir

@amit-shafnir amit-shafnir commented Jun 10, 2026
edited
Loading

Copy link
Copy Markdown
Contributor

Summary

Adds the /add-guardrails skill: optional, per-agent-group input/output guardrails — deterministic regex/keyphrase rules (prompt-injection phrase blocking, credential-leak patterns) with block/flag actions, chat alerts, and a host-side quarantine audit trail. Fails closed on broken config; zero overhead for unconfigured groups.

This PR ships only .claude/skills/add-guardrails/ — no trunk behavior changes. Per the skills model, the guardrails code lives as frozen copies under the skill's resources/; users opt in by running /add-guardrails, which materializes the code and wiring into their install. REMOVE.md reverses it completely.

Enforcement layers (what the skill installs)

  • Inbound (host) — router gate in deliverToAgent(): a blocked message never wakes the container.
  • Inbound (container) — poll-loop hooks re-check rows that bypass the router (tasks, on_wake, a2a), running before pre-task scripts so a blocked task never executes its script.
  • Outbound (container) — result-text and all five MCP tool paths (send_message, send_file caption + display filename, edit_message, ask_user_question, send_card string leaves) get an actionable tool error on block.
  • Outbound (host) — delivery checkpoint in deliverMessage() re-checks every non-system row's content string leaves before platform delivery. This is the layer an injected agent cannot bypass by INSERTing into outbound.db directly; it has deliberately no exemption mechanism, and its alerts go straight through the adapter so they can't recurse.

Design notes

  • Every deliverable string is evaluated separately (never joined), so anchored regexes match and JSON escaping can't defeat keyphrases containing quotes/newlines; block beats flag across texts.
  • Default chat alerts name only the rule id + type — never the matched content or reason; the quarantine record keeps the full reason for audit, and a per-rule message field overrides the user-facing wording.
  • Quarantine is block-but-retain JSONL under data/guardrails/<group-id>/ (rotated, size-capped), never agent-readable; container-side input blocks carry messageId instead of content since outbound.db is agent-readable.
  • Strict whole-config validation: any invalid rule blocks all traffic for the group (with an admin alert) until fixed — picked up within ~5s, no restart.

Skill-guidelines conformance

  • Mostly-add shape; all reach-ins are minimal colocated dynamic-import + call blocks.
  • Behavior tests drive the real MCP tool handlers (mcp-hooks.test.ts, bun:test); structural AST/marker wiring tests pin every reach-in's placement, the hook order, and byte-identity of the intentionally duplicated rules.ts across the two trees.
  • Idempotent marker-guarded apply, complete REMOVE.md, resources shipped byte-identical to the applied tree.

Test plan

Verification was done by applying the skill to a checkout and running the full suites there:

  • pnpm run build + pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmit clean on the applied tree
  • pnpm test — 397 host tests pass (incl. 48 guardrails + wiring)
  • bun test in container/agent-runner — 128 tests pass (incl. 27 guardrails + MCP-hook behavior tests)
  • diff -r both skill resource trees against the applied code — byte-identical (the applied tree is preserved on a local branch for future skill iteration)

🤖 Generated with Claude Code

...ails
Utility skill that installs optional deterministic guardrails for an
agent group: regex/keyphrase rules with block/flag actions, evaluated
per text at four enforcement layers (container inbound, MCP send hooks,
host outbound delivery, router inbound), fail-closed on broken config,
with a host-side quarantine audit trail and chat alerts.
Ships only .claude/skills/add-guardrails/ — no trunk behavior changes.
Resources were verified by applying them to a checkout: build and
typecheck clean, 397 host vitest + 128 container bun tests green;
resources are byte-identical to the applied tree.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@amit-shafnir amit-shafnir changed the title (削除) feat: add /add-guardrails — per-agent-group input/output guardrails (削除ここまで) (追記) feat: add /add-guardrails skill — per-agent-group input/output guardrails (追記ここまで) Jun 10, 2026
amit-shafnir and others added 3 commits June 10, 2026 17:40
Optional modules can now attach to the message path without editing core:
- src/module-hooks.ts — inbound message gates (router, pre-write),
 outbound message gates (delivery, pre-send), mount contributors
 (container spawn). Empty registries are exact no-ops.
- container/agent-runner/src/hooks.ts — inbound batch hooks (run before
 pre-task scripts in both the initial and follow-up poll paths) and
 result-text hooks (null suppresses dispatch).
- mcp-tools/server.ts — tool-call middleware chain around the single
 dispatch chokepoint, covering every registered tool.
- container/agent-runner/src/modules.ts — registration barrel imported
 by both container entry points (poll loop + MCP server); wrong-process
 registrations are inert.
deliverMessage now takes the existing OutboundMessage type instead of an
inline duplicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
...t host write threw
openOutboundDb is readonly:true by design (host reads, container writes),
but writeOutboundDirect — the command gate's deny-response path — used it
for an INSERT, so every call threw 'attempt to write a readonly database'
and the deny response never reached the user. Use the RW opener, mirroring
the host-sweep orphan-claim fix (8d022fd). Callers only write when they've
decided not to wake the container for the message; even seq parity is
preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The skill previously instructed ~11 marker-guarded edit blocks across 8
core files, guarded by an AST wiring test. It now rides the generic hook
seams: install = copy the module dirs + one barrel-import line per side
+ build.
- host: index.ts is now pure import-time registration (inbound gate,
 outbound delivery gate, guardrails-dir RO mount contributor, quarantine
 delivery action); the old index.ts logic lives unchanged in inbound.ts;
 delivery-gate.ts absorbs the alert-delivery block that SKILL.md used to
 splice into deliverMessage.
- container: register.ts registers the inbound batch hook, result-text
 hook, and tool middleware; tool-middleware.ts guards every MCP send path
 through the dispatch chokepoint via a per-tool text-extractor map
 (covering a future tool = one map entry).
- guardrails-wiring.test.ts (AST test) deleted — there are no hand edits
 left to drift. The rules.ts host↔container byte-identity guard moved to
 registration.test.ts; mcp-hooks.test.ts now drives the middleware
 directly.
- SKILL.md phases collapse to cp + 2 barrel lines + build; REMOVE.md to
 the reverse. Two deliberate semantic shifts documented: MCP block alerts
 route via session routing (not the resolved to= destination), and the
 guard now runs before tool-handler validation (a blocked send_file
 filename can no longer probe path existence).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@gavrielc gavrielc Awaiting requested review from gavrielc gavrielc is a code owner

@gabi-simons gabi-simons Awaiting requested review from gabi-simons gabi-simons is a code owner

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /