-
Notifications
You must be signed in to change notification settings - Fork 12.9k
feat: add /add-guardrails skill — per-agent-group input/output guardrails#2726
Open
amit-shafnir wants to merge 4 commits into
Open
feat: add /add-guardrails skill — per-agent-group input/output guardrails #2726amit-shafnir wants to merge 4 commits into
amit-shafnir wants to merge 4 commits into
Conversation
@amit-shafnir
amit-shafnir
requested review from
gabi-simons and
gavrielc
as code owners
June 10, 2026 12:32
...ails Utility skill that installs optional deterministic guardrails for an agent group: regex/keyphrase rules with block/flag actions, evaluated per text at four enforcement layers (container inbound, MCP send hooks, host outbound delivery, router inbound), fail-closed on broken config, with a host-side quarantine audit trail and chat alerts. Ships only .claude/skills/add-guardrails/ — no trunk behavior changes. Resources were verified by applying them to a checkout: build and typecheck clean, 397 host vitest + 128 container bun tests green; resources are byte-identical to the applied tree. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@amit-shafnir
amit-shafnir
force-pushed
the
feat/add-guardrails
branch
from
June 10, 2026 13:04
06544fe to
fd77109
Compare
@amit-shafnir
amit-shafnir
changed the title
(削除) feat: add /add-guardrails — per-agent-group input/output guardrails (削除ここまで)
(追記) feat: add /add-guardrails skill — per-agent-group input/output guardrails (追記ここまで)
Jun 10, 2026
Optional modules can now attach to the message path without editing core: - src/module-hooks.ts — inbound message gates (router, pre-write), outbound message gates (delivery, pre-send), mount contributors (container spawn). Empty registries are exact no-ops. - container/agent-runner/src/hooks.ts — inbound batch hooks (run before pre-task scripts in both the initial and follow-up poll paths) and result-text hooks (null suppresses dispatch). - mcp-tools/server.ts — tool-call middleware chain around the single dispatch chokepoint, covering every registered tool. - container/agent-runner/src/modules.ts — registration barrel imported by both container entry points (poll loop + MCP server); wrong-process registrations are inert. deliverMessage now takes the existing OutboundMessage type instead of an inline duplicate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
...t host write threw openOutboundDb is readonly:true by design (host reads, container writes), but writeOutboundDirect — the command gate's deny-response path — used it for an INSERT, so every call threw 'attempt to write a readonly database' and the deny response never reached the user. Use the RW opener, mirroring the host-sweep orphan-claim fix (8d022fd). Callers only write when they've decided not to wake the container for the message; even seq parity is preserved. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The skill previously instructed ~11 marker-guarded edit blocks across 8 core files, guarded by an AST wiring test. It now rides the generic hook seams: install = copy the module dirs + one barrel-import line per side + build. - host: index.ts is now pure import-time registration (inbound gate, outbound delivery gate, guardrails-dir RO mount contributor, quarantine delivery action); the old index.ts logic lives unchanged in inbound.ts; delivery-gate.ts absorbs the alert-delivery block that SKILL.md used to splice into deliverMessage. - container: register.ts registers the inbound batch hook, result-text hook, and tool middleware; tool-middleware.ts guards every MCP send path through the dispatch chokepoint via a per-tool text-extractor map (covering a future tool = one map entry). - guardrails-wiring.test.ts (AST test) deleted — there are no hand edits left to drift. The rules.ts host↔container byte-identity guard moved to registration.test.ts; mcp-hooks.test.ts now drives the middleware directly. - SKILL.md phases collapse to cp + 2 barrel lines + build; REMOVE.md to the reverse. Two deliberate semantic shifts documented: MCP block alerts route via session routing (not the resolved to= destination), and the guard now runs before tool-handler validation (a blocked send_file filename can no longer probe path existence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
Summary
Adds the /add-guardrails skill: optional, per-agent-group input/output guardrails — deterministic regex/keyphrase rules (prompt-injection phrase blocking, credential-leak patterns) with
block/flagactions, chat alerts, and a host-side quarantine audit trail. Fails closed on broken config; zero overhead for unconfigured groups.This PR ships only
.claude/skills/add-guardrails/— no trunk behavior changes. Per the skills model, the guardrails code lives as frozen copies under the skill'sresources/; users opt in by running/add-guardrails, which materializes the code and wiring into their install.REMOVE.mdreverses it completely.Enforcement layers (what the skill installs)
deliverToAgent(): a blocked message never wakes the container.send_message,send_filecaption + display filename,edit_message,ask_user_question,send_cardstring leaves) get an actionable tool error on block.deliverMessage()re-checks every non-system row's content string leaves before platform delivery. This is the layer an injected agent cannot bypass by INSERTing intooutbound.dbdirectly; it has deliberately no exemption mechanism, and its alerts go straight through the adapter so they can't recurse.Design notes
messagefield overrides the user-facing wording.data/guardrails/<group-id>/(rotated, size-capped), never agent-readable; container-side input blocks carrymessageIdinstead of content sinceoutbound.dbis agent-readable.Skill-guidelines conformance
mcp-hooks.test.ts, bun:test); structural AST/marker wiring tests pin every reach-in's placement, the hook order, and byte-identity of the intentionally duplicatedrules.tsacross the two trees.Test plan
Verification was done by applying the skill to a checkout and running the full suites there:
pnpm run build+pnpm exec tsc -p container/agent-runner/tsconfig.json --noEmitclean on the applied treepnpm test— 397 host tests pass (incl. 48 guardrails + wiring)bun testincontainer/agent-runner— 128 tests pass (incl. 27 guardrails + MCP-hook behavior tests)diff -rboth skill resource trees against the applied code — byte-identical (the applied tree is preserved on a local branch for future skill iteration)🤖 Generated with Claude Code