-
Notifications
You must be signed in to change notification settings - Fork 0
RPL Judgement
RPL-Judgement is blumi's adversarial, regret-minimizing pre-execution reasoning loop for AI coding agents. Before a risky tool call touches your real files, repo, or network, blumi maps the action's blast radius, puts the plan on trial before an adversarial LLM-as-a-judge ("Porfiry"), and only executes the plan that survives cross-examination — then writes the predicted-vs-actual "regret" back into long-term memory so the agent learns from consequences. A standard agent maximizes success; an RPL agent minimizes regret.
Inspired by Fyodor Dostoevsky's Crime and Punishment — the agent doesn't just plan; it simulates the crime, prosecutes itself, and updates its logic before ever touching a live system.
RPL is off by default (opt-in via rpl.enabled) — it trades extra LLM calls for far fewer
catastrophic, irreversible actions, so you turn it on for high-stakes, autonomous, or unattended work.
- TL;DR: blast-radius risk model → adversarial judge gate → typed actuation → regret-to-memory.
-
Where it lives:
crates/blumi-core/src/rpl.rs(pure core) +crates/blumi-core/src/agent.rs(the live loop) +RplConfigincrates/blumi-config. - Related: Self-Management · Memory & Knowledge · Configuration.
Most agent frameworks let the model plan and then immediately execute — a single chain of thought
followed by a tool call. The failure mode is well known: the model, biased toward task completion,
rationalizes a risky step ("just git push --force, it'll be fine") and the side effect is already
done by the time anyone notices.
RPL-Judgement inserts a deliberation stage between deciding and doing. It replaces "plan → act" with "plan → map risk → prosecute → (re-plan or) act → confess." It is an application of three ideas:
- A cheap, deterministic risk prior (the blast radius) decides whether the expensive review is even worth running — so safe, read-only work pays nothing.
- An adversarial LLM-as-a-judge — a separate model invocation whose only instruction is to find the flaw — gates execution. Adversarial review beats self-critique precisely because the judge has no stake in finishing the task.
- Outcome-based credit assignment — the gap between the predicted risk and what actually happened (the Error Delta, i.e. "regret") is persisted to memory and feeds value-based memory fitness, so the system gets calibrated over time.
Risk isn't a vibe — it's computed from the capabilities a tool batch requests
(blumi-core/src/rpl.rs::BlastRadius::assess). Each finalized tool call's
required_capabilities are flattened and bucketed: files written, shell commands,
network egress, VCS mutations, sub-agent spawns, and whether any command matches blumi's
destructive-command heuristic (rm -rf, sudo, git reset --hard, git push --force, mkfs,
dd if=, ... — shared with the permission engine).
From that, a 0–100 severity is derived (BlastRadius::severity):
| Factor | Contribution |
|---|---|
| A destructive command present | +60 |
| Shell commands | +20 each (capped at 2) |
| VCS mutations (commit/reset/push) | +25 each (capped at 2) |
| Network egress to a host | +15 each (capped at 2) |
| File writes | +10 each (capped at 3) |
| Irreversible (destructive / network / VCS) | +15 |
A batch is reversible when it has no destructive command, no network egress, and no VCS mutation
(file writes are reversible via the /undo change journal). Only a mutating batch
whose severity clears blast_threshold (default 40) triggers the full loop — everything else takes
the cheap path and executes normally.
Opaque mutations are covered too. A tool that mutates but declares no capabilities (many MCP tools,
self_config,manage_skill,restart_gateway) would otherwise produce an empty blast radius and slip through.BlastRadius::should_reviewfolds in each tool'sis_read_onlyflag, so an undeclared side effect is reviewed anyway and its predicted risk is floored to the threshold — never recorded as "safe".
Code-graph fan-in. When the code graph is built and
knowledge.graph.rpl_impactis on, editing a heavily-referenced file adds its incoming-reference count to the severity — so changing a high-fan-in symbol is likelier to face review. No-op when the graph isn't available.
The surviving plan is submitted to Porfiry (agent.rs::rpl_porfiry) — a focused LLM call whose
system directive (PORFIRY_POLICY) is only to find the flaw: the ignored edge case, the state the
plan failed to re-read, the irreversible step taken for granted, the way it could break the
environment. The judge returns a single line of JSON — {"approved":bool,"risk":0-100,"flaw":"..."} —
parsed leniently and fail-open (parse_porfiry): an unreachable or unparseable judge approves
(at the blast severity), so a flaky model never deadlocks the agent.
Why a separate adversarial call instead of asking the planner to double-check itself? Because the planner is optimizing for finishing; a self-review inherits that bias. A judge instructed to assume the plan is flawed and prosecute it surfaces failure modes the planner has already rationalized away. It runs at temperature 0 with a tight token budget — cheap and deterministic.
After execution, blumi computes the Error Delta (rpl.rs::ErrorDelta::compute) — how far reality
diverged from the prediction:
-
Failed despite a low predicted risk ⇒ maximal regret (
100 − predicted_risk, floored at 40) — the surprise most worth learning from. -
Succeeded ⇒ a small residual regret proportional to how scared we were (
predicted_risk / 4).
That delta is written as an rpl_delta episode in the agent memory namespace and feeds
value-based memory fitness — so memories and decisions that lead to low-regret outcomes gain value,
and the agent's risk judgement self-calibrates. This is the credit-assignment loop:
regret → memory value → what survives eviction.
RPL is injected into the agent turn between finalize_tool_calls and execute_calls
(blumi-core/src/agent.rs) — the tool calls are fully materialized but nothing has run yet.
Map what the batch would touch into a structured BlastRadius + severity. Read-only / low-blast
batches skip the loop entirely (the cheap path), keeping cost bounded.
For a batch over blast_threshold, predict the worst-case outcome and a risk score.
(MVP: a single-plan dry prediction. Multi-branch Tree-of-Thoughts simulation and real sandboxed
execution are planned — see Implementation status.)
The plan goes to the adversarial judge. If it approves, proceed. If it rejects, the plan is
bounced: the agent keeps its reasoning, the flaw is injected as guidance, and the model re-plans —
bounded by max_defend_rounds (default 2), after which blumi proceeds under caution so the turn can
never deadlock. The bounce is protocol-safe (it never leaves a dangling tool_use, mirroring blumi's
doom-loop guard).
The surviving plan flows through the normal typed tool pipeline (execute_tool_call) unchanged —
the permission engine and pre_tool_use hooks
still apply. RPL only decides which plan reaches this point.
blumi compares the predicted risk to the actual result, computes the Error Delta, and writes it
back to memory as an rpl_delta episode — priming future blast-radius judgments and feeding
value-based fitness. It evolves from the
consequence.
In ~/.blumi/settings.json (defaults shown — enabled is false):
"rpl": { "enabled": false, "blast_threshold": 40, "branches": 3, "max_defend_rounds": 2, "judge_model": "", "sandbox": "dry" }
| Field | Meaning |
|---|---|
enabled |
Master switch. Off by default — opt in for high-stakes / unattended work. |
blast_threshold |
0–100 severity a mutating batch must reach to be reviewed. Lower = more reviews. |
branches |
Tree-of-Thoughts branches to simulate per review (1–5). Reserved — see roadmap. |
max_defend_rounds |
Porfiry reject → re-plan rounds before proceeding under caution. |
judge_model |
Model for the Porfiry judge. Empty = reuse the main model. |
sandbox |
dry (predict) or worktree (real sandboxed sim). worktree reserved — see roadmap.
|
RPL composes with — it does not replace — your permissions and
pre_tool_usehooks. Those still gate the Strike; RPL adds an upstream adversarial review so a risky plan is caught and re-thought before it ever reaches them.
RPL-Judgement borrows its structure from Raskolnikov's psychology in Crime and Punishment:
| Crime and Punishment | RPL-Judgement |
|---|---|
| Raskolnikov's "Extraordinary Man" theory — deciding the normal rules don't apply | The Hypothesis — mapping the blast radius / what protections are being bypassed |
| The fever dream — rehearsing the act, obsessing over chaotic variables | The Fever Dream — worst-case simulation + paranoia score |
| Porfiry Petrovich, the investigator who traps him in his own logic | The Porfiry Node — the adversarial judge that must be satisfied |
| The crime itself | The Strike — deterministic, typed actuation |
| Guilt — reality didn't match the sterile plan | The Confession — the Error Delta written to memory |
Live today (v0.5.0):
- ✅ Phase 1 blast-radius gate, including opaque-mutation coverage (
should_review/predicted_risk). - ✅ Phase 3 Porfiry adversarial judge with reject → re-plan bounce, bounded + fail-open.
- ✅ Phase 4 typed actuation through the normal pipeline (permissions + hooks intact).
- ✅ Phase 5 Error-Delta "regret" written to memory, feeding value-based fitness.
Planned (config present, not yet active):
- 🚧 Multi-branch Tree-of-Thoughts (Phase 2). The MVP judges the single materialized plan; the
branchesknob and theParanoiaScore/ least-regret machinery inrpl.rsare unit-tested but not yet wired into the live loop. - 🚧
sandbox: "worktree". Real sandboxed branch execution in throwaway git worktrees (observe outcomes instead of predicting them). The MVP runssandbox: "dry".
No. It's opt-in via rpl.enabled = true. It adds LLM calls, so it's meant for high-stakes or
unattended runs rather than every keystroke.
No. Only mutating batches whose blast-radius severity clears blast_threshold are reviewed.
Read-only and low-risk work takes the cheap path with zero added cost.
It fails open — the action is approved (recorded at the blast severity) — so a flaky judge can never deadlock the agent. RPL is a safety enhancement layered over the permission engine, not a single point of failure.
Permissions and hooks gate an individual call at execution time. RPL adds an upstream, plan-level adversarial review that can bounce a whole batch back for re-planning before it reaches those gates. They compose.
It changes whether a plan executes (approve / bounce), not how. The surviving plan runs through the exact same typed pipeline, permissions, and approvals as without RPL.
It's the regret signal — predicted risk vs. actual outcome — written to long-term memory, where it feeds value-based fitness so the agent's judgement self-calibrates over time.
See also: Self-Management · Memory & Knowledge · Configuration · FAQ.