Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

RPL Judgement

ankurCES edited this page Jun 8, 2026 · 2 revisions

Raskolnikov's Psychological Loop (RPL-Judgement)

RPL-Judgement is blumi's adversarial, regret-minimizing pre-execution reasoning loop for AI coding agents. Before a risky tool call touches your real files, repo, or network, blumi maps the action's blast radius, puts the plan on trial before an adversarial LLM-as-a-judge ("Porfiry"), and only executes the plan that survives cross-examination — then writes the predicted-vs-actual "regret" back into long-term memory so the agent learns from consequences. A standard agent maximizes success; an RPL agent minimizes regret.

Inspired by Fyodor Dostoevsky's Crime and Punishment — the agent doesn't just plan; it simulates the crime, prosecutes itself, and updates its logic before ever touching a live system.

RPL is off by default (opt-in via rpl.enabled) — it trades extra LLM calls for far fewer catastrophic, irreversible actions, so you turn it on for high-stakes, autonomous, or unattended work.

  • TL;DR: blast-radius risk model → adversarial judge gate → typed actuation → regret-to-memory.
  • Where it lives: crates/blumi-core/src/rpl.rs (pure core) + crates/blumi-core/src/agent.rs (the live loop) + RplConfig in crates/blumi-config.
  • Related: Self-Management · Memory & Knowledge · Configuration.

What is RPL-Judgement?

Most agent frameworks let the model plan and then immediately execute — a single chain of thought followed by a tool call. The failure mode is well known: the model, biased toward task completion, rationalizes a risky step ("just git push --force, it'll be fine") and the side effect is already done by the time anyone notices.

RPL-Judgement inserts a deliberation stage between deciding and doing. It replaces "plan → act" with "plan → map risk → prosecute → (re-plan or) act → confess." It is an application of three ideas:

  1. A cheap, deterministic risk prior (the blast radius) decides whether the expensive review is even worth running — so safe, read-only work pays nothing.
  2. An adversarial LLM-as-a-judge — a separate model invocation whose only instruction is to find the flaw — gates execution. Adversarial review beats self-critique precisely because the judge has no stake in finishing the task.
  3. Outcome-based credit assignment — the gap between the predicted risk and what actually happened (the Error Delta, i.e. "regret") is persisted to memory and feeds value-based memory fitness, so the system gets calibrated over time.

The science of the judgement

1. A structured risk model (the blast radius)

Risk isn't a vibe — it's computed from the capabilities a tool batch requests (blumi-core/src/rpl.rs::BlastRadius::assess). Each finalized tool call's required_capabilities are flattened and bucketed: files written, shell commands, network egress, VCS mutations, sub-agent spawns, and whether any command matches blumi's destructive-command heuristic (rm -rf, sudo, git reset --hard, git push --force, mkfs, dd if=, ... — shared with the permission engine).

From that, a 0–100 severity is derived (BlastRadius::severity):

Factor Contribution
A destructive command present +60
Shell commands +20 each (capped at 2)
VCS mutations (commit/reset/push) +25 each (capped at 2)
Network egress to a host +15 each (capped at 2)
File writes +10 each (capped at 3)
Irreversible (destructive / network / VCS) +15

A batch is reversible when it has no destructive command, no network egress, and no VCS mutation (file writes are reversible via the /undo change journal). Only a mutating batch whose severity clears blast_threshold (default 40) triggers the full loop — everything else takes the cheap path and executes normally.

Opaque mutations are covered too. A tool that mutates but declares no capabilities (many MCP tools, self_config, manage_skill, restart_gateway) would otherwise produce an empty blast radius and slip through. BlastRadius::should_review folds in each tool's is_read_only flag, so an undeclared side effect is reviewed anyway and its predicted risk is floored to the threshold — never recorded as "safe".

Code-graph fan-in. When the code graph is built and knowledge.graph.rpl_impact is on, editing a heavily-referenced file adds its incoming-reference count to the severity — so changing a high-fan-in symbol is likelier to face review. No-op when the graph isn't available.

2. Adversarial LLM-as-a-judge ("Porfiry")

The surviving plan is submitted to Porfiry (agent.rs::rpl_porfiry) — a focused LLM call whose system directive (PORFIRY_POLICY) is only to find the flaw: the ignored edge case, the state the plan failed to re-read, the irreversible step taken for granted, the way it could break the environment. The judge returns a single line of JSON — {"approved":bool,"risk":0-100,"flaw":"..."} — parsed leniently and fail-open (parse_porfiry): an unreachable or unparseable judge approves (at the blast severity), so a flaky model never deadlocks the agent.

Why a separate adversarial call instead of asking the planner to double-check itself? Because the planner is optimizing for finishing; a self-review inherits that bias. A judge instructed to assume the plan is flawed and prosecute it surfaces failure modes the planner has already rationalized away. It runs at temperature 0 with a tight token budget — cheap and deterministic.

3. Regret minimization, not reward maximization

After execution, blumi computes the Error Delta (rpl.rs::ErrorDelta::compute) — how far reality diverged from the prediction:

  • Failed despite a low predicted risk ⇒ maximal regret (100 − predicted_risk, floored at 40) — the surprise most worth learning from.
  • Succeeded ⇒ a small residual regret proportional to how scared we were (predicted_risk / 4).

That delta is written as an rpl_delta episode in the agent memory namespace and feeds value-based memory fitness — so memories and decisions that lead to low-regret outcomes gain value, and the agent's risk judgement self-calibrates. This is the credit-assignment loop: regret → memory value → what survives eviction.


How it works — the five phases

RPL is injected into the agent turn between finalize_tool_calls and execute_calls (blumi-core/src/agent.rs) — the tool calls are fully materialized but nothing has run yet.

Phase 1 — The Hypothesis (blast radius)

Map what the batch would touch into a structured BlastRadius + severity. Read-only / low-blast batches skip the loop entirely (the cheap path), keeping cost bounded.

Phase 2 — The Fever Dream (simulation)

For a batch over blast_threshold, predict the worst-case outcome and a risk score. (MVP: a single-plan dry prediction. Multi-branch Tree-of-Thoughts simulation and real sandboxed execution are planned — see Implementation status.)

Phase 3 — The Porfiry Node (adversarial judgment)

The plan goes to the adversarial judge. If it approves, proceed. If it rejects, the plan is bounced: the agent keeps its reasoning, the flaw is injected as guidance, and the model re-plans — bounded by max_defend_rounds (default 2), after which blumi proceeds under caution so the turn can never deadlock. The bounce is protocol-safe (it never leaves a dangling tool_use, mirroring blumi's doom-loop guard).

Phase 4 — The Strike (deterministic actuation)

The surviving plan flows through the normal typed tool pipeline (execute_tool_call) unchanged — the permission engine and pre_tool_use hooks still apply. RPL only decides which plan reaches this point.

Phase 5 — The Confession (regret → memory)

blumi compares the predicted risk to the actual result, computes the Error Delta, and writes it back to memory as an rpl_delta episode — priming future blast-radius judgments and feeding value-based fitness. It evolves from the consequence.


Configuration

In ~/.blumi/settings.json (defaults shown — enabled is false):

"rpl": {
 "enabled": false,
 "blast_threshold": 40,
 "branches": 3,
 "max_defend_rounds": 2,
 "judge_model": "",
 "sandbox": "dry"
}
Field Meaning
enabled Master switch. Off by default — opt in for high-stakes / unattended work.
blast_threshold 0–100 severity a mutating batch must reach to be reviewed. Lower = more reviews.
branches Tree-of-Thoughts branches to simulate per review (1–5). Reserved — see roadmap.
max_defend_rounds Porfiry reject → re-plan rounds before proceeding under caution.
judge_model Model for the Porfiry judge. Empty = reuse the main model.
sandbox dry (predict) or worktree (real sandboxed sim). worktree reserved — see roadmap.

RPL composes with — it does not replace — your permissions and pre_tool_use hooks. Those still gate the Strike; RPL adds an upstream adversarial review so a risky plan is caught and re-thought before it ever reaches them.


The Dostoevsky mapping

RPL-Judgement borrows its structure from Raskolnikov's psychology in Crime and Punishment:

Crime and Punishment RPL-Judgement
Raskolnikov's "Extraordinary Man" theory — deciding the normal rules don't apply The Hypothesis — mapping the blast radius / what protections are being bypassed
The fever dream — rehearsing the act, obsessing over chaotic variables The Fever Dream — worst-case simulation + paranoia score
Porfiry Petrovich, the investigator who traps him in his own logic The Porfiry Node — the adversarial judge that must be satisfied
The crime itself The Strike — deterministic, typed actuation
Guilt — reality didn't match the sterile plan The Confession — the Error Delta written to memory

Implementation status & roadmap

Live today (v0.5.0):

  • ✅ Phase 1 blast-radius gate, including opaque-mutation coverage (should_review / predicted_risk).
  • ✅ Phase 3 Porfiry adversarial judge with reject → re-plan bounce, bounded + fail-open.
  • ✅ Phase 4 typed actuation through the normal pipeline (permissions + hooks intact).
  • ✅ Phase 5 Error-Delta "regret" written to memory, feeding value-based fitness.

Planned (config present, not yet active):

  • 🚧 Multi-branch Tree-of-Thoughts (Phase 2). The MVP judges the single materialized plan; the branches knob and the ParanoiaScore / least-regret machinery in rpl.rs are unit-tested but not yet wired into the live loop.
  • 🚧 sandbox: "worktree". Real sandboxed branch execution in throwaway git worktrees (observe outcomes instead of predicting them). The MVP runs sandbox: "dry".

FAQ

Is RPL-Judgement on by default?

No. It's opt-in via rpl.enabled = true. It adds LLM calls, so it's meant for high-stakes or unattended runs rather than every keystroke.

Does it slow down every action?

No. Only mutating batches whose blast-radius severity clears blast_threshold are reviewed. Read-only and low-risk work takes the cheap path with zero added cost.

What happens if the judge model is down or returns garbage?

It fails open — the action is approved (recorded at the blast severity) — so a flaky judge can never deadlock the agent. RPL is a safety enhancement layered over the permission engine, not a single point of failure.

How is this different from the permission engine or pre_tool_use hooks?

Permissions and hooks gate an individual call at execution time. RPL adds an upstream, plan-level adversarial review that can bounce a whole batch back for re-planning before it reaches those gates. They compose.

Does RPL change what executes?

It changes whether a plan executes (approve / bounce), not how. The surviving plan runs through the exact same typed pipeline, permissions, and approvals as without RPL.

What is the "Error Delta" used for?

It's the regret signal — predicted risk vs. actual outcome — written to long-term memory, where it feeds value-based fitness so the agent's judgement self-calibrates over time.


See also: Self-Management · Memory & Knowledge · Configuration · FAQ.

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /