@jugeni,
@txdesk,
@hannune, and
@nazar_boyko — took the loose idea and turned it into a spec. This post is that spec, credited to them, and it's now filed as issues on the repo.
The principle, stated properly
Sort your features by whether their source is independent of the model. Gate on those. Treat the self-authored one as context, never authorization. That was @txdesk's line, and it outlives the email case completely — it's the rule for any model-scored decision.
The part I got wrong in post four: I called confidence a "tiebreaker." @jugeni corrected it, and the correction matters. Confidence doesn't demote to a weak signal — it inverts. Self-graded confidence has the same computational shape on adversarial input as on cooperative input, and that sameness is the definition of a confident hallucination. A polished impersonation that reads as a trusted sender is exactly a high-confidence, high-sender-trust, reversible-looking email. So on cooperative input confidence is scenery; on adversarial input it's counter-evidence. The same number flips meaning depending on what the rest of the gate sees. It can't be a tiebreaker, because it's wrong precisely when you'd most want to trust it.
The wiring
The gate decides on the world-anchored features only. senderTrust grounded on observed sender history, reversibility sourced from an action-type lookup — both belong to the runtime, not the model. The classifier proposes; the runtime arbitrates with facts the model has no access to author.
Confidence gets a different job: the canary. After the gate decides, compare confidence to the gate's conclusion. If they agree, silence. If confidence is high and the gate rejected — that's the post-mortem you want, and it goes to a triage queue, not a log line nobody reads. @jugeni's framing: confidence reads the gate, not the other way around. That keeps the self-authored number out of the vote and turns disagreement into something you can audit.
One implementation detail makes the whole thing provable: pull the runtime corroborator into a named external-context object in the decision trace. The model reads it; it can't write it. That's what lets you prove after the fact that the decision was anchored to something outside the model — which is also where the eval gets its teeth.
The eval that turns a belief into a number
Post four's honest close was that "the floor saves us" is a belief until it's a measurement. @jugeni and @hannune specced the measurement, and it's sharper than what I'd have built.
Don't measure a threshold ("did the impersonation reach AUTO"). Measure the delta: whether the (confidence − world-anchored-corroboration) spread separates adversarial from cooperative samples in distribution. The cooperative set is held-out known-safe senders — that's your floor distribution. The adversarial set is hand-crafted to be high-confidence, thin-corroboration on purpose, and — this is the discipline I'd have skipped — matched to the cooperative set on confidence. If the adversarial set has lower confidence than the cooperative one, the eval is leaking signal somewhere else and the spread isn't measuring what it looks like.
The canonical fixture: a sender impersonation that lands AUTO at 0.92 against an action the runtime reversibility table marks internal-only. That's the exact corner where the floor does all the work and the score does none.
It's not just email
@hannune pointed out the principle generalizes straight into retrieval. A model's confidence in a retrieved answer is high precisely because the chunk sounded plausible, not because it's grounded — the same self-referential trap. Citation overlap across retrieved chunks, entity-level consistency with a knowledge graph: corroborators the model can't author. Confident-plus-external-signals-thin-or-contradictory is the canary in that domain too. Any time a model scores features for a decision, the same sort applies.
Honest status
None of this is shipped yet. Today confidence still gates AUTO at 0.85, and what makes that safe is the deterministic floor underneath — AUTO's autonomous execution is off, and the three irreversible actions fail closed regardless of any score. This is design hardening for when AUTO acts, not a live hole. I filed the two pieces as issues so the thread has somewhere to land: the world-anchored gate + canary and the delta eval.
Four posts and a comment section later, the thesis is smaller and sharper than where it started: keep the model in the perception layer, gate on what it can't author, and treat its opinion of itself as a canary, never a vote. Thanks to everyone who out-designed me in the replies. The repo's in the open if you want to keep going — and if the series was useful to you, a ⭐ helps me gauge whether these are worth continuing: github.com/k08200/klorn.