Six Principles for Agent Systems That Don't Hallucinate

DEV Community

Principle 4: Knowledge as a separate layer

What it is. Domain knowledge – platform patterns, known constraints, gotchas you only discover in real-world use – lives in separate files that agents read but do not import into the main code. Curated Markdown or YAML, not an embedding vector store where texts are pre-translated into numeric representations and retrieved by similarity.

Why it works. Domain knowledge changes on a different rhythm than the code itself. A UI framework might update once a year; your code changes every week. If the knowledge is baked into the code, a framework upgrade becomes a migration. If it lives in a separate layer, you change one YAML file and everything else stays intact.

A curated KB is also deterministic. RAG chooses top-k documents by embedding similarity, and if an important paragraph misses the retrieval cut, the agent runs without it. A flat KB is either entirely present in context or it is not – and that is immediately visible.

E2E example. On my ecommerce project, the local KB is 12 Markdown files (admin, classic-storefront, modern-storefront), plus 9 YAML files in a global cross-stack KB (tailwind-css, alpine-js, fastapi, nextjs, and so on). When I ported the method to FastAPI + NextJS, tailwind-css.yml, alpine-js.yml, and mailpit.yml just worked on the new stack without modification. That is cross-project KB reuse: platform knowledge isolated into its own layer travels across projects.

This is a rare kind of evidence in the current multi-agent literature – almost every public case study shows one system on one stack. Portability is what confirms that the split between code, KB, and agents is not cosmetic but architectural: the KB layer behaves like a self-contained component.

Non-E2E application. A security-audit KB can cover CVE categories, OWASP patterns, and framework-specific gotchas (XSS in template engines, SQL injection in ORM bypasses). A customer-support KB can encode ticket types, escalation rules, and refund policies. A documentation generator KB can define documentation formats (JSDoc, RST, OpenAPI) and conventions for each language.

What breaks without it. Knowledge gets smeared across prompts and code. Every agent ends up with its own copy of the rules, and those copies drift apart over time. When the platform changes, there is no single place to update.

When RAG is actually needed

A flat KB stops working at one of three thresholds: around 200k tokens (too expensive to load in full), uncurated sources (code, tickets, logs), or history-driven retrieval (when the agent benefits from the top-k most similar prior cases). At those thresholds, the KB evolves into RAG – but that is a change of tool, not of methodology. The contract, role separation, and persistent state still remain.

Principle 5: Closed-loop learning (knowledge compounding)

What it is. Every failure or error is turned into a structured artifact – not "fixed a selector," but a completed template with diagnosis, hypotheses, action taken, verification, KB candidates, and out-of-scope items. Those artifacts then feed back into the KB, so the next agent run already sees them.

Why it works. Without a closed loop, every run rediscovers the same failures. With one, you get knowledge compounding. The KB grows by the same logic as compound interest: the system becomes cheaper and more accurate on every pass.

E2E example. The healer writes per-run files under heal-findings/<date>-<module>.md with six sections: A (diagnosis), B (hypotheses), C (action), D (verification), E (KB candidates), F (out-of-scope siblings). Section E is the promotion path into the KB. On my project, across eight runs, the KB grew by 67% (from 25 gotchas to 42), and first_try_pass_rate rose from 14% (a new module) to 95% (the third run of the same module). That is the KB saturation curve: same agents, same prompts, different feed.

Non-E2E application. In a code-review pipeline, each rejected agent comment becomes structured feedback ("false positive: the agent flagged X, but X is allowed in this module under line N of CONTRACT.md") and is then promoted into the KB, so the next run sees the rule. In a migration tool, each failed migration becomes a markdown report with the root cause, then a rule in migration-gotchas.yml, so the next migration does not repeat the mistake. In a security audit, each false positive becomes a rule in audit-exceptions.yml, improving signal-to-noise.

What breaks without it. Agents do not learn between runs. The tenth run is as expensive as the first. Every failure requires manual diagnosis from scratch.

Principle 6: Additive instrumentation

What it is. Metrics after each run are written to a file with an evolving schema: new fields are added, old fields stay. v1 records remain valid after v2 fields are introduced. No breaking changes, no migrations.

Why it works. Without quantitative feedback, "is it getting better?" is an unanswerable question. The feeling that "it seems faster now" is not data. With metrics.jsonl, you can actually see the trendline.

There is a second benefit: an additive schema lets you learn gradually which metrics matter. I did not know in advance that first_try_pass_rate would become a key metric; it only appeared on the third run, when I noticed that the number of healing iterations was a proxy for KB maturity. If the schema had been rigid, I would have needed a migration for older records. With an additive schema, I simply added the field and the old records stayed valid.

E2E example. metrics.jsonl v1 (the first two runs) contains timestamp, target, stack, phases, kb_updates, and volume. v2 (from the third run onward) adds first_try_pass_rate, real_app_bugs_found[], test_churn, kb_hits, patterns_added, and wall_clock_ms. The v1 records remained valid, which lets me query across all eight runs.

Non-E2E application. In an ML training pipeline, experiments.jsonl can record hyperparameters, dataset version, and metrics. In a refactoring tool, refactor-runs.jsonl can track the number of changed files, tests broken or restored, and review time. In customer support, tickets.jsonl can store time-to-first-response, escalation depth, and resolution type.

What breaks without it. You cannot say objectively whether the system is improving. Debates about whether it got better or worse get resolved by intuition instead of data. When a new agent introduces an unexpected regression, you do not see it until complaints accumulate.

What these principles give you together

Each principle on its own is a useful pattern. Together they produce a system with specific properties:

Accuracy. Contract + source reading + role separation cut down the space for improvisation. The agent works from ground truth – what is actually in the code – rather than guesses about how it might be organised.
Fewer hallucinations. Persistent state provides stable context; the KB provides deterministic rules; the closed loop catches hallucinations and prevents them from recurring.
Reproducibility. The same input artifact plus the same KB snapshot should produce the same output. Different results across runs are treated as a bug to investigate, not as "the nature of LLMs."
Knowledge accumulation. Closed-loop learning plus additive metrics turn every run into data. After ten runs, you know more about your system than after a hundred one-off GPT calls driven by a single prompt.
Portability. The same six principles work for E2E testing, code review, refactoring, security audit, and migration tools. Only the KB and helpers are platform-specific; the architecture is not.

What these principles do not give you

I would not present this as a silver bullet. The principles solve a specific class of problems – accuracy and reproducibility in multi-agent systems – and do not solve others.

They do not make the agent smarter. GPT does not turn into an expert just because you wrapped six layers around it. If the task requires creativity or deep understanding, the agent stays limited by the model.
They do not work well for very short tasks. The payback starts after three to five runs. If you only run the system once, the overhead is not justified.
They do not replace review. Closed-loop learning catches errors that the agent or the system itself already noticed. Errors nobody recognised as errors still stay in the code.
They require discipline. Six-section heal findings, an explicit contract, persistent state – all of that is work. If the team is not willing to maintain those artifacts, the method turns into dead weight.

What comes next

I am now applying these six principles to a third independent domain – knowledge work (planning, learning, content), not software development. This is a deliberate attempt to eliminate the method's software bias: the first two validations were in E2E testing, and it is still unclear which principles are code-specific and which are truly domain-agnostic.

If you are applying a similar architecture in another domain – or, conversely, if you found where it stops working – I would love to hear about it. I am especially interested in cases where a principle did not work. Those cases show the boundaries of the method more clearly than successful implementations do.

P.S. In parallel, I am writing a more technical deep-dive series on one concrete application of these principles – E2E testing: a month and a half of iteration, eight runs, a six-section healing protocol, and a breakdown of KB-saturation metrics. I am also preparing an open-source companion repo with a reference implementation of the six principles – framework, four agents, metrics schema, and skeleton KBs. Announcements for new articles and the repo launch go out on LinkedIn; the articles themselves are published on the blog.

Top comments (6)

mnemehq profile image

Theo Valmis

Founder, Mneme HQ. Engineering governance for AI coding agents: keeping AI-generated code aligned with your architecture, standards, and decisions. Preventing architectural drift.

Work

Founder
Joined

May 8, 2026

• May 12

The distinction between curated KB and RAG in Principle 4 is underappreciated. Most teams jump straight to vector retrieval because it feels like the modern default, but deterministic knowledge that is either entirely in context or not is far easier to debug than probabilistic top-k retrieval. With a flat KB, you know exactly what the agent saw. With RAG, you have to reconstruct whether the critical paragraph made it through similarity scoring, and that reconstruction step is itself error-prone.

Your KB saturation curve (14% to 95% first-try pass rate over three runs of the same module) is the kind of metric I wish more agent case studies would track. Single-run anecdotes are almost useless for evaluating whether an architecture is sound. The compounding effect you describe in Principle 5 is where the real ROI lives, and it only becomes visible when you measure across runs rather than celebrating a single good output.

webramos profile image

Webmaster Ramos

Building AI agents for e-commerce. I write about what actually ships to production — not what fits on a slide. Topics: agent systems, e2e+LLM, MCP, agent-first commerce.

Location

Spain
Work

Independent e-commerce engineer focused on AI agent systems.
Joined

Feb 5, 2024

• May 12

Thanks – both points cut to the part of the article I most wanted to land, and you sharpened the framing on Principle 4 in a way the article didn't quite reach.

The reconstruction-step framing is the right one. In all three systems I run this on, the kb phase logs which KB files it loaded into a per-session lineage record. When a regression shows up, I can attribute it cleanly to either (a) a loaded KB-claim that was wrong, or (b) a failure class no loaded KB covered at all – and only (b) is a candidate for KB promotion. RAG's similarity-score floor blurs those two into a single "the model didn't retrieve it" bucket, and that loss of attribution is what makes the debugging cost compound, not the retrieval cost.

On the multi-run point: since publishing I've replicated the curve in two more settings. First – same E2E methodology (same four agents, same contract) ported from Magento to FastAPI/Alpine. One flow, two runs: 48.6% → 91.4%. Same shape, so the curve isn't a Magento artifact. Second – same six principles applied to writing production code instead of generating tests. Four analogous agents (analyzer/planner/executor/validator), shadow snapshots, auto-revert on regression. Here a "run" is one executor→validator round inside a single session. Early sessions (first scaffold features) took 4 rounds to PASS with 3 auto-reverts; recent sessions on the same agents and the same KB converge in 1-2 rounds with zero reverts. Same compounding shape, the loop just closes inside an hour rather than across days. Both replications strengthen your "single-run anecdotes are almost useless" point – even two runs of the same setup tell a different story than one.

The KB-promotion gate is the missing gear I'd add to the next iteration – that one's editorial, not automatic, and load-bearing.

mnemehq profile image

Theo Valmis

Founder, Mneme HQ. Engineering governance for AI coding agents: keeping AI-generated code aligned with your architecture, standards, and decisions. Preventing architectural drift.

Work

Founder
Joined

May 8, 2026

• May 13

The attribution split is the load-bearing piece here. "KB-claim was wrong" and "no KB covered this failure class" are completely different problems with completely different remediation paths. One is a data quality fix, one is a coverage gap. Treating them as a single retrieval miss means you run two different fix loops through the same interface and the noise never settles.

The replication curve appearing in code generation is particularly useful evidence. Different task type, same agent loop structure, same compounding shape. That points to something in the loop architecture rather than domain-specific luck.

The editorial gate on KB promotion is the right call. Auto-promotion optimizes for recall over precision. You end up with a KB that explains everything the system got wrong rather than what is actually true. That distinction is where a lot of teams lose the compounding benefit.

webramos profile image

Webmaster Ramos

Building AI agents for e-commerce. I write about what actually ships to production — not what fits on a slide. Topics: agent systems, e2e+LLM, MCP, agent-first commerce.

Location

Spain
Work

Independent e-commerce engineer focused on AI agent systems.
Joined

Feb 5, 2024

• May 13

Yes, particularly the recall-over-precision framing — that's the cleanest handle on the editorial gate I've seen. The KB-as-error-log failure mode is exactly what auto-promotion produces. Thanks for sharpening it.

innovationsiyu profile image

Siyu

Developer of Opportunity Skill. Full-stack builder. Deep into backend architecture, semantic search, and agent skills. Former VC and management consultant.

Email

siyu@questmeet.ai
Location

San Francisco, CA / Beijing, China
Education

University College Dublin
Work

Founder & Architect @ QuestMeet
Joined

May 6, 2026

• May 12

What struck me most is Principle 5 on closed-loop learning. You make a compelling case that agent systems should not be treated as stateless, disposable tools that rediscover the same failures every run. Instead, every failure should be converted into a structured artifact with six sections (diagnosis, hypotheses, action, verification, KB candidates, and out-of-scope items), which then feeds back into a curated knowledge base. This turns the system into a compounding asset: the KB grows, the first-try pass rate climbs, and each subsequent run becomes cheaper and more accurate. The data you shared (67% KB growth and a jump from 14% to 95% pass rate) is a rare kind of quantitative evidence in agent architecture writing.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.