Principle 4: Knowledge as a separate layer
What it is. Domain knowledge – platform patterns, known constraints, gotchas you only discover in real-world use – lives in separate files that agents read but do not import into the main code. Curated Markdown or YAML, not an embedding vector store where texts are pre-translated into numeric representations and retrieved by similarity.
Why it works. Domain knowledge changes on a different rhythm than the code itself. A UI framework might update once a year; your code changes every week. If the knowledge is baked into the code, a framework upgrade becomes a migration. If it lives in a separate layer, you change one YAML file and everything else stays intact.
A curated KB is also deterministic. RAG chooses top-k documents by embedding similarity, and if an important paragraph misses the retrieval cut, the agent runs without it. A flat KB is either entirely present in context or it is not – and that is immediately visible.
E2E example. On my ecommerce project, the local KB is 12 Markdown files (admin, classic-storefront, modern-storefront), plus 9 YAML files in a global cross-stack KB (tailwind-css, alpine-js, fastapi, nextjs, and so on). When I ported the method to FastAPI + NextJS, tailwind-css.yml, alpine-js.yml, and mailpit.yml just worked on the new stack without modification. That is cross-project KB reuse: platform knowledge isolated into its own layer travels across projects.
This is a rare kind of evidence in the current multi-agent literature – almost every public case study shows one system on one stack. Portability is what confirms that the split between code, KB, and agents is not cosmetic but architectural: the KB layer behaves like a self-contained component.
Non-E2E application. A security-audit KB can cover CVE categories, OWASP patterns, and framework-specific gotchas (XSS in template engines, SQL injection in ORM bypasses). A customer-support KB can encode ticket types, escalation rules, and refund policies. A documentation generator KB can define documentation formats (JSDoc, RST, OpenAPI) and conventions for each language.
What breaks without it. Knowledge gets smeared across prompts and code. Every agent ends up with its own copy of the rules, and those copies drift apart over time. When the platform changes, there is no single place to update.
When RAG is actually needed
A flat KB stops working at one of three thresholds: around 200k tokens (too expensive to load in full), uncurated sources (code, tickets, logs), or history-driven retrieval (when the agent benefits from the top-k most similar prior cases). At those thresholds, the KB evolves into RAG – but that is a change of tool, not of methodology. The contract, role separation, and persistent state still remain.
Principle 5: Closed-loop learning (knowledge compounding)
What it is. Every failure or error is turned into a structured artifact – not "fixed a selector," but a completed template with diagnosis, hypotheses, action taken, verification, KB candidates, and out-of-scope items. Those artifacts then feed back into the KB, so the next agent run already sees them.
Why it works. Without a closed loop, every run rediscovers the same failures. With one, you get knowledge compounding. The KB grows by the same logic as compound interest: the system becomes cheaper and more accurate on every pass.
E2E example. The healer writes per-run files under heal-findings/<date>-<module>.md with six sections: A (diagnosis), B (hypotheses), C (action), D (verification), E (KB candidates), F (out-of-scope siblings). Section E is the promotion path into the KB. On my project, across eight runs, the KB grew by 67% (from 25 gotchas to 42), and first_try_pass_rate rose from 14% (a new module) to 95% (the third run of the same module). That is the KB saturation curve: same agents, same prompts, different feed.
Non-E2E application. In a code-review pipeline, each rejected agent comment becomes structured feedback ("false positive: the agent flagged X, but X is allowed in this module under line N of CONTRACT.md") and is then promoted into the KB, so the next run sees the rule. In a migration tool, each failed migration becomes a markdown report with the root cause, then a rule in migration-gotchas.yml, so the next migration does not repeat the mistake. In a security audit, each false positive becomes a rule in audit-exceptions.yml, improving signal-to-noise.
What breaks without it. Agents do not learn between runs. The tenth run is as expensive as the first. Every failure requires manual diagnosis from scratch.
Principle 6: Additive instrumentation
What it is. Metrics after each run are written to a file with an evolving schema: new fields are added, old fields stay. v1 records remain valid after v2 fields are introduced. No breaking changes, no migrations.
Why it works. Without quantitative feedback, "is it getting better?" is an unanswerable question. The feeling that "it seems faster now" is not data. With metrics.jsonl, you can actually see the trendline.
There is a second benefit: an additive schema lets you learn gradually which metrics matter. I did not know in advance that first_try_pass_rate would become a key metric; it only appeared on the third run, when I noticed that the number of healing iterations was a proxy for KB maturity. If the schema had been rigid, I would have needed a migration for older records. With an additive schema, I simply added the field and the old records stayed valid.
E2E example. metrics.jsonl v1 (the first two runs) contains timestamp, target, stack, phases, kb_updates, and volume. v2 (from the third run onward) adds first_try_pass_rate, real_app_bugs_found[], test_churn, kb_hits, patterns_added, and wall_clock_ms. The v1 records remained valid, which lets me query across all eight runs.
Non-E2E application. In an ML training pipeline, experiments.jsonl can record hyperparameters, dataset version, and metrics. In a refactoring tool, refactor-runs.jsonl can track the number of changed files, tests broken or restored, and review time. In customer support, tickets.jsonl can store time-to-first-response, escalation depth, and resolution type.
What breaks without it. You cannot say objectively whether the system is improving. Debates about whether it got better or worse get resolved by intuition instead of data. When a new agent introduces an unexpected regression, you do not see it until complaints accumulate.
What these principles give you together
Each principle on its own is a useful pattern. Together they produce a system with specific properties:
-
Accuracy. Contract + source reading + role separation cut down the space for improvisation. The agent works from ground truth – what is actually in the code – rather than guesses about how it might be organised.
-
Fewer hallucinations. Persistent state provides stable context; the KB provides deterministic rules; the closed loop catches hallucinations and prevents them from recurring.
-
Reproducibility. The same input artifact plus the same KB snapshot should produce the same output. Different results across runs are treated as a bug to investigate, not as "the nature of LLMs."
-
Knowledge accumulation. Closed-loop learning plus additive metrics turn every run into data. After ten runs, you know more about your system than after a hundred one-off GPT calls driven by a single prompt.
-
Portability. The same six principles work for E2E testing, code review, refactoring, security audit, and migration tools. Only the KB and helpers are platform-specific; the architecture is not.
What these principles do not give you
I would not present this as a silver bullet. The principles solve a specific class of problems – accuracy and reproducibility in multi-agent systems – and do not solve others.
- They do not make the agent smarter. GPT does not turn into an expert just because you wrapped six layers around it. If the task requires creativity or deep understanding, the agent stays limited by the model.
- They do not work well for very short tasks. The payback starts after three to five runs. If you only run the system once, the overhead is not justified.
- They do not replace review. Closed-loop learning catches errors that the agent or the system itself already noticed. Errors nobody recognised as errors still stay in the code.
- They require discipline. Six-section heal findings, an explicit contract, persistent state – all of that is work. If the team is not willing to maintain those artifacts, the method turns into dead weight.
What comes next
I am now applying these six principles to a third independent domain – knowledge work (planning, learning, content), not software development. This is a deliberate attempt to eliminate the method's software bias: the first two validations were in E2E testing, and it is still unclear which principles are code-specific and which are truly domain-agnostic.
If you are applying a similar architecture in another domain – or, conversely, if you found where it stops working – I would love to hear about it. I am especially interested in cases where a principle did not work. Those cases show the boundaries of the method more clearly than successful implementations do.
P.S. In parallel, I am writing a more technical deep-dive series on one concrete application of these principles – E2E testing: a month and a half of iteration, eight runs, a six-section healing protocol, and a breakdown of KB-saturation metrics. I am also preparing an open-source companion repo with a reference implementation of the six principles – framework, four agents, metrics schema, and skeleton KBs. Announcements for new articles and the repo launch go out on LinkedIn; the articles themselves are published on the blog.