let's talk.
Code example: orchestration (planner → specialists → synthesizer)
The same example wires the routing logic directly in Python, passing structured state (planner output, specialist JSON) rather than a growing chat transcript:
...
def run_review(design_doc: str) -> tuple[PlannerOutput, list[SpecialistOutput], ArchitectureReview]:
deps = ReviewDeps(design_doc=design_doc)
plan = planner.run_sync(
"Plan the review lenses and questions for this design doc.",
deps=deps,
).output
specialist_outputs: list[SpecialistOutput] = []
for lens in plan.lenses:
specialist = make_specialist(lens)
specialist_outputs.append(
specialist.run_sync(
(
"Review this architecture with your lens. "
f"Scope notes: {plan.scope_notes or '(none)'}"
),
deps=deps,
).output
)
synthesis_input = (
"Synthesize the final ArchitectureReview from:\n\n"
f"Planner clarifying questions:\n- "
+ "\n- ".join(plan.clarifying_questions or ["(none)"])
+ "\n\nSpecialist outputs:\n"
+ "\n\n".join(o.model_dump_json(indent=2) for o in specialist_outputs)
)
final = synthesizer.run_sync(synthesis_input, deps=deps).output
return plan, specialist_outputs, final
End-to-End Example: One Review from Design Doc to Structured Output
An architecture review only becomes real when you can run it on a concrete input and get a structured artifact out the other side. This section sketches a single "happy path" run.
Walkthrough inputs
At minimum, you want three kinds of input:
-
Architecture description: the system diagram in prose - components, dependencies, data flows, and boundaries.
-
Constraints: what is non-negotiable (compliance, latency targets, cloud restrictions, tenancy model).
-
Known risks / focus areas: what the team is already worried about (migration, multi-region, PII, cost).
The biggest determinant of review quality is whether these are explicit. If the input does not state assumptions, the reviewer will either guess (bad) or ask a lot of questions (good but slower). Your contract should reward "ask a question" rather than "invent a fact".
Reference implementation sketch
Conceptually, you define three things:
-
Schemas (Pydantic models) for the planner output, specialist findings, and the final review artifact.
-
Agents bound to those schemas (planner agent, specialist agents, synthesizer agent).
-
Orchestration that routes structured state between them.
Claude is the model behind each agent; PydanticAI is the layer that forces responses to fit the schema and provides retries/repairs when they don’t.
Clone the repository, set ANTHROPIC_API_KEY, and run python run_review.py to reproduce the walkthrough below.
Happy path routing
The run looks like this:
-
Planner call: read the input, select lenses (e.g., Security, Scalability, Operability, Data/Integrity), and emit clarifying questions if required.
-
Specialist fan-out: run each lens with the same input plus planner scope. Each specialist emits a list of
Finding objects (and questions/unknowns if the contract supports them).
-
Synthesizer merge: merge the lists into a final
ArchitectureReview artifact: dedupe, rank, and normalize severity.
If you log inputs and outputs at each step, this pipeline is easy to debug: you can see whether the planner scoped incorrectly, whether a specialist missed evidence, or whether synthesis merged incorrectly.
Example input and output
The final artifact should be something you can paste into a PR comment and parse as data. A good output has:
- A short summary (what’s good, what’s risky).
- Ranked findings with clear severities and categories.
- Evidence tied to the input (or explicit "inference" flags).
- Actionable recommendations (what to change, what to measure, what to decide).
- A small set of clarifying questions that genuinely block conclusions.
The companion repository includes a toy design doc at sample_design.md - short on purpose, but enough for specialists to anchor findings to real statements:
# Example design doc (toy)
We are building a multi-tenant SaaS API that ingests events from customer apps.
## Components
- Public REST API behind an API gateway.
- Worker service that processes events asynchronously.
- Postgres for tenant metadata and configuration.
- S3 for raw event payload storage.
- Redis for rate limiting and job deduplication.
## Constraints / assumptions
- Tenants are identified by an API key.
- Peak: 50k events/sec across tenants, spikes up to ×ばつ.
- PII may be present in event payloads.
- 99.9% availability target for ingestion endpoint.
## Known risks / focus
- We previously had incidents from retry storms.
- We need to support data deletion by tenant (GDPR-style).
Running python run_review.py against that file produces out_review.json. Here is an excerpt of the structured review (truncated for readability):
{"summary":"This multi-tenant SaaS event-ingestion platform has a **high overall risk profile** driven by four converging concern areas: (1) an undefined API key lifecycle... (2) absent tenant isolation controls... (3) PII in S3 with no encryption-at-rest strategy... (4) a GDPR deletion requirement spanning Postgres, S3, and Redis with no coordination mechanism...","overall_risk":"high","findings":[{"title":"GDPR Deletion Lacks Cross-Store Atomicity, Completeness Guarantee, and Audit Trail","category":"compliance","severity":"p0","evidence":[{"kind":"quote","detail":"'We need to support data deletion by tenant (GDPR-style)' is listed as a known risk but no deletion workflow, ordering, or rollback strategy is documented."},{"kind":"observation","detail":"Data is spread across three independent stores — Postgres, S3, and Redis — with no described coordination mechanism."}],"recommendation":"Implement a saga/orchestration pattern for tenant deletion that tracks per-store deletion state..."},{"title":"API Key Lifecycle Management Undefined — No Rotation, Revocation, or Scoping Controls","category":"authn","severity":"p1","evidence":[{"kind":"quote","detail":"Tenants are identified by an API key — no mention of key rotation, revocation, expiry, or scoping anywhere in the design."}],"recommendation":"Implement a full API key lifecycle: scoped creation, server-side HMAC-SHA256 hashing, rotation policies, and immediate revocation propagation via Redis..."}],"questions":["What queue technology sits between the API gateway and the worker (SQS, Kafka, RabbitMQ, etc.)?","..."]}
Notice how each finding carries explicit evidence kinds (quote, observation, inference) and a ranked severity - exactly the shape your contracts enforce, and exactly what makes the output routable into a PR comment or issue tracker without a second translation pass.
What did that run cost? On this toy doc, a full pipeline (planner + four specialists + synthesizer, using claude-sonnet-4-6) came to about 0ドル.45. That’s reasonable for an occasional architecture review on a real design doc; it’s expensive if you run it on every small PR. Treat this as one data point - cost scales with document length, lens count, validation retries, and model choice - not a fixed price tag. It’s another reason to keep the topology lean and scope lenses deliberately.
The point isn’t the exact field names; it’s that the artifact can be routed into real workflows without a human reformatting it.
Failure Modes: Hallucinations, Conflicts, and Unknowns
Architecture review is a high-trust activity. When a human reviewer says "this will fail under load", you can ask why, argue about assumptions, or request a benchmark plan. When a model says it, you get a different problem: the statement is often well-phrased but its epistemic status is unclear. Is it anchored in the input? Is it an inference? Is it a generic warning? If you don’t design for that, you’ll end up with a reviewer that either hallucinates confidently or hedges uselessly.
This section is a set of failure modes worth designing against up front.
Unsupported claims: make evidence a first-class field
The simplest guardrail is structural: require each finding to carry evidence. "Evidence" can be a quote, a reference to a section of the input, or a concrete observation about the described architecture. If the input does not contain enough evidence, the finding should not pretend otherwise - it should downgrade severity or convert into a clarifying question.
This one constraint changes behavior. Models are much less likely to invent specifics when they must attach them to an evidence slot. And when they do invent, it becomes visible: the evidence field will be empty, vague, or obviously unrelated.
Specialist disagreement: preserve dissent when it matters
Parallel specialists will disagree. Sometimes that’s a bug (one misunderstood the architecture). Sometimes it’s the point (tradeoffs are real). Synthesis should not always force consensus. A useful pattern is:
- If the disagreement is resolvable by the input, resolve it and cite the evidence.
- If the disagreement is resolvable by a missing fact, emit a question and present the conditional conclusions ("If X, then P0; if not, then P2").
- If it’s a genuine tradeoff, preserve dissent explicitly and explain the consequence of each choice.
The goal is not to "sound decisive". It’s to help engineers decide with clarity about what hinges on what.
Unknowns: treat "needs human" as a valid outcome
Most model failures under architecture review are failures of uncertainty handling. The model would rather guess than admit it doesn’t know. Your contract should give it a safe place to put uncertainty: unknown, assumption, or needs_human fields that are treated as valid outputs, not errors.
This is also where you differentiate between "missing input" and "non-determinism". Missing input can be fixed by asking a question. Non-determinism might require a benchmark, a threat model, or a human policy decision. Your reviewer should surface that explicitly instead of burying it in hedged prose.
Guardrails without over-engineering
You don’t need a full evaluation harness to be safer than the average demo. A few cheap guardrails go a long way:
-
Schema constraints: enums for severity/categories; required evidence fields; bounded list sizes.
-
Rubric checks: simple consistency rules ("P0 findings must include a clear blast radius and an action").
-
Spot re-asks: targeted second-pass prompts when specific fields are weak ("rewrite evidence", "justify severity", "convert speculative claims into questions").
The point is to fix predictable defects deterministically, not to create an open-ended "think harder" loop.
What to log (so you can debug)
If you deploy this, you want logs that help you answer: "which step failed, and how"? At minimum:
- The input digest (so you can correlate runs without storing sensitive docs verbatim).
- Planner output (selected lenses, questions, scoping decisions).
- Each specialist’s structured findings (including validation failures/retries).
- Synthesizer merge decisions (deduping and any conflict resolution notes).
With that, you can debug multi-agent runs like any other pipeline: identify the step that produced bad data, tighten the contract or prompt for that role, and move on.
Operational Heuristics: Prompt Pack and Debugging
If you treat your architecture reviewer like a one-off prompt, it will behave like one. If you treat it like a component in an engineering system, it becomes maintainable.
Build a "review prompt pack"
Instead of hand-editing prompts in code, keep a small prompt pack with:
- The role definitions (planner, each specialist, synthesizer).
- Your rubric snippets (what counts as severity P0/P1/P2, what categories you care about).
- One or two output examples that demonstrate the contract "done right".
This does two things. First, it creates a shared artifact for the team - people can review and improve it like any other engineering asset. Second, it makes drift obvious: if the output starts violating the rubric, you can update the pack instead of chasing ad-hoc prompt edits scattered through the code.
Version your contracts like an API
Once downstream systems depend on your schema, it becomes an API. Treat it that way:
- Make breaking changes intentionally (renames, enum changes, required fields).
- Consider adding a
schema_version to the top-level review artifact.
- Keep migration logic simple: prefer additive changes early, and prune later once consumers catch up.
Most failures in production-like agent systems aren’t "the model got dumber". They’re "the contract moved and the assumptions didn’t".
Debugging checklist: contract vs. reasoning vs. routing
When something goes wrong, you want a fast way to localize the problem:
-
Contract failure: validation errors, missing fields, wrong enum values. Fix with stricter schemas, clearer instructions, or repair prompts.
-
Reasoning failure: the model followed the schema but produced low-quality findings. Fix with better rubric, better lens prompts, and better evidence requirements.
-
Routing failure: the right work didn’t run (wrong lenses selected), or state was passed incorrectly. Fix the planner logic and the state model; don’t patch around it in specialist prompts.
This is why structured state passing matters: you can inspect each stage and see whether the pipeline is broken structurally or semantically.
Keep it lean: don’t add features until you feel pain
It’s tempting to add memory, retrieval (RAG), tool routers, and evaluation harnesses immediately. Most of that is premature for a reviewer that’s still proving it can produce a single reliable structured artifact.
Add only what fixes a named problem:
- Add memory when you have multi-step interactions that truly benefit from long-lived context.
- Add evaluation when you’re shipping frequent prompt/contract changes and need regression protection.
- Add retrieval when your reviewer needs access to external specs, policies, or service inventories that are too large to paste into the input.
Extend one lens at a time
The clean way to extend this system is to add a specialist, not to bloat existing ones. If you want a "Compliance" lens, define:
- A new role prompt for compliance.
- The same output contract as the other specialists.
- A planner rule for when to include that lens.
Because the contract is stable and synthesis already knows how to merge, you get extensibility without rewriting orchestration.
Closing: From Demo to Engineering Workflow
The pattern in this article is deliberately simple: contracts + roles + a boring topology. That combination is what turns "LLM feedback" into something you can actually integrate into engineering work.
Contracts are the differentiator. They force the reviewer to produce findings as data, not prose. Roles keep each agent honest: the planner scopes, specialists apply lenses, and the synthesizer merges into a single artifact. The topology stays lean so you can run it often and debug it when it misbehaves.
The practical question is where to plug this in. Architecture review is not a single event; it happens at different points in a system’s lifecycle. A structured reviewer can support a few common workflows:
-
Design reviews: run it on a design doc draft to surface missing assumptions and obvious risks before a meeting.
-
PRs for architectural changes: attach the structured artifact as a PR comment, with a short summary plus ranked findings.
-
ADRs: use the questions and "needs human judgment" fields to drive what the ADR must explicitly decide.
The point is not to replace humans. It’s to make the review loop tighter and more consistent - and to ensure the output is shaped like something your team can act on.
If you outgrow the lean version, the next steps are straightforward:
- Add an evaluation harness with a small set of "golden" design docs and expected findings, so prompt/contract changes don’t regress silently.
- Add organization-specific retrieval (policies, SLO templates, service inventories) when you repeatedly see "unknown" due to missing institutional context.
- Expand lens coverage one specialist at a time, keeping the contract stable.
The best call to action is also the simplest: ship the smallest reviewer that returns structured findings with evidence. Run it on one real design doc. If the output is useful, you’ll know exactly what to improve next. If it isn’t, don’t add more agents - tighten the contract and the inputs until it becomes reliable.
Try the code: github.com/nunombispo/multi-agent-architecture-reviewer-article - clone, point run_review.py at your own design doc, and iterate on contracts and lenses from there.
Want to sharpen your Pydantic skills? This article leans on schemas as the backbone of agent reliability. For a full treatment of validation, serialization, and real-world Pydantic patterns, check out Practical Pydantic: The Missing Guide to Data Validation in Python on Leanpub.
Follow me on Twitter: https://twitter.com/DevAsService
Follow me on Instagram: https://www.instagram.com/devasservice/
Follow me on TikTok: https://www.tiktok.com/@devasservice
Follow me on YouTube: https://www.youtube.com/@DevAsService