A rough eval harness, even if it is just a spreadsheet with pass/fail and notes.
The POC answers four questions in order:
| Question |
What "no" means |
| Does the model produce the right shape of output reliably? |
Schema issues, structured-output failures. Fixable. |
| Does it produce the right content on easy cases? |
Capability gap. Sometimes fixable with retrieval or examples. |
| Does it handle the long tail without catastrophic failures? |
The real risk. Often the project killer. |
| Can we detect when it is wrong? |
If no, the project cannot ship to production. Full stop. |
That last question is the one most people skip. An AI system you cannot evaluate is an AI system you cannot trust, and an AI system you cannot trust is a demo, not a product. I have walked away from POCs that worked 90% of the time because there was no signal to catch the 10%.
Exit criterion: measurable performance on the eval set that the client agrees is good enough to justify integration cost, plus a documented failure mode list.
Phase 3: Integration (3 to 8 weeks)
This is where most of the actual work lives, and where most of my time goes. The model is usually the easy part by now. The integration is what makes it real.
My default stack for production AI work:
-
Orchestration: simple, explicit code first. I reach for LangGraph or a hand-rolled state machine only when the workflow genuinely has branches and loops. Most "agents" are a sequential pipeline pretending to be agentic.
-
Storage: Postgres for everything, with pgvector when retrieval matters. Supabase if the client wants managed. I do not introduce a separate vector DB until pgvector measurably stops scaling, which is later than people think.
-
Retrieval: hybrid search (BM25 + dense) with reciprocal rank fusion. Pure semantic search loses on exact identifiers, SKUs, error codes, names. Pure keyword loses on paraphrase. RRF is the cheap fix.
-
Compute: AWS Lambda + EventBridge for scheduled and event-driven work, API Gateway when something needs to be called. Scales to zero, which matters for workloads that run hourly or in bursts.
-
Frontend (when needed): Next.js with server actions. Boring is good here.
Three integration details I now treat as non-negotiable:
1. Idempotency keys on everything
Any external action (send email, create ticket, post to CRM) gets an idempotency key derived from the input. Retries are inevitable, duplicate side effects are not.
def idempotency_key(workflow_id: str, input_hash: str, step: str) -> str:
return f"{workflow_id}:{step}:{input_hash}"
2. A human-in-the-loop seam, even if unused
I always build the approval queue before I build the auto-send. Even if the client wants full automation eventually, shipping with human review for the first 2 to 4 weeks catches the failure modes the eval set missed. Turning approval off later is one config change.
3. Cost guardrails per workflow
Token budgets per execution, hard cutoffs, alerts at 50/80/100% of monthly budget. I have seen a single retry loop burn 400ドル in an hour. Never again.
Exit criterion: the system runs end to end on real production data, with logging, retries, idempotency, and a kill switch. Not perfect outputs yet, but the pipes are sound.
Phase 4: Evaluation (continuous, but formalized for 2 weeks)
Evaluation is not a phase you finish. It is a system you build once and keep running forever. But there is a discrete block of work to set it up, and that is what this phase is.
I build three layers of evaluation:
-
Offline eval set. The 30 to 50 examples from the POC, grown to 100 to 300, with expected outputs and a scoring rubric. Run on every prompt or model change. This is your regression test.
-
LLM-as-judge for open-ended outputs. For anything where there is no single correct answer (drafted emails, summaries, classifications with reasoning), I use a separate, stronger model with a calibrated rubric to score outputs. I have written about how to actually calibrate this so the judge does not just rubber-stamp. The short version: you score the judge against human labels on a held-out set, and you do not trust a judge you have not calibrated.
-
Production telemetry. Every run logs inputs, outputs, model version, prompt version, latency, tokens, cost, and the downstream outcome (was the draft email sent as-is, edited, or rejected?). That last signal is gold. It is the closest thing to ground truth you get in production.
The trap here is treating eval as a one-time gate. Models change. Prompts drift. Data shifts. The eval set has to be re-run on every change and the production telemetry has to feed back into growing the eval set. If a real production failure happens, it goes into the eval set the same day.
Exit criterion: the client can answer "is the system still working correctly?" without calling me.
Phase 5: Operations and Handoff (2 to 4 weeks)
This is the phase that separates a project that survives from one that dies six months in when something breaks and nobody knows where to look.
What I deliver in operations:
-
Runbook. A markdown doc with the top 10 things that can go wrong, how to detect them, and how to fix them. Real ones, from this system, not generic.
-
Dashboards. Usually a simple internal page or a Grafana board: success rate, cost per day, queue depth, latency P50/P95, model errors. The client looks at this weekly.
-
Alerts. Pager-worthy alerts on hard failures (pipeline stopped, cost spike, eval regression). Low-noise. If alerts cry wolf, they get muted, and then the real failure goes unnoticed.
-
Versioned prompts and configs. In git, with a changelog. Prompt changes are deploys, not Slack messages.
-
A maintenance retainer or a clean exit. Either I stay on for a defined number of hours per month, or I hand off to an internal team with a transition period. No silent fade-outs. Those end badly for both sides.
What I would do differently if I were starting over
A few opinions, after running this loop enough times:
-
Spend more on scoping, less on the POC. A bad scope makes a great POC useless. I have never regretted an extra week of scoping. I have regretted skipping it.
-
Pick the boring model. Use the strongest reliable model in your tier (Claude Sonnet or GPT-4 class) until you have a reason not to. Optimizing for cost too early picks fights you cannot win yet.
-
Build the eval before the agent. Sounds backwards. It is not. If you cannot define what good looks like, you cannot build toward it.
-
Treat the first 30 days in production as part of the build. Most of the real bugs surface there. Budget for it. Tell the client.
-
Say no more often. The projects I have turned down have, on average, been better decisions than the ones I took. Wrong-shaped projects do not get better with effort.
The shape of this process is not unique to my work. What is mine is the calibration: which phases I now know to invest in, which exit criteria I refuse to skip, and which mistakes I have made enough times to write them down. That last category is the actual deliverable when you hire someone like me, more than the code.
If you are scoping an AI implementation and want a second pair of eyes on it before you commit budget, I am happy to look at it. Reach out at lazar-milicevic.com/#contact, or browse the rest of the blog for more on evaluation, RAG, and getting agents into production.