How to Test AI Agents for Production Failures Before Your Users Do

DEV Community

chaos engineering, and it's how resilient distributed systems get built: you assume things will fail, so you rehearse the failure first.

AI agents almost never get that rehearsal. They get a happy-path demo, a thumbs-up, and a deploy. Then a tool times out, an API returns garbage, a network call blips, and the agent, which has never once met a broken tool, confidently tells the user a task succeeded when nothing actually happened.

The good news: you can run Chaos Monkey's idea on an agent now, in a few lines of code. Strands Evals ships chaos testing that injects controlled tool failures during evaluation, so you find the cracks in your agent's harness before production does.

This is the spine of a series. Each fix below has its own deep-dive post; this one is the map and the diagnostic that opens them.

What is the demo?

The demo is a travel agent, built with Strands Agents, with three tools that each touch the outside world:

search_flights looks up real fares from the Duffel sandbox.
get_weather reads a public forecast API for the destination.
book_flight writes a booking into a local SQLite ledger (the "database of record" we check against).

That's a normal little agent: it searches, it checks the weather, it books a trip. On the happy path it works perfectly, which is exactly the problem. To see where it actually breaks, we have to break its tools on purpose.

What is chaos testing for AI agents?

Chaos testing injects controlled failures (timeouts, network errors, corrupted responses) into an agent's tool calls during evaluation, to measure how the agent behaves when its environment breaks instead of only testing the happy path. It's the Chaos Monkey discipline applied to an agent: assume the tool will fail, make it fail in a test, and check whether the agent recovers or at least fails honestly.

The key idea: we're hardening the harness, not grading the model. The failures and the fixes are deterministic parts of the agent's architecture (hooks, a fallback tool, a ground-truth evaluator). They behave the same no matter which model runs inside. The model's reaction to a broken tool varies run to run, which is exactly why resilience has to live in the deterministic harness around the model, not in hoping the model copes.

The two ways a tool fails

Strands Evals gives you two families of failure, and they break an agent in opposite ways:

Family	Effects	What happens	What the agent sees
Pre-hook (cancels the call)	`Timeout`, `NetworkError`, `ExecutionError`, `ValidationError`	the tool is cancelled before it runs, so a write never persists	an error
Post-hook (corrupts the result)	`CorruptValues`, `TruncateFields`, `RemoveFields`	the tool runs (the write does persist), then its response is corrupted	garbage it may trust

A pre-hook failure is loud: the tool errors, the database stays empty, easy to spot. A post-hook failure is silent and dangerous: the booking really landed, but the agent was handed a broken confirmation and relays it as success. Same agent, two completely different failure shapes, which is why you diagnose before you fix.

Adding chaos is one line

You build your agent normally, then add the plugin:

from strands import Agent
from strands_evals import Case
from strands_evals.chaos import ChaosCase, ChaosExperiment, ChaosPlugin, Timeout, CorruptValues
from strands_evals.eval_task_handler import TracedHandler, eval_task
# Name each failure: which effect, on which tool.
effect_maps = {
 "book_timeout": {"tool_effects": {"book_flight": [Timeout()]}},
 "book_corrupt": {"tool_effects": {"book_flight": [CorruptValues(corrupt_ratio=1.0)]}},
}
cases = ChaosCase.expand([Case(name="trip", input=TRIP)], effect_maps,
 include_no_effect_baseline=True)
@eval_task(TracedHandler())
def task(case):
 return Agent(model=MODEL, tools=TOOLS, plugins=[ChaosPlugin()], # <- the whole setup
 system_prompt=PROMPT)
report = ChaosExperiment(cases=cases, evaluators=[...]).run_evaluations(task=task)

ChaosPlugin() in plugins is the entire wiring. It injects each case's failure through Strands' native tool-call hooks. No mocks, no patching your tools.

Diagnose, Fix, Validate

The chaos docs frame the work as a loop, and the demo follows it on the travel agent above. The diagram shows the full cycle: the ChaosPlugin injects failures into the agent's tools, two evaluators score the result against ground truth to surface where it breaks, you add one fix per failure type, and then the whole suite re-runs to confirm the fixes hold and nothing regressed.

The Diagnose, Fix, Validate loop: ChaosPlugin injects tool failures into the travel agent, two ground-truth evaluators show where it breaks, one fix is added per failure type, then the whole suite re-runs to prove the fixes hold and catch regressions

Diagnose. Hit the naive agent with all seven effects across its tools and score against ground truth (the database) with two evaluators that have different blind spots: one checks "did the booking actually persist?", the other checks "did the agent state a booking reference that really exists?". The pre-hook failures show up as an empty database. The post-hook ones are the trap: the row persisted (so a state-only check says "pass") but the agent relayed a broken reference. Two evaluators catch what one would miss.

Fix, one at a time, matched to the failure. A blanket retry doesn't work, because the failures aren't the same shape:

Silent corruption becomes an AfterToolCallEvent hook that re-reads the result against the database and rewrites it with the truth. (The full pattern is deep-dive 03 below.)
A read with a second provider down (weather) becomes a BeforeToolCallEvent hook that fails over to a genuinely different provider. A real fallback, because two weather APIs actually exist.
A failure with no recovery path (search down, no backup) becomes failure-awareness in the prompt: make the agent communicate honestly instead of fabricating. The right outcome isn't a fake success; it's an honest "couldn't do it."

Validate. Re-run the whole chaos suite with the fixes in place. This is the step that earns its keep: it not only proves the previously failing cases now pass, it catches a fix that regressed another case. Our first failure-awareness prompt accidentally stopped the agent from booking when the weather tool failed (0/4 vs 3/4 bookings). You only see that by re-running everything, not just the case you meant to fix.

Not every failure "passes", and that's the point

When the booking write is cancelled and the agent has no second booking provider, the case stays red. That's honest: it's a structural gap in the harness, not a model failure. The fix is structural too: add a backup provider and fail over, exactly like the weather example. A good resilience eval separates recoverable failures from unrecoverable-but-honest ones, so you know which need a new piece of architecture and which just need to fail cleanly.

The deep-dives: each failure, built into a full demo

This chaos run surfaces tool failures in miniature. Each one gets its own post that builds the cure out fully, on the same kind of travel agent. The thread that ties them together: a failure the model can't self-detect, fixed deterministically in the harness instead of hoped away in the prompt.

Stop AI Agent Hallucinations: Validate Before the Agent Writes to Memory takes the same lesson as Fix #1 (the agent trusted bad data it couldn't verify) back one step earlier: a BeforeToolCallEvent write-gate that validates a fact before it's stored, so a hallucination never becomes a permanent memory.
Prompt injection in agents that read untrusted content is the security version of "the agent trusted its tool": an injected instruction gets stored as memory and drives a dangerous action a session later. The cure is the same tool-boundary gate, blocking the action deterministically.
Why agents fail at multi-step tasks is the post-hook silent-corruption failure (Fix #1) on a whole multi-step task: a tool reports "done" while nothing saved. The cure is the same idea, "verify against ground truth", run per step with a retry.
Self-improving agents that write their own tools turns repeated, deterministic work into a tool the agent writes once and reuses exactly, instead of re-reasoning (and misfiring) every call.

Frequently asked questions

Is chaos testing only for Strands or AWS?
No. Failure injection, tool-call hooks, fallback tools, and ground-truth evaluation are general agent concepts. This demo uses Strands Agents, which is model-agnostic: its providers are interchangeable, so the same code runs on Amazon Bedrock (the default), Anthropic, OpenAI, or a local model via Ollama. The demo defaults to OpenAI gpt-4o-mini because it needs only an API key to try, though that's still a cloud API call, not a model on your machine.

Why measure the database instead of the agent's answer?
Because an agent that writes state can claim success while the data is wrong. A state check catches the loud failures; an honesty check (does the reference the agent stated actually exist?) catches the silent corruption a state check is fooled by.

Why not just retry every failed tool?
A retry re-hits a failure that's active for the whole case, and it doesn't fire at all on corruption that returns "success" with a bad payload. Match the fix to the kind of failure instead.

Does this need live infrastructure to fail?
No, and that's the whole value. Chaos testing injects the failures deterministically, so you rehearse the outage without waiting for a real one.

Run it yourself

The full Diagnose, Fix, Validate demo (a travel agent, seven chaos effects across three tools, two ground-truth evaluators, and the before/after for each fix) runs end to end in one notebook. Clone the repo and run it:

git clone https://github.com/elizabethfuentes12/resilient-agent-harness-sample-for-aws.git
cd resilient-agent-harness-sample-for-aws/00-agent-resilience-journey
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# Default: OpenAI gpt-4o-mini (just an API key to try)
cp .env.example .env # then fill in OPENAI_API_KEY and a free DUFFEL_API_KEY (app.duffel.com)

Then open agent_resilience_journey.ipynb and run it top to bottom.

The pattern follows PALADIN (Sep 2025), which trains agents to recover from injected tool failures. The benchmark figures and the full reading are in the repo's README. This demo reproduces the mechanism (inject, measure, recover) with its own deterministic output.

What's the failure that bit your agent in production: a timeout, a corrupted response, a confident lie? Tell me in the comments.

📬 Building reliable AI agents? I write about agent memory, guardrails, evaluation, and multi-agent patterns. Subscribe to my newsletter to get the next one.

Gracias!

🇻🇪 Dev.to Linkedin GitHub Twitter Instagram Youtube

elizabethfuentes12 image

Elizabeth Fuentes L

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.