Building FailureDNA: an agent memory that knows when not to trust itself

DEV Community

Incident
 -> embed symptoms (Qwen text-embedding-v3)
 -> pgvector semantic search on Alibaba Cloud RDS
 -> fuse semantic + keyword scores
 -> DETERMINISTIC validity gate <- the important part
 -> Qwen picks one allowlisted action (validated JSON)
 -> execute -> persist the real outcome back to memory

The validity gate is deliberately boring and deterministic:

Prior outcome	Environment match	Disposition
failure	any	avoid
success	full match	use
success	driver / topology / config hash changed	inspect

No model decides whether a memory is trustworthy. And critically, avoid is enforced, not advised: an action with a symptom-matching prior failure is removed from the candidate list before the model sees it. The agent cannot repeat a known failure even if it wanted to — which matters, because a live LLM handed the same memories as a "hint" will sometimes ignore the hint. The creative part (which action, given the evidence) goes to Qwen; the part that must never hallucinate (is this memory valid? did this action succeed?) stays in deterministic code.

Why Qwen Cloud fit

I used Qwen Cloud through its OpenAI-compatible DashScope endpoint, which made two things nearly free:

Embeddings for recall. text-embedding-v3 turns incident symptoms into 1024-d vectors for pgvector search. Hybrid retrieval fuses semantic similarity (weight 0.70) with keyword overlap (0.30), so it catches both paraphrased and exact-token symptoms.
Validated action selection. A Qwen chat model receives the symptoms, the environment fingerprint, the (gated) candidate actions, and the memories, and returns a single JSON decision at temperature=0 with thinking disabled — fast, deterministic-ish output that I validate before anything executes.

Because it's OpenAI-compatible, the whole client is a thin, well-typed wrapper with explicit timeouts and one retry — no exotic SDK to fight.

Proving it helps (without fooling myself)

A demo where the new thing wins is easy to fake, so FailureDNA ships a benchmark designed to be hard on itself: three modes (no_memory, naive, failuredna) on identical seeded history, hidden simulator outcomes, evaluator-only safe/unsafe labels, isolated memory per mode, and static shortcut baselines (always_inspect_downstream, ...) to check it isn't just rediscovering that one action is usually right.

FailureDNA lifts first-action resolution well above the naive agent, cuts unsafe first actions sharply, resolves in fewer actions, and repeats zero historical failures and zero stale successes. The honest caveat I left in the open: in this small scenario set, a static always-inspect policy also scores well — which is exactly why the shortcut audit exists. FailureDNA's value isn't a magic action; it's that it never repeats a known failure and never blindly reuses a stale fix as environments change — the behavior that generalizes beyond a fixed benchmark.

Running it on Alibaba Cloud (and the sharp edges)

The backend runs as a custom container on Alibaba Cloud Function Compute (FastAPI, port 9000), memory persists in ApsaraDB RDS for PostgreSQL + pgvector (HNSW), and the image lives in ACR Personal Edition. A few things bit, and are worth writing down for the next person:

ACR tiers matter. Enterprise Economy can't back an FC custom-container function (no image processing). Personal Edition (free) works.
The default *.fcapp.run domain forces downloads. It adds Content-Disposition: attachment to HTML and JSON responses, so a browser downloads your dashboard or health JSON instead of rendering it. I serve the UI from GitHub Pages and added a small API status page that fetches and displays /health/ready.
CORS is the gateway's job, not the app's. The FC gateway already injects Access-Control-* headers (it even reflects the request origin). The app's only CORS responsibility is to return 200 on OPTIONS — if it 405s the preflight, the browser blocks every POST. Adding app-level CORS headers on top just produces duplicates the browser also rejects.
Watch for stale images. Pushing a tag to ACR doesn't update a running function; you must repoint it, and FC can serve a cached image. I added a /health/cors-debug endpoint and a build marker so "is my new code actually live?" is a one-glance check.

What I'd do next

The most interesting open problem is the inspect disposition. Today the deterministic gate hard-removes avoid actions but leaves inspect ones available with a warning. The right next step is a real verification tool behind inspect — so a stale success is checked against the current environment, not just flagged. That keeps the thesis intact: let the model be creative where creativity helps, and let deterministic code (and real checks) hold the line where being wrong is expensive.

Try it: Live dashboard · API status · GitHub (MIT)

Built with Qwen Cloud + Alibaba Cloud Function Compute and RDS pgvector.