Incident
-> embed symptoms (Qwen text-embedding-v3)
-> pgvector semantic search on Alibaba Cloud RDS
-> fuse semantic + keyword scores
-> DETERMINISTIC validity gate <- the important part
-> Qwen picks one allowlisted action (validated JSON)
-> execute -> persist the real outcome back to memory
The validity gate is deliberately boring and deterministic:
| Prior outcome |
Environment match |
Disposition |
| failure |
any |
avoid |
| success |
full match |
use |
| success |
driver / topology / config hash changed |
inspect |
No model decides whether a memory is trustworthy. And critically, avoid is enforced, not advised: an action with a symptom-matching prior failure is removed from the candidate list before the model sees it. The agent cannot repeat a known failure even if it wanted to — which matters, because a live LLM handed the same memories as a "hint" will sometimes ignore the hint. The creative part (which action, given the evidence) goes to Qwen; the part that must never hallucinate (is this memory valid? did this action succeed?) stays in deterministic code.
Why Qwen Cloud fit
I used Qwen Cloud through its OpenAI-compatible DashScope endpoint, which made two things nearly free:
-
Embeddings for recall.
text-embedding-v3 turns incident symptoms into 1024-d vectors for pgvector search. Hybrid retrieval fuses semantic similarity (weight 0.70) with keyword overlap (0.30), so it catches both paraphrased and exact-token symptoms.
-
Validated action selection. A Qwen chat model receives the symptoms, the environment fingerprint, the (gated) candidate actions, and the memories, and returns a single JSON decision at
temperature=0 with thinking disabled — fast, deterministic-ish output that I validate before anything executes.
Because it's OpenAI-compatible, the whole client is a thin, well-typed wrapper with explicit timeouts and one retry — no exotic SDK to fight.
Proving it helps (without fooling myself)
A demo where the new thing wins is easy to fake, so FailureDNA ships a benchmark designed to be hard on itself: three modes (no_memory, naive, failuredna) on identical seeded history, hidden simulator outcomes, evaluator-only safe/unsafe labels, isolated memory per mode, and static shortcut baselines (always_inspect_downstream, ...) to check it isn't just rediscovering that one action is usually right.
FailureDNA lifts first-action resolution well above the naive agent, cuts unsafe first actions sharply, resolves in fewer actions, and repeats zero historical failures and zero stale successes. The honest caveat I left in the open: in this small scenario set, a static always-inspect policy also scores well — which is exactly why the shortcut audit exists. FailureDNA's value isn't a magic action; it's that it never repeats a known failure and never blindly reuses a stale fix as environments change — the behavior that generalizes beyond a fixed benchmark.
Running it on Alibaba Cloud (and the sharp edges)
The backend runs as a custom container on Alibaba Cloud Function Compute (FastAPI, port 9000), memory persists in ApsaraDB RDS for PostgreSQL + pgvector (HNSW), and the image lives in ACR Personal Edition. A few things bit, and are worth writing down for the next person:
-
ACR tiers matter. Enterprise Economy can't back an FC custom-container function (no image processing). Personal Edition (free) works.
-
The default
*.fcapp.run domain forces downloads. It adds Content-Disposition: attachment to HTML and JSON responses, so a browser downloads your dashboard or health JSON instead of rendering it. I serve the UI from GitHub Pages and added a small API status page that fetches and displays /health/ready.
-
CORS is the gateway's job, not the app's. The FC gateway already injects
Access-Control-* headers (it even reflects the request origin). The app's only CORS responsibility is to return 200 on OPTIONS — if it 405s the preflight, the browser blocks every POST. Adding app-level CORS headers on top just produces duplicates the browser also rejects.
-
Watch for stale images. Pushing a tag to ACR doesn't update a running function; you must repoint it, and FC can serve a cached image. I added a
/health/cors-debug endpoint and a build marker so "is my new code actually live?" is a one-glance check.
What I'd do next
The most interesting open problem is the inspect disposition. Today the deterministic gate hard-removes avoid actions but leaves inspect ones available with a warning. The right next step is a real verification tool behind inspect — so a stale success is checked against the current environment, not just flagged. That keeps the thesis intact: let the model be creative where creativity helps, and let deterministic code (and real checks) hold the line where being wrong is expensive.
Try it: Live dashboard · API status · GitHub (MIT)
Built with Qwen Cloud + Alibaba Cloud Function Compute and RDS pgvector.