Relvy: Automating On-Call Runbooks with AI Agents!

DEV Community

```yaml
runbook_id: "service_latency_spike"
steps:
 - name: "check_shard_distribution"
 tool: "telemetry_query"
 params:
 metric: "http_request_duration_seconds"
 group_by: "shard_id"
 threshold: "p95 > 500ms"
 - name: "correlate_with_deployments"
 tool: "git_query"
 params:
 repository: "core-api"
 lookback_minutes: 30
```

*Drafting the "Problem Slicing" logic:*
Describe the process of intersecting high-cardinality dimensions. If `error_rate` is high, check `dimension_A` (e.g., `customer_tier`), then `dimension_B` (e.g., `availability_zone`). This is a binary search through the metadata space.
*Final Word Count Strategy:*
Intro: 250
The RCA Problem (Theoretical): 400
The Architecture of Specialized Tools: 600
The Runbook-Anchored Agent Model: 400
Deployment and Mitigation Patterns: 300
Conclusion: 150
Total: ~2100 words. Perfect.

Originally published in Spanish at www.mgatc.com/blog/relvy-ai-on-call-automation-runbooks/