Copied to Clipboard
```yaml
runbook_id: "service_latency_spike"
steps:
- name: "check_shard_distribution"
tool: "telemetry_query"
params:
metric: "http_request_duration_seconds"
group_by: "shard_id"
threshold: "p95 > 500ms"
- name: "correlate_with_deployments"
tool: "git_query"
params:
repository: "core-api"
lookback_minutes: 30
```
*Drafting the "Problem Slicing" logic:*
Describe the process of intersecting high-cardinality dimensions. If `error_rate` is high, check `dimension_A` (e.g., `customer_tier`), then `dimension_B` (e.g., `availability_zone`). This is a binary search through the metadata space.
*Final Word Count Strategy:*
Intro: 250
The RCA Problem (Theoretical): 400
The Architecture of Specialized Tools: 600
The Runbook-Anchored Agent Model: 400
Deployment and Mitigation Patterns: 300
Conclusion: 150
Total: ~2100 words. Perfect.
Originally published in Spanish at www.mgatc.com/blog/relvy-ai-on-call-automation-runbooks/