Relvy AI: Automated On-Call Runbooks for Engineering Teams!

DEV Community

look for spikes'. """ data = self.ds.get_metrics(metric_name, time_range) anomalies = detect_outliers(data) # Return a summarized representation rather than raw data points return {"anomalies_detected": len(anomalies), "period": anomalies} def correlate_with_deployment(self, timestamp): """ Query CI/CD metadata to find recent code changes. """ return self.ds.get_recent_commits(limit=5)

By using these targeted tools, we reduce the token load significantly. The agent receives a structured JSON object describing the anomaly, which acts as a "ground truth" anchor, preventing the hallucination of non-existent error patterns.

2. Deterministic Reasoning via Runbook Graphs

We define runbooks as Directed Acyclic Graphs (DAGs). Each node represents a specific diagnostic action. When an alert fires, the Relvy agent traverses the DAG based on the results of the preceding step.

If a diagnostic step yields a result that exceeds a confidence threshold (e.g., an 80% correlation between a latency spike and a specific deployment ID), the agent moves to the mitigation phase. If the confidence is low, the agent surfaces a "notebook" for the human engineer, highlighting the ambiguous data.

Implementation: The Tooling Layer

Relvy utilizes a local-first deployment architecture (Docker/Helm) to minimize security latency when accessing internal observability stacks like Datadog, Prometheus, or Honeycomb. The agent operates within the customer’s VPC, ensuring that proprietary codebases and sensitive telemetry do not leave the infrastructure perimeter.

The agentic loop is implemented via a specialized controller that manages three distinct threads:

Observation Loop: Regularly polls observability sinks for anomalous state changes.
Reasoning Thread: Uses a RAG-augmented LLM to match the current incident signature against existing runbook definitions.
Action/Execution Layer: Executes approved CLI commands or API calls to perform mitigation (e.g., rolling back a deployment, restarting a service, or adjusting traffic weights).

Designing for Trust: The Notebook UI

In high-stakes environments, a "black box" AI is unacceptable. We built a notebook-style output interface to maintain transparency. Every autonomous action taken by the agent is logged as a cell in the notebook, containing the input data, the reasoning process, and the resulting visualization.

{"step":"Check Endpoint Latency","status":"completed","data":{"avg_latency":"450ms","p99_latency":"1200ms","anomaly_confirmed":true},"agent_thought":"P99 latency has deviated from the 7-day moving average by 3.2 standard deviations. Initiating segment analysis by shard ID."}

This record allows engineers to review the agent's work post-incident. If the agent makes a wrong turn, the user can modify the runbook YAML configuration, essentially "training" the agent for future incidents without needing to re-fine-tune the base model.

Overcoming the "Cold Start" Problem

One of the significant hurdles in adopting automated on-call tools is the lack of initial runbooks. We address this through an "observation-first" mode. When installed, Relvy monitors alerts and suggests candidate runbooks based on historical incident patterns.

We utilize a technique where we retrospectively analyze resolved tickets. By feeding historical incident logs and the associated mitigation actions into the agent, we can generate a baseline "Draft Runbook." The engineering team then simply reviews and approves these drafts. This significantly reduces the overhead of adopting Relvy in legacy environments where documentation is either outdated or non-existent.

The Role of Local Execution

The critical distinction in our architecture is the decision to keep the agentic reasoning and tool execution as close to the data as possible. By installing Relvy within the user's environment, we solve two problems simultaneously:

Security and Compliance: Data-at-rest stays within the perimeter. Only anonymized metadata is sent to the orchestration layer for agent planning.
Latency: The agent interacts with internal APIs (Kubernetes, AWS, Datadog) over high-speed local networks, which is crucial when an incident is causing cascading failures.

Conclusion: Moving Towards Autonomous Resilience

The shift toward autonomous on-call is not about replacing human engineers; it is about automating the "drudge work" of the investigation. By combining deterministic runbook workflows with specialized observability tools, Relvy provides a structured environment where AI can perform RCA effectively, accurately, and safely.

The next evolution of this technology will likely involve cross-service dependency mapping, where the agent automatically maps an alert in a frontend service to a failing downstream microservice, further shortening the path to resolution.

For organizations looking to integrate autonomous on-call capabilities into their existing infrastructure, or for deep dives into building out scalable observability pipelines, we are available to assist. Our team specializes in bridging the gap between high-volume telemetry and actionable, AI-driven automation. Visit https://www.mgatc.com for consulting services.

Originally published in Spanish at www.mgatc.com/blog/relvy-ai-automated-on-call-runbooks/