Feature Request: Service Level Objectives (SLOs), SLIs, and Error Budgets #1453

New issue

Open

@alok87

Description

@alok87

alok87

opened

on Dec 8, 2025

Summary

Add support for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to HyperDX, inspired by Honeycomb's SLO implementation.

Background

SLOs are a critical component of modern observability and Site Reliability Engineering (SRE) practices. They help teams:

Define measurable reliability goals for services
Balance feature development with infrastructure maintenance
Prioritize incidents effectively based on error budget consumption
Report on service quality with precision

Key Concepts to Implement

1. Service Level Indicator (SLI)

A per-event measurement that defines whether the system succeeded or failed.

Examples:

Latency: response_time < 500ms
Availability: http_status_code < 500
Error rate: error == false

2. Service Level Objective (SLO)

The target proportion of successful SLIs over a rolling time window, expressed as a percentage.

Examples:

"99.9% availability over 30 days"
"99% of requests under 500ms latency over 7 days"

3. Error Budget

The allowable amount of failure within the SLO window.

Example calculation:

At 99.9% target with 1 million events in 30 days
Error budget = 1,000 failed events (0.1% of 1M)
Or ~44 minutes of downtime in 30 days

4. Burn Rate

How quickly the error budget is being consumed compared to the target rate.

Burn rate of 1.0 = consuming budget evenly (will deplete exactly at window end)
Burn rate of 2.0 = consuming twice as fast (will deplete in half the window)

Formula:

Burn rate = actual error rate / expected error rate

Proposed Features

Phase 1: Core SLO Infrastructure

SLI Definition
- Allow users to define SLIs using existing query builder
- Support for multiple SLI types:
  - Latency-based (threshold comparison)
  - Availability-based (error rate)
  - Custom expressions
SLO Configuration
- Target percentage (e.g., 99.9%)
- Rolling time window (7d, 14d, 30d, 90d)
- Associated services/sources
- Name and description
Error Budget Tracking
- Calculate remaining error budget
- Budget burndown visualization
- Historical budget consumption

Phase 2: Alerting & Notifications

Burn Alerts
- Alert when error budget is depleting faster than expected
- Configurable burn rate thresholds
- Multiple alert severity levels based on burn rate
Integration with Existing Alert Channels
- Webhook notifications
- Future: Slack, PagerDuty, email

Phase 3: Visualization & Reporting

SLO Dashboard
- Current SLO compliance percentage
- Error budget remaining (events and time)
- Burn rate visualization
- Historical SLO performance
BubbleUp Integration
- Click through from SLO violations to investigate root causes
- Identify outliers contributing to budget burn

Phase 4: Advanced Features (Future)

Multi-Service SLOs
- Share error budget across multiple services
- Aggregate events from related services
- Support for up to 10 services per SLO
SLO Tags
- Organize SLOs by team, project, or service
- Filter and group SLOs
SLO Reporting
- Weekly/monthly SLO reports
- Trend analysis

Technical Considerations

Data Model

New MongoDB models needed:

// SLI Model
interface ISLI {
 id: string;
 name: string;
 description?: string;
 team: ObjectId;
 source: ObjectId; // Reference to Source/Service
 
 // Query definition
 successCondition: {
 field: string;
 operator: '<' | '<=' | '>' | '>=' | '==' | '!=' | 'exists';
 value: string | number | boolean;
 };
 
 createdBy: ObjectId;
 createdAt: Date;
 updatedAt: Date;
}
// SLO Model
interface ISLO {
 id: string;
 name: string;
 description?: string;
 team: ObjectId;
 sli: ObjectId; // Reference to SLI
 
 // Target configuration
 targetPercentage: number; // e.g., 99.9
 windowDays: number; // e.g., 30
 
 // Current state
 currentPercentage?: number;
 errorBudgetRemaining?: number;
 burnRate?: number;
 
 // Alerting
 burnAlerts?: {
 enabled: boolean;
 thresholds: Array<{
 burnRate: number;
 severity: 'warning' | 'critical';
 channel: AlertChannel;
 }>;
 };
 
 tags?: string[];
 createdBy: ObjectId;
 createdAt: Date;
 updatedAt: Date;
}

ClickHouse Queries

SLO calculations will require efficient ClickHouse queries:

-- Calculate SLI success rate over time window
SELECT
 countIf(duration_ms < 500) as successful_events,
 count(*) as total_events,
 successful_events / total_events * 100 as success_rate
FROM events
WHERE 
 timestamp >= now() - INTERVAL 30 DAY
 AND service_name = 'my-service'

Background Tasks

New background task for SLO monitoring:

Calculate SLO compliance periodically (every 1-5 minutes)
Update burn rates
Trigger burn alerts when thresholds exceeded
Store historical SLO data for reporting

UI/UX Considerations

SLO List Page
- Overview of all SLOs with current status
- Quick view of error budget remaining
- Filter by tags, services, compliance status
SLO Detail Page
- Real-time compliance percentage
- Error budget burndown chart
- Burn rate over time
- Recent violations with drill-down capability
SLO Creation Wizard
- Step-by-step SLI and SLO configuration
- Preview of expected error budget
- Alert configuration

References

Acceptance Criteria

Users can define SLIs based on telemetry data
Users can create SLOs with target percentages and time windows
Error budget is calculated and displayed in real-time
Burn rate is tracked and visualized
Burn alerts can be configured and trigger notifications
SLO dashboard provides at-a-glance view of service reliability
Historical SLO data is available for reporting

Labels

enhancement, feature-request, observability

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Service Level Objectives (SLOs), SLIs, and Error Budgets #1453

Description

Summary

Background

Key Concepts to Implement

1. Service Level Indicator (SLI)

2. Service Level Objective (SLO)

3. Error Budget

4. Burn Rate

Proposed Features

Phase 1: Core SLO Infrastructure

Phase 2: Alerting & Notifications

Phase 3: Visualization & Reporting

Phase 4: Advanced Features (Future)

Technical Considerations

Data Model

ClickHouse Queries

Background Tasks

UI/UX Considerations

References

Acceptance Criteria

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions