-
Notifications
You must be signed in to change notification settings - Fork 343
Description
Summary
Add support for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to HyperDX, inspired by Honeycomb's SLO implementation.
Background
SLOs are a critical component of modern observability and Site Reliability Engineering (SRE) practices. They help teams:
- Define measurable reliability goals for services
- Balance feature development with infrastructure maintenance
- Prioritize incidents effectively based on error budget consumption
- Report on service quality with precision
Key Concepts to Implement
1. Service Level Indicator (SLI)
A per-event measurement that defines whether the system succeeded or failed.
Examples:
- Latency:
response_time < 500ms - Availability:
http_status_code < 500 - Error rate:
error == false
2. Service Level Objective (SLO)
The target proportion of successful SLIs over a rolling time window, expressed as a percentage.
Examples:
- "99.9% availability over 30 days"
- "99% of requests under 500ms latency over 7 days"
3. Error Budget
The allowable amount of failure within the SLO window.
Example calculation:
- At 99.9% target with 1 million events in 30 days
- Error budget = 1,000 failed events (0.1% of 1M)
- Or ~44 minutes of downtime in 30 days
4. Burn Rate
How quickly the error budget is being consumed compared to the target rate.
- Burn rate of
1.0= consuming budget evenly (will deplete exactly at window end) - Burn rate of
2.0= consuming twice as fast (will deplete in half the window)
Formula:
Burn rate = actual error rate / expected error rate
Proposed Features
Phase 1: Core SLO Infrastructure
-
SLI Definition
- Allow users to define SLIs using existing query builder
- Support for multiple SLI types:
- Latency-based (threshold comparison)
- Availability-based (error rate)
- Custom expressions
-
SLO Configuration
- Target percentage (e.g., 99.9%)
- Rolling time window (7d, 14d, 30d, 90d)
- Associated services/sources
- Name and description
-
Error Budget Tracking
- Calculate remaining error budget
- Budget burndown visualization
- Historical budget consumption
Phase 2: Alerting & Notifications
-
Burn Alerts
- Alert when error budget is depleting faster than expected
- Configurable burn rate thresholds
- Multiple alert severity levels based on burn rate
-
Integration with Existing Alert Channels
- Webhook notifications
- Future: Slack, PagerDuty, email
Phase 3: Visualization & Reporting
-
SLO Dashboard
- Current SLO compliance percentage
- Error budget remaining (events and time)
- Burn rate visualization
- Historical SLO performance
-
BubbleUp Integration
- Click through from SLO violations to investigate root causes
- Identify outliers contributing to budget burn
Phase 4: Advanced Features (Future)
-
Multi-Service SLOs
- Share error budget across multiple services
- Aggregate events from related services
- Support for up to 10 services per SLO
-
SLO Tags
- Organize SLOs by team, project, or service
- Filter and group SLOs
-
SLO Reporting
- Weekly/monthly SLO reports
- Trend analysis
Technical Considerations
Data Model
New MongoDB models needed:
// SLI Model interface ISLI { id: string; name: string; description?: string; team: ObjectId; source: ObjectId; // Reference to Source/Service // Query definition successCondition: { field: string; operator: '<' | '<=' | '>' | '>=' | '==' | '!=' | 'exists'; value: string | number | boolean; }; createdBy: ObjectId; createdAt: Date; updatedAt: Date; } // SLO Model interface ISLO { id: string; name: string; description?: string; team: ObjectId; sli: ObjectId; // Reference to SLI // Target configuration targetPercentage: number; // e.g., 99.9 windowDays: number; // e.g., 30 // Current state currentPercentage?: number; errorBudgetRemaining?: number; burnRate?: number; // Alerting burnAlerts?: { enabled: boolean; thresholds: Array<{ burnRate: number; severity: 'warning' | 'critical'; channel: AlertChannel; }>; }; tags?: string[]; createdBy: ObjectId; createdAt: Date; updatedAt: Date; }
ClickHouse Queries
SLO calculations will require efficient ClickHouse queries:
-- Calculate SLI success rate over time window SELECT countIf(duration_ms < 500) as successful_events, count(*) as total_events, successful_events / total_events * 100 as success_rate FROM events WHERE timestamp >= now() - INTERVAL 30 DAY AND service_name = 'my-service'
Background Tasks
New background task for SLO monitoring:
- Calculate SLO compliance periodically (every 1-5 minutes)
- Update burn rates
- Trigger burn alerts when thresholds exceeded
- Store historical SLO data for reporting
UI/UX Considerations
-
SLO List Page
- Overview of all SLOs with current status
- Quick view of error budget remaining
- Filter by tags, services, compliance status
-
SLO Detail Page
- Real-time compliance percentage
- Error budget burndown chart
- Burn rate over time
- Recent violations with drill-down capability
-
SLO Creation Wizard
- Step-by-step SLI and SLO configuration
- Preview of expected error budget
- Alert configuration
References
Acceptance Criteria
- Users can define SLIs based on telemetry data
- Users can create SLOs with target percentages and time windows
- Error budget is calculated and displayed in real-time
- Burn rate is tracked and visualized
- Burn alerts can be configured and trigger notifications
- SLO dashboard provides at-a-glance view of service reliability
- Historical SLO data is available for reporting
Labels
enhancement, feature-request, observability