Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Feature Request: Service Level Objectives (SLOs), SLIs, and Error Budgets #1453

Open
@alok87

Description

Summary

Add support for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to HyperDX, inspired by Honeycomb's SLO implementation.

Background

SLOs are a critical component of modern observability and Site Reliability Engineering (SRE) practices. They help teams:

  • Define measurable reliability goals for services
  • Balance feature development with infrastructure maintenance
  • Prioritize incidents effectively based on error budget consumption
  • Report on service quality with precision

Key Concepts to Implement

1. Service Level Indicator (SLI)

A per-event measurement that defines whether the system succeeded or failed.

Examples:

  • Latency: response_time < 500ms
  • Availability: http_status_code < 500
  • Error rate: error == false

2. Service Level Objective (SLO)

The target proportion of successful SLIs over a rolling time window, expressed as a percentage.

Examples:

  • "99.9% availability over 30 days"
  • "99% of requests under 500ms latency over 7 days"

3. Error Budget

The allowable amount of failure within the SLO window.

Example calculation:

  • At 99.9% target with 1 million events in 30 days
  • Error budget = 1,000 failed events (0.1% of 1M)
  • Or ~44 minutes of downtime in 30 days

4. Burn Rate

How quickly the error budget is being consumed compared to the target rate.

  • Burn rate of 1.0 = consuming budget evenly (will deplete exactly at window end)
  • Burn rate of 2.0 = consuming twice as fast (will deplete in half the window)

Formula:

Burn rate = actual error rate / expected error rate

Proposed Features

Phase 1: Core SLO Infrastructure

  1. SLI Definition

    • Allow users to define SLIs using existing query builder
    • Support for multiple SLI types:
      • Latency-based (threshold comparison)
      • Availability-based (error rate)
      • Custom expressions
  2. SLO Configuration

    • Target percentage (e.g., 99.9%)
    • Rolling time window (7d, 14d, 30d, 90d)
    • Associated services/sources
    • Name and description
  3. Error Budget Tracking

    • Calculate remaining error budget
    • Budget burndown visualization
    • Historical budget consumption

Phase 2: Alerting & Notifications

  1. Burn Alerts

    • Alert when error budget is depleting faster than expected
    • Configurable burn rate thresholds
    • Multiple alert severity levels based on burn rate
  2. Integration with Existing Alert Channels

    • Webhook notifications
    • Future: Slack, PagerDuty, email

Phase 3: Visualization & Reporting

  1. SLO Dashboard

    • Current SLO compliance percentage
    • Error budget remaining (events and time)
    • Burn rate visualization
    • Historical SLO performance
  2. BubbleUp Integration

    • Click through from SLO violations to investigate root causes
    • Identify outliers contributing to budget burn

Phase 4: Advanced Features (Future)

  1. Multi-Service SLOs

    • Share error budget across multiple services
    • Aggregate events from related services
    • Support for up to 10 services per SLO
  2. SLO Tags

    • Organize SLOs by team, project, or service
    • Filter and group SLOs
  3. SLO Reporting

    • Weekly/monthly SLO reports
    • Trend analysis

Technical Considerations

Data Model

New MongoDB models needed:

// SLI Model
interface ISLI {
 id: string;
 name: string;
 description?: string;
 team: ObjectId;
 source: ObjectId; // Reference to Source/Service
 
 // Query definition
 successCondition: {
 field: string;
 operator: '<' | '<=' | '>' | '>=' | '==' | '!=' | 'exists';
 value: string | number | boolean;
 };
 
 createdBy: ObjectId;
 createdAt: Date;
 updatedAt: Date;
}
// SLO Model
interface ISLO {
 id: string;
 name: string;
 description?: string;
 team: ObjectId;
 sli: ObjectId; // Reference to SLI
 
 // Target configuration
 targetPercentage: number; // e.g., 99.9
 windowDays: number; // e.g., 30
 
 // Current state
 currentPercentage?: number;
 errorBudgetRemaining?: number;
 burnRate?: number;
 
 // Alerting
 burnAlerts?: {
 enabled: boolean;
 thresholds: Array<{
 burnRate: number;
 severity: 'warning' | 'critical';
 channel: AlertChannel;
 }>;
 };
 
 tags?: string[];
 createdBy: ObjectId;
 createdAt: Date;
 updatedAt: Date;
}

ClickHouse Queries

SLO calculations will require efficient ClickHouse queries:

-- Calculate SLI success rate over time window
SELECT
 countIf(duration_ms < 500) as successful_events,
 count(*) as total_events,
 successful_events / total_events * 100 as success_rate
FROM events
WHERE 
 timestamp >= now() - INTERVAL 30 DAY
 AND service_name = 'my-service'

Background Tasks

New background task for SLO monitoring:

  • Calculate SLO compliance periodically (every 1-5 minutes)
  • Update burn rates
  • Trigger burn alerts when thresholds exceeded
  • Store historical SLO data for reporting

UI/UX Considerations

  1. SLO List Page

    • Overview of all SLOs with current status
    • Quick view of error budget remaining
    • Filter by tags, services, compliance status
  2. SLO Detail Page

    • Real-time compliance percentage
    • Error budget burndown chart
    • Burn rate over time
    • Recent violations with drill-down capability
  3. SLO Creation Wizard

    • Step-by-step SLI and SLO configuration
    • Preview of expected error budget
    • Alert configuration

References

Acceptance Criteria

  • Users can define SLIs based on telemetry data
  • Users can create SLOs with target percentages and time windows
  • Error budget is calculated and displayed in real-time
  • Burn rate is tracked and visualized
  • Burn alerts can be configured and trigger notifications
  • SLO dashboard provides at-a-glance view of service reliability
  • Historical SLO data is available for reporting

Labels

enhancement, feature-request, observability

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /