Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

springdom/solace

Repository files navigation

Status: Alpha Python 3.12+ License: Apache 2.0

Solace

Open-source alert management and incident response platform. Ingest alerts from any monitoring source, deduplicate them, auto-correlate into incidents, and manage the response — all from a single dashboard.

Think PagerDuty / OpsGenie, but open-source and self-hosted.

Features

Authentication & Access Control

  • JWT-based authentication — Secure login with username/password, 8-hour token expiry
  • Role-based access control (RBAC) — Three roles: Admin (full access), User (read + acknowledge/resolve), Viewer (read-only)
  • Default admin account — Auto-seeded on first startup with configurable credentials
  • First-login password change — Admin account requires password change on first login
  • API key backward compatibility — Webhook ingestion continues to use X-API-Key header, existing integrations unaffected
  • User management — Admin panel to create, edit, and deactivate user accounts

On-Call Scheduling

  • Flexible rotations — Hourly, daily, weekly, or custom rotation intervals
  • Member management — Add team members to schedules with ordered rotation positions
  • Timezone-aware handoffs — Configure handoff time and timezone per schedule
  • Temporary overrides — Swap on-call duty for a time range with reason tracking
  • "Who's On Call" view — Real-time display of the current on-call person per schedule

Escalation Policies

  • Multi-level escalation — Define escalation levels with configurable timeouts (1-1440 minutes)
  • Mixed targets — Each level can notify users directly or the current on-call from a schedule
  • Repeat support — Policies can repeat through all levels N times before stopping
  • Service-to-policy mapping — Map services to escalation policies using glob patterns (e.g., billing-*, *)
  • Priority ordering — When multiple mappings match, the lowest priority number wins
  • Severity filtering — Optionally restrict mappings to specific severity levels

Alert Ingestion & Normalization

  • 6 built-in webhook normalizers — Generic, Prometheus Alertmanager, Grafana, Splunk, Datadog, and Email ingest
  • Pluggable architecture — Each provider has its own normalizer that maps vendor-specific payloads to Solace's internal format
  • Auto-severity mapping — Provider-specific priority/severity levels are normalized to Solace's 5-level model (critical, high, warning, low, info)

Deduplication

  • Fingerprint-based dedup — SHA256 hash of identity fields (source, name, service, host, labels) ensures identical alerts merge rather than duplicate
  • Configurable dedup window — Default 5 minutes; identical alerts within the window increment duplicate_count
  • Occurrence timeline — Every duplicate arrival is tracked with a timestamp for frequency analysis

Incident Correlation

  • Automatic service-based grouping — Alerts from the same service within a configurable time window (default 10 min) are grouped into a single incident
  • Severity auto-promotion — Incident severity always reflects the worst alert severity
  • Auto-resolve — When all alerts in an incident resolve, the incident auto-resolves

Alert Lifecycle

  • Full status workflow — Firing → Acknowledged → Resolved, plus Suppressed and Archived states
  • Acknowledge & resolve — One-click actions from the dashboard or via API
  • Bulk operations — Select multiple alerts and acknowledge or resolve them in one action
  • Archive — Archive resolved alerts older than N days to keep the dashboard clean

Incident Management

  • Incident timeline — Every action (created, alert added, severity changed, acknowledged, resolved) is recorded as a timestamped event
  • Incident detail view — See all correlated alerts, event audit trail, and incident metadata in one place
  • Cascade actions — Acknowledging/resolving an incident applies to all its alerts

Notification Channels (5 types)

  • Slack — Block Kit formatted messages with severity color coding, alert counts, service info, and dashboard links
  • Microsoft Teams — Adaptive Card messages via incoming webhook or Power Automate workflow URLs
  • Email — HTML-formatted incident notifications via SMTP with correlated alert tables
  • Generic Webhook (Outbound) — JSON payload with full incident and alert data, optional shared secret for HMAC verification, custom headers support
  • PagerDuty — Events API v2 integration; triggers, resolves, and dedup keys sync incidents to PagerDuty services
  • Per-channel filters — Filter notifications by severity and/or service
  • Rate limiting — Per-channel, per-incident cooldown prevents notification spam
  • Delivery logs — Every notification attempt is logged with status (pending/sent/failed) and error details
  • Test button — Send a test notification through any channel from the UI

Silence / Maintenance Windows

  • Time-based suppression — Define start/end times for maintenance windows
  • Flexible matchers — Match by service (list), severity (list), or label key-value pairs
  • AND logic — All matchers must match for an alert to be suppressed
  • CRUD management — Create, edit, and view active/expired windows from the UI

Alert Enrichment

  • Tags — Free-form string tags with add/remove from UI or API; stored as JSONB with GIN index for fast queries
  • Investigation notes — Timestamped notes with author attribution and full CRUD
  • External ticket linking — Link alerts to Jira, GitHub, or any URL; auto-prepends https:// if missing
  • Runbook URL — Editable from the alert detail panel; manually paste a URL or auto-attach via runbook rules
  • Runbook rules — Pattern-based rules that auto-attach runbook URLs to incoming alerts. Define a service glob pattern (e.g., payment-*), an optional name pattern, and a URL template with variables ({service}, {host}, {name}, {environment}). First matching rule wins (priority-ordered). "Save as Rule" checkbox on the alert panel creates a rule from the current alert in one click.
  • Raw payload — Full original webhook payload preserved for forensic inspection

Alert Auto-Expire

  • Configurable TTL — Firing alerts auto-resolve after a configurable time-to-live (default 24 hours, 0 to disable)
  • Admin-only control — Only admins can adjust the TTL at runtime via Settings; env var ALERT_TTL_SECONDS for persistence across restarts
  • Smart exclusions — Acknowledged alerts are excluded from auto-expire (someone is working on it)
  • Freshness tracking — Duplicate arrivals reset the expiry timer via last_received_at, so actively recurring issues stay open
  • Auto-expired tag — Expired alerts are tagged auto-expired to distinguish from manual resolution
  • Cascade — Auto-expired alerts trigger incident resolution and WebSocket events like any other resolve

Analytics Dashboard

  • Alert volume trends — Hourly area chart showing alert ingest rate over time
  • MTTA/MTTR trends — Daily line chart tracking mean time to acknowledge and resolve
  • Top noisy services — Bar chart ranking services by alert volume
  • Severity breakdown — Distribution of alerts across severity levels
  • Time range selector — Toggle between 7-day, 14-day, and 30-day views
  • Integrated into Statistics — Expands the existing Statistics view, no separate navigation

Heartbeat / Dead-Man Monitoring

  • Dead-man switch — Register expected check-ins; if a service doesn't ping within the interval + grace period, Solace fires an alert
  • HTTP health checks — Periodically GET a URL; if the response is non-2xx or times out, Solace fires an alert
  • Automatic recovery — When a failed heartbeat recovers, Solace sends a resolved alert to close the incident
  • Full pipeline integration — Heartbeat alerts go through the standard ingestion pipeline (dedup, correlation, notifications, escalation)
  • CRUD management — Create, edit, delete, and monitor heartbeats from the Heartbeats tab
  • Slug-based ping endpoint — Dead-man pings use POST /api/v1/heartbeats/{slug}/ping with API key auth

Dashboard & UI

  • Light and dark themes — Toggle between a high-contrast dark ops-console theme and a clean light theme; preference persisted in localStorage
  • Real-time updates — WebSocket connection with automatic reconnect and fallback polling
  • Keyboard shortcutsj/k navigation, a acknowledge, r resolve, Esc close, ? help
  • Search & filter — Full-text search across name, service, host, tags with status/severity/service filters
  • Sortable columns — Sort by time, severity, name, service, duplicate count, or status
  • Pagination — Configurable page size with server-side pagination
  • Stats bar — Live counts of alerts by status/severity, incident counts, MTTA, and MTTR

API & Integration

  • Full REST API — Every feature is accessible via API (alerts, incidents, silences, notifications, on-call, stats, settings)
  • OpenAPI docs — Auto-generated Swagger UI at /docs
  • Health checks — Liveness (/health) and readiness (/health/ready) endpoints for Kubernetes probes
  • WebSocket events — Real-time event stream for alert.created, incident.updated, incident_created, severity_changed, incident_resolved
  • Dual auth — JWT Bearer tokens for user sessions, X-API-Key header for webhook ingestion and external integrations

Architecture

Prometheus ──┐
Grafana ─────┤ ┌─────────────┐ ┌────────────┐
Datadog ─────┼─▶ Webhook API ──▶ │ Normalizer │ ──▶ │ Dedup │
Splunk ──────┤ (X-API-Key) │ (pluggable) │ │ Engine │
Email ───────┤ └─────────────┘ └─────┬──────┘
Custom ──────┘ │
 ┌─────▼──────┐
 │ Silence │
 │ Check │
 └─────┬──────┘
 │
 ┌─────▼──────┐ ┌──────────────┐
 │ Correlation │──▶ │ Notifications │
 │ Engine │ └──────┬───────┘
 └─────┬──────┘ │
 │ ┌──────▼───────┐
 │ │ Escalation │
 │ │ Engine │
 │ └──────┬───────┘
 │ │
 ┌────────────▼───────────────────▼┐
 │ PostgreSQL + Redis │
 └────────────┬────────────────────┘
 │
 ┌────────────▼────────────┐
 │ React Dashboard (WS) │
 │ JWT Auth + RBAC │
 │ (Vite + Tailwind) │
 └─────────────────────────┘

Quick Start

Docker Compose (recommended)

git clone https://github.com/springdom/solace.git
cd solace
docker compose up --build

Default login: admin / admin (you'll be prompted to change the password on first login)

Send a test alert

# Generic webhook
curl -X POST http://localhost:8000/api/v1/webhooks/generic \
 -H "Content-Type: application/json" \
 -d '{
 "name": "HighCPU",
 "severity": "critical",
 "service": "payment-api",
 "host": "web-01",
 "description": "CPU usage above 95% for 10 minutes",
 "tags": ["production", "us-east-1"]
 }'
# Prometheus Alertmanager
curl -X POST http://localhost:8000/api/v1/webhooks/prometheus \
 -H "Content-Type: application/json" \
 -d '{
 "version": "4",
 "status": "firing",
 "alerts": [{
 "status": "firing",
 "labels": {
 "alertname": "DiskFull",
 "instance": "db-01:9090",
 "job": "postgres",
 "severity": "critical"
 },
 "annotations": {
 "summary": "Disk 95% full on db-01"
 },
 "startsAt": "2024-01-15T10:00:00.000Z",
 "endsAt": "0001-01-01T00:00:00Z"
 }]
 }'
# Grafana unified alerting
curl -X POST http://localhost:8000/api/v1/webhooks/grafana \
 -H "Content-Type: application/json" \
 -d '{
 "alerts": [{
 "status": "firing",
 "labels": { "alertname": "HighMemory", "grafana_folder": "Infrastructure" },
 "annotations": { "summary": "Memory above 90%", "severity": "high" },
 "startsAt": "2024-01-15T10:00:00.000Z",
 "endsAt": "0001-01-01T00:00:00Z",
 "values": { "B": 92.5 }
 }]
 }'
# Datadog monitor webhook
curl -X POST http://localhost:8000/api/v1/webhooks/datadog \
 -H "Content-Type: application/json" \
 -d '{
 "id": "123456789",
 "title": "CPU is high on web-01",
 "text": "CPU utilization above threshold",
 "alert_status": "triggered",
 "priority": "P1",
 "hostname": "web-01",
 "org": { "name": "MyOrg" },
 "tags": "env:production,service:payment-api"
 }'
# Splunk webhook alert
curl -X POST http://localhost:8000/api/v1/webhooks/splunk \
 -H "Content-Type: application/json" \
 -d '{
 "result": {
 "host": "web-01",
 "severity": "critical",
 "service": "payment-api",
 "message": "CPU usage above 95% for 10 minutes"
 },
 "sid": "scheduler_admin_HighCPU_at_17000000_132",
 "search_name": "High CPU Usage Alert"
 }'

Test incident correlation

Alerts from the same service auto-group into a single incident:

# These two alerts will be correlated into ONE incident
curl -X POST http://localhost:8000/api/v1/webhooks/generic \
 -H "Content-Type: application/json" \
 -d '{"name":"HighCPU","severity":"critical","service":"payment-api","host":"web-01"}'
curl -X POST http://localhost:8000/api/v1/webhooks/generic \
 -H "Content-Type: application/json" \
 -d '{"name":"HighMemory","severity":"high","service":"payment-api","host":"web-02"}'
# This creates a SEPARATE incident (different service)
curl -X POST http://localhost:8000/api/v1/webhooks/generic \
 -H "Content-Type: application/json" \
 -d '{"name":"HighErrorRate","severity":"warning","service":"auth-service"}'

Configure notification channels

# Slack
curl -X POST http://localhost:8000/api/v1/notifications/channels \
 -H "Content-Type: application/json" \
 -d '{
 "name": "Ops Slack",
 "channel_type": "slack",
 "config": { "webhook_url": "https://hooks.slack.com/services/YOUR/HOOK/URL" },
 "filters": { "severity": ["critical", "high"] }
 }'
# Microsoft Teams
curl -X POST http://localhost:8000/api/v1/notifications/channels \
 -H "Content-Type: application/json" \
 -d '{
 "name": "DevOps Teams",
 "channel_type": "teams",
 "config": { "webhook_url": "https://your-org.webhook.office.com/..." },
 "filters": { "severity": ["critical"] }
 }'
# PagerDuty
curl -X POST http://localhost:8000/api/v1/notifications/channels \
 -H "Content-Type: application/json" \
 -d '{
 "name": "PagerDuty On-Call",
 "channel_type": "pagerduty",
 "config": { "routing_key": "YOUR_PAGERDUTY_INTEGRATION_KEY" },
 "filters": { "severity": ["critical"] }
 }'
# Generic outbound webhook
curl -X POST http://localhost:8000/api/v1/notifications/channels \
 -H "Content-Type: application/json" \
 -d '{
 "name": "Automation Webhook",
 "channel_type": "webhook",
 "config": {
 "webhook_url": "https://your-service.com/hooks/solace",
 "secret": "optional-shared-secret",
 "headers": { "X-Custom-Header": "value" }
 }
 }'

API Endpoints

Authentication

Method Endpoint Description
POST /api/v1/auth/login Login with username/password, returns JWT
GET /api/v1/auth/me Get current user profile
POST /api/v1/auth/change-password Change password

Users (Admin only)

Method Endpoint Description
GET /api/v1/users List users
POST /api/v1/users Create user
PUT /api/v1/users/{id} Update user profile/role
POST /api/v1/users/{id}/reset-password Reset user password
DELETE /api/v1/users/{id} Deactivate user

Health

Method Endpoint Description
GET /health Liveness check
GET /health/ready Readiness check (DB + Redis)

Webhooks (Alert Ingestion)

Method Endpoint Description
POST /api/v1/webhooks/generic Generic webhook
POST /api/v1/webhooks/prometheus Prometheus Alertmanager
POST /api/v1/webhooks/grafana Grafana unified alerting
POST /api/v1/webhooks/datadog Datadog monitor webhook
POST /api/v1/webhooks/splunk Splunk saved search webhook
POST /api/v1/webhooks/email_ingest Email-based alert ingestion

Alerts

Method Endpoint Description
GET /api/v1/alerts List alerts (filterable, sortable, paginated)
GET /api/v1/alerts/{id} Get alert by ID
POST /api/v1/alerts/{id}/acknowledge Acknowledge alert
POST /api/v1/alerts/{id}/resolve Resolve alert
PUT /api/v1/alerts/{id}/tags Replace all tags
POST /api/v1/alerts/{id}/tags/{tag} Add a single tag
DELETE /api/v1/alerts/{id}/tags/{tag} Remove a tag
GET /api/v1/alerts/{id}/notes List investigation notes
POST /api/v1/alerts/{id}/notes Add a note
PUT /api/v1/alerts/notes/{id} Edit a note
DELETE /api/v1/alerts/notes/{id} Delete a note
GET /api/v1/alerts/{id}/history Get occurrence timeline
PUT /api/v1/alerts/{id}/ticket Set external ticket URL
PUT /api/v1/alerts/{id}/runbook Set runbook URL (optionally create rule)
POST /api/v1/alerts/bulk/acknowledge Bulk acknowledge
POST /api/v1/alerts/bulk/resolve Bulk resolve
POST /api/v1/alerts/archive Archive old resolved alerts

Incidents

Method Endpoint Description
GET /api/v1/incidents List incidents (filterable, sortable, paginated)
GET /api/v1/incidents/{id} Get incident with alerts + event timeline
POST /api/v1/incidents/{id}/acknowledge Acknowledge incident + all alerts
POST /api/v1/incidents/{id}/resolve Resolve incident + all alerts

Silences (Maintenance Windows)

Method Endpoint Description
GET /api/v1/silences List silence windows (filterable by state)
POST /api/v1/silences Create silence window
GET /api/v1/silences/{id} Get silence window
PUT /api/v1/silences/{id} Update silence window
DELETE /api/v1/silences/{id} Delete silence window

Notification Channels

Method Endpoint Description
GET /api/v1/notifications/channels List all channels
POST /api/v1/notifications/channels Create channel (slack/teams/email/webhook/pagerduty)
GET /api/v1/notifications/channels/{id} Get channel
PUT /api/v1/notifications/channels/{id} Update channel
DELETE /api/v1/notifications/channels/{id} Delete channel
POST /api/v1/notifications/channels/{id}/test Send test notification
GET /api/v1/notifications/logs List delivery logs

On-Call Schedules

Method Endpoint Description
GET /api/v1/oncall/schedules List schedules (paginated, active_only filter)
POST /api/v1/oncall/schedules Create schedule (admin)
GET /api/v1/oncall/schedules/{id} Get schedule
PUT /api/v1/oncall/schedules/{id} Update schedule (admin)
DELETE /api/v1/oncall/schedules/{id} Delete schedule (admin)
GET /api/v1/oncall/schedules/{id}/current Get who is currently on call
POST /api/v1/oncall/schedules/{id}/overrides Create temporary override (admin)
DELETE /api/v1/oncall/overrides/{id} Delete override (admin)

Escalation Policies

Method Endpoint Description
GET /api/v1/oncall/policies List escalation policies
POST /api/v1/oncall/policies Create policy (admin)
GET /api/v1/oncall/policies/{id} Get policy
PUT /api/v1/oncall/policies/{id} Update policy (admin)
DELETE /api/v1/oncall/policies/{id} Delete policy (admin)

Service Mappings

Method Endpoint Description
GET /api/v1/oncall/mappings List service-to-policy mappings
POST /api/v1/oncall/mappings Create mapping (admin)
DELETE /api/v1/oncall/mappings/{id} Delete mapping (admin)

Runbook Rules

Method Endpoint Description
GET /api/v1/runbooks/rules List runbook rules
POST /api/v1/runbooks/rules Create rule (admin)
PUT /api/v1/runbooks/rules/{id} Update rule (admin)
DELETE /api/v1/runbooks/rules/{id} Delete rule (admin)

Stats & Settings

Method Endpoint Description
GET /api/v1/stats Dashboard statistics (counts, MTTA, MTTR)
GET /api/v1/stats/trends Time-series analytics (alert volume, MTTA/MTTR daily, top services)
GET /api/v1/settings Application configuration (includes alert TTL)
PUT /api/v1/settings/alert-ttl Update alert auto-expire TTL (admin only)

Heartbeats

Method Endpoint Description
GET /api/v1/heartbeats List all heartbeats
POST /api/v1/heartbeats Create heartbeat (admin only)
PUT /api/v1/heartbeats/{id} Update heartbeat (admin only)
DELETE /api/v1/heartbeats/{id} Delete heartbeat (admin only)
POST /api/v1/heartbeats/{slug}/ping Record dead-man check-in (API key auth)

WebSocket

Endpoint Description
GET /api/v1/ws?token={jwt_or_api_key} Real-time event stream

Configuration

All settings are configurable via environment variables:

Variable Default Description
DATABASE_URL postgresql+asyncpg://solace:solace@localhost:5432/solace PostgreSQL connection
REDIS_URL redis://localhost:6379/0 Redis connection
API_KEY "" API key for webhook ingestion (empty = no auth in dev)
SECRET_KEY change-me-to-a-random-secret-key Secret for JWT signing
ADMIN_USERNAME admin Default admin username (created on first startup)
ADMIN_PASSWORD admin Default admin password
ADMIN_EMAIL admin@solace.local Default admin email
JWT_EXPIRE_MINUTES 480 JWT token expiry (8 hours)
DEDUP_WINDOW_SECONDS 300 Window for deduplicating identical alerts (5 min)
CORRELATION_WINDOW_SECONDS 600 Window for correlating alerts into incidents (10 min)
NOTIFICATION_COOLDOWN_SECONDS 300 Per-channel, per-incident notification cooldown (5 min)
SOLACE_DASHBOARD_URL http://localhost:3000 Dashboard URL (used in notification links)
APP_ENV development Environment (development / production)
LOG_LEVEL INFO Logging level
ALERT_TTL_SECONDS 86400 Auto-expire firing alerts after N seconds (0 = disabled, default 24h)
ALERT_EXPIRE_CHECK_INTERVAL_SECONDS 60 How often to check for expired alerts
HEARTBEAT_CHECK_INTERVAL_SECONDS 30 How often to run heartbeat monitoring checks
SMTP_HOST "" SMTP server for email notifications
SMTP_PORT 587 SMTP port
SMTP_USER "" SMTP username
SMTP_PASSWORD "" SMTP password
SMTP_USE_TLS true Enable STARTTLS
SMTP_FROM_ADDRESS solace@localhost Sender address for email notifications

Tech Stack

Backend: Python 3.12+, FastAPI, async SQLAlchemy (asyncpg), Alembic, PostgreSQL, Redis, python-jose (JWT), passlib (bcrypt)

Frontend: React 18, TypeScript, Vite, Tailwind CSS, Zustand

Deployment: Docker Compose, Kubernetes-ready health probes

Development

Run tests

pip install -e ".[dev]"
pytest tests/ -v

Lint

ruff check backend/

Local development (without Docker)

# Start PostgreSQL and Redis
# Create database: CREATE DATABASE solace;
# Run migrations
alembic upgrade head
# Start API server
uvicorn backend.main:app --reload --port 8000
# Start frontend (separate terminal)
cd frontend && npm install && npm run dev

Roadmap

Completed

  • Multi-source webhook ingestion (Generic, Prometheus, Grafana, Datadog, Splunk, Email)

  • Fingerprint-based deduplication with configurable window

  • Service-based automatic incident correlation

  • Full alert lifecycle (firing, acknowledged, resolved, suppressed, archived)

  • Incident management with event audit trail

  • Notification channels: Slack, Microsoft Teams, Email, Webhook (outbound), PagerDuty

  • Notification filters, rate limiting, delivery logs, and test button

  • Silence / maintenance windows with flexible matchers

  • Alert tagging and investigation notes

  • External ticket URL linking (Jira, GitHub, etc.)

  • Runbook URL support with editable UI and auto-attach rules

  • Bulk acknowledge/resolve operations

  • Archive old resolved alerts

  • Dashboard stats (MTTA, MTTR, counts by status/severity)

  • Real-time WebSocket updates with fallback polling

  • Keyboard shortcuts for fast navigation

  • Light and dark theme toggle

  • JWT authentication with default admin account

  • Role-based access control (admin, user, viewer)

  • User management (create, edit, deactivate)

  • On-call scheduling (hourly/daily/weekly/custom rotations)

  • Temporary on-call overrides

  • Escalation policies with multi-level targets

  • Service-to-policy mapping with glob patterns and priority ordering

  • Alert auto-expire with configurable TTL (admin-controlled)

  • Analytics dashboard with time-series trends (alert volume, MTTA/MTTR, top services)

  • Heartbeat / dead-man monitoring with HTTP health checks

Next Up

  • Background escalation checker (auto-escalate if not ack'd in N minutes)
  • SSO integration (Google, GitHub, SAML)
  • SMS and voice call notifications (Twilio)
  • Status pages (public incident status)
  • Plugin system (custom normalizers, notification channels, enrichment hooks)
  • Alert pattern detection and noise scoring
  • Post-incident review and retrospectives
  • Topology-aware correlation (service dependency graph)

License

Apache 2.0

About

Open-source alert management and incident response platform

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /