Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Alert pipeline: silencing, inhibition, dedup, grouping, persistence#255

Merged
mostafa merged 10 commits into
main from
feat/alert-pipeline
Jun 26, 2026
Merged

Alert pipeline: silencing, inhibition, dedup, grouping, persistence #255
mostafa merged 10 commits into
main from
feat/alert-pipeline

Conversation

@mostafa

@mostafa mostafa commented Jun 26, 2026
edited
Loading

Copy link
Copy Markdown
Member

Summary

Adds an optional post-engine alert-processing stage to the daemon sink path, between enrichment and the sinks, modeled on the Alertmanager processing pipeline. Enabled with --alert-pipeline <path> (or the daemon.alert_pipeline config key) and hot-reloaded on SIGHUP, file-watcher changes, and POST /api/v1/reload; a failed reload keeps the previous pipeline active. The stage is strictly post-engine (it consumes and emits EvaluationResults), so the evaluation hot path is untouched.

The pipeline runs four stages in Alertmanager order (inhibition, silencing, deduplication, grouping):

  • Deduplication folds repeats of a configurable fingerprint into an active alert with an active -> resolved lifecycle: first fire passes through, repeats fold, the alert re-emits on repeat_interval, and emits a final resolved record after resolve_timeout. The active-alert store is bounded by dedup.max_active_alerts.
  • Grouping assigns survivors to incidents and annotates each pass-through result with incident_id. group_by (default) uses deterministic ids stable across restarts; an opt-in entity_graph union-find merges incidents sharing an entity value, guarded against the giant-component failure by a stop_values list and a per-value max_value_cardinality ceiling. Incidents emit on group_wait, group_interval, and repeat_interval, with per-incident caps.
  • Silencing mutes results matching operator-defined matchers (=, !=, =~, !~, regex anchored) for a time window. Static silences come from the config and are re-seeded on reload; dynamic silences are created over POST /api/v1/silences and bounded by max_silences.
  • Inhibition mutes a target while a matching source is active (source_match, target_match, equal, duration), with the self-inhibition guard and non-transitivity.

State (active dedup alerts, open incidents, dynamic silences, inhibition active-source index) persists to the existing SQLite store in its own table when --state-db is set, with window-aware pruning on restore and a versioned snapshot.

New HTTP endpoints: GET /api/v1/incidents, GET/POST /api/v1/silences, DELETE /api/v1/silences/{id}. Incidents are delivered via an additive Sink::send_incident across stdout/file/NATS (with an optional nats_subject override). Thirteen Prometheus metrics cover all four stages.

The Scope filter was lifted into a shared crate-level rsigma_runtime::scope module (re-exported from enrichment, so existing paths are unchanged).

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test -p rsigma-runtime --all-features (unit, golden wire-shape, snapshot round-trip)
  • cargo test -p rsigma --test cli_daemon_alert_pipeline (dedup, grouping, silencing, inhibition, persistence across restart)
  • mkdocs build --strict
  • Sigma corpus regression on the release binary

mostafa added 10 commits June 26, 2026 15:32
Move the Scope filter out of the enrichment module into a crate-level
rsigma_runtime::scope module so post-engine stages can share one
implementation. The enrichment module keeps a re-export, so both
rsigma_runtime::Scope and rsigma_runtime::enrichment::Scope are
unchanged. Behavior-neutral; pinned by the existing enrichment and
scope tests.
Add an optional post-engine alert-processing layer in rsigma-runtime
(rsigma_runtime::alert_pipeline) and wire it into the daemon sink path
between enrichment and the sinks, configured with --alert-pipeline (or
the daemon.alert_pipeline config key) and hot-reloaded like --enrichers.
The dedup stage fingerprints each in-scope result by the rule identity
plus configured field selectors and keeps one active alert per
fingerprint: the first fire passes through, duplicates fold in, the
alert re-emits on repeat_interval carrying the accumulated count, and
resolves after resolve_timeout with no further fires. Summary records
ride the existing NDJSON wire shape via a dedup_state enrichment key.
scope restricts which results the layer acts on; strip_event reads the
event for selector resolution then drops raw payloads before delivery.
Adds five Prometheus metrics, a fuzz target over the config and
selector grammar, unit/golden/E2E tests, and docs.
Add a grouping stage to the alert pipeline, after dedup, that collapses
survivors into incidents and emits a higher-level IncidentResult on the
Alertmanager timers. Each pass-through survivor is annotated with its
incident_id in enrichments and flows on immediately; the incident itself
is emitted by the sink-task tick.
Two modes: group_by (default) groups by equality on a selector list with
a deterministic incident id stable across restarts; an opt-in
entity_graph union-find merges incidents sharing an entity value, guarded
against the giant-component failure by a stop_values list and a per-value
max_value_cardinality ceiling. Incidents emit on group_wait,
group_interval, and repeat_interval, and resolve after resolve_timeout;
include refs or full (event-stripped) results, bounded by per-incident
caps.
IncidentResult is one flat NDJSON object disambiguated by an incident_id
key, delivered via an additive Sink::send_incident across stdout/file/NATS
(with an optional nats_subject override) routed through the dispatcher;
OTLP and webhook sinks no-op. Open incidents are readable at
GET /api/v1/incidents. Adds four incident metrics, a criterion bench,
unit/golden/E2E tests, and docs.
Add a silencing stage to the alert pipeline, ahead of dedup, that mutes
results matching operator-defined matchers. A muted result is acked and
dropped before dedup, so it neither emits nor opens an incident.
A matcher is selector <op> value over the field-selector namespace with
the =, !=, =~, !~ operators (regex anchored), ANDed into a set; the
matcher engine is shared with the forthcoming inhibition stage. Silences
carry an optional RFC 3339 time window with a derived
pending/active/expired state and one of two origins: static silences
declared under silences: in the config (re-seeded on hot-reload) and api
silences created at runtime. Expired silences are garbage-collected on
the tick.
New endpoints GET/POST /api/v1/silences and DELETE /api/v1/silences/{id}
manage silences, with two metrics (rsigma_silenced_total,
rsigma_silences_active). The sink task's mutable stores (dedup, incidents,
silences) are unified behind one AlertPipelineState shared via RwLock with
the admin API. Adds unit/E2E tests and docs.
Add an inhibition stage that mutes a target result while a matching
source is active, modeled on Alertmanager inhibit_rules. Each rule is
{ source_match, target_match, equal, duration } reusing the matcher
engine: while a result matching source_match has been seen within
duration, any result matching target_match that shares the same equal
selector values is muted.
Encodes the two Alertmanager behaviors via evaluation order: a silenced
source still inhibits its targets (the active-source index is updated
from every non-inhibited result before silencing), while an inhibited
target does not become a source (it is dropped before the index update).
Carries the self-inhibition guard (a result matching both sides does not
inhibit itself) and is non-transitive.
Inhibition is config-driven and hot-reloads with the pipeline; the
active-source index lives in the shared AlertPipelineState and is
GC'd on the tick. Two metrics (rsigma_inhibited_total{rule},
rsigma_inhibit_sources_active). Adds unit/E2E tests and docs.
Add a versioned AlertPipelineSnapshot (active dedup alerts, open
incidents, dynamic silences, and the inhibition active-source index)
saved to the existing SQLite store in its own rsigma_alert_pipeline_state
table, on the periodic and shutdown hooks beside the correlation
snapshot, and restored on boot.
Restore is window-aware: dedup alerts past resolve_timeout, incidents
past their resolve_timeout, silences past ends_at, and inhibition sources
past their rule's duration are pruned on load. Deterministic group_by
incident ids survive the restart; a snapshot version mismatch starts
fresh with a warning. --clear-state skips the restore and --keep-state
forces it, matching the correlation-state flags.
To make the state serializable without an EvaluationResult Deserialize,
retained samples (dedup) and embedded results (incidents) are stored as
serialized JSON values, and dedup summary lines are delivered through the
existing raw incident dispatch path. The sink task's stores are restored
into the shared AlertPipelineState. Adds runtime snapshot round-trip,
SQLite round-trip, and an end-to-end restart test, plus docs.
The entity-graph cardinality counters (value_counts) were inserted on
every joinable entity value but never pruned, so the map grew unbounded
with distinct entity values over the daemon's lifetime. Free a resolved
incident's counters alongside its entity-index entries on eviction.
Also enforce max_results_per_incident when merging two incidents so a
survivor's refs/results cannot exceed the configured per-incident cap.
Add two ceilings to close the remaining unbounded-growth paths in the
alert pipeline:
- dedup.max_active_alerts (default 100000): once the active-alert store
 is full, a first-fire for a new fingerprint passes through un-deduped
 instead of opening another alert, so a high-cardinality fingerprint
 cannot exhaust memory between resolve windows. The store-entries gauge
 plateauing at the cap signals saturation.
- max_silences (default 1000): cap concurrently-tracked dynamic (API)
 silences; POST /api/v1/silences returns 429 past the cap. Static
 silences from the config do not count against it.
Also downgrade the per-result incident_id collision log from warn to
debug so an upstream enricher setting incident_id on every result cannot
flood the log.
@mostafa mostafa merged commit c57ddcb into main Jun 26, 2026
18 checks passed
@mostafa mostafa deleted the feat/alert-pipeline branch June 26, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /