feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard#840

Open

QuentinBisson wants to merge 2 commits into

main from

worktree-feat+otel-service-state-metrics

Open

feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard #840
QuentinBisson wants to merge 2 commits into
main from
worktree-feat+otel-service-state-metrics

Conversation

@QuentinBisson

@QuentinBisson QuentinBisson commented Jun 12, 2026

Copy link

Copy Markdown

Contributor

Closes #401

What changed

Go: service lifecycle metrics (`internal/orchestrator/metrics.go`)

Two new OTel instruments emitted from the orchestrator's state-change hook:

Metric	Type	Labels
`muster.service.state_transitions_total`	Counter	`service_name`, `service_type`, `from_state`, `to_state`
`muster.service.up`	Observable Gauge	`service_name`, `service_type`

muster.service.up is 1 when a service is in running or connected state, 0 otherwise — the standard "is it up?" signal for alerting.

Both use otel.Meter(observability.TracerName), the same scope as the existing muster.tool_calls counter in internal/aggregator/metrics.go. The global MeterProvider is already initialised in cmd/serve.go via mcp-toolkit, so no additional bootstrap is needed.

Helm: PrometheusRule (`helm/muster/templates/prometheusrule.yaml`)

Three alerting rules (guarded by prometheusExporterEnabled, same as ServiceMonitor):

MusterServiceDown — fires after 5 min with muster_service_up == 0
MusterServiceFlapping — fires immediately on > 4 transitions in 10 min
MusterHighToolErrorRate — fires after 5 min when tool error ratio > 10 %

Supports observability.giantswarm.io/tenant label for multi-tenant Mimir.

Enable with:

muster:
 observability:
 metrics:
 exporter: prometheus
 prometheus:
 prometheusRule:
 enabled: true
 tenant: giantswarm # optional

Helm: Grafana dashboard (`helm/muster/templates/grafana-dashboard.yaml`)

ConfigMap carrying a ready-to-use dashboard, picked up by the standard grafana-sidecar. Panels:

Service status stat (up/down per service)
State transitions timeseries
Tool call rate by tool + outcome
Tool call latency p50/p99
Tool error rate per tool

Enable with:

muster:
 observability:
 metrics:
 grafanaDashboard:
 enabled: true

What was already present (not changed)

muster.tool_calls counter + muster.tool_call.duration histogram in internal/aggregator/metrics.go
mcp-go OTel tracing hooks (internal/aggregator/server_options.go)
cmd/serve.go tracing + metrics init via mcp-toolkit
ServiceMonitor Helm template and deployment OTel env vars
values.yaml observability section

QuentinBisson and others added 2 commits

June 12, 2026 14:15

@QuentinBisson


 feat(observability): add service-state OTel metrics and Helm alerting...

31db95a

.../dashboard templates
Closes #401
Three additions on top of the tool-call metrics already present on main:
1. internal/orchestrator/metrics.go
 - muster.service.state_transitions_total (counter): incremented on every
 MCPServer/aggregator state transition; labels: service_name, service_type,
 from_state, to_state.
 - muster.service.up (observable gauge): polled each collection cycle; 1 when
 a service is running/connected, 0 otherwise; labels: service_name, service_type.
 Both instruments use otel.Meter(observability.TracerName) — the same scope as
 the existing tool-call counter in internal/aggregator/metrics.go — so they share
 the global MeterProvider already initialised by cmd/serve.go via mcp-toolkit.
2. helm/muster/templates/prometheusrule.yaml
 PrometheusRule with three rules:
 - MusterServiceDown (5 m for): muster_service_up == 0
 - MusterServiceFlapping (instant): > 4 transitions in 10 m
 - MusterHighToolErrorRate (5 m for): error ratio > 10 % per tool
 Guarded by the same prometheusExporterEnabled helper as the ServiceMonitor.
 Supports tenant label for multi-tenant Mimir deployments.
3. helm/muster/templates/grafana-dashboard.yaml
 ConfigMap carrying a ready-to-use Grafana dashboard (picked up automatically
 by the grafana-sidecar). Panels: service status stat, state transitions
 timeseries, tool call rate, p50/p99 latency, error rate per tool.
 Namespace/pod template variables for multi-instance filtering.

@QuentinBisson


 Merge branch 'main' into worktree-feat+otel-service-state-metrics

b6342e9

@QuentinBisson QuentinBisson marked this pull request as ready for review

June 12, 2026 12:25

@QuentinBisson QuentinBisson requested a review from a team as a code owner

June 12, 2026 12:25

Labels

None yet

1 participant

@QuentinBisson

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard#840

feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard #840
QuentinBisson wants to merge 2 commits into
main from
worktree-feat+otel-service-state-metrics

Conversation

@QuentinBisson QuentinBisson commented Jun 12, 2026

What changed

Go: service lifecycle metrics (`internal/orchestrator/metrics.go`)

Helm: PrometheusRule (`helm/muster/templates/prometheusrule.yaml`)

Helm: Grafana dashboard (`helm/muster/templates/grafana-dashboard.yaml`)

What was already present (not changed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

@QuentinBisson QuentinBisson commented Jun 12, 2026

What changed

Go: service lifecycle metrics (internal/orchestrator/metrics.go)

Helm: PrometheusRule (helm/muster/templates/prometheusrule.yaml)

Helm: Grafana dashboard (helm/muster/templates/grafana-dashboard.yaml)

What was already present (not changed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Go: service lifecycle metrics (`internal/orchestrator/metrics.go`)

Helm: PrometheusRule (`helm/muster/templates/prometheusrule.yaml`)

Helm: Grafana dashboard (`helm/muster/templates/grafana-dashboard.yaml`)