Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard#840

Open
QuentinBisson wants to merge 2 commits into
main from
worktree-feat+otel-service-state-metrics
Open

feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard #840
QuentinBisson wants to merge 2 commits into
main from
worktree-feat+otel-service-state-metrics

Conversation

@QuentinBisson

@QuentinBisson QuentinBisson commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Closes #401

What changed

Go: service lifecycle metrics (internal/orchestrator/metrics.go)

Two new OTel instruments emitted from the orchestrator's state-change hook:

Metric Type Labels
muster.service.state_transitions_total Counter service_name, service_type, from_state, to_state
muster.service.up Observable Gauge service_name, service_type

muster.service.up is 1 when a service is in running or connected state, 0 otherwise — the standard "is it up?" signal for alerting.

Both use otel.Meter(observability.TracerName), the same scope as the existing muster.tool_calls counter in internal/aggregator/metrics.go. The global MeterProvider is already initialised in cmd/serve.go via mcp-toolkit, so no additional bootstrap is needed.

Helm: PrometheusRule (helm/muster/templates/prometheusrule.yaml)

Three alerting rules (guarded by prometheusExporterEnabled, same as ServiceMonitor):

  • MusterServiceDown — fires after 5 min with muster_service_up == 0
  • MusterServiceFlapping — fires immediately on > 4 transitions in 10 min
  • MusterHighToolErrorRate — fires after 5 min when tool error ratio > 10 %

Supports observability.giantswarm.io/tenant label for multi-tenant Mimir.

Enable with:

muster:
 observability:
 metrics:
 exporter: prometheus
 prometheus:
 prometheusRule:
 enabled: true
 tenant: giantswarm # optional

Helm: Grafana dashboard (helm/muster/templates/grafana-dashboard.yaml)

ConfigMap carrying a ready-to-use dashboard, picked up by the standard grafana-sidecar. Panels:

  • Service status stat (up/down per service)
  • State transitions timeseries
  • Tool call rate by tool + outcome
  • Tool call latency p50/p99
  • Tool error rate per tool

Enable with:

muster:
 observability:
 metrics:
 grafanaDashboard:
 enabled: true

What was already present (not changed)

  • muster.tool_calls counter + muster.tool_call.duration histogram in internal/aggregator/metrics.go
  • mcp-go OTel tracing hooks (internal/aggregator/server_options.go)
  • cmd/serve.go tracing + metrics init via mcp-toolkit
  • ServiceMonitor Helm template and deployment OTel env vars
  • values.yaml observability section

QuentinBisson and others added 2 commits June 12, 2026 14:15
.../dashboard templates
Closes #401
Three additions on top of the tool-call metrics already present on main:
1. internal/orchestrator/metrics.go
 - muster.service.state_transitions_total (counter): incremented on every
 MCPServer/aggregator state transition; labels: service_name, service_type,
 from_state, to_state.
 - muster.service.up (observable gauge): polled each collection cycle; 1 when
 a service is running/connected, 0 otherwise; labels: service_name, service_type.
 Both instruments use otel.Meter(observability.TracerName) — the same scope as
 the existing tool-call counter in internal/aggregator/metrics.go — so they share
 the global MeterProvider already initialised by cmd/serve.go via mcp-toolkit.
2. helm/muster/templates/prometheusrule.yaml
 PrometheusRule with three rules:
 - MusterServiceDown (5 m for): muster_service_up == 0
 - MusterServiceFlapping (instant): > 4 transitions in 10 m
 - MusterHighToolErrorRate (5 m for): error ratio > 10 % per tool
 Guarded by the same prometheusExporterEnabled helper as the ServiceMonitor.
 Supports tenant label for multi-tenant Mimir deployments.
3. helm/muster/templates/grafana-dashboard.yaml
 ConfigMap carrying a ready-to-use Grafana dashboard (picked up automatically
 by the grafana-sidecar). Panels: service status stat, state transitions
 timeseries, tool call rate, p50/p99 latency, error rate per tool.
 Namespace/pod template variables for multi-instance filtering.
@QuentinBisson QuentinBisson marked this pull request as ready for review June 12, 2026 12:25
@QuentinBisson QuentinBisson requested a review from a team as a code owner June 12, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Add OpenTelemetry instrumentation and Prometheus metrics

1 participant

AltStyle によって変換されたページ (->オリジナル) /