-
Notifications
You must be signed in to change notification settings - Fork 8
feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard#840
Open
QuentinBisson wants to merge 2 commits into
Open
feat(observability): service-state OTel metrics, PrometheusRule, Grafana dashboard #840QuentinBisson wants to merge 2 commits into
QuentinBisson wants to merge 2 commits into
Conversation
.../dashboard templates Closes #401 Three additions on top of the tool-call metrics already present on main: 1. internal/orchestrator/metrics.go - muster.service.state_transitions_total (counter): incremented on every MCPServer/aggregator state transition; labels: service_name, service_type, from_state, to_state. - muster.service.up (observable gauge): polled each collection cycle; 1 when a service is running/connected, 0 otherwise; labels: service_name, service_type. Both instruments use otel.Meter(observability.TracerName) — the same scope as the existing tool-call counter in internal/aggregator/metrics.go — so they share the global MeterProvider already initialised by cmd/serve.go via mcp-toolkit. 2. helm/muster/templates/prometheusrule.yaml PrometheusRule with three rules: - MusterServiceDown (5 m for): muster_service_up == 0 - MusterServiceFlapping (instant): > 4 transitions in 10 m - MusterHighToolErrorRate (5 m for): error ratio > 10 % per tool Guarded by the same prometheusExporterEnabled helper as the ServiceMonitor. Supports tenant label for multi-tenant Mimir deployments. 3. helm/muster/templates/grafana-dashboard.yaml ConfigMap carrying a ready-to-use Grafana dashboard (picked up automatically by the grafana-sidecar). Panels: service status stat, state transitions timeseries, tool call rate, p50/p99 latency, error rate per tool. Namespace/pod template variables for multi-instance filtering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #401
What changed
Go: service lifecycle metrics (
internal/orchestrator/metrics.go)Two new OTel instruments emitted from the orchestrator's state-change hook:
muster.service.state_transitions_totalservice_name,service_type,from_state,to_statemuster.service.upservice_name,service_typemuster.service.upis 1 when a service is inrunningorconnectedstate, 0 otherwise — the standard "is it up?" signal for alerting.Both use
otel.Meter(observability.TracerName), the same scope as the existingmuster.tool_callscounter ininternal/aggregator/metrics.go. The global MeterProvider is already initialised incmd/serve.goviamcp-toolkit, so no additional bootstrap is needed.Helm: PrometheusRule (
helm/muster/templates/prometheusrule.yaml)Three alerting rules (guarded by
prometheusExporterEnabled, same as ServiceMonitor):muster_service_up == 0Supports
observability.giantswarm.io/tenantlabel for multi-tenant Mimir.Enable with:
Helm: Grafana dashboard (
helm/muster/templates/grafana-dashboard.yaml)ConfigMap carrying a ready-to-use dashboard, picked up by the standard grafana-sidecar. Panels:
Enable with:
What was already present (not changed)
muster.tool_callscounter +muster.tool_call.durationhistogram ininternal/aggregator/metrics.gointernal/aggregator/server_options.go)cmd/serve.gotracing + metrics init via mcp-toolkitvalues.yamlobservability section