Health indicators based on Service Level Objectives #21311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jkschneider wants to merge 2 commits into spring-projects:main

from jkschneider:health-slos

Open

Health indicators based on Service Level Objectives #21311

jkschneider wants to merge 2 commits into spring-projects:main from jkschneider:health-slos

+486 −1

Conversation

jkschneider

Copy link

Contributor

@jkschneider jkschneider commented May 4, 2020 •

edited

Loading

This feature adds support for commonly requested functionality for an application to be able to aggregate some set of metrics key performance indicators down to a health indicator.

I fully expect some changes, probably significant changes, based on feedback iterations on this, but want to offer this up early in the 2.4.0 release iteration so we have time to iterate and also dogfood any autoconfigured service level objectives.

Some indicators are known to be broadly applicable to a wide range of Java applications, and those could be autoconfigured. An example of a set of such indicators is defined here and autoconfigured by this pull request (JvmServiceLevelObjectives.MEMORY).

In many cases, users would like to configure a load balancer to avoid instances that are failing a key performance indicator by configuring an HTTP health check on the load balancer. In fact, some applications may already be doing this for the health indicators Spring Boot or users already provide. Example platform load balancer configurations that can be pointed to /actuator/health:

CloudFoundry health-check-http-endpoint
AWS ALB health checking
Kubernetes service health checking:

metadata:
 name: instance-reported-utilization
 annotations:
 service.beta.kubernetes.io/do-loadbalancer-healthcheck-port: "80"
 service.beta.kubernetes.io/do-loadbalancer-healthcheck-protocol: "http"
 service.beta.kubernetes.io/do-loadbalancer-healthcheck-path: "/actuator/health"

See micrometer-metrics/micrometer#2055 for more detail.

The `HealthMeterRegistry`

As of 1.6.0, Micrometer has a new implementation: micrometer-registry-health. An autoconfiguration was added to spring-boot-actuator-autoconfigure for this new implementation.

Any @Bean ServiceLevelObjective is configured onto the HealthMeterRegistry and bound as a Spring Boot HealthIndicator.

What it looks like in `/actuator/health`

image

About `ServiceLevelObjective`

Service level objectives broadly have the following capabilities:

Are defined as a single or multi-indicator test against a set of time series registered to HealthMeterRegistry.
Can define required MeterBinder that contain the measurements that they need to determine availability.
Contains a filterable and transformable name and tag set that is mapped to the Spring Boot bean name and Health#details map, respectively.
Optionally contains a readable base unit that is mapped to health details.
Can pretty-print values and thresholds for human-readable interpretation of an SLO at some instant.
Can be defined to look back and aggregate over a time window in different ways.

API error ratio property-driven configuration

management.metrics.export.health.api-error-budgets.api.customer=0.01
management.metrics.export.health.api-error-budgets.admin=0.02

The above properties result in two service level objective health indicators called apiErrorRatioApiCustomer and apiErrorRatioAdmin, which check for a SERVER_ERROR outcome to total throughput ratio of less than 1% for requests to paths starting with /api/customer and 2% for requests to paths starting with /admin, respectively.


 Upgrade Micrometer to 1.6.0-SNAPSHOT

c5b75a7

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage label

May 4, 2020

@jkschneider

Copy link

Contributor Author

jkschneider commented May 4, 2020 •

edited

Loading

Open questions

We build health indicators with AbstractHealthIndicator(slo.getFailedMessage()). It's unclear to me if the failed message ever appears in /actuator/health response body output.

Some of the SLOs are a combination of two or more indicators. For example, in jvmTotalMemory, we set a relatively low threshold on GC overhead (20% of CPU time over the last 5 minutes) if there is 90% pool utilization as well. These composite SLOs are registered with the relatively new CompositeHealthContributor.fromMap(..) API. Unfortunately there is no way I can see to provide details and a failed message name on the composite. I'd like to add details and a failed message for each contributing health indicator and potentially a different one for what it means for a set of such indicators to fail together. @philwebb you may have suggestions? An example is included below of what I think might be nice (specifically the details directly underneath jvmTotalMemory)?

"jvmTotalMemory": {
 "status": "UP",
 "details": { 
 "someTag": "someValue"
 },
 "components": {
 "jvmGcOverhead": {
 "status": "UP",
 "details": {
 "value": "0.01%",
 "mustBe": "<20%",
 "unit": "percent CPU time spent"
 }
 },
 "jvmMemoryConsumption": {
 "status": "UP",
 "details": {
 "value": "9.09%",
 "mustBe": "<90%",
 "unit": "maximum percent used in last 5 minutes"
 }
 }
 }
}

@jkschneider jkschneider force-pushed the health-slos branch 3 times, most recently from 220c8ba to d907ba5 Compare

May 5, 2020 13:26

@philwebb philwebb added type: enhancement and removed status: waiting-for-triage labels

May 5, 2020

@philwebb philwebb added this to the 2.4.x milestone

May 5, 2020

@philwebb

Copy link

Member

philwebb commented May 5, 2020

Thanks @jkschneider! I'll target this for 2.4.x so we remember to take a look as soon the 2.3.0 release crunch is over.


 Service level objective health indicators

7290f5f

@jkschneider jkschneider force-pushed the health-slos branch from d907ba5 to 7290f5f Compare

May 5, 2020 21:36

@snicoll snicoll added the for: team-attention label

Sep 9, 2020

@bclozel bclozel modified the milestones: 2.4.x, 2.x

Sep 28, 2020

@bclozel bclozel added status: blocked and removed for: team-attention labels

Sep 28, 2020

@bclozel

Copy link

Member

bclozel commented Sep 28, 2020

We haven't had a chance to take a look at this change, nor upgrade to Micrometer 1.6.
We're already quite late in the Milestone cycle and we don't think we'll have time to address this change properly.
We need to take a look at this change and its implications (including the new concepts introduced and the Health endpoint format).

@snicoll snicoll mentioned this pull request

Sep 29, 2020

Upgrade to Micrometer 1.6.0 #23525

Closed

@wilkinsona wilkinsona mentioned this pull request

Aug 23, 2021

Provide a configuration property for setting the path used by auto-configured disk space metrics #27306

Closed

@mbhave mbhave self-assigned this

Aug 24, 2021

@mbhave

Copy link

Contributor

mbhave commented Sep 16, 2021 •

edited

Loading

@snicoll and I discussed this today. There are a few things that came up:

Since we decided that the diskspace health indicator should ideally be something that can be configured in the monitoring system, this feels very much along those lines. If we decide to surface the SLO's as a health indicator, we should align our strategy for diskspace accordingly. Even with the deprecation of the diskspace indicator, we could surface that information in health via the SLOs.
We are not sure if having a top-level component for every SLO is the best way to do this. Maybe having some sort of nested structure for the SLOs might be a better alternative.
From an API perspective, we could have an API to expose SLOs which we could use to create the composite rather than the current method which registers beans within a bean method.

Flagging for team-meeting so that we can discuss this on the next team call.

@mbhave mbhave added for: team-meeting and removed status: blocked labels

Sep 16, 2021

@wilkinsona

Copy link

Member

wilkinsona commented Sep 17, 2021 •

edited

Loading

We discussed this some more as a team today and our feeling is that we're not sure that we have a strong enough opinion to auto-configure SLOs has health indicators. We can see that it may make sense for some users but not for others. For example, in some cases, a proxy will already be aware of the error rate for requests that it routes to an application instance. In this case, exposing the information via a health endpoint that it will also be monitoring will be of minimal value, and may even be harmful depending on how things behave when the application's health changes. For users that do want to expose SLOs as health indicators, we could provide some classes that make it easier to do so.

Since this proposal was made, we've also introduced the concept of application state. It may be that some users want to configure things such that an unmet objective results in a change to the application state to indicate that it's no longer ready, for example. We could provide some helper classes that a user can configure to connect SLOs to application state.

We discussed possibly auto-configuring the HealthMeterRegistry, automatically adding any ServiceLevelObjective beans to it. We could auto-configure some ServiceLevelObjective beans such as JvmServiceLevelObjectives.MEMORY and OperatingSystemServiceLevelObjectives.DISK rather than hard-coding them as proposed here. This would align with our auto-configuring of Micrometer's various Jvm...Metrics classes.

Overall, our feeling was that we would stop short of anything that exposes the SLOs externally, instead auto-configuring the HealthMeterRegistry and supporting beans and making it easier for a user to then plug the SLOs into health or application state in a way that meets their specific needs.

@shakuzen @jonatan-ivanov Could we have your input here please? Are we right to be cautious and just give users the parts they need and leave them to join things together or is there some clearly established usage of HealthMeterRegistry and SLOs that means that we can proceed with confidence in a particular direction?