Understanding SLOs for monitoring applications

To properly manage and monitor an application, you need a goal for defining where you are and how you are doing so you can adjust and improve over time. This reference point is known as a service level objective (SLO). Taking the time to define clear SLOs will make life easier for service owners as well as for the internal or external users who depend on your services. 

However, before you can define an SLO you need an objective, quantitative metric you can look at to determine performance or reliability for your application. These metrics are known as service level indicators (SLIs).

Service level indicator—SLI

A good way to determine what metrics you should use for your SLIs is to think about what directly impacts your user’s happiness in terms of your application’s performance. This could include things such as latency, availability, and accuracy of the application. On the other hand, CPU utilization would be a bad SLI because your users don’t really care about how your server’s CPU is doing, as long as it isn’t impacting their experience with your app.

Additionally, the SLIs you choose will depend on what type of application you are running. For a typical request/response type application you will probably focus on availability, request latency, and successful requests per second capacity. You might look at availability and the consistency of the data being served for data storage. For a data pipeline, your SLIs might be whether the expected data is returned and how long it takes for the data to be processed, especially in an eventual consistency model.

Service level objective—SLO

An SLO is a performance threshold measured for an SLI over a period of time. This is the bar against which the SLI is measured to determine if performance is meeting expectations. A good SLO will define the level of performance your application needs, but not any higher than necessary. This is a crucial point and will require some testing over time. If your users are fine with 99% availability, there’s no reason to make the massive investment that would be required to hit 99.999% availability.

Some example SLOs for latency could be the 95th percentile latencies, which would tell you the latency for the 5% slowest requests being made by users. This is far better than simple latency averages that could be easily skewed by outliers.

Copyright © 2021 IDG Communications, Inc.

Source link