What are SLIs?

Service Level Indicators (SLIs) are carefully defined quantitative measures of some aspect of the level of service that is provided. They form the foundation of reliability engineering by providing measurable metrics that reflect the user experience and system performance[3].

SLIs are crucial in both Site Reliability Engineering (SRE) and DevOps because they provide objective data for making informed decisions about system reliability. Without clearly defined SLIs, teams lack the visibility needed to understand if their services are meeting user expectations or where improvements are needed[7].

In practical terms, SLIs are metrics over time which inform about the health of a service[1]. They serve as the raw measurements that feed into Service Level Objectives (SLOs), which in turn inform Service Level Agreements (SLAs). This hierarchical relationship creates a framework for managing reliability:

SLIs measure specific aspects of service performance
SLOs set targets for those measurements
SLAs formalize commitments based on those targets

For example, if we consider a web application, an SLI might measure the percentage of requests that complete in under 200ms. The corresponding SLO might state that 99% of requests should complete within that timeframe. Finally, an SLA might formalize this commitment to customers with specific consequences if the objective isn’t met[11].

SLIs vs SLOs vs SLAs

Understanding the distinction between these three concepts is essential for effective reliability engineering:

Service Level Indicators (SLIs) are the actual measurements of service performance. They are the metrics that matter to users and reflect the quality of service being delivered. Examples include error rates, latency measurements, and system throughput[3].

Service Level Objectives (SLOs) are target values or ranges for a service level measured by an SLI. They define the expectations for how reliable a service should be. For instance, an SLO might state that 99.9% of requests should return successfully over a 30-day window[1].

Service Level Agreements (SLAs) are formal contracts that include SLOs and specify consequences if service levels aren’t met. SLAs typically include penalties, such as financial compensation or service credits, when objectives aren’t achieved[9].

Here’s a practical example to illustrate the relationship:

SLI: Percentage of API requests that return a response within 300 milliseconds
SLO: 99% of API requests must return within 300 milliseconds over a 7-day period
SLA: If the 99% threshold isn’t met for a calendar month, customers receive a 10% service credit[4]

SLIs provide the raw data, SLOs set the targets, and SLAs formalize the business commitments around those targets.

The Golden Signals

When implementing SLIs, it’s helpful to start with the “Golden Signals” – four key metrics that provide comprehensive insight into service health:

Latency: The time it takes to service a request. This includes both successful requests and failed requests (which can time out or return an error). Latency is typically measured at various percentiles (e.g., 50th, 90th, 99th) to capture both the typical and worst-case user experiences[1].

Traffic: A measure of how much demand is being placed on your system. For web services, this is typically measured in requests per second. For data processing systems, it might be transactions or records processed per second[3].

Errors: The rate of requests that fail. Failures can be explicit (e.g., HTTP 500 errors) or implicit (successful HTTP responses that contain error messages). Error rates are often expressed as a percentage of total requests[1].

Saturation: How “full” your service is or how close it is to its capacity limit. This could measure CPU utilization, memory usage, I/O operations, or network bandwidth. Saturation metrics help predict when a system will begin to degrade due to resource constraints[3].

These Golden Signals provide a balanced view of service health and form the basis for many effective SLIs. By monitoring these four aspects, teams can quickly identify issues affecting user experience and system performance.

Part 2: Designing SLIs

How to Design Good SLIs

Effective SLIs share several key characteristics:

User-centric: Good SLIs reflect what users actually experience and care about. They should correlate strongly with user satisfaction.
Measurable: SLIs must be quantifiable and objectively measurable through automated systems.
Actionable: When an SLI indicates a problem, it should be clear what actions might resolve the issue.
Simple: SLIs should be easy to understand and explain to both technical and non-technical stakeholders.
Consistent: The measurement methodology should produce consistent results over time to enable meaningful comparisons.

When designing SLIs, it’s important to distinguish between user-centric and system-centric metrics. User-centric SLIs directly measure the user experience, such as page load time or transaction success rate. System-centric SLIs focus on internal system performance, such as database query time or CPU utilization.

While both types have value, user-centric SLIs should generally take precedence because they more directly reflect service quality as experienced by users. System-centric SLIs are most valuable when they have a clear correlation with user experience or when they help diagnose issues identified by user-centric SLIs.

Choosing the Right Metrics

Selecting appropriate metrics for your SLIs requires careful consideration of what truly matters for your service:

Quantitative vs. Qualitative Metrics

Quantitative metrics provide numerical measurements that can be objectively tracked and compared over time. These include response times, error rates, and throughput measurements. Qualitative metrics attempt to measure subjective aspects of the user experience, such as satisfaction scores or feature usability. While SLIs typically focus on quantitative metrics due to their objectivity, qualitative feedback can help validate that your SLIs are measuring what truly matters to users.

Leading vs. Lagging Indicators

Leading indicators predict future performance or issues before they significantly impact users. For example, increasing memory usage might predict an upcoming out-of-memory error. Lagging indicators measure outcomes after they’ve occurred, such as the number of failed requests. A balanced set of SLIs should include both types: leading indicators to provide early warnings and lagging indicators to confirm actual service quality.

When choosing metrics, focus on those that:

Have a direct impact on user experience
Align with business objectives
Can be consistently measured
Provide actionable insights when they deviate from expected values

Types of SLIs

Different services require different types of SLIs based on their nature and user expectations. Here are common SLI types with examples:

Availability SLIs

Percentage of successful requests vs. total requests
Percentage of time the service is operational
Ratio of successful API calls to total calls

Example: “99.9% of API requests return a valid response (non-5xx status code)”

Latency SLIs

Time taken for requests to complete
Percentile-based measurements (e.g., 95th percentile response time)
Time to first byte or time to interactive

Example: “95% of web page loads complete within 2 seconds”

Throughput SLIs

Requests processed per second
Transactions completed per minute
Data volume processed per hour

Example: “The system can handle 10,000 requests per second during peak times”

Error Rate SLIs

Percentage of failed requests
Rate of specific error types
Failed transactions as a percentage of total

Example: “Fewer than 0.05% of all transactions result in an error”

Quality/Correctness SLIs

Percentage of data processed correctly
Proportion of responses that were served in an undegraded state
Accuracy of results compared to expected outcomes

Example: “99.99% of database writes are successfully replicated to all nodes”

For data processing systems, additional SLI types include:

Freshness SLIs

How recently data was updated
Time lag between data creation and availability

Example: “90% of dashboard data is less than 5 minutes old”

Coverage SLIs

Percentage of data successfully processed
Proportion of expected records that were handled

Example: “99% of incoming records are successfully processed within 10 minutes”

Durability SLIs

Probability of data being retained over time
Percentage of data that can be successfully recovered

Example: “99.999999% of objects stored will be retained for one year”

Instrumenting Applications

To collect SLI data, applications must be properly instrumented. This involves adding code or configuration to measure and expose metrics. Here are key approaches:

Code-Level Instrumentation

Add timing code around critical functions
Count errors and successful operations
Track resource usage and saturation

Infrastructure Instrumentation

Configure monitoring agents on servers
Enable metrics collection in cloud platforms
Set up network monitoring

Application Framework Instrumentation

Use built-in metrics capabilities of frameworks
Add middleware for consistent measurement
Leverage auto-instrumentation libraries

Popular tools and libraries for instrumenting applications include:

Prometheus Client Libraries – For languages like Go, Python, Java, and others
OpenTelemetry – For collecting traces, metrics, and logs
StatsD – For sending custom metrics to collection systems
Micrometer – For JVM-based applications
Application Performance Monitoring (APM) tools like New Relic, Datadog, and Dynatrace

When instrumenting applications, follow these best practices:

Measure at service boundaries to capture the full user experience
Include contextual information like service name and environment
Consider the performance impact of instrumentation itself
Standardize naming conventions for consistency
Instrument both successful operations and failures

Part 3: Collecting & Analyzing SLIs

Monitoring Tools

Several powerful tools are available for collecting, analyzing, and visualizing SLIs:

Prometheus
Prometheus has established itself as a core tool for SRE monitoring due to its flexibility, scalability, and open-source nature. It’s designed specifically for complex, dynamic environments and excels at real-time metrics and alerting. Key features include:

Powerful Query Language (PromQL) for sophisticated queries and insights
Multi-dimensional data model that organizes time series data by metric names and key/value labels
Pull model for data collection that’s highly suitable for rapidly changing infrastructure like containers

Grafana
Grafana is an open-source, composable platform for monitoring and observability that allows you to query, visualize, and analyze metrics regardless of where they’re stored. Its powerful visualization capabilities make it indispensable for SREs because it can:

Integrate with Prometheus and over 300 other popular platforms
Create dashboards providing real-time insights into system health
Support a wide range of visualizations from simple graphs to complex heatmaps

Datadog
Datadog is a commercial monitoring and analytics platform for cloud-scale applications that integrates with various services and tools to provide comprehensive visibility. It offers:

Application Performance Monitoring (APM)
Log management and security monitoring
Real-time dashboards and alerting capabilities

Google Cloud Monitoring
Google Cloud’s Operations suite uses machine learning to group related issues, helping companies fix problems up to 50% faster. It provides:

Integrated monitoring for Google Cloud resources
Custom metrics and dashboards
ML-powered alerting and anomaly detection

New Relic
New Relic is another commercial observability platform providing real-time insights into application performance and infrastructure. It combines:

Application monitoring
Infrastructure monitoring
Real-user monitoring
Intuitive interface and automation capabilities

Querying Metrics

Once you’ve collected SLI data, you need effective ways to query and analyze it. For Prometheus-based systems, PromQL (Prometheus Query Language) is the standard tool:

PromQL Basics

Simple Selectors: Retrieve time series data

   http_requests_total{status="200"}Code language: JavaScript (javascript)

Range Vectors: Get data over time

   http_requests_total{status="200"}[5m]Code language: JavaScript (javascript)

Aggregation: Combine multiple time series

   sum(rate(http_requests_total{status="200"}[5m])) by (service)Code language: JavaScript (javascript)

Functions: Apply transformations

   histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Code language: CSS (css)

Writing SLI-focused Queries

For availability SLIs:

sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Code language: JavaScript (javascript)

For latency SLIs:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Code language: CSS (css)

For error rate SLIs:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Code language: JavaScript (javascript)

For throughput SLIs:

sum(rate(http_requests_total[5m]))Code language: CSS (css)

For saturation SLIs:

sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)

Alerting Based on SLIs

Effective alerting is crucial for responding to SLI issues before they significantly impact users:

Defining Alert Thresholds

Alert thresholds should be set based on your SLOs. For example, if your SLO states that 99.9% of requests should be successful, you might set an alert when the success rate drops below 99.95% (providing a buffer for response)[5].

Common threshold approaches include:

Static thresholds: Alert when a metric crosses a fixed value
Dynamic thresholds: Alert based on deviation from historical patterns
Burn rate alerts: Alert when SLO budget consumption accelerates
Multi-level thresholds: Different severity alerts at different thresholds

Avoiding Alert Fatigue

Alert fatigue occurs when teams receive too many alerts, leading them to ignore or miss critical notifications. To avoid this:

Only alert on symptoms that directly affect users, not causes
Ensure alerts are actionable
Implement proper alert routing to the right teams
Use alert suppression during known issues or maintenance
Regularly review and tune alert thresholds

Alert Levels

Implement a tiered alert system:

Alert Level	Threshold	Action
Warning	90% of SLO	Notify team lead
Critical	95% of SLO	Page on-call engineer

Part 4: Real-World Examples

Web Applications

Web applications typically focus on user experience metrics as their primary SLIs:

Key SLIs for Web Applications:

Page Load Time: The time it takes for a page to become fully interactive

SLI Example: “95% of page loads complete within 2 seconds”
Implementation: Use the Navigation Timing API or Real User Monitoring (RUM) tools

HTTP Error Rate: The percentage of HTTP requests that result in errors

SLI Example: “99.9% of HTTP requests return status codes other than 5xx”
Implementation: Monitor web server logs or application metrics

Availability: Whether the website is accessible to users

SLI Example: “99.95% of health check probes succeed”
Implementation: External synthetic monitoring from multiple regions

Time to First Byte (TTFB): How quickly the server starts sending data

SLI Example: “90% of requests have TTFB < 100ms”
Implementation: Server-side timing or RUM data

Client-Side Errors: JavaScript errors experienced by users

SLI Example: “Less than 0.1% of page views result in JavaScript errors”
Implementation: Client-side error tracking

APIs & Microservices

APIs and microservices require SLIs that reflect both external quality and internal health:

Key SLIs for APIs & Microservices:

Request Success Rate: Percentage of successful API calls

SLI Example: “99.95% of API requests return successful responses (non-5xx status codes)”
Implementation: API gateway logs or service instrumentation

Latency: Response time for API requests

SLI Example: “99% of API requests complete in under 300ms”
Implementation: Service-level timing metrics

Throughput: Request handling capacity

SLI Example: “API handles 1,000 requests per second with < 1% error rate”
Implementation: Load balancer metrics or application counters

Dependency Health: Success rate of calls to dependencies

SLI Example: “99.9% of database queries complete successfully”
Implementation: Client library instrumentation

Resource Utilization: CPU, memory, and connection usage

SLI Example: “Services maintain < 80% CPU utilization during peak load”
Implementation: Container or host-level metrics

Databases

Databases require specialized SLIs that focus on data integrity, performance, and availability:

Key SLIs for Databases:

Query Latency: Time to execute database queries

SLI Example: “95% of queries complete in under 50ms”
Implementation: Database performance monitoring or client-side timing

Error Rate: Failed query percentage

SLI Example: “99.99% of write operations succeed”
Implementation: Database logs or client error tracking

Replication Lag: Delay between primary and replica databases

SLI Example: “Replication lag remains under 10 seconds for 99.9% of the time”
Implementation: Database-specific replication metrics

Connection Utilization: Usage of available database connections

SLI Example: “Connection pool utilization stays below 80%”
Implementation: Database metrics or connection pool monitoring

Storage Utilization: Disk space usage and growth rate

SLI Example: “Storage utilization increases by less than 5% per day”
Implementation: Database or filesystem metrics

CDNs / Edge Services

Content Delivery Networks and edge services focus on content distribution efficiency:

Key SLIs for CDNs / Edge Services:

Cache Hit Ratio: Percentage of requests served from cache

SLI Example: “90% of eligible content is served from cache”
Implementation: CDN analytics or custom headers

Edge Latency: Response time from edge locations

SLI Example: “95% of edge responses complete in under 100ms”
Implementation: CDN metrics or RUM data

Origin Shield Effectiveness: Reduction in origin requests

SLI Example: “Less than 10% of requests reach the origin servers”
Implementation: CDN analytics and origin server metrics

Geographic Performance: Latency across different regions

SLI Example: “99% of users in each region experience < 200ms latency”
Implementation: Synthetic monitoring from multiple locations

Purge/Invalidation Time: Time to refresh content after changes

SLI Example: “95% of cache purge operations complete within 5 minutes globally”
Implementation: CDN API metrics and content freshness checks

CI/CD Pipelines

CI/CD pipelines benefit from SLIs that measure build and deployment reliability:

Key SLIs for CI/CD Pipelines:

Build Success Rate: Percentage of successful builds

SLI Example: “99% of builds in the main branch succeed”
Implementation: CI system metrics or logs

Build Duration: Time to complete builds

SLI Example: “90% of builds complete in under 10 minutes”
Implementation: CI system timing metrics

Deployment Success Rate: Percentage of successful deployments

SLI Example: “99.5% of production deployments succeed without rollback”
Implementation: Deployment tool metrics or manual tracking

Deployment Duration: Time to complete deployments

SLI Example: “95% of deployments complete in under 15 minutes”
Implementation: Deployment tool timing metrics

Time to Recovery: Time to recover from failed deployments

SLI Example: “99% of failed deployments are recovered within 30 minutes”
Implementation: Incident tracking or deployment metrics

Rajesh Kumar

I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.

Do you want to learn Quantum Computing?

Please find my social handles as below;

Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND