What are SLIs?
Service Level Indicators (SLIs) are carefully defined quantitative measures of some aspect of the level of service that is provided. They form the foundation of reliability engineering by providing measurable metrics that reflect the user experience and system performance[3].
SLIs are crucial in both Site Reliability Engineering (SRE) and DevOps because they provide objective data for making informed decisions about system reliability. Without clearly defined SLIs, teams lack the visibility needed to understand if their services are meeting user expectations or where improvements are needed[7].
In practical terms, SLIs are metrics over time which inform about the health of a service[1]. They serve as the raw measurements that feed into Service Level Objectives (SLOs), which in turn inform Service Level Agreements (SLAs). This hierarchical relationship creates a framework for managing reliability:
- SLIs measure specific aspects of service performance
- SLOs set targets for those measurements
- SLAs formalize commitments based on those targets
For example, if we consider a web application, an SLI might measure the percentage of requests that complete in under 200ms. The corresponding SLO might state that 99% of requests should complete within that timeframe. Finally, an SLA might formalize this commitment to customers with specific consequences if the objective isn’t met[11].
SLIs vs SLOs vs SLAs
Understanding the distinction between these three concepts is essential for effective reliability engineering:
Service Level Indicators (SLIs) are the actual measurements of service performance. They are the metrics that matter to users and reflect the quality of service being delivered. Examples include error rates, latency measurements, and system throughput[3].
Service Level Objectives (SLOs) are target values or ranges for a service level measured by an SLI. They define the expectations for how reliable a service should be. For instance, an SLO might state that 99.9% of requests should return successfully over a 30-day window[1].
Service Level Agreements (SLAs) are formal contracts that include SLOs and specify consequences if service levels aren’t met. SLAs typically include penalties, such as financial compensation or service credits, when objectives aren’t achieved[9].
Here’s a practical example to illustrate the relationship:
- SLI: Percentage of API requests that return a response within 300 milliseconds
- SLO: 99% of API requests must return within 300 milliseconds over a 7-day period
- SLA: If the 99% threshold isn’t met for a calendar month, customers receive a 10% service credit[4]
SLIs provide the raw data, SLOs set the targets, and SLAs formalize the business commitments around those targets.
The Golden Signals
When implementing SLIs, it’s helpful to start with the “Golden Signals” – four key metrics that provide comprehensive insight into service health:
Latency: The time it takes to service a request. This includes both successful requests and failed requests (which can time out or return an error). Latency is typically measured at various percentiles (e.g., 50th, 90th, 99th) to capture both the typical and worst-case user experiences[1].
Traffic: A measure of how much demand is being placed on your system. For web services, this is typically measured in requests per second. For data processing systems, it might be transactions or records processed per second[3].
Errors: The rate of requests that fail. Failures can be explicit (e.g., HTTP 500 errors) or implicit (successful HTTP responses that contain error messages). Error rates are often expressed as a percentage of total requests[1].
Saturation: How “full” your service is or how close it is to its capacity limit. This could measure CPU utilization, memory usage, I/O operations, or network bandwidth. Saturation metrics help predict when a system will begin to degrade due to resource constraints[3].
These Golden Signals provide a balanced view of service health and form the basis for many effective SLIs. By monitoring these four aspects, teams can quickly identify issues affecting user experience and system performance.
Part 2: Designing SLIs
How to Design Good SLIs
Effective SLIs share several key characteristics:
- User-centric: Good SLIs reflect what users actually experience and care about. They should correlate strongly with user satisfaction.
- Measurable: SLIs must be quantifiable and objectively measurable through automated systems.
- Actionable: When an SLI indicates a problem, it should be clear what actions might resolve the issue.
- Simple: SLIs should be easy to understand and explain to both technical and non-technical stakeholders.
- Consistent: The measurement methodology should produce consistent results over time to enable meaningful comparisons.
When designing SLIs, it’s important to distinguish between user-centric and system-centric metrics. User-centric SLIs directly measure the user experience, such as page load time or transaction success rate. System-centric SLIs focus on internal system performance, such as database query time or CPU utilization.
While both types have value, user-centric SLIs should generally take precedence because they more directly reflect service quality as experienced by users. System-centric SLIs are most valuable when they have a clear correlation with user experience or when they help diagnose issues identified by user-centric SLIs.
Choosing the Right Metrics
Selecting appropriate metrics for your SLIs requires careful consideration of what truly matters for your service:
Quantitative vs. Qualitative Metrics
Quantitative metrics provide numerical measurements that can be objectively tracked and compared over time. These include response times, error rates, and throughput measurements. Qualitative metrics attempt to measure subjective aspects of the user experience, such as satisfaction scores or feature usability. While SLIs typically focus on quantitative metrics due to their objectivity, qualitative feedback can help validate that your SLIs are measuring what truly matters to users.
Leading vs. Lagging Indicators
Leading indicators predict future performance or issues before they significantly impact users. For example, increasing memory usage might predict an upcoming out-of-memory error. Lagging indicators measure outcomes after they’ve occurred, such as the number of failed requests. A balanced set of SLIs should include both types: leading indicators to provide early warnings and lagging indicators to confirm actual service quality.
When choosing metrics, focus on those that:
- Have a direct impact on user experience
- Align with business objectives
- Can be consistently measured
- Provide actionable insights when they deviate from expected values
Types of SLIs
Different services require different types of SLIs based on their nature and user expectations. Here are common SLI types with examples:
Availability SLIs
- Percentage of successful requests vs. total requests
- Percentage of time the service is operational
- Ratio of successful API calls to total calls
Example: “99.9% of API requests return a valid response (non-5xx status code)”
Latency SLIs
- Time taken for requests to complete
- Percentile-based measurements (e.g., 95th percentile response time)
- Time to first byte or time to interactive
Example: “95% of web page loads complete within 2 seconds”
Throughput SLIs
- Requests processed per second
- Transactions completed per minute
- Data volume processed per hour
Example: “The system can handle 10,000 requests per second during peak times”
Error Rate SLIs
- Percentage of failed requests
- Rate of specific error types
- Failed transactions as a percentage of total
Example: “Fewer than 0.05% of all transactions result in an error”
Quality/Correctness SLIs
- Percentage of data processed correctly
- Proportion of responses that were served in an undegraded state
- Accuracy of results compared to expected outcomes
Example: “99.99% of database writes are successfully replicated to all nodes”
For data processing systems, additional SLI types include:
Freshness SLIs
- How recently data was updated
- Time lag between data creation and availability
Example: “90% of dashboard data is less than 5 minutes old”
Coverage SLIs
- Percentage of data successfully processed
- Proportion of expected records that were handled
Example: “99% of incoming records are successfully processed within 10 minutes”
Durability SLIs
- Probability of data being retained over time
- Percentage of data that can be successfully recovered
Example: “99.999999% of objects stored will be retained for one year”
Instrumenting Applications
To collect SLI data, applications must be properly instrumented. This involves adding code or configuration to measure and expose metrics. Here are key approaches:
Code-Level Instrumentation
- Add timing code around critical functions
- Count errors and successful operations
- Track resource usage and saturation
Infrastructure Instrumentation
- Configure monitoring agents on servers
- Enable metrics collection in cloud platforms
- Set up network monitoring
Application Framework Instrumentation
- Use built-in metrics capabilities of frameworks
- Add middleware for consistent measurement
- Leverage auto-instrumentation libraries
Popular tools and libraries for instrumenting applications include:
- Prometheus Client Libraries – For languages like Go, Python, Java, and others
- OpenTelemetry – For collecting traces, metrics, and logs
- StatsD – For sending custom metrics to collection systems
- Micrometer – For JVM-based applications
- Application Performance Monitoring (APM) tools like New Relic, Datadog, and Dynatrace
When instrumenting applications, follow these best practices:
- Measure at service boundaries to capture the full user experience
- Include contextual information like service name and environment
- Consider the performance impact of instrumentation itself
- Standardize naming conventions for consistency
- Instrument both successful operations and failures
Part 3: Collecting & Analyzing SLIs
Monitoring Tools
Several powerful tools are available for collecting, analyzing, and visualizing SLIs:
Prometheus
Prometheus has established itself as a core tool for SRE monitoring due to its flexibility, scalability, and open-source nature. It’s designed specifically for complex, dynamic environments and excels at real-time metrics and alerting. Key features include:
- Powerful Query Language (PromQL) for sophisticated queries and insights
- Multi-dimensional data model that organizes time series data by metric names and key/value labels
- Pull model for data collection that’s highly suitable for rapidly changing infrastructure like containers
Grafana
Grafana is an open-source, composable platform for monitoring and observability that allows you to query, visualize, and analyze metrics regardless of where they’re stored. Its powerful visualization capabilities make it indispensable for SREs because it can:
- Integrate with Prometheus and over 300 other popular platforms
- Create dashboards providing real-time insights into system health
- Support a wide range of visualizations from simple graphs to complex heatmaps
Datadog
Datadog is a commercial monitoring and analytics platform for cloud-scale applications that integrates with various services and tools to provide comprehensive visibility. It offers:
- Application Performance Monitoring (APM)
- Log management and security monitoring
- Real-time dashboards and alerting capabilities
Google Cloud Monitoring
Google Cloud’s Operations suite uses machine learning to group related issues, helping companies fix problems up to 50% faster. It provides:
- Integrated monitoring for Google Cloud resources
- Custom metrics and dashboards
- ML-powered alerting and anomaly detection
New Relic
New Relic is another commercial observability platform providing real-time insights into application performance and infrastructure. It combines:
- Application monitoring
- Infrastructure monitoring
- Real-user monitoring
- Intuitive interface and automation capabilities
Querying Metrics
Once you’ve collected SLI data, you need effective ways to query and analyze it. For Prometheus-based systems, PromQL (Prometheus Query Language) is the standard tool:
PromQL Basics
- Simple Selectors: Retrieve time series data
http_requests_total{status="200"}
- Range Vectors: Get data over time
http_requests_total{status="200"}[5m]
- Aggregation: Combine multiple time series
sum(rate(http_requests_total{status="200"}[5m])) by (service)
- Functions: Apply transformations
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Writing SLI-focused Queries
For availability SLIs:
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
For latency SLIs:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
For error rate SLIs:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
For throughput SLIs:
sum(rate(http_requests_total[5m]))
For saturation SLIs:
sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)
Alerting Based on SLIs
Effective alerting is crucial for responding to SLI issues before they significantly impact users:
Defining Alert Thresholds
Alert thresholds should be set based on your SLOs. For example, if your SLO states that 99.9% of requests should be successful, you might set an alert when the success rate drops below 99.95% (providing a buffer for response)[5].
Common threshold approaches include:
- Static thresholds: Alert when a metric crosses a fixed value
- Dynamic thresholds: Alert based on deviation from historical patterns
- Burn rate alerts: Alert when SLO budget consumption accelerates
- Multi-level thresholds: Different severity alerts at different thresholds
Avoiding Alert Fatigue
Alert fatigue occurs when teams receive too many alerts, leading them to ignore or miss critical notifications. To avoid this:
- Only alert on symptoms that directly affect users, not causes
- Ensure alerts are actionable
- Implement proper alert routing to the right teams
- Use alert suppression during known issues or maintenance
- Regularly review and tune alert thresholds
Alert Levels
Implement a tiered alert system:
Alert Level | Threshold | Action |
---|---|---|
Warning | 90% of SLO | Notify team lead |
Critical | 95% of SLO | Page on-call engineer |
Part 4: Real-World Examples
Web Applications
Web applications typically focus on user experience metrics as their primary SLIs:
Key SLIs for Web Applications:
- Page Load Time: The time it takes for a page to become fully interactive
- SLI Example: “95% of page loads complete within 2 seconds”
- Implementation: Use the Navigation Timing API or Real User Monitoring (RUM) tools
- HTTP Error Rate: The percentage of HTTP requests that result in errors
- SLI Example: “99.9% of HTTP requests return status codes other than 5xx”
- Implementation: Monitor web server logs or application metrics
- Availability: Whether the website is accessible to users
- SLI Example: “99.95% of health check probes succeed”
- Implementation: External synthetic monitoring from multiple regions
- Time to First Byte (TTFB): How quickly the server starts sending data
- SLI Example: “90% of requests have TTFB < 100ms”
- Implementation: Server-side timing or RUM data
- Client-Side Errors: JavaScript errors experienced by users
- SLI Example: “Less than 0.1% of page views result in JavaScript errors”
- Implementation: Client-side error tracking
APIs & Microservices
APIs and microservices require SLIs that reflect both external quality and internal health:
Key SLIs for APIs & Microservices:
- Request Success Rate: Percentage of successful API calls
- SLI Example: “99.95% of API requests return successful responses (non-5xx status codes)”
- Implementation: API gateway logs or service instrumentation
- Latency: Response time for API requests
- SLI Example: “99% of API requests complete in under 300ms”
- Implementation: Service-level timing metrics
- Throughput: Request handling capacity
- SLI Example: “API handles 1,000 requests per second with < 1% error rate”
- Implementation: Load balancer metrics or application counters
- Dependency Health: Success rate of calls to dependencies
- SLI Example: “99.9% of database queries complete successfully”
- Implementation: Client library instrumentation
- Resource Utilization: CPU, memory, and connection usage
- SLI Example: “Services maintain < 80% CPU utilization during peak load”
- Implementation: Container or host-level metrics
Databases
Databases require specialized SLIs that focus on data integrity, performance, and availability:
Key SLIs for Databases:
- Query Latency: Time to execute database queries
- SLI Example: “95% of queries complete in under 50ms”
- Implementation: Database performance monitoring or client-side timing
- Error Rate: Failed query percentage
- SLI Example: “99.99% of write operations succeed”
- Implementation: Database logs or client error tracking
- Replication Lag: Delay between primary and replica databases
- SLI Example: “Replication lag remains under 10 seconds for 99.9% of the time”
- Implementation: Database-specific replication metrics
- Connection Utilization: Usage of available database connections
- SLI Example: “Connection pool utilization stays below 80%”
- Implementation: Database metrics or connection pool monitoring
- Storage Utilization: Disk space usage and growth rate
- SLI Example: “Storage utilization increases by less than 5% per day”
- Implementation: Database or filesystem metrics
CDNs / Edge Services
Content Delivery Networks and edge services focus on content distribution efficiency:
Key SLIs for CDNs / Edge Services:
- Cache Hit Ratio: Percentage of requests served from cache
- SLI Example: “90% of eligible content is served from cache”
- Implementation: CDN analytics or custom headers
- Edge Latency: Response time from edge locations
- SLI Example: “95% of edge responses complete in under 100ms”
- Implementation: CDN metrics or RUM data
- Origin Shield Effectiveness: Reduction in origin requests
- SLI Example: “Less than 10% of requests reach the origin servers”
- Implementation: CDN analytics and origin server metrics
- Geographic Performance: Latency across different regions
- SLI Example: “99% of users in each region experience < 200ms latency”
- Implementation: Synthetic monitoring from multiple locations
- Purge/Invalidation Time: Time to refresh content after changes
- SLI Example: “95% of cache purge operations complete within 5 minutes globally”
- Implementation: CDN API metrics and content freshness checks
CI/CD Pipelines
CI/CD pipelines benefit from SLIs that measure build and deployment reliability:
Key SLIs for CI/CD Pipelines:
- Build Success Rate: Percentage of successful builds
- SLI Example: “99% of builds in the main branch succeed”
- Implementation: CI system metrics or logs
- Build Duration: Time to complete builds
- SLI Example: “90% of builds complete in under 10 minutes”
- Implementation: CI system timing metrics
- Deployment Success Rate: Percentage of successful deployments
- SLI Example: “99.5% of production deployments succeed without rollback”
- Implementation: Deployment tool metrics or manual tracking
- Deployment Duration: Time to complete deployments
- SLI Example: “95% of deployments complete in under 15 minutes”
- Implementation: Deployment tool timing metrics
- Time to Recovery: Time to recover from failed deployments
- SLI Example: “99% of failed deployments are recovered within 30 minutes”
- Implementation: Incident tracking or deployment metrics
I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.
Please find my social handles as below;
Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND