What is rules in Prometheus?
Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus
User Interface. Prometheus rule is way to run promql expression at certain interval and store a value in Prometheus time series
database for future use such as to store a some manipulative values in TSDB or alerting needs.
Why to use rules in Prometheus?
Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus User Interface.
100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
This is not so bad, it describes how much percent of memory is free on your server running the Prometheus Node Exporter. So, rather than remembering and typing this query every time we want to know that answer, we can create a recording rule that will run at a chosen interval and make the data available as a time series.
Types of rules in Prometheus?
Prometheus supports two types of rules which may be configured and then evaluated at regular intervals:
- Recording rules are for pre-calculating frequently used or computationally expensive queries. The results of those rules are saved into their own time series.
- Alerting rules on the other hand enable you to specify the conditions that an alert should be fired to an external service like Slack. These are based on PromQL queries.
How to add Recording rules and Alerting rules in Prometheus?
Step 1 β You can create prometheus.rules.yml file in the same directory where prometheus.yml is stored, e.g.
/etc/prometheus/prometheus.rules.yml.
Step 2 β Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section.
Step 3 β and restart the prometheus service.
Step 4 β Refresh the Prometheus user interface and check the drop down.
How to check rules config file?
$ promtool check rules /etc/prometheus/prometheus.rules.yml
Step to enable prometheus Rule
================================ | |
CD into the /usr/local/bin/prometheus folder | |
cd /usr/local/bin/prometheus | |
Create a new file called prometheus_rules.yml | |
sudo nano prometheus_rules.yml | |
Add our test expression as a recording rule | |
groups: | |
- name: custom_rules | |
rules: | |
- record: node_memory_MemFree_percent | |
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes) | |
Save it and we can now verify the syntax is ok. | |
We will check our rules file is ok. | |
./promtool check rules prometheus_rules.yml | |
Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section. | |
rule_files: | |
- "prometheus_rules.yml" | |
and restart the prometheus service. | |
$ sudo service prometheus restart | |
$ sudo service prometheus status | |
# Refresh the Prometheus user interface and check the dropdown |
Example of prometheus Recording Rule
Recording Rule Example 1 | |
================================ | |
# Aggregating up requests per second that has a path label: | |
- record: instance_path:requests:rate5m | |
expr: rate(requests_total{job="myjob"}[5m]) | |
- record: path:requests:rate5m | |
expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"}) | |
Recording Rule Example 2 | |
================================ | |
# Calculating a request failure ratio and aggregating up to the job-level failure ratio: | |
- record: instance_path:request_failures:rate5m | |
expr: rate(request_failures_total{job="myjob"}[5m]) | |
- record: instance_path:request_failures_per_requests:ratio_rate5m | |
expr: |2 | |
instance_path:request_failures:rate5m{job="myjob"} | |
/ | |
instance_path:requests:rate5m{job="myjob"} | |
# Aggregate up numerator and denominator, then divide to get path-level ratio. | |
- record: path:request_failures_per_requests:ratio_rate5m | |
expr: |2 | |
sum without (instance)(instance_path:request_failures:rate5m{job="myjob"}) | |
/ | |
sum without (instance)(instance_path:requests:rate5m{job="myjob"}) | |
# No labels left from instrumentation or distinguishing instances, | |
# so we use 'job' as the level. | |
- record: job:request_failures_per_requests:ratio_rate5m | |
expr: |2 | |
sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"}) | |
/ | |
sum without (instance, path)(instance_path:requests:rate5m{job="myjob"}) | |
Recording Rule Example 3 | |
================================ | |
# Calculating average latency over a time period from a Summary: | |
- record: instance_path:request_latency_seconds_count:rate5m | |
expr: rate(request_latency_seconds_count{job="myjob"}[5m]) | |
- record: instance_path:request_latency_seconds_sum:rate5m | |
expr: rate(request_latency_seconds_sum{job="myjob"}[5m]) | |
- record: instance_path:request_latency_seconds:mean5m | |
expr: |2 | |
instance_path:request_latency_seconds_sum:rate5m{job="myjob"} | |
/ | |
instance_path:request_latency_seconds_count:rate5m{job="myjob"} | |
# Aggregate up numerator and denominator, then divide. | |
- record: path:request_latency_seconds:mean5m | |
expr: |2 | |
sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"}) | |
/ | |
sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"}) | |
Recording Rule Example 5 | |
================================ | |
# Calculating the average query rate across instances and paths is done using the avg() function: | |
- record: job:request_latency_seconds_count:avg_rate5m | |
expr: avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"}) | |
Recording Rule Example 6 | |
================================ | |
groups: | |
- name: custom_rules | |
rules: | |
- record: node_memory_MemFree_percent | |
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes) | |
- record: node_filesystem_free_percent | |
expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} | |
Recording Rule Example 7 | |
================================ | |
groups: | |
- name: recording_rules | |
interval: 5s | |
rules: | |
- record: node_exporter:node_filesystem_free:fs_used_percents | |
expr: 100 - 100 * ( node_filesystem_free{mountpoint="/"} / node_filesystem_size{mountpoint="/"} ) | |
- record: node_exporter:node_memory_free:memory_used_percents | |
expr: 100 - 100 * (node_memory_MemFree / node_memory_MemTotal) | |
Recording Rule Example 8 | |
================================ | |
groups: | |
- name: custom_rules | |
rules: | |
- record: node_memory_MemFree_percent | |
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes) | |
- record: node_filesystem_free_percent expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes | |
{mountpoint="/"} |
Example of prometheus alerting rules
Example of prometheus alerting rules 1 | |
============================================== | |
groups: | |
- name: example | |
rules: | |
- alert: HighRequestLatency | |
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 | |
for: 10m | |
labels: | |
severity: page | |
annotations: | |
summary: High request latency | |
Example of prometheus alerting rules 2 | |
============================================== | |
groups: | |
- name: example | |
rules: | |
# Alert for any instance that is unreachable for >5 minutes. | |
- alert: InstanceDown | |
expr: up == 0 | |
for: 5m | |
labels: | |
severity: page | |
annotations: | |
summary: "Instance {{ $labels.instance }} down" | |
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." | |
# Alert for any instance that has a median request latency >1s. | |
- alert: APIHighRequestLatency | |
expr: api_http_request_latencies_second{quantile="0.5"} > 1 | |
for: 10m | |
annotations: | |
summary: "High request latency on {{ $labels.instance }}" | |
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" | |
Example of prometheus alerting rules 3 | |
============================================== | |
- name: alert_rules | |
rules: | |
- alert: InstanceDown | |
expr: up == 0 | |
for: 1m | |
labels: | |
severity: critical | |
annotations: | |
summary: "Instance [{{ $labels.instance }}] down" | |
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute." | |
Example of prometheus alerting rules 4 | |
============================================== | |
- alert: DiskSpaceFree10Percent | |
expr: node_filesystem_free_percent <= 10 | |
labels: | |
severity: warning | |
annotations: | |
summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space" | |
description: "[{{ $labels.instance }}] has only {{ $value }}% or less free." |
Iβm a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.
Please find my social handles as below;
Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND