πŸš€ DevOps & SRE Certification Program πŸ“… Starting: 1st of Every Month 🀝 +91 8409492687 πŸ” Contact@DevOpsSchool.com

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

We spend hours on Instagram and YouTube and waste money on coffee and fast food, but won’t spend 30 minutes a day learning skills to boost our careers.
Master in DevOps, SRE, DevSecOps & MLOps!

Learn from Guru Rajesh Kumar and double your salary in just one year.


Get Started Now!

Recording rules and Alerting rules explained in Prometheus!!!

What is rules in Prometheus?

Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus

User Interface. Prometheus rule is way to run promql expression at certain interval and store a value in Prometheus time series

database for future use such as to store a some manipulative values in TSDB or alerting needs.

Why to use rules in Prometheus?

Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus User Interface.

100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)

This is not so bad, it describes how much percent of memory is free on your server running the Prometheus Node Exporter. So, rather than remembering and typing this query every time we want to know that answer, we can create a recording rule that will run at a chosen interval and make the data available as a time series.

Types of rules in Prometheus?

Prometheus supports two types of rules which may be configured and then evaluated at regular intervals:

  • Recording rules are for pre-calculating frequently used or computationally expensive queries. The results of those rules are saved into their own time series.
  • Alerting rules on the other hand enable you to specify the conditions that an alert should be fired to an external service like Slack. These are based on PromQL queries.

How to add Recording rules and Alerting rules in Prometheus?

Step 1 – You can create prometheus.rules.yml file in the same directory where prometheus.yml is stored, e.g.

/etc/prometheus/prometheus.rules.yml.
Step 2 – Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section.
Step 3 – and restart the prometheus service.
Step 4 – Refresh the Prometheus user interface and check the drop down.

How to check rules config file?
$ promtool check rules /etc/prometheus/prometheus.rules.yml

Step to enable prometheus Rule

================================
CD into the /usr/local/bin/prometheus folder
cd /usr/local/bin/prometheus
Create a new file called prometheus_rules.yml
sudo nano prometheus_rules.yml
Add our test expression as a recording rule
groups:
- name: custom_rules
rules:
- record: node_memory_MemFree_percent
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
Save it and we can now verify the syntax is ok.
We will check our rules file is ok.
./promtool check rules prometheus_rules.yml
Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section.
rule_files:
- "prometheus_rules.yml"
and restart the prometheus service.
$ sudo service prometheus restart
$ sudo service prometheus status
# Refresh the Prometheus user interface and check the dropdown

Example of prometheus Recording Rule

Recording Rule Example 1
================================
# Aggregating up requests per second that has a path label:
- record: instance_path:requests:rate5m
expr: rate(requests_total{job="myjob"}[5m])
- record: path:requests:rate5m
expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"})
Recording Rule Example 2
================================
# Calculating a request failure ratio and aggregating up to the job-level failure ratio:
- record: instance_path:request_failures:rate5m
expr: rate(request_failures_total{job="myjob"}[5m])
- record: instance_path:request_failures_per_requests:ratio_rate5m
expr: |2
instance_path:request_failures:rate5m{job="myjob"}
/
instance_path:requests:rate5m{job="myjob"}
# Aggregate up numerator and denominator, then divide to get path-level ratio.
- record: path:request_failures_per_requests:ratio_rate5m
expr: |2
sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
/
sum without (instance)(instance_path:requests:rate5m{job="myjob"})
# No labels left from instrumentation or distinguishing instances,
# so we use 'job' as the level.
- record: job:request_failures_per_requests:ratio_rate5m
expr: |2
sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
/
sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})
Recording Rule Example 3
================================
# Calculating average latency over a time period from a Summary:
- record: instance_path:request_latency_seconds_count:rate5m
expr: rate(request_latency_seconds_count{job="myjob"}[5m])
- record: instance_path:request_latency_seconds_sum:rate5m
expr: rate(request_latency_seconds_sum{job="myjob"}[5m])
- record: instance_path:request_latency_seconds:mean5m
expr: |2
instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
/
instance_path:request_latency_seconds_count:rate5m{job="myjob"}
# Aggregate up numerator and denominator, then divide.
- record: path:request_latency_seconds:mean5m
expr: |2
sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
/
sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})
Recording Rule Example 5
================================
# Calculating the average query rate across instances and paths is done using the avg() function:
- record: job:request_latency_seconds_count:avg_rate5m
expr: avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})
Recording Rule Example 6
================================
groups:
- name: custom_rules
rules:
- record: node_memory_MemFree_percent
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
- record: node_filesystem_free_percent
expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
Recording Rule Example 7
================================
groups:
- name: recording_rules
interval: 5s
rules:
- record: node_exporter:node_filesystem_free:fs_used_percents
expr: 100 - 100 * ( node_filesystem_free{mountpoint="/"} / node_filesystem_size{mountpoint="/"} )
- record: node_exporter:node_memory_free:memory_used_percents
expr: 100 - 100 * (node_memory_MemFree / node_memory_MemTotal)
Recording Rule Example 8
================================
groups:
- name: custom_rules
rules:
- record: node_memory_MemFree_percent
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
- record: node_filesystem_free_percent expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes
{mountpoint="/"}

Example of prometheus alerting rules

Example of prometheus alerting rules 1
==============================================
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
Example of prometheus alerting rules 2
==============================================
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
Example of prometheus alerting rules 3
==============================================
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] down"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
Example of prometheus alerting rules 4
==============================================
- alert: DiskSpaceFree10Percent
expr: node_filesystem_free_percent <= 10
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."

Certification Courses

DevOpsSchool has introduced a series of professional certification courses designed to enhance your skills and expertise in cutting-edge technologies and methodologies. Whether you are aiming to excel in development, security, or operations, these certifications provide a comprehensive learning experience. Explore the following programs:

DevOps Certification, SRE Certification, and DevSecOps Certification by DevOpsSchool

Explore our DevOps Certification, SRE Certification, and DevSecOps Certification programs at DevOpsSchool. Gain the expertise needed to excel in your career with hands-on training and globally recognized certifications.