Recording rules and Alerting rules explained in Prometheus!!!

What is rules in Prometheus?

Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus

User Interface. Prometheus rule is way to run promql expression at certain interval and store a value in Prometheus time series

database for future use such as to store a some manipulative values in TSDB or alerting needs.

Why to use rules in Prometheus?

Time series queries can quickly become quite complicated to remember and type using the Expression Browser in the default Prometheus User Interface.

100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)

This is not so bad, it describes how much percent of memory is free on your server running the Prometheus Node Exporter. So, rather than remembering and typing this query every time we want to know that answer, we can create a recording rule that will run at a chosen interval and make the data available as a time series.

Types of rules in Prometheus?

Prometheus supports two types of rules which may be configured and then evaluated at regular intervals:

Recording rules are for pre-calculating frequently used or computationally expensive queries. The results of those rules are saved into their own time series.
Alerting rules on the other hand enable you to specify the conditions that an alert should be fired to an external service like Slack. These are based on PromQL queries.

How to add Recording rules and Alerting rules in Prometheus?

Step 1 – You can create prometheus.rules.yml file in the same directory where prometheus.yml is stored, e.g.

/etc/prometheus/prometheus.rules.yml.
Step 2 – Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section.
Step 3 – and restart the prometheus service.
Step 4 – Refresh the Prometheus user interface and check the drop down.

How to check rules config file?
$ promtool check rules /etc/prometheus/prometheus.rules.yml

Step to enable prometheus Rule

	================================
	CD into the /usr/local/bin/prometheus folder

	cd /usr/local/bin/prometheus
	Create a new file called prometheus_rules.yml

	sudo nano prometheus_rules.yml
	Add our test expression as a recording rule

	groups:
	- name: custom_rules
	rules:
	- record: node_memory_MemFree_percent
	expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
	Save it and we can now verify the syntax is ok.

	We will check our rules file is ok.

	./promtool check rules prometheus_rules.yml

	Now lets add the prometheus_rules.yml reference to the prometheus.yml rule_files section.

	rule_files:
	- "prometheus_rules.yml"
	and restart the prometheus service.

	$ sudo service prometheus restart
	$ sudo service prometheus status

	# Refresh the Prometheus user interface and check the dropdown

view raw Step-to-enable-prometheus-Rule.yaml hosted with

by GitHub

Example of prometheus Recording Rule

	Recording Rule Example 1
	================================
	# Aggregating up requests per second that has a path label:
	- record: instance_path:requests:rate5m
	expr: rate(requests_total{job="myjob"}[5m])

	- record: path:requests:rate5m
	expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"})

	Recording Rule Example 2
	================================
	# Calculating a request failure ratio and aggregating up to the job-level failure ratio:
	- record: instance_path:request_failures:rate5m
	expr: rate(request_failures_total{job="myjob"}[5m])

	- record: instance_path:request_failures_per_requests:ratio_rate5m
	expr: \|2
	instance_path:request_failures:rate5m{job="myjob"}
	/
	instance_path:requests:rate5m{job="myjob"}

	# Aggregate up numerator and denominator, then divide to get path-level ratio.
	- record: path:request_failures_per_requests:ratio_rate5m
	expr: \|2
	sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
	/
	sum without (instance)(instance_path:requests:rate5m{job="myjob"})

	# No labels left from instrumentation or distinguishing instances,
	# so we use 'job' as the level.
	- record: job:request_failures_per_requests:ratio_rate5m
	expr: \|2
	sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
	/
	sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})

	Recording Rule Example 3
	================================
	# Calculating average latency over a time period from a Summary:

	- record: instance_path:request_latency_seconds_count:rate5m
	expr: rate(request_latency_seconds_count{job="myjob"}[5m])

	- record: instance_path:request_latency_seconds_sum:rate5m
	expr: rate(request_latency_seconds_sum{job="myjob"}[5m])

	- record: instance_path:request_latency_seconds:mean5m
	expr: \|2
	instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
	/
	instance_path:request_latency_seconds_count:rate5m{job="myjob"}

	# Aggregate up numerator and denominator, then divide.
	- record: path:request_latency_seconds:mean5m
	expr: \|2
	sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
	/
	sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})


	Recording Rule Example 5
	================================
	# Calculating the average query rate across instances and paths is done using the avg() function:

	- record: job:request_latency_seconds_count:avg_rate5m
	expr: avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})

	Recording Rule Example 6
	================================
	groups:
	- name: custom_rules
	rules:
	- record: node_memory_MemFree_percent
	expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)

	- record: node_filesystem_free_percent
	expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}


	Recording Rule Example 7
	================================
	groups:
	- name: recording_rules
	interval: 5s
	rules:
	- record: node_exporter:node_filesystem_free:fs_used_percents
	expr: 100 - 100 * ( node_filesystem_free{mountpoint="/"} / node_filesystem_size{mountpoint="/"} )

	- record: node_exporter:node_memory_free:memory_used_percents
	expr: 100 - 100 * (node_memory_MemFree / node_memory_MemTotal)

	Recording Rule Example 8
	================================
	groups:
	- name: custom_rules
	rules:
	- record: node_memory_MemFree_percent
	expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
	- record: node_filesystem_free_percent expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes

	{mountpoint="/"}

view raw Example-prometheus-Recording-Rule.yaml hosted with

by GitHub

Example of prometheus alerting rules

	Example of prometheus alerting rules 1
	==============================================

	groups:
	- name: example
	rules:
	- alert: HighRequestLatency
	expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
	for: 10m
	labels:
	severity: page
	annotations:
	summary: High request latency


	Example of prometheus alerting rules 2
	==============================================

	groups:
	- name: example
	rules:

	# Alert for any instance that is unreachable for >5 minutes.
	- alert: InstanceDown
	expr: up == 0
	for: 5m
	labels:
	severity: page
	annotations:
	summary: "Instance {{ $labels.instance }} down"
	description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

	# Alert for any instance that has a median request latency >1s.
	- alert: APIHighRequestLatency
	expr: api_http_request_latencies_second{quantile="0.5"} > 1
	for: 10m
	annotations:
	summary: "High request latency on {{ $labels.instance }}"
	description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

	Example of prometheus alerting rules 3
	==============================================

	- name: alert_rules
	rules:
	- alert: InstanceDown
	expr: up == 0
	for: 1m
	labels:
	severity: critical
	annotations:
	summary: "Instance [{{ $labels.instance }}] down"
	description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."

	Example of prometheus alerting rules 4
	==============================================

	- alert: DiskSpaceFree10Percent
	expr: node_filesystem_free_percent <= 10
	labels:
	severity: warning
	annotations:
	summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
	description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."

view raw Example-of-prometheus-alerting-rules.yaml hosted with

by GitHub

Rajesh Kumar

I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.

Please find my social handles as below;

Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

Recording rules and Alerting rules explained in Prometheus!!!

Certification Courses

Need Assistance!!!

Feel Free To Contact Us

+1 (469) 756-6329

(US Call-WhatsApp)

+91 7004 215 841

(India Call-WhatsApp)

Email us

Contact@DevOpsSchool.com