Writing Prometheus Alert Rules — from expr, for, labels, annotations to testing
How to write Prometheus alerting rules properly. We cover rule file structure and the five parts of a rule (alert/expr/for/labels/annotations), the pending→firing states, writing annotations, good expr patterns, real examples like InstanceDown/error rate/latency/disk prediction, recording rules, and promtool validation/unit testing plus operational tips.
Half the point of collecting metrics is to "get alerted when something breaks." In Prometheus, the Alert Rule defines the condition for that alert. Write rules well and only real incidents page you cleanly; write them poorly and you drown in meaningless alerts every night.
What this post covers:
- The evaluation flow and file structure of alerting rules
- The five parts of a rule:
alert/expr/for/labels/annotations - The
pending → firingstates and avoiding flapping - Annotation templating and writing good
expr - Real examples: InstanceDown, error rate, latency, disk prediction
- Recording rules,
promtoolvalidation/unit tests, operational tips
1. What an Alert Rule is
Every evaluation interval (evaluation_interval), Prometheus evaluates each rule's expr. If it returns results (= condition met) it becomes an alert candidate, and once the condition holds for the configured for duration it enters the firing state and is sent to Alertmanager. Grouping, deduplication, routing, and delivery (Slack, email, etc.) are Alertmanager's job.
[metrics] → Prometheus evaluates expr (periodically)
→ condition met + for elapsed → firing
→ sent to Alertmanager → grouping/routing → Slack/Email/PagerDutyThere are two kinds of rules: alerting rules (alert conditions) and recording rules (precompute frequently used expressions and store them as new series). This post is about the former; the latter appears in section 9.
2. Rule file structure
Write rules in a separate YAML file under groups, and load them via rule_files in prometheus.yml.
# prometheus.yml
global:
evaluation_interval: 15s # how often to evaluate rules
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]# rules/app-alerts.yml
groups:
- name: app-availability # group name (rules in a group evaluate in order)
interval: 30s # (optional) per-group evaluation interval
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance down: {{ $labels.instance }}"
description: "{{ $labels.job }} / {{ $labels.instance }} has not responded for over 5 minutes."Rules within a group evaluate top to bottom, so a recording rule's result can be referenced by another rule in the same group.
3. The five parts of an Alert Rule
| Field | Role | Required |
|---|---|---|
alert | Alert name (= the alertname label) | Y |
expr | The PromQL condition. If it returns even one result, each such series becomes an alert candidate | Y |
for | Must hold continuously for this duration to fire | N |
labels | Labels added to the alert (routing/severity). Merged with the expr result labels | N |
annotations | Human-readable description (templatable). Not used for routing | N |
The key is that each series returned by expr is an individual alert instance. If up == 0 is true for 3 instances, 3 alerts are created, each carrying its own instance label.
4. for and alert states
An alert moves through three states.
- inactive —
exprreturns nothing (condition not met) - pending — the condition is met but the
forduration hasn't elapsed - firing — the condition has held continuously for
for→ sent to Alertmanager
inactive ──(condition met)──▶ pending ──(for elapsed)──▶ firing
▲ │ │
└────(condition cleared)───────┴──────────────────────────┘Without for, a single match fires immediately, so transient spikes make alerts flap on and off. Putting for: 5m–for: 15m on most rules dramatically reduces noise. If the condition clears even once during pending, the timer resets to 0.
Active alerts are also queryable via the ALERTS{alertname=..., alertstate="pending|firing", ...} metric, so you can monitor the alerts themselves.
5. labels — routing and severity
labels give an alert meaning, and Alertmanager uses them to decide routing, grouping, and inhibition. The most common is severity.
labels:
severity: critical # use a consistent scheme: critical / warning / info
team: platform # which team to route toLabels already on the expr result (instance, job, etc.) are preserved on the alert, with the above added on top. So an Alertmanager route can branch like severity="critical" to PagerDuty and warning to Slack.
Tip: standardize
severityvalues across the whole org with the same vocabulary. If one person usescritand anothercritical, your routing rules start leaking.
6. annotations and templating
annotations are the alert's message body. Use Go templating to inject labels and values, stating what and how bad the problem is.
annotations:
summary: "{{ $labels.job }} high 5xx error rate"
description: "Error rate {{ $value | humanizePercentage }} (threshold 5%) for over 10 minutes."
runbook_url: "https://runbooks.example.com/HighErrorRate"
dashboard: "https://grafana.example.com/d/abc/app"Common templating elements:
{{ $labels.<name> }}— a label value of this alert series ({{ $labels.instance }}){{ $value }}— the current value ofexpr{{ $value | humanize }}— 1234567 →1.235M, human-friendly{{ $value | humanizePercentage }}— 0.0523 →5.23%{{ $value | humanizeDuration }}— seconds →2h 5m{{ printf "%.2f" $value }}— fixed decimal places
A good annotation includes the current value, threshold, scope of impact, and a runbook link, so you can act from the alert alone.
7. Writing good expr
A rule's quality is decided in the expr.
- Counters via
rate(). Don't compare cumulative values directly; use per-second rate. - Aggregate keeping only meaningful labels. Like
sum by (job, instance) (...)— keep the labels that distinguish the alert and sum away the rest. - Beware divide-by-zero in ratios. If the denominator is 0, the result disappears (no data) and the alert doesn't fire — usually the safe behavior.
- Detect absence with
absent(). A metric vanishing entirely (a target disappearing) can't be caught by a threshold comparison. - Alert on symptoms, not causes. "Error rate up, latency up" (user impact) is far less noisy than "CPU 90%" alone.
# error rate (5xx ratio) — naturally no data if denominator is 0
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m])) > 0.05
# whether a job's metric disappeared entirely
absent(up{job="payment-api"})8. Common patterns
groups:
- name: app-symptoms
rules:
# instance down
- alert: InstanceDown
expr: up == 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Instance down: {{ $labels.instance }}"
# 5xx error rate over 5%
- alert: HighErrorRate
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m])) > 0.05
for: 10m
labels: { severity: warning }
annotations:
summary: "{{ $labels.job }} high error rate"
description: "Error rate {{ $value | humanizePercentage }}, sustained 10 minutes."
# p95 latency over 1s (histogram)
- alert: HighRequestLatency
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels: { severity: warning }
annotations:
summary: "{{ $labels.job }} high p95 latency"
description: "p95 = {{ $value | humanizeDuration }} (threshold 1s)."
# disk trending to fill within 4 hours (prediction)
- alert: DiskWillFillSoon
expr: |
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[1h], 4*3600) < 0
for: 30m
labels: { severity: warning }
annotations:
summary: "{{ $labels.instance }} disk expected to fill within 4 hours"
# memory saturation over 90%
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 15m
labels: { severity: warning }
annotations:
summary: "{{ $labels.instance }} memory usage over 90%"predict_linear(...[1h], 4*3600) extrapolates the value 4 hours out from the last hour's trend. < 0 means "at this trend it hits 0 (= full) within 4 hours," giving you an alert before the threshold is reached.
9. Using recording rules
For complex or frequently used expressions, a recording rule precomputes them into a new series, simplifying your alert rules and dashboards and reducing evaluation cost.
groups:
- name: app-recordings
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:ratio5m
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))By convention, names use the level:metric:operations form (job:http_errors:ratio5m). Then the alert becomes short:
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 10m
labels: { severity: warning }10. Validation and testing (promtool)
Before shipping rules to production, validate syntax and behavior with promtool.
# syntax/structure validation
promtool check rules rules/*.ymlGoing further, unit tests let you pin down "given this input, this alert should fire."
# tests/instance-down.test.yml
rule_files:
- ../rules/app-alerts.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="api", instance="a:8080"}'
values: "1 1 0 0 0 0 0 0" # down from minute 3
alert_rule_test:
- eval_time: 7m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: critical
job: api
instance: a:8080
exp_annotations:
summary: "Instance down: a:8080"promtool test rules tests/instance-down.test.ymlWiring promtool check rules and promtool test rules into CI prevents broken rules from being merged.
11. Operational tips / anti-patterns
- Put
foron almost every rule. Transient-spike alerts disappear. - Alert on symptoms. Tie alerts to user impact (error rate, latency, availability); keep cause metrics (CPU, memory) as supporting signals.
- Make annotations actionable. Current value, threshold, and a runbook link are essential.
- Standardize
severity. Routing depends on it. - Beware alert fatigue. Alerts you never act on get ignored by everyone. If there's no answer to "what does a person do when this fires," it's a dashboard item, not an alert.
- After changing rules, reload Prometheus.
# when running with --web.enable-lifecycle
curl -X POST http://localhost:9090/-/reloadConnecting to Alertmanager only needs the alerting: block from section 2; the actual delivery channels, routing, and deduplication are configured separately in Alertmanager's alertmanager.yml (a topic of its own).
Wrapping up
An Alert Rule is ultimately the combination of "a meaningful expr + an appropriate for + an actionable annotation." Capture symptoms precisely with expr, filter noise with for, route to the right person with labels, and make it immediately actionable with annotations — those four are the whole of a good alert. Finally, add promtool unit tests and you can guarantee, like code, that a rule "fires as expected."
As a next step, connect these alerts to Alertmanager routes (per-severity channels, business-hours branching, inhibition) and dashboard the ALERTS metric to manage "alerts themselves" — that is, alert hygiene.