prometheusalertingalertmanagerpromqlobservabilitypromtool

Writing Prometheus Alert Rules — from expr, for, labels, annotations to testing

How to write Prometheus alerting rules properly. We cover rule file structure and the five parts of a rule (alert/expr/for/labels/annotations), the pending→firing states, writing annotations, good expr patterns, real examples like InstanceDown/error rate/latency/disk prediction, recording rules, and promtool validation/unit testing plus operational tips.

Data DynamicsJune 12, 20269 min read

Half the point of collecting metrics is to "get alerted when something breaks." In Prometheus, the Alert Rule defines the condition for that alert. Write rules well and only real incidents page you cleanly; write them poorly and you drown in meaningless alerts every night.

What this post covers:

The evaluation flow and file structure of alerting rules
The five parts of a rule: alert / expr / for / labels / annotations
The pending → firing states and avoiding flapping
Annotation templating and writing good expr
Real examples: InstanceDown, error rate, latency, disk prediction
Recording rules, promtool validation/unit tests, operational tips

1. What an Alert Rule is

Every evaluation interval (evaluation_interval), Prometheus evaluates each rule's expr. If it returns results (= condition met) it becomes an alert candidate, and once the condition holds for the configured for duration it enters the firing state and is sent to Alertmanager. Grouping, deduplication, routing, and delivery (Slack, email, etc.) are Alertmanager's job.

Loading diagram…

There are two kinds of rules: alerting rules (alert conditions) and recording rules (precompute frequently used expressions and store them as new series). This post is about the former; the latter appears in section 9.

2. Rule file structure

Write rules in a separate YAML file under groups, and load them via rule_files in prometheus.yml.

# prometheus.yml
global:
  evaluation_interval: 15s        # how often to evaluate rules
 
rule_files:
  - "rules/*.yml"
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

# rules/app-alerts.yml
groups:
  - name: app-availability         # group name (rules in a group evaluate in order)
    interval: 30s                  # (optional) per-group evaluation interval
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance down: {{ $labels.instance }}"
          description: "{{ $labels.job }} / {{ $labels.instance }} has not responded for over 5 minutes."

Rules within a group evaluate top to bottom, so a recording rule's result can be referenced by another rule in the same group.

3. The five parts of an Alert Rule

Field	Role	Required
`alert`	Alert name (= the `alertname` label)	Y
`expr`	The PromQL condition. If it returns even one result, each such series becomes an alert candidate	Y
`for`	Must hold continuously for this duration to fire	N
`labels`	Labels added to the alert (routing/severity). Merged with the `expr` result labels	N
`annotations`	Human-readable description (templatable). Not used for routing	N

The key is that each series returned by expr is an individual alert instance. If up == 0 is true for 3 instances, 3 alerts are created, each carrying its own instance label.

4. `for` and alert states

An alert moves through three states.

inactive — expr returns nothing (condition not met)
pending — the condition is met but the for duration hasn't elapsed
firing — the condition has held continuously for for → sent to Alertmanager

Loading diagram…

Without for, a single match fires immediately, so transient spikes make alerts flap on and off. Putting for: 5m–for: 15m on most rules dramatically reduces noise. If the condition clears even once during pending, the timer resets to 0.

Active alerts are also queryable via the ALERTS{alertname=..., alertstate="pending|firing", ...} metric, so you can monitor the alerts themselves.

5. labels — routing and severity

labels give an alert meaning, and Alertmanager uses them to decide routing, grouping, and inhibition. The most common is severity.

labels:
  severity: critical      # use a consistent scheme: critical / warning / info
  team: platform          # which team to route to

Labels already on the expr result (instance, job, etc.) are preserved on the alert, with the above added on top. So an Alertmanager route can branch like severity="critical" to PagerDuty and warning to Slack.

Tip: standardize severity values across the whole org with the same vocabulary. If one person uses crit and another critical, your routing rules start leaking.

6. annotations and templating

annotations are the alert's message body. Use Go templating to inject labels and values, stating what and how bad the problem is.

annotations:
  summary: "{{ $labels.job }} high 5xx error rate"
  description: "Error rate {{ $value | humanizePercentage }} (threshold 5%) for over 10 minutes."
  runbook_url: "https://runbooks.example.com/HighErrorRate"
  dashboard: "https://grafana.example.com/d/abc/app"

Common templating elements:

{{ $labels.<name> }} — a label value of this alert series ({{ $labels.instance }})
{{ $value }} — the current value of expr
{{ $value | humanize }} — 1234567 → 1.235M, human-friendly
{{ $value | humanizePercentage }} — 0.0523 → 5.23%
{{ $value | humanizeDuration }} — seconds → 2h 5m
{{ printf "%.2f" $value }} — fixed decimal places

A good annotation includes the current value, threshold, scope of impact, and a runbook link, so you can act from the alert alone.

7. Writing good `expr`

A rule's quality is decided in the expr.

Counters via rate(). Don't compare cumulative values directly; use per-second rate.
Aggregate keeping only meaningful labels. Like sum by (job, instance) (...) — keep the labels that distinguish the alert and sum away the rest.
Beware divide-by-zero in ratios. If the denominator is 0, the result disappears (no data) and the alert doesn't fire — usually the safe behavior.
Detect absence with absent(). A metric vanishing entirely (a target disappearing) can't be caught by a threshold comparison.
Alert on symptoms, not causes. "Error rate up, latency up" (user impact) is far less noisy than "CPU 90%" alone.

# error rate (5xx ratio) — naturally no data if denominator is 0
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
  / sum by (job) (rate(http_requests_total[5m])) > 0.05
 
# whether a job's metric disappeared entirely
absent(up{job="payment-api"})

8. Common patterns

groups:
  - name: app-symptoms
    rules:
      # instance down
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Instance down: {{ $labels.instance }}"
 
      # 5xx error rate over 5%
      - alert: HighErrorRate
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
            / sum by (job) (rate(http_requests_total[5m])) > 0.05
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.job }} high error rate"
          description: "Error rate {{ $value | humanizePercentage }}, sustained 10 minutes."
 
      # p95 latency over 1s (histogram)
      - alert: HighRequestLatency
        expr: |
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.job }} high p95 latency"
          description: "p95 = {{ $value | humanizeDuration }} (threshold 1s)."
 
      # disk trending to fill within 4 hours (prediction)
      - alert: DiskWillFillSoon
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[1h], 4*3600) < 0
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.instance }} disk expected to fill within 4 hours"
 
      # memory saturation over 90%
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.instance }} memory usage over 90%"

predict_linear(...[1h], 4*3600) extrapolates the value 4 hours out from the last hour's trend. < 0 means "at this trend it hits 0 (= full) within 4 hours," giving you an alert before the threshold is reached.

9. Using recording rules

For complex or frequently used expressions, a recording rule precomputes them into a new series, simplifying your alert rules and dashboards and reducing evaluation cost.

groups:
  - name: app-recordings
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
 
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
            / sum by (job) (rate(http_requests_total[5m]))

By convention, names use the level:metric:operations form (job:http_errors:ratio5m). Then the alert becomes short:

- alert: HighErrorRate
  expr: job:http_errors:ratio5m > 0.05
  for: 10m
  labels: { severity: warning }

10. Validation and testing (promtool)

Before shipping rules to production, validate syntax and behavior with promtool.

# syntax/structure validation
promtool check rules rules/*.yml

Going further, unit tests let you pin down "given this input, this alert should fire."

# tests/instance-down.test.yml
rule_files:
  - ../rules/app-alerts.yml
evaluation_interval: 1m
 
tests:
  - interval: 1m
    input_series:
      - series: 'up{job="api", instance="a:8080"}'
        values: "1 1 0 0 0 0 0 0"      # down from minute 3
    alert_rule_test:
      - eval_time: 7m
        alertname: InstanceDown
        exp_alerts:
          - exp_labels:
              severity: critical
              job: api
              instance: a:8080
            exp_annotations:
              summary: "Instance down: a:8080"

promtool test rules tests/instance-down.test.yml

Wiring promtool check rules and promtool test rules into CI prevents broken rules from being merged.

11. Operational tips / anti-patterns

Put for on almost every rule. Transient-spike alerts disappear.
Alert on symptoms. Tie alerts to user impact (error rate, latency, availability); keep cause metrics (CPU, memory) as supporting signals.
Make annotations actionable. Current value, threshold, and a runbook link are essential.
Standardize severity. Routing depends on it.
Beware alert fatigue. Alerts you never act on get ignored by everyone. If there's no answer to "what does a person do when this fires," it's a dashboard item, not an alert.
After changing rules, reload Prometheus.

# when running with --web.enable-lifecycle
curl -X POST http://localhost:9090/-/reload

Connecting to Alertmanager only needs the alerting: block from section 2; the actual delivery channels, routing, and deduplication are configured separately in Alertmanager's alertmanager.yml (a topic of its own).

Wrapping up

An Alert Rule is ultimately the combination of "a meaningful expr + an appropriate for + an actionable annotation." Capture symptoms precisely with expr, filter noise with for, route to the right person with labels, and make it immediately actionable with annotations — those four are the whole of a good alert. Finally, add promtool unit tests and you can guarantee, like code, that a rule "fires as expected."

As a next step, connect these alerts to Alertmanager routes (per-severity channels, business-hours branching, inhibition) and dashboard the ALERTS metric to manage "alerts themselves" — that is, alert hygiene.