[Kafka Ops 1] Mastering Consumer Lag — Measuring, Diagnosing, and Resolving
From the precise definition of Kafka Consumer Lag (LEO − committed offset) through the offset commit mechanism, measuring with kafka-consumer-groups, JMX, and Burrow, to reading lag correctly by distinguishing normal lag from dangerous lag — all from an operations perspective.
3 AM, PagerDuty fires: "Consumer lag exceeded 2 million." You rub your eyes, open the dashboard, and the number keeps climbing. Is this actually an incident, or just a transient effect of a traffic peak? Among everything you deal with operating Kafka, few metrics are as commonly encountered yet as hard to explain — "what exactly is it, and how much is dangerous?" In this first post of the series, we drill all the way into how to define, measure, and correctly read lag.
What you'll learn in this post
- The precise definition of consumer lag:
lag = LEO − committed offset, and offset lag vs time lag- How offsets are stored in
__consumer_offsetsand committed, and position vs committed offset- How to measure lag with
kafka-consumer-groups.sh, JMXrecords-lag-max, and Burrow / Kafka Lag Exporter- The skill of distinguishing normal steady-state lag, growing lag, and spiky lag
- A fast triage checklist, and preparation before moving on to Part 2 (root-cause deep dive)
1. What lag actually is
Consumer lag has a one-sentence definition.
lag = log-end-offset (LEO) − last committed offset — per partition.
That is, it is the distance between the position of the last message written to the partition (LEO) and the position the consumer last committed as processed (committed offset). This distance is precisely "the number of messages not yet processed."
The key point is that lag is a per-partition metric. The total lag of a consumer group is the sum of the lag across every partition the group subscribes to.
group lag = Σ (LEO_p − committed_offset_p) for all partitions p assigned to the group
Looking only at the group total makes it easy to miss "which partition is the problem." Because it is common for a single partition to become a hot partition and generate lag all by itself, you should always break it down per partition.
Offset lag vs Time lag
There are two units for expressing lag. They answer entirely different questions.
| Type | Definition | Question it answers | Unit |
|---|---|---|---|
| Offset lag | LEO − committed offset | "How many messages are backed up?" | record count |
| Time lag | now − (timestamp of oldest unprocessed message) | "How far behind in wall-clock time?" | time (ms/s) |
Why do you need both? Because message size and processing cost are not uniform.
- Even if offset lag is 1 million, at 0.1ms processing per message you may actually be only 100 seconds behind.
- Even if offset lag is 10,000, if each message triggers a heavy DB join, the time lag could be 30 minutes.
If your SLA says "data must be processed within 5 minutes," what actually matters is time lag, not offset lag. Conversely, to assess the risk of message loss past the topic's retention, you need offset lag (relative to partition size). In operations you need both to see the full picture.
2. How offsets and commits work
To understand lag, you first need to know how a consumer records "how far it has read."
The __consumer_offsets internal topic
Kafka stores each consumer group's committed offsets in an internal topic called __consumer_offsets. This topic has the following characteristics.
| Property | Value | Meaning |
|---|---|---|
| Partition count | offsets.topic.num.partitions (default 50) | A group maps to one partition via hash(group.id) % 50 |
| Cleanup policy | compact | Keeps only the latest offset per (group, topic, partition) key |
| Replication factor | offsets.topic.replication.factor (default 3) | Guarantees durability of offset data |
On commit, the consumer writes (group.id, topic, partition) → offset, metadata, commit-timestamp as a message to this topic. Thanks to compaction, past commits for the same key are cleaned up and only the latest value remains.
Position vs Committed offset
Here you must distinguish two frequently confused concepts.
- Position (current position): the offset the consumer will fetch next via
poll(). It is an in-memory value that advances forward on every fetch. - Committed offset: the offset the consumer last recorded to
__consumer_offsets. After a rebalance or restart, the consumer resumes from here.
... [read but not yet committed] ...
LEO ──────────────────────────────► right edge (last produced + 1)
▲ position (next read location)
▲ committed offset (recovery point on failure)
Lag is computed against the committed offset (the CURRENT-OFFSET column of kafka-consumer-groups is the committed offset). Therefore, even if a consumer processes messages quickly, committing late or infrequently can make the lag metric appear larger than reality.
| Commit method | Configuration | Effect on lag metric |
|---|---|---|
| Auto commit | enable.auto.commit=true, auto.commit.interval.ms=5000 | Up to 5s of processing uncommitted → lag always looks slightly inflated |
| Manual sync commit | commitSync() | Accurate but may reduce throughput |
| Manual async commit | commitAsync() | Great throughput, but needs a retry strategy on failure |
Note: One common cause of "processing is done but lag won't drop" is the commit interval/failure. We cover this pitfall in detail in Part 2.
3. Measuring lag
3.1 CLI — kafka-consumer-groups.sh
The most basic and most accurate tool. It queries the broker directly and shows group state.
kafka-consumer-groups.sh \
--bootstrap-server kafka-1:9092 \
--describe \
--group order-processorExample output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
order-processor orders 0 1048210 1048210 0 consumer-1-a1b2c3-... /10.0.1.21 consumer-1
order-processor orders 1 982134 1003221 21087 consumer-2-d4e5f6-... /10.0.1.22 consumer-2
order-processor orders 2 1120004 1120010 6 consumer-3-g7h8i9-... /10.0.1.23 consumer-3
order-processor orders 3 550120 770540 220420 - - -How to read it:
- CURRENT-OFFSET = committed offset, LOG-END-OFFSET = LEO, LAG = the difference.
- Partition 1 has lag 21,087 — it is processing but slightly behind.
- Partition 3 has lag 220,420 but CONSUMER-ID is
-— meaning no consumer is assigned to it. This is a strong signal of too few instances or an in-progress rebalance. It is the decisive clue you would have missed by looking only at the group total.
To quickly check just the state:
# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --list
# Summarize group STATE — check Stable / Rebalancing / Empty
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --describe --group order-processor --state3.2 JMX metric — records-lag-max
The CLI is a snapshot and puts query load on the broker. For continuous monitoring, the JMX metric the consumer exposes itself is more suitable.
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<id>
├─ records-lag-max # max lag across all partitions of this consumer (most important)
├─ records-lag # current per-partition lag (with a partition tag)
└─ records-lead-min # headroom from the partition start (log-start-offset) — loss imminent if it shrinksrecords-lag-max is computed on the consumer side, so you can scrape it with Prometheus (JMX Exporter) and alert without querying the broker. Monitor records-lead-min as well — when this value approaches 0, it warns that unprocessed messages risk being deleted by retention.
3.3 External tools
| Tool | Characteristics | When it fits |
|---|---|---|
| Burrow (LinkedIn) | Judges status (OK/WARN/ERR) by consumption trend (evaluation rules) instead of thresholds. Auto-detects a stalled/slow committer | When you want group health without tuning thresholds |
| Kafka Lag Exporter | Exposes offset lag and (estimated) time lag simultaneously as Prometheus metrics | Standardizing Grafana dashboards/alerts |
| Cruise Control (LinkedIn) | Automates partition rebalancing and load distribution rather than lag itself | Resolving hot partitions / broker imbalance |
| kafka-consumer-groups | Accurate snapshot with no extra infrastructure | Ad-hoc debugging |
Burrow's core idea is the insight that "applying thresholds to absolute lag values yields many false positives depending on traffic." So Burrow looks at the trend of offset progression and judges whether the consumer has stalled (committed offset stops moving) or cannot keep up with the production rate.
4. Reading lag correctly
Do not panic over a single number. Lag must be read as a pattern over time.
Interpretation by pattern
| Pattern | Graph shape | Usual meaning | Response |
|---|---|---|---|
| Steady-state lag | Nonzero but flat | Normal — in-flight buffer. Consumer keeps up with production | Set the threshold to a reasonable nonzero value |
| Growing lag | Steadily rising | Dangerous — consumption < production. Capacity/bottleneck/stall | Investigate immediately (Part 2) |
| Spiky lag | Sawtooth/pulse then fall | Usually normal — batch production / traffic peak absorption | Confirm recovery time is within SLA |
| Step-and-plateau | Rises then stays flat | Some consumers stalled / unassigned after rebalance | Check per partition and consumer-id |
Why small steady-state lag is normal
If lag is always exactly 0, you should actually be suspicious. That usually means "production has stopped or the consumer is over-provisioned." A healthy streaming pipeline always has some in-flight messages (fetched batches, a processing queue), so small, steady lag is a healthy state. Therefore, the alert threshold should be of the form "lag keeps rising for N minutes" or "time lag > SLA" — not "lag > 0."
Records vs Time, again
For the same offset lag of 50,000:
- If messages are small and the transform is stateless → time lag of a few seconds, negligible.
- If each message makes an external API call → time lag of tens of minutes, SLA violation.
That is why alerting on time lag where possible reduces false positives (use Kafka Lag Exporter's estimated time lag, or a calculation based on message timestamps).
5. Fast triage checklist
When the 3 AM alert hits, check in order.
The checklist:
- Check the trend: rising, flat, or recovering? (Never a single snapshot.)
- Break down per partition: use
--describeto see if it concentrates on specific partitions. - Assignment state: is any partition's CONSUMER-ID
-? → too few instances / rebalance. - Group STATE: does
--stateshow repeatedRebalancing? → suspect a rebalance storm. - Production-side change: did the production rate spike? (Lag can come from a production surge, not consumption.)
- Check time lag: even with a large record count, is it within the SLA time?
- Retention headroom: is
records-lead-minapproaching 0? → loss imminent, top-priority response.
These 7 steps let you distinguish "normal/recovering" from "a real incident" in most cases. The root-cause diagnosis and resolution for cases classified as "a real incident" (why the consumer can't catch up, rebalance storms, processing bottlenecks, partition scaling) is covered in earnest in this series' Part 2: "When lag won't go down."
Wrapping up
- lag = LEO − committed offset, per partition. Group lag is the sum. Always break it down per partition.
- offset lag (record count) and time lag (wall clock) answer different questions. If your SLA is time-based, alert on time lag.
- Offsets are committed to
__consumer_offsets(a compact topic). Since lag is measured against the committed offset, the commit interval can inflate the metric. - For measurement, combine
kafka-consumer-groups(snapshot) + JMXrecords-lag-max(continuous monitoring) + Burrow / Kafka Lag Exporter (trend, time lag). - Small, steady lag is normal. The danger signals are "sustained growth" and "time lag > SLA." Do not alert on "lag > 0."
- In the next post (Part 2), we cover the root causes and resolution strategies for "when lag won't go down."
References
- Apache Kafka Documentation — https://kafka.apache.org/documentation/
- Apache Kafka — Consumer Group / Offset Management — https://kafka.apache.org/documentation/#impl_offsettracking
kafka-consumer-groupstooling (Kafka Operations) — https://kafka.apache.org/documentation/#basic_ops_consumer_group- LinkedIn Burrow — Kafka Consumer Lag Checking — https://github.com/linkedin/Burrow
- Kafka Lag Exporter — https://github.com/seglo/kafka-lag-exporter
— Data Dynamics Engineering Team