kafkaconsumer-lagstreamingoperationstroubleshooting

[Kafka Ops 1] Mastering Consumer Lag — Measuring, Diagnosing, and Resolving

From the precise definition of Kafka Consumer Lag (LEO − committed offset) through the offset commit mechanism, measuring with kafka-consumer-groups, JMX, and Burrow, to reading lag correctly by distinguishing normal lag from dangerous lag — all from an operations perspective.

Data DynamicsMay 31, 202611 min read

3 AM, PagerDuty fires: "Consumer lag exceeded 2 million." You rub your eyes, open the dashboard, and the number keeps climbing. Is this actually an incident, or just a transient effect of a traffic peak? Among everything you deal with operating Kafka, few metrics are as commonly encountered yet as hard to explain — "what exactly is it, and how much is dangerous?" In this first post of the series, we drill all the way into how to define, measure, and correctly read lag.

What you'll learn in this post

The precise definition of consumer lag: lag = LEO − committed offset, and offset lag vs time lag

How offsets are stored in __consumer_offsets and committed, and position vs committed offset

How to measure lag with kafka-consumer-groups.sh, JMX records-lag-max, and Burrow / Kafka Lag Exporter

The skill of distinguishing normal steady-state lag, growing lag, and spiky lag

A fast triage checklist, and preparation before moving on to Part 2 (root-cause deep dive)

1. What lag actually is

Consumer lag has a one-sentence definition.

lag = log-end-offset (LEO) − last committed offset — per partition.

That is, it is the distance between the position of the last message written to the partition (LEO) and the position the consumer last committed as processed (committed offset). This distance is precisely "the number of messages not yet processed."

The key point is that lag is a per-partition metric. The total lag of a consumer group is the sum of the lag across every partition the group subscribes to.

group lag = Σ (LEO_p − committed_offset_p)   for all partitions p assigned to the group

Looking only at the group total makes it easy to miss "which partition is the problem." Because it is common for a single partition to become a hot partition and generate lag all by itself, you should always break it down per partition.

Offset lag vs Time lag

There are two units for expressing lag. They answer entirely different questions.

Type	Definition	Question it answers	Unit
Offset lag	LEO − committed offset	"How many messages are backed up?"	record count
Time lag	now − (timestamp of oldest unprocessed message)	"How far behind in wall-clock time?"	time (ms/s)

Why do you need both? Because message size and processing cost are not uniform.

Even if offset lag is 1 million, at 0.1ms processing per message you may actually be only 100 seconds behind.
Even if offset lag is 10,000, if each message triggers a heavy DB join, the time lag could be 30 minutes.

If your SLA says "data must be processed within 5 minutes," what actually matters is time lag, not offset lag. Conversely, to assess the risk of message loss past the topic's retention, you need offset lag (relative to partition size). In operations you need both to see the full picture.

2. How offsets and commits work

To understand lag, you first need to know how a consumer records "how far it has read."

The `__consumer_offsets` internal topic

Kafka stores each consumer group's committed offsets in an internal topic called __consumer_offsets. This topic has the following characteristics.

Property	Value	Meaning
Partition count	`offsets.topic.num.partitions` (default 50)	A group maps to one partition via `hash(group.id) % 50`
Cleanup policy	`compact`	Keeps only the latest offset per (group, topic, partition) key
Replication factor	`offsets.topic.replication.factor` (default 3)	Guarantees durability of offset data

On commit, the consumer writes (group.id, topic, partition) → offset, metadata, commit-timestamp as a message to this topic. Thanks to compaction, past commits for the same key are cleaned up and only the latest value remains.

Position vs Committed offset

Here you must distinguish two frequently confused concepts.

Position (current position): the offset the consumer will fetch next via poll(). It is an in-memory value that advances forward on every fetch.
Committed offset: the offset the consumer last recorded to __consumer_offsets. After a rebalance or restart, the consumer resumes from here.

... [read but not yet committed] ...
LEO ──────────────────────────────► right edge (last produced + 1)
        ▲ position (next read location)
              ▲ committed offset (recovery point on failure)

Lag is computed against the committed offset (the CURRENT-OFFSET column of kafka-consumer-groups is the committed offset). Therefore, even if a consumer processes messages quickly, committing late or infrequently can make the lag metric appear larger than reality.

Commit method	Configuration	Effect on lag metric
Auto commit	`enable.auto.commit=true`, `auto.commit.interval.ms=5000`	Up to 5s of processing uncommitted → lag always looks slightly inflated
Manual sync commit	`commitSync()`	Accurate but may reduce throughput
Manual async commit	`commitAsync()`	Great throughput, but needs a retry strategy on failure

Note: One common cause of "processing is done but lag won't drop" is the commit interval/failure. We cover this pitfall in detail in Part 2.

3. Measuring lag

3.1 CLI — `kafka-consumer-groups.sh`

The most basic and most accurate tool. It queries the broker directly and shows group state.

kafka-consumer-groups.sh \
  --bootstrap-server kafka-1:9092 \
  --describe \
  --group order-processor

Example output:

GROUP           TOPIC        PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID                HOST            CLIENT-ID
order-processor  orders       0          1048210         1048210         0      consumer-1-a1b2c3-...      /10.0.1.21      consumer-1
order-processor  orders       1          982134          1003221         21087  consumer-2-d4e5f6-...      /10.0.1.22      consumer-2
order-processor  orders       2          1120004         1120010         6      consumer-3-g7h8i9-...      /10.0.1.23      consumer-3
order-processor  orders       3          550120          770540          220420 -                          -               -

How to read it:

CURRENT-OFFSET = committed offset, LOG-END-OFFSET = LEO, LAG = the difference.
Partition 1 has lag 21,087 — it is processing but slightly behind.
Partition 3 has lag 220,420 but CONSUMER-ID is - — meaning no consumer is assigned to it. This is a strong signal of too few instances or an in-progress rebalance. It is the decisive clue you would have missed by looking only at the group total.

To quickly check just the state:

# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --list
 
# Summarize group STATE — check Stable / Rebalancing / Empty
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --describe --group order-processor --state

3.2 JMX metric — `records-lag-max`

The CLI is a snapshot and puts query load on the broker. For continuous monitoring, the JMX metric the consumer exposes itself is more suitable.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<id>
  ├─ records-lag-max     # max lag across all partitions of this consumer (most important)
  ├─ records-lag         # current per-partition lag (with a partition tag)
  └─ records-lead-min    # headroom from the partition start (log-start-offset) — loss imminent if it shrinks

records-lag-max is computed on the consumer side, so you can scrape it with Prometheus (JMX Exporter) and alert without querying the broker. Monitor records-lead-min as well — when this value approaches 0, it warns that unprocessed messages risk being deleted by retention.

3.3 External tools

Tool	Characteristics	When it fits
Burrow (LinkedIn)	Judges status (OK/WARN/ERR) by consumption trend (evaluation rules) instead of thresholds. Auto-detects a stalled/slow committer	When you want group health without tuning thresholds
Kafka Lag Exporter	Exposes offset lag and (estimated) time lag simultaneously as Prometheus metrics	Standardizing Grafana dashboards/alerts
Cruise Control (LinkedIn)	Automates partition rebalancing and load distribution rather than lag itself	Resolving hot partitions / broker imbalance
kafka-consumer-groups	Accurate snapshot with no extra infrastructure	Ad-hoc debugging

Burrow's core idea is the insight that "applying thresholds to absolute lag values yields many false positives depending on traffic." So Burrow looks at the trend of offset progression and judges whether the consumer has stalled (committed offset stops moving) or cannot keep up with the production rate.

4. Reading lag correctly

Do not panic over a single number. Lag must be read as a pattern over time.

Loading diagram…

Interpretation by pattern

Pattern	Graph shape	Usual meaning	Response
Steady-state lag	Nonzero but flat	Normal — in-flight buffer. Consumer keeps up with production	Set the threshold to a reasonable nonzero value
Growing lag	Steadily rising	Dangerous — consumption < production. Capacity/bottleneck/stall	Investigate immediately (Part 2)
Spiky lag	Sawtooth/pulse then fall	Usually normal — batch production / traffic peak absorption	Confirm recovery time is within SLA
Step-and-plateau	Rises then stays flat	Some consumers stalled / unassigned after rebalance	Check per partition and consumer-id

Why small steady-state lag is normal

If lag is always exactly 0, you should actually be suspicious. That usually means "production has stopped or the consumer is over-provisioned." A healthy streaming pipeline always has some in-flight messages (fetched batches, a processing queue), so small, steady lag is a healthy state. Therefore, the alert threshold should be of the form "lag keeps rising for N minutes" or "time lag > SLA" — not "lag > 0."

Records vs Time, again

For the same offset lag of 50,000:

If messages are small and the transform is stateless → time lag of a few seconds, negligible.
If each message makes an external API call → time lag of tens of minutes, SLA violation.

That is why alerting on time lag where possible reduces false positives (use Kafka Lag Exporter's estimated time lag, or a calculation based on message timestamps).

5. Fast triage checklist

When the 3 AM alert hits, check in order.

Loading diagram…

The checklist:

Check the trend: rising, flat, or recovering? (Never a single snapshot.)
Break down per partition: use --describe to see if it concentrates on specific partitions.
Assignment state: is any partition's CONSUMER-ID -? → too few instances / rebalance.
Group STATE: does --state show repeated Rebalancing? → suspect a rebalance storm.
Production-side change: did the production rate spike? (Lag can come from a production surge, not consumption.)
Check time lag: even with a large record count, is it within the SLA time?
Retention headroom: is records-lead-min approaching 0? → loss imminent, top-priority response.

These 7 steps let you distinguish "normal/recovering" from "a real incident" in most cases. The root-cause diagnosis and resolution for cases classified as "a real incident" (why the consumer can't catch up, rebalance storms, processing bottlenecks, partition scaling) is covered in earnest in this series' Part 2: "When lag won't go down."

Wrapping up

lag = LEO − committed offset, per partition. Group lag is the sum. Always break it down per partition.
offset lag (record count) and time lag (wall clock) answer different questions. If your SLA is time-based, alert on time lag.
Offsets are committed to __consumer_offsets (a compact topic). Since lag is measured against the committed offset, the commit interval can inflate the metric.
For measurement, combine kafka-consumer-groups (snapshot) + JMX records-lag-max (continuous monitoring) + Burrow / Kafka Lag Exporter (trend, time lag).
Small, steady lag is normal. The danger signals are "sustained growth" and "time lag > SLA." Do not alert on "lag > 0."
In the next post (Part 2), we cover the root causes and resolution strategies for "when lag won't go down."

References

Apache Kafka Documentation — https://kafka.apache.org/documentation/

Apache Kafka — Consumer Group / Offset Management — https://kafka.apache.org/documentation/#impl_offsettracking

kafka-consumer-groups tooling (Kafka Operations) — https://kafka.apache.org/documentation/#basic_ops_consumer_group

LinkedIn Burrow — Kafka Consumer Lag Checking — https://github.com/linkedin/Burrow

Kafka Lag Exporter — https://github.com/seglo/kafka-lag-exporter

— Data Dynamics Engineering Team