Blog
kafkaconsumer-lagstreamingoperationstroubleshooting

[Kafka Ops 1] Mastering Consumer Lag — Measuring, Diagnosing, and Resolving

From the precise definition of Kafka Consumer Lag (LEO − committed offset) through the offset commit mechanism, measuring with kafka-consumer-groups, JMX, and Burrow, to reading lag correctly by distinguishing normal lag from dangerous lag — all from an operations perspective.

Data DynamicsMay 31, 202611 min read

3 AM, PagerDuty fires: "Consumer lag exceeded 2 million." You rub your eyes, open the dashboard, and the number keeps climbing. Is this actually an incident, or just a transient effect of a traffic peak? Among everything you deal with operating Kafka, few metrics are as commonly encountered yet as hard to explain — "what exactly is it, and how much is dangerous?" In this first post of the series, we drill all the way into how to define, measure, and correctly read lag.

What you'll learn in this post

  • The precise definition of consumer lag: lag = LEO − committed offset, and offset lag vs time lag
  • How offsets are stored in __consumer_offsets and committed, and position vs committed offset
  • How to measure lag with kafka-consumer-groups.sh, JMX records-lag-max, and Burrow / Kafka Lag Exporter
  • The skill of distinguishing normal steady-state lag, growing lag, and spiky lag
  • A fast triage checklist, and preparation before moving on to Part 2 (root-cause deep dive)

1. What lag actually is

Consumer lag has a one-sentence definition.

lag = log-end-offset (LEO) − last committed offset — per partition.

That is, it is the distance between the position of the last message written to the partition (LEO) and the position the consumer last committed as processed (committed offset). This distance is precisely "the number of messages not yet processed."

The key point is that lag is a per-partition metric. The total lag of a consumer group is the sum of the lag across every partition the group subscribes to.

group lag = Σ (LEO_p − committed_offset_p)   for all partitions p assigned to the group

Looking only at the group total makes it easy to miss "which partition is the problem." Because it is common for a single partition to become a hot partition and generate lag all by itself, you should always break it down per partition.

Offset lag vs Time lag

There are two units for expressing lag. They answer entirely different questions.

TypeDefinitionQuestion it answersUnit
Offset lagLEO − committed offset"How many messages are backed up?"record count
Time lagnow − (timestamp of oldest unprocessed message)"How far behind in wall-clock time?"time (ms/s)

Why do you need both? Because message size and processing cost are not uniform.

  • Even if offset lag is 1 million, at 0.1ms processing per message you may actually be only 100 seconds behind.
  • Even if offset lag is 10,000, if each message triggers a heavy DB join, the time lag could be 30 minutes.

If your SLA says "data must be processed within 5 minutes," what actually matters is time lag, not offset lag. Conversely, to assess the risk of message loss past the topic's retention, you need offset lag (relative to partition size). In operations you need both to see the full picture.


2. How offsets and commits work

To understand lag, you first need to know how a consumer records "how far it has read."

The __consumer_offsets internal topic

Kafka stores each consumer group's committed offsets in an internal topic called __consumer_offsets. This topic has the following characteristics.

PropertyValueMeaning
Partition countoffsets.topic.num.partitions (default 50)A group maps to one partition via hash(group.id) % 50
Cleanup policycompactKeeps only the latest offset per (group, topic, partition) key
Replication factoroffsets.topic.replication.factor (default 3)Guarantees durability of offset data

On commit, the consumer writes (group.id, topic, partition) → offset, metadata, commit-timestamp as a message to this topic. Thanks to compaction, past commits for the same key are cleaned up and only the latest value remains.

Position vs Committed offset

Here you must distinguish two frequently confused concepts.

  • Position (current position): the offset the consumer will fetch next via poll(). It is an in-memory value that advances forward on every fetch.
  • Committed offset: the offset the consumer last recorded to __consumer_offsets. After a rebalance or restart, the consumer resumes from here.
... [read but not yet committed] ...
LEO ──────────────────────────────► right edge (last produced + 1)
        ▲ position (next read location)
              ▲ committed offset (recovery point on failure)

Lag is computed against the committed offset (the CURRENT-OFFSET column of kafka-consumer-groups is the committed offset). Therefore, even if a consumer processes messages quickly, committing late or infrequently can make the lag metric appear larger than reality.

Commit methodConfigurationEffect on lag metric
Auto commitenable.auto.commit=true, auto.commit.interval.ms=5000Up to 5s of processing uncommitted → lag always looks slightly inflated
Manual sync commitcommitSync()Accurate but may reduce throughput
Manual async commitcommitAsync()Great throughput, but needs a retry strategy on failure

Note: One common cause of "processing is done but lag won't drop" is the commit interval/failure. We cover this pitfall in detail in Part 2.


3. Measuring lag

3.1 CLI — kafka-consumer-groups.sh

The most basic and most accurate tool. It queries the broker directly and shows group state.

kafka-consumer-groups.sh \
  --bootstrap-server kafka-1:9092 \
  --describe \
  --group order-processor

Example output:

GROUP           TOPIC        PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID                HOST            CLIENT-ID
order-processor  orders       0          1048210         1048210         0      consumer-1-a1b2c3-...      /10.0.1.21      consumer-1
order-processor  orders       1          982134          1003221         21087  consumer-2-d4e5f6-...      /10.0.1.22      consumer-2
order-processor  orders       2          1120004         1120010         6      consumer-3-g7h8i9-...      /10.0.1.23      consumer-3
order-processor  orders       3          550120          770540          220420 -                          -               -

How to read it:

  • CURRENT-OFFSET = committed offset, LOG-END-OFFSET = LEO, LAG = the difference.
  • Partition 1 has lag 21,087 — it is processing but slightly behind.
  • Partition 3 has lag 220,420 but CONSUMER-ID is - — meaning no consumer is assigned to it. This is a strong signal of too few instances or an in-progress rebalance. It is the decisive clue you would have missed by looking only at the group total.

To quickly check just the state:

# List all groups
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --list
 
# Summarize group STATE — check Stable / Rebalancing / Empty
kafka-consumer-groups.sh --bootstrap-server kafka-1:9092 --describe --group order-processor --state

3.2 JMX metric — records-lag-max

The CLI is a snapshot and puts query load on the broker. For continuous monitoring, the JMX metric the consumer exposes itself is more suitable.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<id>
  ├─ records-lag-max     # max lag across all partitions of this consumer (most important)
  ├─ records-lag         # current per-partition lag (with a partition tag)
  └─ records-lead-min    # headroom from the partition start (log-start-offset) — loss imminent if it shrinks

records-lag-max is computed on the consumer side, so you can scrape it with Prometheus (JMX Exporter) and alert without querying the broker. Monitor records-lead-min as well — when this value approaches 0, it warns that unprocessed messages risk being deleted by retention.

3.3 External tools

ToolCharacteristicsWhen it fits
Burrow (LinkedIn)Judges status (OK/WARN/ERR) by consumption trend (evaluation rules) instead of thresholds. Auto-detects a stalled/slow committerWhen you want group health without tuning thresholds
Kafka Lag ExporterExposes offset lag and (estimated) time lag simultaneously as Prometheus metricsStandardizing Grafana dashboards/alerts
Cruise Control (LinkedIn)Automates partition rebalancing and load distribution rather than lag itselfResolving hot partitions / broker imbalance
kafka-consumer-groupsAccurate snapshot with no extra infrastructureAd-hoc debugging

Burrow's core idea is the insight that "applying thresholds to absolute lag values yields many false positives depending on traffic." So Burrow looks at the trend of offset progression and judges whether the consumer has stalled (committed offset stops moving) or cannot keep up with the production rate.


4. Reading lag correctly

Do not panic over a single number. Lag must be read as a pattern over time.

Loading diagram…

Interpretation by pattern

PatternGraph shapeUsual meaningResponse
Steady-state lagNonzero but flatNormal — in-flight buffer. Consumer keeps up with productionSet the threshold to a reasonable nonzero value
Growing lagSteadily risingDangerous — consumption < production. Capacity/bottleneck/stallInvestigate immediately (Part 2)
Spiky lagSawtooth/pulse then fallUsually normal — batch production / traffic peak absorptionConfirm recovery time is within SLA
Step-and-plateauRises then stays flatSome consumers stalled / unassigned after rebalanceCheck per partition and consumer-id

Why small steady-state lag is normal

If lag is always exactly 0, you should actually be suspicious. That usually means "production has stopped or the consumer is over-provisioned." A healthy streaming pipeline always has some in-flight messages (fetched batches, a processing queue), so small, steady lag is a healthy state. Therefore, the alert threshold should be of the form "lag keeps rising for N minutes" or "time lag > SLA" — not "lag > 0."

Records vs Time, again

For the same offset lag of 50,000:

  • If messages are small and the transform is stateless → time lag of a few seconds, negligible.
  • If each message makes an external API call → time lag of tens of minutes, SLA violation.

That is why alerting on time lag where possible reduces false positives (use Kafka Lag Exporter's estimated time lag, or a calculation based on message timestamps).


5. Fast triage checklist

When the 3 AM alert hits, check in order.

Loading diagram…

The checklist:

  1. Check the trend: rising, flat, or recovering? (Never a single snapshot.)
  2. Break down per partition: use --describe to see if it concentrates on specific partitions.
  3. Assignment state: is any partition's CONSUMER-ID -? → too few instances / rebalance.
  4. Group STATE: does --state show repeated Rebalancing? → suspect a rebalance storm.
  5. Production-side change: did the production rate spike? (Lag can come from a production surge, not consumption.)
  6. Check time lag: even with a large record count, is it within the SLA time?
  7. Retention headroom: is records-lead-min approaching 0? → loss imminent, top-priority response.

These 7 steps let you distinguish "normal/recovering" from "a real incident" in most cases. The root-cause diagnosis and resolution for cases classified as "a real incident" (why the consumer can't catch up, rebalance storms, processing bottlenecks, partition scaling) is covered in earnest in this series' Part 2: "When lag won't go down."


Wrapping up

  • lag = LEO − committed offset, per partition. Group lag is the sum. Always break it down per partition.
  • offset lag (record count) and time lag (wall clock) answer different questions. If your SLA is time-based, alert on time lag.
  • Offsets are committed to __consumer_offsets (a compact topic). Since lag is measured against the committed offset, the commit interval can inflate the metric.
  • For measurement, combine kafka-consumer-groups (snapshot) + JMX records-lag-max (continuous monitoring) + Burrow / Kafka Lag Exporter (trend, time lag).
  • Small, steady lag is normal. The danger signals are "sustained growth" and "time lag > SLA." Do not alert on "lag > 0."
  • In the next post (Part 2), we cover the root causes and resolution strategies for "when lag won't go down."

References


— Data Dynamics Engineering Team