kafkaconsumer-lagperformanceoperationstroubleshooting

[Kafka Ops 2] Seven Reasons Consumer Lag Won't Go Down — and How to Fix Them

Seven root causes of stubborn Kafka consumer lag, each laid out as symptom → diagnosis → fix: slow consumer processing, a blocked poll loop, insufficient parallelism, hot partitions, GC, broker bottlenecks, and offset commit issues. With diagnostic commands, a config table, and a decision flow.

Data DynamicsJune 1, 202617 min read

3 a.m., the pager goes off: "Consumer lag at 12 million and climbing." The consumers are running, there are no error logs, the CPU is idle — and yet the lag won't shrink. It's like bailing water out of a bucket while more pours in through a hole. This post is a map for finding that hole. When lag won't go down, the cause is almost always one of seven recurring patterns.

What you'll learn in this post

The seven root causes of stubborn consumer lag, and the fix for each

How to use kafka-consumer-groups.sh --describe to pinpoint exactly which partition is falling behind

The meaning and trade-offs of max.poll.records, max.poll.interval.ms, and the fetch.* settings

The golden ratio between consumer count and partition count, and how to untangle a hot partition

A decision flow that takes you straight from symptom to cause

This is Part 2 of the "Kafka Operations Troubleshooting" series. Part 1, "Measuring and Monitoring Consumer Lag," covers how to measure and monitor lag; Part 6, "Rebalance Storm," goes deeper into the mechanism by which rebalances create lag. This post focuses on the next step: "I can measure it, but it won't go down."

1. First, define what "won't go down" means

Before prescribing, read the symptom precisely. Lag is the difference, per partition, between log-end-offset (the broker's last offset) and current-offset (the offset the consumer has committed). Stubborn lag usually takes one of three shapes.

Pattern	Lag graph shape	Interpretation
Linear growth	Straight line, rising	Consume rate < produce rate. Chronic under-processing
Sawtooth	Builds up → briefly drops → builds again	Rebalance loop, periodic GC, batched-downstream stalls
One partition spikes	One partition climbs, the rest flat	Hot/skewed partition, or a single stalled consumer

Where diagnosis starts: `--describe`

Lag troubleshooting always begins with this command. The key is to look per partition, not at the group as a whole.

kafka-consumer-groups.sh \
  --bootstrap-server broker1:9092 \
  --describe \
  --group payment-consumer

GROUP            TOPIC      PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG       CONSUMER-ID                   HOST          CLIENT-ID
payment-consumer payments   0          1048576         1048590         14        consumer-1-a1b2  /10.0.1.5     consumer-1
payment-consumer payments   1          990000          1990000         1000000   consumer-2-c3d4  /10.0.1.6     consumer-2
payment-consumer payments   2          1048500         1048512         12        consumer-3-e5f6  /10.0.1.7     consumer-3
payment-consumer payments   3          1048400         1048450         50        consumer-1-a1b2  /10.0.1.5     consumer-1

This single output already does half the diagnosis. In the example above, only partition 1 is backed up by a million messages (the rest are double digits). That is not an "everything is slow" problem — it's either a hot partition (cause 4) or a stalled consumer that owns partition 1 (causes 5/7). If instead every partition's LAG grows evenly, you have a throughput shortfall (causes 1/3/6).

Checkpoint: If a partition has an empty (-) CONSUMER-ID, that partition has no consumer assigned. Either a consumer died, a rebalance hasn't finished, or you have more consumers than partitions (cause 3).

2. Cause 1: Slow consumer processing (downstream latency)

The most common cause. The consumer code itself is fast, but for every message it makes a synchronous call to a downstream system (DB / external API) and burns its throughput waiting for the response.

Symptom

Lag grows linearly and evenly across all partitions.
Consumer CPU is idle (I/O wait) yet messages/sec is low.
Downstream DB/API p99 latency is high or erratic.

Diagnosis

Work the throughput backwards. If a single message takes 20 ms of downstream round-trip, the theoretical maximum for a single-threaded consumer is 50 messages/sec. If the produce rate is 5,000/sec, you're 100x short — and the lag will never shrink.

produce rate              : 5,000 msg/s
downstream RTT per consumer: 20 ms  → 50 msg/s
parallelism needed         : 5,000 / 50 = 100 (rough)
current partitions/consumers: 12     → 600 msg/s ⟶ deficit

Fix

The key is to not process polled messages one at a time, synchronously.

Batch the downstream writes: instead of per-row INSERT, use INSERT ... VALUES (...), (...), ... for hundreds of rows at once. One RTT handles hundreds of records, lifting throughput by tens to hundreds of times.
Async writes + pipelining: issue downstream calls asynchronously and commit only up to the offset whose response has returned (mind ordering and consistency).
More parallelism: add partitions and spin up matching consumers (see cause 3). Remember: partition count is your parallelism ceiling.

# Anti-pattern: per-record synchronous processing (RTT dominates throughput)
for record in consumer:
    db.execute("INSERT INTO events VALUES (%s)", record.value)  # 20ms wait each time
 
# Fix: poll-batch + bulk write
while True:
    batch = consumer.poll(timeout_ms=500, max_records=500)
    rows = [r.value for tp in batch.values() for r in tp]
    if rows:
        db.execute_batch("INSERT INTO events VALUES %s", rows)  # hundreds per RTT
    consumer.commit()  # commit after downstream success

3. Cause 2: The poll loop blocks longer than `max.poll.interval.ms`

The most insidious cause. The consumer is alive, lag builds in a sawtooth, and the logs show periodic rebalances.

Mechanism

A Kafka consumer has to call poll() regularly for that call to count as an "I'm alive" signal. If processing the messages from one poll() (max.poll.records of them) exceeds max.poll.interval.ms (default 5 minutes), the broker treats the consumer as dead, evicts it from the group, and triggers a rebalance.

Loading diagram…

The evicted consumer never commits its last batch, so the reassigned consumer reprocesses the same range — and since that processing is also slow, it too gets evicted. The loop never catches up; lag oscillates as a sawtooth while trending upward. (When this eviction→rebalance chain spreads across the cluster, you get the rebalance storm of Part 6.)

Diagnosis

Look for this message in the consumer log.

This member will leave the group because consumer poll timeout has expired.
This means the time between subsequent calls to poll() was longer than the
configured max.poll.interval.ms, which typically implies that the poll loop
is spending too much time processing messages.

Fix

The goal is to bring processing time back inside max.poll.interval.ms. You have three levers.

Lever	Direction	Effect	Caveat
Lower `max.poll.records`	↓ (e.g. 500→100)	Shorter per-batch processing → avoids eviction	Too low raises poll overhead
Raise `max.poll.interval.ms`	↑ (e.g. 5min→10min)	Tolerates slow batches	Slower to detect truly dead consumers
Move processing off the poll thread	restructure	Root fix	Requires careful offset-commit design

The most robust fix is the third. The poll thread only receives messages and hands them to a worker queue, while the real processing happens in a separate thread pool. The poll loop then never blocks.

# Poll thread stays fast; processing is delegated to a worker pool
while True:
    batch = consumer.poll(timeout_ms=300, max_records=100)
    for tp, records in batch.items():
        worker_pool.submit(process_partition, tp, records)
    # commit only up to the offset workers have reported complete (consistency is key)
    consumer.commit(offsets=completed_offsets.snapshot())

But once you move processing off-thread, when to commit becomes tricky. Commit a range that isn't done yet and you lose data on failure; commit too conservatively and you reprocess more. Track each worker's completed offset and commit only up to the "lowest contiguously completed offset."

4. Cause 3: Insufficient parallelism — consumers ≠ partitions

Kafka's unit of parallelism is the partition. A single partition can be read by exactly one consumer within a consumer group. Two mistakes fall out of this constraint.

Symptom A: consumers < partitions (some partitions starved)

One consumer carries several partitions. When that consumer hits its processing limit, the lag on all of its partitions rises together.

Symptom B: consumers > partitions (idle consumers)

With 6 partitions and 10 consumers, 4 get no partition and sit idle. They show up in --describe with no CONSUMER-ID assignment, or you scale out and throughput doesn't budge.

The golden rule

Effective max consumers = partition count. Beyond that, you add no throughput, only wasted resources.

Partitions	Consumers	Result
12	4	3 partitions each. 4x headroom remaining
12	12	1:1, maximum parallelism
12	16	Only 12 work, 4 idle (wasted)

Diagnosis and fix

# Check the topic's partition count
kafka-topics.sh --bootstrap-server broker1:9092 --describe --topic payments
# → PartitionCount: 12  (this is your parallelism ceiling)

Throughput short and consumers < partitions → add consumers (up to the partition count).
Already at consumers = partitions and still short → add partitions (--alter --partitions). Caveat: with key-based partitioning, increasing partitions changes the key→partition mapping and can break ordering guarantees.

# Increase partitions (decreasing is not allowed)
kafka-topics.sh --bootstrap-server broker1:9092 \
  --alter --topic payments --partitions 24

5. Cause 4: Hot/skewed partition (bad key distribution)

If --describe shows only one partition spiking while the rest stay flat, it's almost certainly a hot partition. Even if you grow to 24 partitions, when 80% of traffic funnels into one key, the single partition that key maps to burns while the other 23 idle.

Cause: the key → partition mapping

The default partitioner distributes messages by partition = hash(key) % partitionCount. If keys are not evenly distributed, messages pile onto specific partitions. Classic cases:

Keying on tenant_id when one large customer is half the traffic.
A null key or a constant key like "default" mixed in, all heading to one partition.
Keying on a time bucket (yyyyMMddHH), so the "current hour" partition is always hot.

Diagnosis

Look at the variance of per-partition LAG in the --describe output. If one partition's LAG exceeds the sum of the rest, you have skew. For more precision, compare the per-partition message intake rate.

# Compare per-partition message counts (offsets) — is one partition growing fast?
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list broker1:9092 --topic payments --time -1
# payments:0:1048590
# payments:1:9982110   ← this partition has overwhelmingly more
# payments:2:1048512

Fix

Redesign the key: pick a high-cardinality, evenly distributed key. Use a composite key like tenant_id + ":" + record_id instead of tenant_id to spread the load (only when per-tenant ordering isn't required).
Custom partitioner: implement a partitioner that detects hot keys and scatters them across partitions (e.g. round-robin hot keys only).
Drop the key if ordering isn't required: with a null key, messages distribute evenly via round-robin (or sticky) partitioning.

// Outline of a custom partitioner that scatters only hot keys
public int partition(String topic, Object key, byte[] keyBytes,
                     Object value, byte[] valueBytes, Cluster cluster) {
    int numPartitions = cluster.partitionsForTopic(topic).size();
    if (isHotKey(key)) {
        // scatter hot keys round-robin instead of by key hash
        return Math.floorMod(counter.getAndIncrement(), numPartitions);
    }
    return Math.floorMod(Utils.murmur2(keyBytes), numPartitions);
}

Scattering hot keys trades off against ordering. In domains where same-key order matters (e.g. transaction events for one account), you must not scatter the key. Instead, consider adding partitions and decomposing the hot key itself (e.g. moving a large customer to a dedicated topic).

6. Cause 5: GC pauses / an under-resourced consumer

When the consumer JVM's stop-the-world GC runs long, both poll() and heartbeats halt for that duration. The result is the same as cause 2 — poll timeout → eviction → rebalance → sawtooth lag. The difference is that the trigger isn't processing time but a GC pause.

Symptom

Lag spikes periodically, and those moments line up with long pauses in the GC log.
Consumer memory usage is near its limit and max.poll.records is large (records pulled at once pressure the heap).
In a container, CPU throttling or proximity to the memory limit.

Diagnosis

Enable GC logging and inspect pause times.

# Add GC logging to the consumer JVM
-Xlog:gc*:file=/var/log/kafka/consumer-gc.log:time,uptime:filecount=5,filesize=10M
 
# Correlate long STW events (e.g. > 1s) against lag spike timestamps

Fix

Shrink the batch that pressures the heap: lower max.poll.records and fetch.max.bytes to reduce how much you load into memory at once.
Grow the heap or tune GC: with G1GC, lower -XX:MaxGCPauseMillis to reduce the pause target. At very high throughput, consider a low-latency GC like ZGC/Shenandoah.
Fix container resources: if a CPU limit causes throttling, raise the limit or guarantee the request. Size the memory limit to cover heap + direct buffers + native.
Release records quickly: drop references after processing so GC can reclaim sooner (watch for unnecessary accumulating collections).

7. Cause 6: Broker-side bottleneck

No matter how much you tune the consumer, if the broker can't serve data fast enough, lag won't fall. When both the consumer CPU and downstream are idle yet throughput is low, suspect the broker.

Symptom and diagnosis

Bottleneck	Symptom	Indicator
Under-replicated partitions	ISR shrinks, some partitions fetch slowly	`UnderReplicatedPartitions > 0`
Slow fetch	Consumer fetch latency rises	`FetchConsumerTotalTimeMs` p99
Network saturation	Broker NIC bandwidth maxed	`BytesOutPerSec` ≈ NIC limit
Disk I/O limit	Disk wait when reading cold data	iostat `%util` ≈ 100
Large messages	Big records inflate the fetch and delay it	average record size↑

# Check under-replicated partitions immediately
kafka-topics.sh --bootstrap-server broker1:9092 --describe --under-replicated-partitions
 
# Broker fetch/network metrics come from JMX (example MBeans)
# kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer
# kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec

Fix

Resolve under-replication: recover the slow/failed broker and, if needed, redistribute partition leadership (kafka-leader-election.sh).
Tune fetch efficiency: raising the consumer's fetch.min.bytes lets the broker batch small responses, cutting request count and lifting throughput (at a slight latency cost — tune alongside fetch.max.wait.ms).
Large messages: raise max.partition.fetch.bytes to match your message size so the consumer can receive a big record in one go. Too small and it can't receive even one record, stalling. But too large invites memory pressure (cause 5).
Network/disk: if NIC or disk is maxed, add brokers or reassign partitions to spread the load.

8. Cause 7: Commit/offset issues

Here the lag number itself lies. You're actually processing, but the offset isn't advancing — or you keep rewinding into the past and reprocessing the same data forever.

Symptom A: not committing

With enable.auto.commit=false, if the code never calls commitSync()/commitAsync(), processing happens but current-offset stays put. --describe LAG keeps climbing while data lands fine downstream. On restart, you get massive reprocessing from the last commit point.

Symptom B: it keeps rewinding (offset reset)

With auto.offset.reset=earliest, if the committed offset disappears (purged from __consumer_offsets by retention, or the group ID changes every time), the consumer reads from the very start of the topic. Lag suddenly jumps to the full topic size.

Diagnosis

# Is the commit point stuck? If CURRENT-OFFSET doesn't move over time, it's a commit issue
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
  --describe --group payment-consumer
 
# Did the offset reset? Check group state / reset logs
# Consumer log: "Resetting offset for partition ... to offset 0" means a reset occurred

Fix

Guarantee commits: always commit after a batch succeeds. For async processing, apply the "commit up to the lowest completed offset" pattern from cause 2.
Stable group ID: don't change the consumer group ID per environment/deploy. If it accidentally becomes a new group each time, earliest triggers a full reprocess.
Check offset retention: if the consumer is down longer than offsets.retention.minutes (broker default 7 days), commits expire. If a long outage is expected, raise retention or deliberately reset to a specific offset.

# Deliberately reset offsets to after a given time (dry-run first, then --execute)
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
  --group payment-consumer --topic payments \
  --reset-offsets --to-datetime 2026-07-02T03:00:00.000 --dry-run

9. The key settings at a glance

These are the consumer settings you'll touch most often in lag troubleshooting. Remember the trade-off: turning one up cuts another down.

Setting	Default	Role	Higher	Lower
`max.poll.records`	500	Max records per poll	Batch efficiency↑, processing time↑ (eviction risk, GC pressure)	poll overhead↑, eviction risk↓
`max.poll.interval.ms`	300000 (5min)	Max gap allowed between polls	Tolerates slow batches	Faster dead-consumer detection
`fetch.min.bytes`	1	Min accumulated bytes before a response	Throughput↑ (fewer requests), latency↑	Latency↓, requests↑
`fetch.max.bytes`	52428800 (50MB)	Max bytes in one fetch response	Receive more at once (memory↑)	Memory↓, requests↑
`max.partition.fetch.bytes`	1048576 (1MB)	Max fetch bytes per partition	Accommodates large messages	Memory↓ (risk: can't receive large messages)
`fetch.max.wait.ms`	500	Max wait when below `fetch.min.bytes`	Batch efficiency↑	Latency↓

A common starting point: if throughput is short, raise fetch.min.bytes to improve fetch efficiency; if you suspect eviction, lower max.poll.records to shorten per-batch processing. If large messages stall you, raise max.partition.fetch.bytes above your maximum message size.

10. Symptom → cause decision flow

A branch you can follow fast at 3 a.m. Start with --describe.

Loading diagram…

Wrapping up

Lag troubleshooting always starts with per-partition diagnosis via kafka-consumer-groups.sh --describe. Whether everything backs up evenly or only one partition does is the first fork.
Even growth across all partitions is a throughput problem — suspect slow downstream (1), insufficient parallelism (3), and broker bottlenecks (6) in turn.
One partition spiking means a hot/skewed partition (4) or a dead consumer owning that partition (5/7).
Sawtooth + rebalance logs signal poll blocking (2) or GC (5). The textbook move is to lower max.poll.records and move processing off the poll thread.
Every setting is a trade-off. Raising max.poll.records or fetch.* lifts throughput, but memory, latency, and eviction risk move with it. Change one thing at a time and verify against the lag graph.
The ceiling on effective parallelism is the partition count. Adding consumers beyond that is pointless; to genuinely raise throughput, look at partitions and key design together.
The next installment, Part 6 "Rebalance Storm," covers how causes 2 and 5 escalate into a cluster-wide rebalance storm — and how to break it.

References

Apache Kafka — Consumer Configs: https://kafka.apache.org/documentation/#consumerconfigs

Apache Kafka — Consumer Group Tooling: https://kafka.apache.org/documentation/#basic_ops_consumer_group

Apache Kafka — Design & Replication: https://kafka.apache.org/documentation/#design

"Kafka Operations Troubleshooting 1" — Measuring and Monitoring Consumer Lag

"Kafka Operations Troubleshooting 6" — Diagnosing and Stopping the Rebalance Storm

— The Data Dynamics Engineering Team

1. First, define what "won't go down" means

Where diagnosis starts: --describe

2. Cause 1: Slow consumer processing (downstream latency)

Symptom

Diagnosis

Fix

3. Cause 2: The poll loop blocks longer than max.poll.interval.ms

Mechanism

Diagnosis

Fix

4. Cause 3: Insufficient parallelism — consumers ≠ partitions

Symptom A: consumers < partitions (some partitions starved)

Symptom B: consumers > partitions (idle consumers)

The golden rule

Diagnosis and fix

5. Cause 4: Hot/skewed partition (bad key distribution)

Cause: the key → partition mapping

Diagnosis

Fix

6. Cause 5: GC pauses / an under-resourced consumer

Symptom

Diagnosis

Fix

7. Cause 6: Broker-side bottleneck

Symptom and diagnosis

Fix

8. Cause 7: Commit/offset issues

Symptom A: not committing

Symptom B: it keeps rewinding (offset reset)

Diagnosis

Fix

9. The key settings at a glance

10. Symptom → cause decision flow

Wrapping up

Where diagnosis starts: `--describe`

3. Cause 2: The poll loop blocks longer than `max.poll.interval.ms`