[Kafka Ops 2] Seven Reasons Consumer Lag Won't Go Down — and How to Fix Them
Seven root causes of stubborn Kafka consumer lag, each laid out as symptom → diagnosis → fix: slow consumer processing, a blocked poll loop, insufficient parallelism, hot partitions, GC, broker bottlenecks, and offset commit issues. With diagnostic commands, a config table, and a decision flow.
3 a.m., the pager goes off: "Consumer lag at 12 million and climbing." The consumers are running, there are no error logs, the CPU is idle — and yet the lag won't shrink. It's like bailing water out of a bucket while more pours in through a hole. This post is a map for finding that hole. When lag won't go down, the cause is almost always one of seven recurring patterns.
What you'll learn in this post
- The seven root causes of stubborn consumer lag, and the fix for each
- How to use
kafka-consumer-groups.sh --describeto pinpoint exactly which partition is falling behind- The meaning and trade-offs of
max.poll.records,max.poll.interval.ms, and thefetch.*settings- The golden ratio between consumer count and partition count, and how to untangle a hot partition
- A decision flow that takes you straight from symptom to cause
This is Part 2 of the "Kafka Operations Troubleshooting" series. Part 1, "Measuring and Monitoring Consumer Lag," covers how to measure and monitor lag; Part 6, "Rebalance Storm," goes deeper into the mechanism by which rebalances create lag. This post focuses on the next step: "I can measure it, but it won't go down."
1. First, define what "won't go down" means
Before prescribing, read the symptom precisely. Lag is the difference, per partition, between log-end-offset (the broker's last offset) and current-offset (the offset the consumer has committed). Stubborn lag usually takes one of three shapes.
| Pattern | Lag graph shape | Interpretation |
|---|---|---|
| Linear growth | Straight line, rising | Consume rate < produce rate. Chronic under-processing |
| Sawtooth | Builds up → briefly drops → builds again | Rebalance loop, periodic GC, batched-downstream stalls |
| One partition spikes | One partition climbs, the rest flat | Hot/skewed partition, or a single stalled consumer |
Where diagnosis starts: --describe
Lag troubleshooting always begins with this command. The key is to look per partition, not at the group as a whole.
kafka-consumer-groups.sh \
--bootstrap-server broker1:9092 \
--describe \
--group payment-consumerGROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
payment-consumer payments 0 1048576 1048590 14 consumer-1-a1b2 /10.0.1.5 consumer-1
payment-consumer payments 1 990000 1990000 1000000 consumer-2-c3d4 /10.0.1.6 consumer-2
payment-consumer payments 2 1048500 1048512 12 consumer-3-e5f6 /10.0.1.7 consumer-3
payment-consumer payments 3 1048400 1048450 50 consumer-1-a1b2 /10.0.1.5 consumer-1This single output already does half the diagnosis. In the example above, only partition 1 is backed up by a million messages (the rest are double digits). That is not an "everything is slow" problem — it's either a hot partition (cause 4) or a stalled consumer that owns partition 1 (causes 5/7). If instead every partition's LAG grows evenly, you have a throughput shortfall (causes 1/3/6).
Checkpoint: If a partition has an empty (
-)CONSUMER-ID, that partition has no consumer assigned. Either a consumer died, a rebalance hasn't finished, or you have more consumers than partitions (cause 3).
2. Cause 1: Slow consumer processing (downstream latency)
The most common cause. The consumer code itself is fast, but for every message it makes a synchronous call to a downstream system (DB / external API) and burns its throughput waiting for the response.
Symptom
- Lag grows linearly and evenly across all partitions.
- Consumer CPU is idle (I/O wait) yet messages/sec is low.
- Downstream DB/API p99 latency is high or erratic.
Diagnosis
Work the throughput backwards. If a single message takes 20 ms of downstream round-trip, the theoretical maximum for a single-threaded consumer is 50 messages/sec. If the produce rate is 5,000/sec, you're 100x short — and the lag will never shrink.
produce rate : 5,000 msg/s
downstream RTT per consumer: 20 ms → 50 msg/s
parallelism needed : 5,000 / 50 = 100 (rough)
current partitions/consumers: 12 → 600 msg/s ⟶ deficitFix
The key is to not process polled messages one at a time, synchronously.
- Batch the downstream writes: instead of per-row
INSERT, useINSERT ... VALUES (...), (...), ...for hundreds of rows at once. One RTT handles hundreds of records, lifting throughput by tens to hundreds of times. - Async writes + pipelining: issue downstream calls asynchronously and commit only up to the offset whose response has returned (mind ordering and consistency).
- More parallelism: add partitions and spin up matching consumers (see cause 3). Remember: partition count is your parallelism ceiling.
# Anti-pattern: per-record synchronous processing (RTT dominates throughput)
for record in consumer:
db.execute("INSERT INTO events VALUES (%s)", record.value) # 20ms wait each time
# Fix: poll-batch + bulk write
while True:
batch = consumer.poll(timeout_ms=500, max_records=500)
rows = [r.value for tp in batch.values() for r in tp]
if rows:
db.execute_batch("INSERT INTO events VALUES %s", rows) # hundreds per RTT
consumer.commit() # commit after downstream success3. Cause 2: The poll loop blocks longer than max.poll.interval.ms
The most insidious cause. The consumer is alive, lag builds in a sawtooth, and the logs show periodic rebalances.
Mechanism
A Kafka consumer has to call poll() regularly for that call to count as an "I'm alive" signal. If processing the messages from one poll() (max.poll.records of them) exceeds max.poll.interval.ms (default 5 minutes), the broker treats the consumer as dead, evicts it from the group, and triggers a rebalance.
The evicted consumer never commits its last batch, so the reassigned consumer reprocesses the same range — and since that processing is also slow, it too gets evicted. The loop never catches up; lag oscillates as a sawtooth while trending upward. (When this eviction→rebalance chain spreads across the cluster, you get the rebalance storm of Part 6.)
Diagnosis
Look for this message in the consumer log.
This member will leave the group because consumer poll timeout has expired.
This means the time between subsequent calls to poll() was longer than the
configured max.poll.interval.ms, which typically implies that the poll loop
is spending too much time processing messages.Fix
The goal is to bring processing time back inside max.poll.interval.ms. You have three levers.
| Lever | Direction | Effect | Caveat |
|---|---|---|---|
Lower max.poll.records | ↓ (e.g. 500→100) | Shorter per-batch processing → avoids eviction | Too low raises poll overhead |
Raise max.poll.interval.ms | ↑ (e.g. 5min→10min) | Tolerates slow batches | Slower to detect truly dead consumers |
| Move processing off the poll thread | restructure | Root fix | Requires careful offset-commit design |
The most robust fix is the third. The poll thread only receives messages and hands them to a worker queue, while the real processing happens in a separate thread pool. The poll loop then never blocks.
# Poll thread stays fast; processing is delegated to a worker pool
while True:
batch = consumer.poll(timeout_ms=300, max_records=100)
for tp, records in batch.items():
worker_pool.submit(process_partition, tp, records)
# commit only up to the offset workers have reported complete (consistency is key)
consumer.commit(offsets=completed_offsets.snapshot())But once you move processing off-thread, when to commit becomes tricky. Commit a range that isn't done yet and you lose data on failure; commit too conservatively and you reprocess more. Track each worker's completed offset and commit only up to the "lowest contiguously completed offset."
4. Cause 3: Insufficient parallelism — consumers ≠ partitions
Kafka's unit of parallelism is the partition. A single partition can be read by exactly one consumer within a consumer group. Two mistakes fall out of this constraint.
Symptom A: consumers < partitions (some partitions starved)
One consumer carries several partitions. When that consumer hits its processing limit, the lag on all of its partitions rises together.
Symptom B: consumers > partitions (idle consumers)
With 6 partitions and 10 consumers, 4 get no partition and sit idle. They show up in --describe with no CONSUMER-ID assignment, or you scale out and throughput doesn't budge.
The golden rule
Effective max consumers = partition count. Beyond that, you add no throughput, only wasted resources.
| Partitions | Consumers | Result |
|---|---|---|
| 12 | 4 | 3 partitions each. 4x headroom remaining |
| 12 | 12 | 1:1, maximum parallelism |
| 12 | 16 | Only 12 work, 4 idle (wasted) |
Diagnosis and fix
# Check the topic's partition count
kafka-topics.sh --bootstrap-server broker1:9092 --describe --topic payments
# → PartitionCount: 12 (this is your parallelism ceiling)- Throughput short and consumers < partitions → add consumers (up to the partition count).
- Already at consumers = partitions and still short → add partitions (
--alter --partitions). Caveat: with key-based partitioning, increasing partitions changes the key→partition mapping and can break ordering guarantees.
# Increase partitions (decreasing is not allowed)
kafka-topics.sh --bootstrap-server broker1:9092 \
--alter --topic payments --partitions 245. Cause 4: Hot/skewed partition (bad key distribution)
If --describe shows only one partition spiking while the rest stay flat, it's almost certainly a hot partition. Even if you grow to 24 partitions, when 80% of traffic funnels into one key, the single partition that key maps to burns while the other 23 idle.
Cause: the key → partition mapping
The default partitioner distributes messages by partition = hash(key) % partitionCount. If keys are not evenly distributed, messages pile onto specific partitions. Classic cases:
- Keying on
tenant_idwhen one large customer is half the traffic. - A
nullkey or a constant key like"default"mixed in, all heading to one partition. - Keying on a time bucket (
yyyyMMddHH), so the "current hour" partition is always hot.
Diagnosis
Look at the variance of per-partition LAG in the --describe output. If one partition's LAG exceeds the sum of the rest, you have skew. For more precision, compare the per-partition message intake rate.
# Compare per-partition message counts (offsets) — is one partition growing fast?
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list broker1:9092 --topic payments --time -1
# payments:0:1048590
# payments:1:9982110 ← this partition has overwhelmingly more
# payments:2:1048512Fix
- Redesign the key: pick a high-cardinality, evenly distributed key. Use a composite key like
tenant_id + ":" + record_idinstead oftenant_idto spread the load (only when per-tenant ordering isn't required). - Custom partitioner: implement a partitioner that detects hot keys and scatters them across partitions (e.g. round-robin hot keys only).
- Drop the key if ordering isn't required: with a
nullkey, messages distribute evenly via round-robin (or sticky) partitioning.
// Outline of a custom partitioner that scatters only hot keys
public int partition(String topic, Object key, byte[] keyBytes,
Object value, byte[] valueBytes, Cluster cluster) {
int numPartitions = cluster.partitionsForTopic(topic).size();
if (isHotKey(key)) {
// scatter hot keys round-robin instead of by key hash
return Math.floorMod(counter.getAndIncrement(), numPartitions);
}
return Math.floorMod(Utils.murmur2(keyBytes), numPartitions);
}Scattering hot keys trades off against ordering. In domains where same-key order matters (e.g. transaction events for one account), you must not scatter the key. Instead, consider adding partitions and decomposing the hot key itself (e.g. moving a large customer to a dedicated topic).
6. Cause 5: GC pauses / an under-resourced consumer
When the consumer JVM's stop-the-world GC runs long, both poll() and heartbeats halt for that duration. The result is the same as cause 2 — poll timeout → eviction → rebalance → sawtooth lag. The difference is that the trigger isn't processing time but a GC pause.
Symptom
- Lag spikes periodically, and those moments line up with long pauses in the GC log.
- Consumer memory usage is near its limit and
max.poll.recordsis large (records pulled at once pressure the heap). - In a container, CPU throttling or proximity to the memory limit.
Diagnosis
Enable GC logging and inspect pause times.
# Add GC logging to the consumer JVM
-Xlog:gc*:file=/var/log/kafka/consumer-gc.log:time,uptime:filecount=5,filesize=10M
# Correlate long STW events (e.g. > 1s) against lag spike timestampsFix
- Shrink the batch that pressures the heap: lower
max.poll.recordsandfetch.max.bytesto reduce how much you load into memory at once. - Grow the heap or tune GC: with G1GC, lower
-XX:MaxGCPauseMillisto reduce the pause target. At very high throughput, consider a low-latency GC like ZGC/Shenandoah. - Fix container resources: if a CPU limit causes throttling, raise the limit or guarantee the request. Size the memory limit to cover heap + direct buffers + native.
- Release records quickly: drop references after processing so GC can reclaim sooner (watch for unnecessary accumulating collections).
7. Cause 6: Broker-side bottleneck
No matter how much you tune the consumer, if the broker can't serve data fast enough, lag won't fall. When both the consumer CPU and downstream are idle yet throughput is low, suspect the broker.
Symptom and diagnosis
| Bottleneck | Symptom | Indicator |
|---|---|---|
| Under-replicated partitions | ISR shrinks, some partitions fetch slowly | UnderReplicatedPartitions > 0 |
| Slow fetch | Consumer fetch latency rises | FetchConsumerTotalTimeMs p99 |
| Network saturation | Broker NIC bandwidth maxed | BytesOutPerSec ≈ NIC limit |
| Disk I/O limit | Disk wait when reading cold data | iostat %util ≈ 100 |
| Large messages | Big records inflate the fetch and delay it | average record size↑ |
# Check under-replicated partitions immediately
kafka-topics.sh --bootstrap-server broker1:9092 --describe --under-replicated-partitions
# Broker fetch/network metrics come from JMX (example MBeans)
# kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer
# kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSecFix
- Resolve under-replication: recover the slow/failed broker and, if needed, redistribute partition leadership (
kafka-leader-election.sh). - Tune fetch efficiency: raising the consumer's
fetch.min.byteslets the broker batch small responses, cutting request count and lifting throughput (at a slight latency cost — tune alongsidefetch.max.wait.ms). - Large messages: raise
max.partition.fetch.bytesto match your message size so the consumer can receive a big record in one go. Too small and it can't receive even one record, stalling. But too large invites memory pressure (cause 5). - Network/disk: if NIC or disk is maxed, add brokers or reassign partitions to spread the load.
8. Cause 7: Commit/offset issues
Here the lag number itself lies. You're actually processing, but the offset isn't advancing — or you keep rewinding into the past and reprocessing the same data forever.
Symptom A: not committing
With enable.auto.commit=false, if the code never calls commitSync()/commitAsync(), processing happens but current-offset stays put. --describe LAG keeps climbing while data lands fine downstream. On restart, you get massive reprocessing from the last commit point.
Symptom B: it keeps rewinding (offset reset)
With auto.offset.reset=earliest, if the committed offset disappears (purged from __consumer_offsets by retention, or the group ID changes every time), the consumer reads from the very start of the topic. Lag suddenly jumps to the full topic size.
Diagnosis
# Is the commit point stuck? If CURRENT-OFFSET doesn't move over time, it's a commit issue
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
--describe --group payment-consumer
# Did the offset reset? Check group state / reset logs
# Consumer log: "Resetting offset for partition ... to offset 0" means a reset occurredFix
- Guarantee commits: always commit after a batch succeeds. For async processing, apply the "commit up to the lowest completed offset" pattern from cause 2.
- Stable group ID: don't change the consumer group ID per environment/deploy. If it accidentally becomes a new group each time,
earliesttriggers a full reprocess. - Check offset retention: if the consumer is down longer than
offsets.retention.minutes(broker default 7 days), commits expire. If a long outage is expected, raise retention or deliberately reset to a specific offset.
# Deliberately reset offsets to after a given time (dry-run first, then --execute)
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
--group payment-consumer --topic payments \
--reset-offsets --to-datetime 2026-07-02T03:00:00.000 --dry-run9. The key settings at a glance
These are the consumer settings you'll touch most often in lag troubleshooting. Remember the trade-off: turning one up cuts another down.
| Setting | Default | Role | Higher | Lower |
|---|---|---|---|---|
max.poll.records | 500 | Max records per poll | Batch efficiency↑, processing time↑ (eviction risk, GC pressure) | poll overhead↑, eviction risk↓ |
max.poll.interval.ms | 300000 (5min) | Max gap allowed between polls | Tolerates slow batches | Faster dead-consumer detection |
fetch.min.bytes | 1 | Min accumulated bytes before a response | Throughput↑ (fewer requests), latency↑ | Latency↓, requests↑ |
fetch.max.bytes | 52428800 (50MB) | Max bytes in one fetch response | Receive more at once (memory↑) | Memory↓, requests↑ |
max.partition.fetch.bytes | 1048576 (1MB) | Max fetch bytes per partition | Accommodates large messages | Memory↓ (risk: can't receive large messages) |
fetch.max.wait.ms | 500 | Max wait when below fetch.min.bytes | Batch efficiency↑ | Latency↓ |
A common starting point: if throughput is short, raise
fetch.min.bytesto improve fetch efficiency; if you suspect eviction, lowermax.poll.recordsto shorten per-batch processing. If large messages stall you, raisemax.partition.fetch.bytesabove your maximum message size.
10. Symptom → cause decision flow
A branch you can follow fast at 3 a.m. Start with --describe.
Wrapping up
- Lag troubleshooting always starts with per-partition diagnosis via
kafka-consumer-groups.sh --describe. Whether everything backs up evenly or only one partition does is the first fork. - Even growth across all partitions is a throughput problem — suspect slow downstream (1), insufficient parallelism (3), and broker bottlenecks (6) in turn.
- One partition spiking means a hot/skewed partition (4) or a dead consumer owning that partition (5/7).
- Sawtooth + rebalance logs signal poll blocking (2) or GC (5). The textbook move is to lower
max.poll.recordsand move processing off the poll thread. - Every setting is a trade-off. Raising
max.poll.recordsorfetch.*lifts throughput, but memory, latency, and eviction risk move with it. Change one thing at a time and verify against the lag graph. - The ceiling on effective parallelism is the partition count. Adding consumers beyond that is pointless; to genuinely raise throughput, look at partitions and key design together.
- The next installment, Part 6 "Rebalance Storm," covers how causes 2 and 5 escalate into a cluster-wide rebalance storm — and how to break it.
References
- Apache Kafka — Consumer Configs: https://kafka.apache.org/documentation/#consumerconfigs
- Apache Kafka — Consumer Group Tooling: https://kafka.apache.org/documentation/#basic_ops_consumer_group
- Apache Kafka — Design & Replication: https://kafka.apache.org/documentation/#design
- "Kafka Operations Troubleshooting 1" — Measuring and Monitoring Consumer Lag
- "Kafka Operations Troubleshooting 6" — Diagnosing and Stopping the Rebalance Storm
— The Data Dynamics Engineering Team