[Kafka Ops 6] Taming the Rebalance Storm in Consumer Groups
We dissect the 'rebalance storm' — where a consumer group rebalances endlessly — through the heartbeat/session/poll timeout model, then tame it with cooperative rebalancing and static membership.
2 a.m. A consumer lag alert fires. You open the dashboard and find the same consumer group rebalancing dozens of times per minute. The consumers are all up and running, yet almost no messages are being processed. A partition gets assigned, then immediately revoked, then assigned again, then revoked — on and on. This is the rebalance storm, the thing operators dread most. In this post we dissect how the storm forms through the timeout model, then cover the concrete settings and strategies that calm it down.
What you'll learn in this post
- What a rebalance is and which events trigger it
- The liveness model created by three timeouts: heartbeat, session, and poll-interval
- The anatomy of the storm loop — "slow processing → kicked → reassignment → even slower" that never converges
- How to stop the storm with timeout tuning, cooperative rebalancing, and static membership
1. What a Rebalance Is and What Triggers It
Definition
A consumer group divides a topic's partitions among its members (consumers). A rebalance is the process of recomputing and redistributing this "partition → member" assignment. The broker-side Group Coordinator drives the rebalance, and members receive their new assignment through the JoinGroup → SyncGroup protocol.
A rebalance is a normal mechanism in itself. The problem is when it happens too often and won't stop.
Events that trigger a rebalance
| Trigger | Description | How common |
|---|---|---|
| Member join | A new consumer instance joins the group (scale-out, restart) | Normal |
| Member leave | A consumer leaves the group gracefully (clean shutdown) | Normal |
| Subscription change | The subscribe() topic list changes, or a regex subscription matches a new topic | Occasional |
| Partition count change | A topic's partition count increases | Rare |
| Missed heartbeat | A member fails to send a heartbeat within session.timeout.ms | Common culprit |
| Poll delay | A member fails to call poll() within max.poll.interval.ms (processing too slow) | Common culprit |
The first four are intentional changes you can't avoid. What torments us in operations is almost always the last two — a member that gets kicked out because it "looks dead." The consumer process is perfectly fine, but the coordinator decides it's dead and revokes its partitions.
2. The Timeout Model — Two Independent Liveness Checks
To understand a rebalance storm, you must distinguish the two separate mechanisms that judge whether a consumer is "alive." Many operators confuse the two and end up tuning the wrong setting.
Mechanism 1 — The heartbeat thread (heartbeat / session)
A consumer runs a separate heartbeat thread in the background. This thread sends "I'm alive" signals to the coordinator independently, whether or not the application is processing messages.
heartbeat.interval.ms(default 3000): how often heartbeats are sent. Shorter means faster death detection but more traffic.session.timeout.ms(default 45000): if the coordinator receives no heartbeat within this window, it declares the member dead and starts a rebalance.
session.timeout.ms must fall between the broker's group.min.session.timeout.ms (default 6000) and group.max.session.timeout.ms (default 1800000). A value outside this range causes the consumer's join to be rejected.
Recommended ratio: heartbeat.interval.ms ≈ session.timeout.ms / 3. This ensures at least three heartbeat opportunities before the session expires — a safety margin so a single network blip doesn't get the member kicked.
Mechanism 2 — The poll loop (max.poll.interval)
Even if the heartbeat thread is alive, whether the application thread is actually doing work is a separate matter. If a consumer spends too long processing the records returned by poll() and fails to call the next poll(), that consumer should be considered "alive but stuck."
max.poll.interval.ms(default 300000, 5 min): the maximum allowed gap between twopoll()calls. Exceed it and the consumer leaves the group on its own, and the coordinator starts a rebalance.max.poll.records(default 500): the maximum number of records a singlepoll()returns. The larger this value, the longer one batch takes to process.
The key point is that these two mechanisms are independent.
| Measures | Who sends/judges | On violation | |
|---|---|---|---|
| session.timeout | Liveness of the heartbeat thread | Background thread ↔ coordinator | Coordinator declares it dead |
| max.poll.interval | Progress of the poll loop | The application thread itself | Consumer leaves the group itself |
In other words, the most common situation in practice is heartbeats going fine but the member getting kicked because processing is slow. The heartbeat thread diligently shouts "I'm alive," while the main thread is stuck on one heavy record for over five minutes. In this case, raising session.timeout.ms is useless — the real culprit is max.poll.interval.ms.
3. How the Storm Forms
A rebalance storm is not a single cause but a self-reinforcing loop. It usually starts with processing delay.
- Some consumer hits a heavy batch of messages and its processing exceeds
max.poll.interval.ms. - That consumer is kicked from the group, and a rebalance occurs.
- The kicked consumer's partitions are redistributed to the remaining consumers. The survivors now carry more partitions.
- The now-overloaded consumers also slow down and begin exceeding
max.poll.interval.ms. - Kicked again → reassigned again → even more load on the survivors → even slower...
It never converges. On top of that, the rebalance itself is costly. In the eager protocol, every member revokes all of its partitions (stop-the-world) during a rebalance, then gets new ones. The more time processing is halted, the more lag piles up, and the accumulated lag makes the next batch heavier, accelerating the vicious cycle.
This symptom ties directly to Part 2 — "Consumer lag won't decrease." When you trace why lag won't go down, it's often not that the consumer can't process, but that it keeps grinding to a halt because of rebalances. Always look at rebalance frequency alongside the lag graph (metrics like RebalanceRatePerHour, or the frequency of Preparing to rebalance in coordinator logs).
Signs of a storm
- Consumer logs repeating
Member ... sending LeaveGroup request/Attempt to heartbeat failed since group is rebalancing. - Broker (coordinator) logs repeating
Preparing to rebalance group ... (reason: removing member ... on ...)every minute. - Consumers are up with no OOM/crash, yet throughput oscillates near zero.
4. Tuning Settings to Calm the Storm
The epicenter of a storm is almost always processing exceeding a timeout. So the prescription has two directions: "make processing finish within the timeout, or raise the timeout to match the processing time."
Aligning the timeouts
# heartbeat/session — so a transient network blip doesn't get you kicked
session.timeout.ms=45000
heartbeat.interval.ms=15000 # session / 3
# poll loop — generous enough for one batch's processing time
max.poll.interval.ms=600000 # 5 min → 10 min (based on real P99 processing time)
max.poll.records=200 # 500 → 200, make the batch lighterThe decision criteria are clear.
- If you're kicked because heartbeats are dropping (network/GC) → raise
session.timeout.msand setheartbeat.interval.msto a third of it. - If you're kicked because processing is slow (heavy batches) → raise
max.poll.interval.mscomfortably above your P99 batch processing time, or lowermax.poll.recordsto make each batch lighter. Doing both together is usually the most effective.
Move heavy processing off the poll thread
The most fundamental fix is to move heavy work outside the poll loop. Let poll() only fetch records quickly, and delegate the actual processing to a separate worker thread pool. The catch: you must then manage ordering and dedup guarantees between offset commits and in-flight processing yourself. You'll also want a backpressure pattern using pause()/resume() so you commit only up to records that have actually finished.
| Prescription | Effect | Caveat |
|---|---|---|
session.timeout.ms ↑ | Fewer kicks from blips/GC | Slower detection of truly dead members |
max.poll.interval.ms ↑ | Fewer kicks from slow processing | Slower detection of truly stuck members |
max.poll.records ↓ | Lighter batch → shorter poll interval | Throughput preserved (slightly more overhead) |
| Offload processing | poll always returns quickly | Must implement offset/ordering management yourself |
5. Cooperative (Incremental) Rebalancing
No matter how well you tune your settings, eager rebalancing carries the inherent cost of stop-the-world on every rebalance. Cooperative (Incremental) Rebalancing (KIP-429) reduces it.
Eager vs cooperative
| Aspect | Eager (RangeAssignor, RoundRobinAssignor) | Cooperative (CooperativeStickyAssignor) |
|---|---|---|
| Revocation | Every member revokes all partitions at once | Only the partitions that need reassignment are revoked incrementally |
| Processing pause | The whole group stops during the rebalance (stop-the-world) | Unaffected partitions keep processing |
| Rebalance count | 1 | Usually 2 (revoke, then rejoin), but with a small blast radius |
| Storm impact | Large stalls accelerate lag buildup | Small stalls help mitigate the storm |
In the eager protocol, even one member joining or leaving makes the entire group drop all partitions first. This wastes effort revoking partitions you already owned and then getting them back. Cooperative revokes only the partitions whose owner actually has to change. The rest keep processing, which greatly reduces the "stall begets stall" vicious cycle during a storm.
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignorMigration note
The eager → cooperative switch is not safe unless every consumer in the group uses the same strategy at the same time. For a zero-downtime rolling switch, you typically go through two phases.
- First deploy: list both the old strategy and CooperativeSticky in
partition.assignment.strategy(e.g.,RangeAssignor, CooperativeStickyAssignor). Support both until every instance has this config. - Second deploy: remove the old eager strategy from the list, leaving only
CooperativeStickyAssignor.
If you switch all at once, mixed strategies among members mid-transition can break assignment negotiation.
6. Static Membership — Stopping Rolling-Restart Storms
If you roll-restart consumers on every deploy, each member leaving and rejoining causes two rebalances (one on leave, one on rejoin). With many instances, a single deploy triggers dozens of rebalance storms. Static Membership (KIP-345) eliminates this.
By giving each consumer instance a fixed group.instance.id, the coordinator remembers it as a "static member." If a static member disappears briefly and returns with the same ID within session.timeout.ms, the coordinator triggers no rebalance and hands back the previous assignment unchanged.
# A stable ID unique to each instance (e.g., StatefulSet ordinal, hostname)
group.instance.id=consumer-prod-0
# The restart must finish within this window for the rebalance to be skipped → keep it generous
session.timeout.ms=120000Key operational points:
group.instance.idmust be unique per instance and identical across restarts. In Kubernetes, a StatefulSet's Pod ordinal (pod-0,pod-1, ...) is a perfect fit.- The restart must finish within
session.timeout.ms. That's why, with static membership, the session timeout is often set generously (e.g., 2–5 minutes) above how long a deploy takes. - Trade-off: a static member that truly dies isn't detected until
session.timeout.msfully elapses. It's a balance between availability (fast failure detection) and deploy stability (avoiding rebalances).
Static Membership + CooperativeStickyAssignor together is the de facto standard combination for large consumer groups today.
7. Diagnostic Checklist — Reading the Timeline
When you hit a storm, narrow down the cause through this timeline lens. Seeing how the heartbeat interval, session expiry, and poll interval interlock along the time axis reveals at a glance which mechanism blew up.
| Symptom / log | Suspected cause | First action |
|---|---|---|
Repeated heartbeat failed ... group is rebalancing | Another member triggered the kick | Trace the kicked member's logs |
LeaveGroup right after processing | Poll delay (excessive processing) | max.poll.interval.ms↑ / max.poll.records↓ |
| Kicked after a long GC | Session expiry (stop-the-world GC) | session.timeout.ms↑, heap/GC tuning |
| Storm on every rolling deploy | Member restarts | group.instance.id (static membership) |
| Throughput 0 on every rebalance | Eager STW | Switch to CooperativeStickyAssignor |
Wrapping up
- A rebalance is a normal mechanism, but when it repeats endlessly it becomes a storm. The storm is almost always the self-reinforcing loop "slow processing → kicked → reassignment → even slower."
- A consumer's liveness is judged by two independent mechanisms: the heartbeat thread (
session.timeout.ms) and the poll loop (max.poll.interval.ms). If you can't tell them apart, you'll tune the wrong setting. The most common culprit in operations is poll delay. - Prescription: align
heartbeat.interval.ms≈session.timeout.ms/3, raisemax.poll.interval.msto match processing time or lowermax.poll.records, and offload heavy processing off the poll thread. - Eliminate stop-the-world revocation with CooperativeStickyAssignor, and avoid rolling-restart rebalances with Static Membership (
group.instance.id). This pairing is the standard prescription for large consumer groups. - If lag won't decrease (see Part 2), it may not be that the consumer is slow but that it's stalled by rebalances. Always read the lag graph and rebalance frequency together.
References
- Apache Kafka. "Consumer Configurations" — https://kafka.apache.org/documentation/#consumerconfigs
- KIP-429. "Kafka Consumer Incremental Rebalance Protocol" — https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol
- KIP-345. "Introduce static membership protocol to reduce consumer rebalances" — https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances
— The Data Dynamics Engineering Team