Blog
kafkaconsumer-grouprebalanceoperationstroubleshooting

[Kafka Ops 6] Taming the Rebalance Storm in Consumer Groups

We dissect the 'rebalance storm' — where a consumer group rebalances endlessly — through the heartbeat/session/poll timeout model, then tame it with cooperative rebalancing and static membership.

Data DynamicsJune 5, 202612 min read

2 a.m. A consumer lag alert fires. You open the dashboard and find the same consumer group rebalancing dozens of times per minute. The consumers are all up and running, yet almost no messages are being processed. A partition gets assigned, then immediately revoked, then assigned again, then revoked — on and on. This is the rebalance storm, the thing operators dread most. In this post we dissect how the storm forms through the timeout model, then cover the concrete settings and strategies that calm it down.

What you'll learn in this post

  • What a rebalance is and which events trigger it
  • The liveness model created by three timeouts: heartbeat, session, and poll-interval
  • The anatomy of the storm loop — "slow processing → kicked → reassignment → even slower" that never converges
  • How to stop the storm with timeout tuning, cooperative rebalancing, and static membership

1. What a Rebalance Is and What Triggers It

Definition

A consumer group divides a topic's partitions among its members (consumers). A rebalance is the process of recomputing and redistributing this "partition → member" assignment. The broker-side Group Coordinator drives the rebalance, and members receive their new assignment through the JoinGroupSyncGroup protocol.

A rebalance is a normal mechanism in itself. The problem is when it happens too often and won't stop.

Events that trigger a rebalance

TriggerDescriptionHow common
Member joinA new consumer instance joins the group (scale-out, restart)Normal
Member leaveA consumer leaves the group gracefully (clean shutdown)Normal
Subscription changeThe subscribe() topic list changes, or a regex subscription matches a new topicOccasional
Partition count changeA topic's partition count increasesRare
Missed heartbeatA member fails to send a heartbeat within session.timeout.msCommon culprit
Poll delayA member fails to call poll() within max.poll.interval.ms (processing too slow)Common culprit

The first four are intentional changes you can't avoid. What torments us in operations is almost always the last two — a member that gets kicked out because it "looks dead." The consumer process is perfectly fine, but the coordinator decides it's dead and revokes its partitions.


2. The Timeout Model — Two Independent Liveness Checks

To understand a rebalance storm, you must distinguish the two separate mechanisms that judge whether a consumer is "alive." Many operators confuse the two and end up tuning the wrong setting.

Mechanism 1 — The heartbeat thread (heartbeat / session)

A consumer runs a separate heartbeat thread in the background. This thread sends "I'm alive" signals to the coordinator independently, whether or not the application is processing messages.

  • heartbeat.interval.ms (default 3000): how often heartbeats are sent. Shorter means faster death detection but more traffic.
  • session.timeout.ms (default 45000): if the coordinator receives no heartbeat within this window, it declares the member dead and starts a rebalance.

session.timeout.ms must fall between the broker's group.min.session.timeout.ms (default 6000) and group.max.session.timeout.ms (default 1800000). A value outside this range causes the consumer's join to be rejected.

Recommended ratio: heartbeat.interval.mssession.timeout.ms / 3. This ensures at least three heartbeat opportunities before the session expires — a safety margin so a single network blip doesn't get the member kicked.

Mechanism 2 — The poll loop (max.poll.interval)

Even if the heartbeat thread is alive, whether the application thread is actually doing work is a separate matter. If a consumer spends too long processing the records returned by poll() and fails to call the next poll(), that consumer should be considered "alive but stuck."

  • max.poll.interval.ms (default 300000, 5 min): the maximum allowed gap between two poll() calls. Exceed it and the consumer leaves the group on its own, and the coordinator starts a rebalance.
  • max.poll.records (default 500): the maximum number of records a single poll() returns. The larger this value, the longer one batch takes to process.

The key point is that these two mechanisms are independent.

MeasuresWho sends/judgesOn violation
session.timeoutLiveness of the heartbeat threadBackground thread ↔ coordinatorCoordinator declares it dead
max.poll.intervalProgress of the poll loopThe application thread itselfConsumer leaves the group itself

In other words, the most common situation in practice is heartbeats going fine but the member getting kicked because processing is slow. The heartbeat thread diligently shouts "I'm alive," while the main thread is stuck on one heavy record for over five minutes. In this case, raising session.timeout.ms is useless — the real culprit is max.poll.interval.ms.


3. How the Storm Forms

A rebalance storm is not a single cause but a self-reinforcing loop. It usually starts with processing delay.

  1. Some consumer hits a heavy batch of messages and its processing exceeds max.poll.interval.ms.
  2. That consumer is kicked from the group, and a rebalance occurs.
  3. The kicked consumer's partitions are redistributed to the remaining consumers. The survivors now carry more partitions.
  4. The now-overloaded consumers also slow down and begin exceeding max.poll.interval.ms.
  5. Kicked again → reassigned again → even more load on the survivors → even slower...

It never converges. On top of that, the rebalance itself is costly. In the eager protocol, every member revokes all of its partitions (stop-the-world) during a rebalance, then gets new ones. The more time processing is halted, the more lag piles up, and the accumulated lag makes the next batch heavier, accelerating the vicious cycle.

Loading diagram…

This symptom ties directly to Part 2 — "Consumer lag won't decrease." When you trace why lag won't go down, it's often not that the consumer can't process, but that it keeps grinding to a halt because of rebalances. Always look at rebalance frequency alongside the lag graph (metrics like RebalanceRatePerHour, or the frequency of Preparing to rebalance in coordinator logs).

Signs of a storm

  • Consumer logs repeating Member ... sending LeaveGroup request / Attempt to heartbeat failed since group is rebalancing.
  • Broker (coordinator) logs repeating Preparing to rebalance group ... (reason: removing member ... on ...) every minute.
  • Consumers are up with no OOM/crash, yet throughput oscillates near zero.

4. Tuning Settings to Calm the Storm

The epicenter of a storm is almost always processing exceeding a timeout. So the prescription has two directions: "make processing finish within the timeout, or raise the timeout to match the processing time."

Aligning the timeouts

# heartbeat/session — so a transient network blip doesn't get you kicked
session.timeout.ms=45000
heartbeat.interval.ms=15000   # session / 3
 
# poll loop — generous enough for one batch's processing time
max.poll.interval.ms=600000   # 5 min → 10 min (based on real P99 processing time)
max.poll.records=200          # 500 → 200, make the batch lighter

The decision criteria are clear.

  • If you're kicked because heartbeats are dropping (network/GC) → raise session.timeout.ms and set heartbeat.interval.ms to a third of it.
  • If you're kicked because processing is slow (heavy batches) → raise max.poll.interval.ms comfortably above your P99 batch processing time, or lower max.poll.records to make each batch lighter. Doing both together is usually the most effective.

Move heavy processing off the poll thread

The most fundamental fix is to move heavy work outside the poll loop. Let poll() only fetch records quickly, and delegate the actual processing to a separate worker thread pool. The catch: you must then manage ordering and dedup guarantees between offset commits and in-flight processing yourself. You'll also want a backpressure pattern using pause()/resume() so you commit only up to records that have actually finished.

PrescriptionEffectCaveat
session.timeout.msFewer kicks from blips/GCSlower detection of truly dead members
max.poll.interval.msFewer kicks from slow processingSlower detection of truly stuck members
max.poll.recordsLighter batch → shorter poll intervalThroughput preserved (slightly more overhead)
Offload processingpoll always returns quicklyMust implement offset/ordering management yourself

5. Cooperative (Incremental) Rebalancing

No matter how well you tune your settings, eager rebalancing carries the inherent cost of stop-the-world on every rebalance. Cooperative (Incremental) Rebalancing (KIP-429) reduces it.

Eager vs cooperative

AspectEager (RangeAssignor, RoundRobinAssignor)Cooperative (CooperativeStickyAssignor)
RevocationEvery member revokes all partitions at onceOnly the partitions that need reassignment are revoked incrementally
Processing pauseThe whole group stops during the rebalance (stop-the-world)Unaffected partitions keep processing
Rebalance count1Usually 2 (revoke, then rejoin), but with a small blast radius
Storm impactLarge stalls accelerate lag buildupSmall stalls help mitigate the storm

In the eager protocol, even one member joining or leaving makes the entire group drop all partitions first. This wastes effort revoking partitions you already owned and then getting them back. Cooperative revokes only the partitions whose owner actually has to change. The rest keep processing, which greatly reduces the "stall begets stall" vicious cycle during a storm.

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Migration note

The eager → cooperative switch is not safe unless every consumer in the group uses the same strategy at the same time. For a zero-downtime rolling switch, you typically go through two phases.

  1. First deploy: list both the old strategy and CooperativeSticky in partition.assignment.strategy (e.g., RangeAssignor, CooperativeStickyAssignor). Support both until every instance has this config.
  2. Second deploy: remove the old eager strategy from the list, leaving only CooperativeStickyAssignor.

If you switch all at once, mixed strategies among members mid-transition can break assignment negotiation.


6. Static Membership — Stopping Rolling-Restart Storms

If you roll-restart consumers on every deploy, each member leaving and rejoining causes two rebalances (one on leave, one on rejoin). With many instances, a single deploy triggers dozens of rebalance storms. Static Membership (KIP-345) eliminates this.

By giving each consumer instance a fixed group.instance.id, the coordinator remembers it as a "static member." If a static member disappears briefly and returns with the same ID within session.timeout.ms, the coordinator triggers no rebalance and hands back the previous assignment unchanged.

# A stable ID unique to each instance (e.g., StatefulSet ordinal, hostname)
group.instance.id=consumer-prod-0
 
# The restart must finish within this window for the rebalance to be skipped → keep it generous
session.timeout.ms=120000

Key operational points:

  • group.instance.id must be unique per instance and identical across restarts. In Kubernetes, a StatefulSet's Pod ordinal (pod-0, pod-1, ...) is a perfect fit.
  • The restart must finish within session.timeout.ms. That's why, with static membership, the session timeout is often set generously (e.g., 2–5 minutes) above how long a deploy takes.
  • Trade-off: a static member that truly dies isn't detected until session.timeout.ms fully elapses. It's a balance between availability (fast failure detection) and deploy stability (avoiding rebalances).

Static Membership + CooperativeStickyAssignor together is the de facto standard combination for large consumer groups today.


7. Diagnostic Checklist — Reading the Timeline

When you hit a storm, narrow down the cause through this timeline lens. Seeing how the heartbeat interval, session expiry, and poll interval interlock along the time axis reveals at a glance which mechanism blew up.

Loading diagram…
Symptom / logSuspected causeFirst action
Repeated heartbeat failed ... group is rebalancingAnother member triggered the kickTrace the kicked member's logs
LeaveGroup right after processingPoll delay (excessive processing)max.poll.interval.ms↑ / max.poll.records
Kicked after a long GCSession expiry (stop-the-world GC)session.timeout.ms↑, heap/GC tuning
Storm on every rolling deployMember restartsgroup.instance.id (static membership)
Throughput 0 on every rebalanceEager STWSwitch to CooperativeStickyAssignor

Wrapping up

  • A rebalance is a normal mechanism, but when it repeats endlessly it becomes a storm. The storm is almost always the self-reinforcing loop "slow processing → kicked → reassignment → even slower."
  • A consumer's liveness is judged by two independent mechanisms: the heartbeat thread (session.timeout.ms) and the poll loop (max.poll.interval.ms). If you can't tell them apart, you'll tune the wrong setting. The most common culprit in operations is poll delay.
  • Prescription: align heartbeat.interval.mssession.timeout.ms/3, raise max.poll.interval.ms to match processing time or lower max.poll.records, and offload heavy processing off the poll thread.
  • Eliminate stop-the-world revocation with CooperativeStickyAssignor, and avoid rolling-restart rebalances with Static Membership (group.instance.id). This pairing is the standard prescription for large consumer groups.
  • If lag won't decrease (see Part 2), it may not be that the consumer is slow but that it's stalled by rebalances. Always read the lag graph and rebalance frequency together.

References


— The Data Dynamics Engineering Team