Blog
kafkaoperationsincident-responsemonitoringrunbook

[Kafka Ops 9] Incident Response Runbook — A Symptom-to-Fix Decision Tree

Consumer lag spikes, URP, offline partitions, missing controller, full disks, ISR flapping, request latency — a runbook to detect, diagnose, and remediate Kafka incidents by symptom. A decision tree, JMX metric thresholds, and a CLI cheat sheet that tie Parts 1-8 together.

Data DynamicsJune 8, 202618 min read

It's 3 a.m. and your pager goes off. All it says is "producer latency alert" — it won't tell you what actually broke. Did a broker die? Is the disk full? Did the controller vanish? What you need in that moment isn't heroic intuition but a runbook that starts from a single symptom and narrows down to the cause. This post ties together the diagnostic techniques from the whole Kafka Ops series (Parts 1-8) into one decision tree — a manual you can follow step by step in the middle of an incident.

What you'll learn in this post

  • A master decision tree that flows symptom → branch → action
  • Detect/diagnose/remediate procedures for the 7 major symptoms: consumer lag spikes, URP, offline partitions, missing controller, and more
  • The JMX metric tied to each symptom and its healthy threshold
  • A CLI cheat sheet you can use right in the trenches (kafka-topics.sh, kafka-consumer-groups.sh, kafka-log-dirs.sh, kafka-reassign-partitions.sh)
  • A map of when to reach for each diagnostic technique from Parts 1-8

1. How to use this runbook — start from the symptom

The most common mistake in incident response is guessing the cause first. You decide "it's probably GC" and start digging through GC logs, only to miss that the disk is actually full. A good runbook works the other way around. It starts from an observable symptom (a metric or an alert), follows branch conditions to narrow the cause, and ends with a verified action.

Every section in this post follows the same three-step structure.

StepQuestionTooling
DetectWhat tells us something is wrong?JMX metrics, alert rules
DiagnoseWhich of the possible causes is it?CLI, logs, branch conditions
RemediateWhat do we safely do about it?Restart, reassign, config change

And before any action, record three things: (1) when the symptom started, (2) recent deploy/config changes, and (3) the affected topics, partitions, and consumer groups. Those three lines are half of your postmortem.


2. The master decision tree

This is the first picture to pull up when the pager fires. The most critical symptoms (those that directly affect data availability) sit at the top, and branches descend in order of narrowing blast radius.

Loading diagram…

The branch order is the priority order. Offline partitions and a missing controller threaten cluster-wide availability, so check them first. Consumer lag is serious but usually recoverable without data loss, so it comes last. When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading into URP, ISR, and lag — so clear the cause near the top of the tree first.


3. Symptom: Consumer lag spiking

The most direct signal that a downstream isn't getting data in time. (For consumer internals see Parts 1-2; for deeper lag monitoring see Part 6.)

Detect

MetricLocationMeaning
records-lag-maxConsumer client JMX (consumer-fetch-manager-metrics)Lag (in records) of the most-behind partition
records-consumed-rateConsumer client JMXRecords consumed per second — converging to 0 means stalled
Group LAG sumkafka-consumer-groups.sh --describeTotal group lag. Track trend with an exporter (e.g. Burrow)

Alert on trend rather than absolute value: "lag rising monotonically for N minutes" or "consume rate at 0 while produce rate is positive" are good conditions.

Diagnose

# Group state and per-partition lag at a glance
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --group payment-consumer --describe

Branch points in the output:

ObservationLikely causeNext step
CONSUMER-ID emptyConsumer dropped out (crash/deploy)Check consumer process/logs
STATE = PreparingRebalance persistsRebalance storm, session timeoutCheck max.poll.interval.ms / processing time
Only specific partitions lagHot partition / key skew / slow handlerReview partitioning and processing logic
All partitions lag uniformlyInsufficient throughput / downstream backpressureScale out consumers

Remediate

  • Rebalance storm: lower max.poll.records or speed up processing so each poll stays within max.poll.interval.ms. Switch to cooperative rebalancing (cooperative-sticky) to reduce stop-the-world.
  • Insufficient throughput: add consumer instances up to the partition count (consumers beyond the partition count sit idle).
  • Hot partition: rethink key design or increase partitions to redistribute (mind the impact on existing ordering).
  • Downstream backpressure: clear the sink bottleneck (DB, external API) first. Adding consumers alone just increases sink load.

4. Symptom: Under-Replicated Partitions (URP)

The leader is alive, but some replicas have fallen out of the ISR (In-Sync Replicas). Not immediate data loss, but a danger zone of reduced fault tolerance.

Detect

MetricHealthyWarning
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions0Investigate if > 0 persists for minutes
# See exactly which partitions are under-replicated
kafka-topics.sh --bootstrap-server broker:9092 \
  --describe --under-replicated-partitions

Diagnose

URP usually comes from one of three causes.

Loading diagram…
  • Broker down: confirm live brokers with kafka-broker-api-versions.sh --bootstrap-server ..., or watch for BrokerChange in controller logs. Every replica on a dead broker shows up as URP.
  • Slow follower: a follower fails to catch up within replica.lag.time.max.ms (default 30s) and drops out of the ISR. Disk I/O saturation or long GC are common causes (same diagnosis as Part 8).
  • Network: packet loss or insufficient bandwidth between brokers. Confirm via RequestQueueTimeMs and replication traffic metrics.

Remediate

CauseAction
Broker downRecover/restart the broker. On return, followers auto-catch-up and rejoin the ISR → URP converges to 0
Permanently lost brokerUse kafka-reassign-partitions.sh to move replicas to live brokers
Slow followerResolve the disk/GC issue (§8). If transient, recovers automatically
Leader skewAfter recovery, rebalance with kafka-leader-election.sh --election-type preferred

When doing a rolling restart, take one broker at a time, and move to the next only after the previous broker's URP returns to 0. Taking two down at once can break min.insync.replicas and block writes.


5. Symptom: Offline partitions

The most severe symptom. The partition has no leader, so neither produce nor consume is possible. It directly hits data availability, so check it at the top of the tree. (For unclean leader election and ISR semantics, see Part 5.)

Detect

MetricHealthyWarning
kafka.controller:type=KafkaController,name=OfflinePartitionsCount0Respond immediately (page) if > 0

Diagnose

Offline partitions typically arise from this combination:

  • No replicas left in the ISR + unclean leader election disabled (unclean.leader.election.enable=false, the recommended production default). When the ISR empties, Kafka won't promote a potentially-corrupt out-of-sync replica to leader and leaves the partition offline — choosing consistency over availability.
# Find partitions whose leader is -1 (none)
kafka-topics.sh --bootstrap-server broker:9092 \
  --describe --topic orders | grep "Leader: -1"

To narrow the cause, check whether multiple brokers went down at once just before (power, rack, or storage failure). With RF=3, if the ISR has already shrunk to one and that broker dies too, the ISR empties and the partition goes offline.

Remediate

Loading diagram…

If you must restore availability at the cost of data loss, temporarily enable unclean leader election for the affected topics only.

# Allow unclean leader election for one topic (availability first, possible data loss)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
  --entity-type topics --entity-name orders \
  --add-config unclean.leader.election.enable=true
# Always revert after recovery

Unclean leader election is a last resort. It promotes an out-of-ISR replica to leader, so some messages may be lost permanently. If you enable it, revert to false right after recovery and record the potential loss window in your postmortem.


6. Symptom: No active controller (or duplicate)

The controller manages partition leadership, ISR, and topic metadata. With no controller, metadata changes like leader election and reassignment stop.

Detect

ActiveControllerCount should sum to exactly 1 across the whole cluster. Each broker reports 0 or 1, and a total of 1 is healthy.

SumMeaningSeverity
1Healthy — single controllerOK
0No controller — metadata changes stalledCritical
2+Split brain — duplicate controllersCritical
# Confirm the sum of ActiveControllerCount across brokers (JMX → Prometheus recommended)
# Page immediately if the sum is not 1

Diagnose

  • KRaft mode (controller quorum): the controller is elected within a dedicated quorum (controller.quorum.voters). A sum of 0 means the quorum lost its majority — confirm a majority of controller nodes are alive. Inspect leader/replication state with kafka-metadata-quorum.sh.
# KRaft: check metadata quorum status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication
  • ZooKeeper mode (legacy): the controller is decided by claiming the /controller znode in ZK. A sum of 0 suggests a ZK session expiry or lost connection; 2+ suggests (rarely) a zombie controller or a network partition. Confirm majority availability of the ZK ensemble and the ZK↔broker connectivity.

Remediate

SituationAction
KRaft quorum lost majorityRecover controller nodes to restore the majority. If a data dir is corrupt, run the metadata restore procedure
ZK session issueStabilize the ZK ensemble (confirm majority of nodes), check zookeeper.session.timeout.ms
Duplicate controller (2+)Resolve the network partition. Usually one side resigns once the split heals. To force a move, restart the current controller broker to trigger failover

A normal controller failover (e.g. restarting one controller broker) drops the sum to 0 briefly and recovers to 1. Debounce the alert so it fires only when the sum stays "not 1" for tens of seconds or more.


7. Symptom: Disk full / log directory offline

When a broker has nowhere to write, it takes the affected log directory offline, and all replicas in that directory can cascade from URP into offline. (Storage and retention design are covered in detail in Part 7 — read it alongside this.)

Detect

SignalLocation
kafka.log:type=LogManager,name=OfflineLogDirectoryCount> 0 means a disk/mount problem
Disk usage 95%+node_exporter disk_used_percent
No space left on device, Error while writing to checkpoint fileBroker server.log

Diagnose

# Per-broker / per-log-dir usage and partition sizes
kafka-log-dirs.sh --bootstrap-server broker:9092 \
  --describe --broker-list 1,2,3 | python3 -m json.tool | less

Branch points:

  • Simple capacity overrun: a topic accumulates faster than retention allows. Check whether retention.ms/retention.bytes are appropriate and whether there was a sudden traffic surge.
  • Log directory offline (mount/disk failure): OfflineLogDirectoryCount > 0. The disk itself failed. Replicas held by that broker are at risk.

Remediate

CauseImmediate actionRoot-cause action
Excessive retentionTemporarily lower retention.ms/retention.bytes on affected topics to trigger segment deletionRedesign per-topic retention policy (Part 7)
Traffic surgeAdd disk, or reassign some partitions to brokers with headroomCapacity planning and alert thresholds
Disk failureInspect/replace the offline-directory broker. Replicas already exist on other brokers (RF≥2)Tighten JBOD disk replacement procedure
# Temporarily lower retention to relieve disk pressure (caution: early data deletion)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
  --entity-type topics --entity-name verbose-logs \
  --add-config retention.ms=3600000

Alert at 80%, act at 90% is the safe line. Once you hit 100%, the broker can't write, and recovery is harder because compaction and segment deletion need extra space too.


8. Symptom: ISR shrinking/flapping

When replicas repeatedly enter and leave the ISR (flapping), it doesn't break availability by itself, but it's a precursor to URP and offline partitions.

Detect

MetricMeaningWarning
kafka.server:type=ReplicaManager,name=IsrShrinksPerSecISR shrinks per secondPersistently above 0
kafka.server:type=ReplicaManager,name=IsrExpandsPerSecISR expands per secondRepeating alongside shrinks = flapping

Alternating shrink and expand means replicas are walking the tightrope right at the replica.lag.time.max.ms boundary (default 30,000 ms).

Diagnose

Loading diagram…
  • GC pauses: a full GC exceeding replica.lag.time.max.ms briefly drops a follower from the ISR before it recovers. Check stop-the-world durations in the GC log.
  • Disk I/O saturation: the follower can't write the leader's data to disk fast enough. Check %util/await in iostat -x.
  • Network jitter: replication traffic stalls intermittently.

Remediate

  • GC: tune heap and G1 parameters; check that partitions/segments per broker aren't excessive.
  • Disk: move to faster storage or rebalance load (partition reassignment); keep page-cache headroom.
  • Adjust the boundary value carefully: blindly raising replica.lag.time.max.ms keeps truly-lagging replicas in the ISR and weakens consistency guarantees. Fix the root cause (GC/disk) first.

9. Symptom: Request latency high

Brokers are alive and replication is healthy, but produce/fetch responses are slow. The usual cause is broker-internal thread-pool saturation.

Detect

MetricHealthyWarning
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent> 0.3 (30%+ idle)< 0.2 = I/O threads saturated
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent> 0.3< 0.2 = network threads saturated
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce (p99)Stable baselinep99 spikes
...,request=FetchConsumer (p99)Stable baselinep99 spikes

RequestHandlerAvgIdlePercent ranges 0-1 and is an idle ratio. The closer to 0, the more the I/O handler threads are saturated with no time to rest.

Diagnose

TotalTimeMs can be broken into stages. Where the time leaks tells you the cause.

Sub-metricIf elevated, suspect
RequestQueueTimeMsToo few I/O handler threads → check num.io.threads
LocalTimeMsDisk/page-cache bottleneck (read/write latency)
RemoteTimeMsReplication wait — slow follower under acks=all (§4/§8)
ResponseQueueTimeMs / ResponseSendTimeMsToo few network threads / slow client receive

Remediate

CauseAction
I/O threads saturatedRaise num.io.threads (typically match the disk count), distribute load
Network threads saturatedRaise num.network.threads
Replication wait (RemoteTime)Fix the slow-follower cause (§8), revisit the acks policy
Disk bottleneck (LocalTime)Upgrade storage, redistribute hot partitions
Plain overloadAdd partitions/brokers, tune client batching (linger.ms/batch.size)

Before adding threads, look at idle ratio and queue time together. If handler idle is healthy but only p99 is high, the problem likely isn't threads but disk or replication (RemoteTime).


10. Must-watch JMX metrics at a glance

Register these as alert rules and you can have most of the tree's branch questions answered automatically.

MetricMBean (summary)Healthy thresholdWhat it warns aboutSection
OfflinePartitionsCountKafkaController= 0Leaderless partitions, reads/writes stopped§5
ActiveControllerCountKafkaControllercluster sum = 1No controller (0) / duplicate (2+)§6
UnderReplicatedPartitionsReplicaManager= 0Reduced replication fault tolerance§4
OfflineLogDirectoryCountLogManager= 0Disk/mount failure§7
IsrShrinksPerSecReplicaManagernormally 0Slow follower / GC / disk§8
IsrExpandsPerSecReplicaManagernormally 0Flapping (alongside shrink)§8
RequestHandlerAvgIdlePercentKafkaRequestHandlerPool> 0.3I/O thread saturation§9
NetworkProcessorAvgIdlePercentSocketServer> 0.3Network thread saturation§9
TotalTimeMs (Produce/Fetch p99)RequestMetricsstable baselineRequest latency§9
records-lag-maxConsumer clientflat trendConsumer lag§3
BytesInPerSec / BytesOutPerSecBrokerTopicMetricsvs. baselineSudden traffic change (capacity)§7

Absolute thresholds vary per cluster. Alert on counter-type metrics (Offline/URP/Controller) by exact value (0 or 1), and on rate/gauge-type metrics (IdlePercent/p99/lag) by deviation from baseline — this cuts false positives.


11. CLI cheat sheet

The commands you'll reach for most in the trenches. Don't confuse --bootstrap-server (brokers) with --bootstrap-controller (KRaft controllers).

# ── Topic / replication state ───────────────────
# Print only under-replicated partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-replicated-partitions
# Partitions below min ISR
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-min-isr-partitions
# Leaderless (offline) partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --unavailable-partitions
 
# ── Consumer groups / lag ───────────────────────
kafka-consumer-groups.sh --bootstrap-server broker:9092 --list
kafka-consumer-groups.sh --bootstrap-server broker:9092 --group payment-consumer --describe
 
# ── Disk / log directories ──────────────────────
kafka-log-dirs.sh --bootstrap-server broker:9092 --describe --broker-list 1,2,3
 
# ── Partition reassignment (lost / skewed brokers) ──
# 1) Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --topics-to-move-json-file topics.json --broker-list "1,2,3" --generate
# 2) Execute
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --reassignment-json-file reassignment.json --execute --throttle 50000000
# 3) Check progress
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --reassignment-json-file reassignment.json --verify
 
# ── Preferred leader election (rebalance after broker returns) ──
kafka-leader-election.sh --bootstrap-server broker:9092 --election-type preferred --all-topic-partitions
 
# ── Controller / metadata quorum (KRaft) ────────
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication
 
# ── Dynamic config changes (retention / unclean election etc.) ──
kafka-configs.sh --bootstrap-server broker:9092 --entity-type topics --entity-name orders --describe

Throttle replication bandwidth with --throttle during reassignment. Running a large reassignment without a throttle lets replication traffic crowd out normal produce/fetch and causes §9 request latency on yourself.


12. Series map — which part to open when

This runbook is the entry point. Once you've narrowed the cause from the symptom, dive into the relevant part for deeper diagnosis and tuning.

Symptom / topicSection herePart to go deeper
Consumer mechanics / offsets§3Parts 1-2 (consumer / offset management)
Advanced consumer-lag monitoring§3Part 6 (lag monitoring / alerting)
ISR / unclean leader election semantics§5/§8Part 5 (replication / ISR / durability)
Disk / retention / storage§7Part 7 (storage operations)
Broker / network tuning§4/§9(Broker tuning part)

Wrapping up

  • Incident response starts from the symptom, not a guess. Pin the master tree from §2 next to your pager.
  • The branch order is the priority order. Check by availability impact: offline partitions → controller → disk → URP → ISR → latency → lag.
  • When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading. Clear it from the top of the tree.
  • Alert on counter-type metrics (Offline/URP/Controller) by exact value, and on rate/gauge types by deviation from baseline.
  • Unclean leader election and un-throttled reassignment are double-edged swords. Don't forget to revert after recovery and record it in the postmortem.
  • A good runbook is never written once. Update the tree, thresholds, and cheat sheet after every incident to make the next 3 a.m. a little easier.

References


— The Data Dynamics Engineering Team