kafkaoperationsincident-responsemonitoringrunbook

[Kafka Ops 9] Incident Response Runbook — A Symptom-to-Fix Decision Tree

Consumer lag spikes, URP, offline partitions, missing controller, full disks, ISR flapping, request latency — a runbook to detect, diagnose, and remediate Kafka incidents by symptom. A decision tree, JMX metric thresholds, and a CLI cheat sheet that tie Parts 1-8 together.

Data DynamicsJune 8, 202618 min read

It's 3 a.m. and your pager goes off. All it says is "producer latency alert" — it won't tell you what actually broke. Did a broker die? Is the disk full? Did the controller vanish? What you need in that moment isn't heroic intuition but a runbook that starts from a single symptom and narrows down to the cause. This post ties together the diagnostic techniques from the whole Kafka Ops series (Parts 1-8) into one decision tree — a manual you can follow step by step in the middle of an incident.

What you'll learn in this post

A master decision tree that flows symptom → branch → action

Detect/diagnose/remediate procedures for the 7 major symptoms: consumer lag spikes, URP, offline partitions, missing controller, and more

The JMX metric tied to each symptom and its healthy threshold

A CLI cheat sheet you can use right in the trenches (kafka-topics.sh, kafka-consumer-groups.sh, kafka-log-dirs.sh, kafka-reassign-partitions.sh)

A map of when to reach for each diagnostic technique from Parts 1-8

1. How to use this runbook — start from the symptom

The most common mistake in incident response is guessing the cause first. You decide "it's probably GC" and start digging through GC logs, only to miss that the disk is actually full. A good runbook works the other way around. It starts from an observable symptom (a metric or an alert), follows branch conditions to narrow the cause, and ends with a verified action.

Every section in this post follows the same three-step structure.

Step	Question	Tooling
Detect	What tells us something is wrong?	JMX metrics, alert rules
Diagnose	Which of the possible causes is it?	CLI, logs, branch conditions
Remediate	What do we safely do about it?	Restart, reassign, config change

And before any action, record three things: (1) when the symptom started, (2) recent deploy/config changes, and (3) the affected topics, partitions, and consumer groups. Those three lines are half of your postmortem.

2. The master decision tree

This is the first picture to pull up when the pager fires. The most critical symptoms (those that directly affect data availability) sit at the top, and branches descend in order of narrowing blast radius.

Loading diagram…

The branch order is the priority order. Offline partitions and a missing controller threaten cluster-wide availability, so check them first. Consumer lag is serious but usually recoverable without data loss, so it comes last. When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading into URP, ISR, and lag — so clear the cause near the top of the tree first.

3. Symptom: Consumer lag spiking

The most direct signal that a downstream isn't getting data in time. (For consumer internals see Parts 1-2; for deeper lag monitoring see Part 6.)

Detect

Metric	Location	Meaning
`records-lag-max`	Consumer client JMX (`consumer-fetch-manager-metrics`)	Lag (in records) of the most-behind partition
`records-consumed-rate`	Consumer client JMX	Records consumed per second — converging to 0 means stalled
Group `LAG` sum	`kafka-consumer-groups.sh --describe`	Total group lag. Track trend with an exporter (e.g. Burrow)

Alert on trend rather than absolute value: "lag rising monotonically for N minutes" or "consume rate at 0 while produce rate is positive" are good conditions.

Diagnose

# Group state and per-partition lag at a glance
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --group payment-consumer --describe

Branch points in the output:

Observation	Likely cause	Next step
`CONSUMER-ID` empty	Consumer dropped out (crash/deploy)	Check consumer process/logs
`STATE = PreparingRebalance` persists	Rebalance storm, session timeout	Check `max.poll.interval.ms` / processing time
Only specific partitions lag	Hot partition / key skew / slow handler	Review partitioning and processing logic
All partitions lag uniformly	Insufficient throughput / downstream backpressure	Scale out consumers

Remediate

Rebalance storm: lower max.poll.records or speed up processing so each poll stays within max.poll.interval.ms. Switch to cooperative rebalancing (cooperative-sticky) to reduce stop-the-world.
Insufficient throughput: add consumer instances up to the partition count (consumers beyond the partition count sit idle).
Hot partition: rethink key design or increase partitions to redistribute (mind the impact on existing ordering).
Downstream backpressure: clear the sink bottleneck (DB, external API) first. Adding consumers alone just increases sink load.

4. Symptom: Under-Replicated Partitions (URP)

The leader is alive, but some replicas have fallen out of the ISR (In-Sync Replicas). Not immediate data loss, but a danger zone of reduced fault tolerance.

Detect

Metric	Healthy	Warning
`kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`	`0`	Investigate if `> 0` persists for minutes

# See exactly which partitions are under-replicated
kafka-topics.sh --bootstrap-server broker:9092 \
  --describe --under-replicated-partitions

Diagnose

URP usually comes from one of three causes.

Loading diagram…

Broker down: confirm live brokers with kafka-broker-api-versions.sh --bootstrap-server ..., or watch for BrokerChange in controller logs. Every replica on a dead broker shows up as URP.
Slow follower: a follower fails to catch up within replica.lag.time.max.ms (default 30s) and drops out of the ISR. Disk I/O saturation or long GC are common causes (same diagnosis as Part 8).
Network: packet loss or insufficient bandwidth between brokers. Confirm via RequestQueueTimeMs and replication traffic metrics.

Remediate

Cause	Action
Broker down	Recover/restart the broker. On return, followers auto-catch-up and rejoin the ISR → URP converges to 0
Permanently lost broker	Use `kafka-reassign-partitions.sh` to move replicas to live brokers
Slow follower	Resolve the disk/GC issue (§8). If transient, recovers automatically
Leader skew	After recovery, rebalance with `kafka-leader-election.sh --election-type preferred`

When doing a rolling restart, take one broker at a time, and move to the next only after the previous broker's URP returns to 0. Taking two down at once can break min.insync.replicas and block writes.

5. Symptom: Offline partitions

The most severe symptom. The partition has no leader, so neither produce nor consume is possible. It directly hits data availability, so check it at the top of the tree. (For unclean leader election and ISR semantics, see Part 5.)

Detect

Metric	Healthy	Warning
`kafka.controller:type=KafkaController,name=OfflinePartitionsCount`	`0`	Respond immediately (page) if `> 0`

Diagnose

Offline partitions typically arise from this combination:

No replicas left in the ISR + unclean leader election disabled (unclean.leader.election.enable=false, the recommended production default). When the ISR empties, Kafka won't promote a potentially-corrupt out-of-sync replica to leader and leaves the partition offline — choosing consistency over availability.

# Find partitions whose leader is -1 (none)
kafka-topics.sh --bootstrap-server broker:9092 \
  --describe --topic orders | grep "Leader: -1"

To narrow the cause, check whether multiple brokers went down at once just before (power, rack, or storage failure). With RF=3, if the ISR has already shrunk to one and that broker dies too, the ISR empties and the partition goes offline.

Remediate

Loading diagram…

If you must restore availability at the cost of data loss, temporarily enable unclean leader election for the affected topics only.

# Allow unclean leader election for one topic (availability first, possible data loss)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
  --entity-type topics --entity-name orders \
  --add-config unclean.leader.election.enable=true
# Always revert after recovery

Unclean leader election is a last resort. It promotes an out-of-ISR replica to leader, so some messages may be lost permanently. If you enable it, revert to false right after recovery and record the potential loss window in your postmortem.

6. Symptom: No active controller (or duplicate)

The controller manages partition leadership, ISR, and topic metadata. With no controller, metadata changes like leader election and reassignment stop.

Detect

ActiveControllerCount should sum to exactly 1 across the whole cluster. Each broker reports 0 or 1, and a total of 1 is healthy.

Sum	Meaning	Severity
`1`	Healthy — single controller	OK
`0`	No controller — metadata changes stalled	Critical
`2+`	Split brain — duplicate controllers	Critical

# Confirm the sum of ActiveControllerCount across brokers (JMX → Prometheus recommended)
# Page immediately if the sum is not 1

Diagnose

KRaft mode (controller quorum): the controller is elected within a dedicated quorum (controller.quorum.voters). A sum of 0 means the quorum lost its majority — confirm a majority of controller nodes are alive. Inspect leader/replication state with kafka-metadata-quorum.sh.

# KRaft: check metadata quorum status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication

ZooKeeper mode (legacy): the controller is decided by claiming the /controller znode in ZK. A sum of 0 suggests a ZK session expiry or lost connection; 2+ suggests (rarely) a zombie controller or a network partition. Confirm majority availability of the ZK ensemble and the ZK↔broker connectivity.

Remediate

Situation	Action
KRaft quorum lost majority	Recover controller nodes to restore the majority. If a data dir is corrupt, run the metadata restore procedure
ZK session issue	Stabilize the ZK ensemble (confirm majority of nodes), check `zookeeper.session.timeout.ms`
Duplicate controller (2+)	Resolve the network partition. Usually one side resigns once the split heals. To force a move, restart the current controller broker to trigger failover

A normal controller failover (e.g. restarting one controller broker) drops the sum to 0 briefly and recovers to 1. Debounce the alert so it fires only when the sum stays "not 1" for tens of seconds or more.

7. Symptom: Disk full / log directory offline

When a broker has nowhere to write, it takes the affected log directory offline, and all replicas in that directory can cascade from URP into offline. (Storage and retention design are covered in detail in Part 7 — read it alongside this.)

Detect

Signal	Location
`kafka.log:type=LogManager,name=OfflineLogDirectoryCount`	`> 0` means a disk/mount problem
Disk usage 95%+	node_exporter `disk_used_percent`
`No space left on device`, `Error while writing to checkpoint file`	Broker server.log

Diagnose

# Per-broker / per-log-dir usage and partition sizes
kafka-log-dirs.sh --bootstrap-server broker:9092 \
  --describe --broker-list 1,2,3 | python3 -m json.tool | less

Branch points:

Simple capacity overrun: a topic accumulates faster than retention allows. Check whether retention.ms/retention.bytes are appropriate and whether there was a sudden traffic surge.
Log directory offline (mount/disk failure): OfflineLogDirectoryCount > 0. The disk itself failed. Replicas held by that broker are at risk.

Remediate

Cause	Immediate action	Root-cause action
Excessive retention	Temporarily lower `retention.ms`/`retention.bytes` on affected topics to trigger segment deletion	Redesign per-topic retention policy (Part 7)
Traffic surge	Add disk, or reassign some partitions to brokers with headroom	Capacity planning and alert thresholds
Disk failure	Inspect/replace the offline-directory broker. Replicas already exist on other brokers (RF≥2)	Tighten JBOD disk replacement procedure

# Temporarily lower retention to relieve disk pressure (caution: early data deletion)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
  --entity-type topics --entity-name verbose-logs \
  --add-config retention.ms=3600000

Alert at 80%, act at 90% is the safe line. Once you hit 100%, the broker can't write, and recovery is harder because compaction and segment deletion need extra space too.

8. Symptom: ISR shrinking/flapping

When replicas repeatedly enter and leave the ISR (flapping), it doesn't break availability by itself, but it's a precursor to URP and offline partitions.

Detect

Metric	Meaning	Warning
`kafka.server:type=ReplicaManager,name=IsrShrinksPerSec`	ISR shrinks per second	Persistently above 0
`kafka.server:type=ReplicaManager,name=IsrExpandsPerSec`	ISR expands per second	Repeating alongside shrinks = flapping

Alternating shrink and expand means replicas are walking the tightrope right at the replica.lag.time.max.ms boundary (default 30,000 ms).

Diagnose

Loading diagram…

GC pauses: a full GC exceeding replica.lag.time.max.ms briefly drops a follower from the ISR before it recovers. Check stop-the-world durations in the GC log.
Disk I/O saturation: the follower can't write the leader's data to disk fast enough. Check %util/await in iostat -x.
Network jitter: replication traffic stalls intermittently.

Remediate

GC: tune heap and G1 parameters; check that partitions/segments per broker aren't excessive.
Disk: move to faster storage or rebalance load (partition reassignment); keep page-cache headroom.
Adjust the boundary value carefully: blindly raising replica.lag.time.max.ms keeps truly-lagging replicas in the ISR and weakens consistency guarantees. Fix the root cause (GC/disk) first.

9. Symptom: Request latency high

Brokers are alive and replication is healthy, but produce/fetch responses are slow. The usual cause is broker-internal thread-pool saturation.

Detect

Metric	Healthy	Warning
`kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`	`> 0.3` (30%+ idle)	`< 0.2` = I/O threads saturated
`kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`	`> 0.3`	`< 0.2` = network threads saturated
`kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce` (p99)	Stable baseline	p99 spikes
`...,request=FetchConsumer` (p99)	Stable baseline	p99 spikes

RequestHandlerAvgIdlePercent ranges 0-1 and is an idle ratio. The closer to 0, the more the I/O handler threads are saturated with no time to rest.

Diagnose

TotalTimeMs can be broken into stages. Where the time leaks tells you the cause.

Sub-metric	If elevated, suspect
`RequestQueueTimeMs`	Too few I/O handler threads → check `num.io.threads`
`LocalTimeMs`	Disk/page-cache bottleneck (read/write latency)
`RemoteTimeMs`	Replication wait — slow follower under `acks=all` (§4/§8)
`ResponseQueueTimeMs` / `ResponseSendTimeMs`	Too few network threads / slow client receive

Remediate

Cause	Action
I/O threads saturated	Raise `num.io.threads` (typically match the disk count), distribute load
Network threads saturated	Raise `num.network.threads`
Replication wait (RemoteTime)	Fix the slow-follower cause (§8), revisit the `acks` policy
Disk bottleneck (LocalTime)	Upgrade storage, redistribute hot partitions
Plain overload	Add partitions/brokers, tune client batching (`linger.ms`/`batch.size`)

Before adding threads, look at idle ratio and queue time together. If handler idle is healthy but only p99 is high, the problem likely isn't threads but disk or replication (RemoteTime).

10. Must-watch JMX metrics at a glance

Metric	MBean (summary)	Healthy threshold	What it warns about	Section
OfflinePartitionsCount	`KafkaController`	`= 0`	Leaderless partitions, reads/writes stopped	§5
ActiveControllerCount	`KafkaController`	cluster sum `= 1`	No controller (0) / duplicate (2+)	§6
UnderReplicatedPartitions	`ReplicaManager`	`= 0`	Reduced replication fault tolerance	§4
OfflineLogDirectoryCount	`LogManager`	`= 0`	Disk/mount failure	§7
IsrShrinksPerSec	`ReplicaManager`	normally `0`	Slow follower / GC / disk	§8
IsrExpandsPerSec	`ReplicaManager`	normally `0`	Flapping (alongside shrink)	§8
RequestHandlerAvgIdlePercent	`KafkaRequestHandlerPool`	`> 0.3`	I/O thread saturation	§9
NetworkProcessorAvgIdlePercent	`SocketServer`	`> 0.3`	Network thread saturation	§9
TotalTimeMs (Produce/Fetch p99)	`RequestMetrics`	stable baseline	Request latency	§9
records-lag-max	Consumer client	flat trend	Consumer lag	§3
BytesInPerSec / BytesOutPerSec	`BrokerTopicMetrics`	vs. baseline	Sudden traffic change (capacity)	§7

Absolute thresholds vary per cluster. Alert on counter-type metrics (Offline/URP/Controller) by exact value (0 or 1), and on rate/gauge-type metrics (IdlePercent/p99/lag) by deviation from baseline — this cuts false positives.

11. CLI cheat sheet

The commands you'll reach for most in the trenches. Don't confuse --bootstrap-server (brokers) with --bootstrap-controller (KRaft controllers).

# ── Topic / replication state ───────────────────
# Print only under-replicated partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-replicated-partitions
# Partitions below min ISR
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-min-isr-partitions
# Leaderless (offline) partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --unavailable-partitions
 
# ── Consumer groups / lag ───────────────────────
kafka-consumer-groups.sh --bootstrap-server broker:9092 --list
kafka-consumer-groups.sh --bootstrap-server broker:9092 --group payment-consumer --describe
 
# ── Disk / log directories ──────────────────────
kafka-log-dirs.sh --bootstrap-server broker:9092 --describe --broker-list 1,2,3
 
# ── Partition reassignment (lost / skewed brokers) ──
# 1) Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --topics-to-move-json-file topics.json --broker-list "1,2,3" --generate
# 2) Execute
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --reassignment-json-file reassignment.json --execute --throttle 50000000
# 3) Check progress
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
  --reassignment-json-file reassignment.json --verify
 
# ── Preferred leader election (rebalance after broker returns) ──
kafka-leader-election.sh --bootstrap-server broker:9092 --election-type preferred --all-topic-partitions
 
# ── Controller / metadata quorum (KRaft) ────────
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication
 
# ── Dynamic config changes (retention / unclean election etc.) ──
kafka-configs.sh --bootstrap-server broker:9092 --entity-type topics --entity-name orders --describe

Throttle replication bandwidth with --throttle during reassignment. Running a large reassignment without a throttle lets replication traffic crowd out normal produce/fetch and causes §9 request latency on yourself.

12. Series map — which part to open when

This runbook is the entry point. Once you've narrowed the cause from the symptom, dive into the relevant part for deeper diagnosis and tuning.

Symptom / topic	Section here	Part to go deeper
Consumer mechanics / offsets	§3	Parts 1-2 (consumer / offset management)
Advanced consumer-lag monitoring	§3	Part 6 (lag monitoring / alerting)
ISR / unclean leader election semantics	§5/§8	Part 5 (replication / ISR / durability)
Disk / retention / storage	§7	Part 7 (storage operations)
Broker / network tuning	§4/§9	(Broker tuning part)

Wrapping up

Incident response starts from the symptom, not a guess. Pin the master tree from §2 next to your pager.
The branch order is the priority order. Check by availability impact: offline partitions → controller → disk → URP → ISR → latency → lag.
When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading. Clear it from the top of the tree.
Alert on counter-type metrics (Offline/URP/Controller) by exact value, and on rate/gauge types by deviation from baseline.
Unclean leader election and un-throttled reassignment are double-edged swords. Don't forget to revert after recovery and record it in the postmortem.
A good runbook is never written once. Update the tree, thresholds, and cheat sheet after every incident to make the next 3 a.m. a little easier.

References

Apache Kafka. "Monitoring" — https://kafka.apache.org/documentation/#monitoring

Apache Kafka. "Operations" — https://kafka.apache.org/documentation/#operations

Apache Kafka. "Datacenters & Geo-Replication" — https://kafka.apache.org/documentation/#datacenters

Apache Kafka. "KRaft Metadata Quorum" — https://kafka.apache.org/documentation/#kraft

— The Data Dynamics Engineering Team