[Kafka Ops 9] Incident Response Runbook — A Symptom-to-Fix Decision Tree
Consumer lag spikes, URP, offline partitions, missing controller, full disks, ISR flapping, request latency — a runbook to detect, diagnose, and remediate Kafka incidents by symptom. A decision tree, JMX metric thresholds, and a CLI cheat sheet that tie Parts 1-8 together.
It's 3 a.m. and your pager goes off. All it says is "producer latency alert" — it won't tell you what actually broke. Did a broker die? Is the disk full? Did the controller vanish? What you need in that moment isn't heroic intuition but a runbook that starts from a single symptom and narrows down to the cause. This post ties together the diagnostic techniques from the whole Kafka Ops series (Parts 1-8) into one decision tree — a manual you can follow step by step in the middle of an incident.
What you'll learn in this post
- A master decision tree that flows symptom → branch → action
- Detect/diagnose/remediate procedures for the 7 major symptoms: consumer lag spikes, URP, offline partitions, missing controller, and more
- The JMX metric tied to each symptom and its healthy threshold
- A CLI cheat sheet you can use right in the trenches (
kafka-topics.sh,kafka-consumer-groups.sh,kafka-log-dirs.sh,kafka-reassign-partitions.sh)- A map of when to reach for each diagnostic technique from Parts 1-8
1. How to use this runbook — start from the symptom
The most common mistake in incident response is guessing the cause first. You decide "it's probably GC" and start digging through GC logs, only to miss that the disk is actually full. A good runbook works the other way around. It starts from an observable symptom (a metric or an alert), follows branch conditions to narrow the cause, and ends with a verified action.
Every section in this post follows the same three-step structure.
| Step | Question | Tooling |
|---|---|---|
| Detect | What tells us something is wrong? | JMX metrics, alert rules |
| Diagnose | Which of the possible causes is it? | CLI, logs, branch conditions |
| Remediate | What do we safely do about it? | Restart, reassign, config change |
And before any action, record three things: (1) when the symptom started, (2) recent deploy/config changes, and (3) the affected topics, partitions, and consumer groups. Those three lines are half of your postmortem.
2. The master decision tree
This is the first picture to pull up when the pager fires. The most critical symptoms (those that directly affect data availability) sit at the top, and branches descend in order of narrowing blast radius.
The branch order is the priority order. Offline partitions and a missing controller threaten cluster-wide availability, so check them first. Consumer lag is serious but usually recoverable without data loss, so it comes last. When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading into URP, ISR, and lag — so clear the cause near the top of the tree first.
3. Symptom: Consumer lag spiking
The most direct signal that a downstream isn't getting data in time. (For consumer internals see Parts 1-2; for deeper lag monitoring see Part 6.)
Detect
| Metric | Location | Meaning |
|---|---|---|
records-lag-max | Consumer client JMX (consumer-fetch-manager-metrics) | Lag (in records) of the most-behind partition |
records-consumed-rate | Consumer client JMX | Records consumed per second — converging to 0 means stalled |
Group LAG sum | kafka-consumer-groups.sh --describe | Total group lag. Track trend with an exporter (e.g. Burrow) |
Alert on trend rather than absolute value: "lag rising monotonically for N minutes" or "consume rate at 0 while produce rate is positive" are good conditions.
Diagnose
# Group state and per-partition lag at a glance
kafka-consumer-groups.sh --bootstrap-server broker:9092 \
--group payment-consumer --describeBranch points in the output:
| Observation | Likely cause | Next step |
|---|---|---|
CONSUMER-ID empty | Consumer dropped out (crash/deploy) | Check consumer process/logs |
STATE = PreparingRebalance persists | Rebalance storm, session timeout | Check max.poll.interval.ms / processing time |
| Only specific partitions lag | Hot partition / key skew / slow handler | Review partitioning and processing logic |
| All partitions lag uniformly | Insufficient throughput / downstream backpressure | Scale out consumers |
Remediate
- Rebalance storm: lower
max.poll.recordsor speed up processing so each poll stays withinmax.poll.interval.ms. Switch to cooperative rebalancing (cooperative-sticky) to reduce stop-the-world. - Insufficient throughput: add consumer instances up to the partition count (consumers beyond the partition count sit idle).
- Hot partition: rethink key design or increase partitions to redistribute (mind the impact on existing ordering).
- Downstream backpressure: clear the sink bottleneck (DB, external API) first. Adding consumers alone just increases sink load.
4. Symptom: Under-Replicated Partitions (URP)
The leader is alive, but some replicas have fallen out of the ISR (In-Sync Replicas). Not immediate data loss, but a danger zone of reduced fault tolerance.
Detect
| Metric | Healthy | Warning |
|---|---|---|
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | 0 | Investigate if > 0 persists for minutes |
# See exactly which partitions are under-replicated
kafka-topics.sh --bootstrap-server broker:9092 \
--describe --under-replicated-partitionsDiagnose
URP usually comes from one of three causes.
- Broker down: confirm live brokers with
kafka-broker-api-versions.sh --bootstrap-server ..., or watch forBrokerChangein controller logs. Every replica on a dead broker shows up as URP. - Slow follower: a follower fails to catch up within
replica.lag.time.max.ms(default 30s) and drops out of the ISR. Disk I/O saturation or long GC are common causes (same diagnosis as Part 8). - Network: packet loss or insufficient bandwidth between brokers. Confirm via
RequestQueueTimeMsand replication traffic metrics.
Remediate
| Cause | Action |
|---|---|
| Broker down | Recover/restart the broker. On return, followers auto-catch-up and rejoin the ISR → URP converges to 0 |
| Permanently lost broker | Use kafka-reassign-partitions.sh to move replicas to live brokers |
| Slow follower | Resolve the disk/GC issue (§8). If transient, recovers automatically |
| Leader skew | After recovery, rebalance with kafka-leader-election.sh --election-type preferred |
When doing a rolling restart, take one broker at a time, and move to the next only after the previous broker's URP returns to 0. Taking two down at once can break
min.insync.replicasand block writes.
5. Symptom: Offline partitions
The most severe symptom. The partition has no leader, so neither produce nor consume is possible. It directly hits data availability, so check it at the top of the tree. (For unclean leader election and ISR semantics, see Part 5.)
Detect
| Metric | Healthy | Warning |
|---|---|---|
kafka.controller:type=KafkaController,name=OfflinePartitionsCount | 0 | Respond immediately (page) if > 0 |
Diagnose
Offline partitions typically arise from this combination:
- No replicas left in the ISR + unclean leader election disabled (
unclean.leader.election.enable=false, the recommended production default). When the ISR empties, Kafka won't promote a potentially-corrupt out-of-sync replica to leader and leaves the partition offline — choosing consistency over availability.
# Find partitions whose leader is -1 (none)
kafka-topics.sh --bootstrap-server broker:9092 \
--describe --topic orders | grep "Leader: -1"To narrow the cause, check whether multiple brokers went down at once just before (power, rack, or storage failure). With RF=3, if the ISR has already shrunk to one and that broker dies too, the ISR empties and the partition goes offline.
Remediate
If you must restore availability at the cost of data loss, temporarily enable unclean leader election for the affected topics only.
# Allow unclean leader election for one topic (availability first, possible data loss)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
--entity-type topics --entity-name orders \
--add-config unclean.leader.election.enable=true
# Always revert after recoveryUnclean leader election is a last resort. It promotes an out-of-ISR replica to leader, so some messages may be lost permanently. If you enable it, revert to
falseright after recovery and record the potential loss window in your postmortem.
6. Symptom: No active controller (or duplicate)
The controller manages partition leadership, ISR, and topic metadata. With no controller, metadata changes like leader election and reassignment stop.
Detect
ActiveControllerCount should sum to exactly 1 across the whole cluster. Each broker reports 0 or 1, and a total of 1 is healthy.
| Sum | Meaning | Severity |
|---|---|---|
1 | Healthy — single controller | OK |
0 | No controller — metadata changes stalled | Critical |
2+ | Split brain — duplicate controllers | Critical |
# Confirm the sum of ActiveControllerCount across brokers (JMX → Prometheus recommended)
# Page immediately if the sum is not 1Diagnose
- KRaft mode (controller quorum): the controller is elected within a dedicated quorum (
controller.quorum.voters). A sum of 0 means the quorum lost its majority — confirm a majority of controller nodes are alive. Inspect leader/replication state withkafka-metadata-quorum.sh.
# KRaft: check metadata quorum status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication- ZooKeeper mode (legacy): the controller is decided by claiming the
/controllerznode in ZK. A sum of 0 suggests a ZK session expiry or lost connection; 2+ suggests (rarely) a zombie controller or a network partition. Confirm majority availability of the ZK ensemble and the ZK↔broker connectivity.
Remediate
| Situation | Action |
|---|---|
| KRaft quorum lost majority | Recover controller nodes to restore the majority. If a data dir is corrupt, run the metadata restore procedure |
| ZK session issue | Stabilize the ZK ensemble (confirm majority of nodes), check zookeeper.session.timeout.ms |
| Duplicate controller (2+) | Resolve the network partition. Usually one side resigns once the split heals. To force a move, restart the current controller broker to trigger failover |
A normal controller failover (e.g. restarting one controller broker) drops the sum to 0 briefly and recovers to 1. Debounce the alert so it fires only when the sum stays "not 1" for tens of seconds or more.
7. Symptom: Disk full / log directory offline
When a broker has nowhere to write, it takes the affected log directory offline, and all replicas in that directory can cascade from URP into offline. (Storage and retention design are covered in detail in Part 7 — read it alongside this.)
Detect
| Signal | Location |
|---|---|
kafka.log:type=LogManager,name=OfflineLogDirectoryCount | > 0 means a disk/mount problem |
| Disk usage 95%+ | node_exporter disk_used_percent |
No space left on device, Error while writing to checkpoint file | Broker server.log |
Diagnose
# Per-broker / per-log-dir usage and partition sizes
kafka-log-dirs.sh --bootstrap-server broker:9092 \
--describe --broker-list 1,2,3 | python3 -m json.tool | lessBranch points:
- Simple capacity overrun: a topic accumulates faster than retention allows. Check whether
retention.ms/retention.bytesare appropriate and whether there was a sudden traffic surge. - Log directory offline (mount/disk failure):
OfflineLogDirectoryCount > 0. The disk itself failed. Replicas held by that broker are at risk.
Remediate
| Cause | Immediate action | Root-cause action |
|---|---|---|
| Excessive retention | Temporarily lower retention.ms/retention.bytes on affected topics to trigger segment deletion | Redesign per-topic retention policy (Part 7) |
| Traffic surge | Add disk, or reassign some partitions to brokers with headroom | Capacity planning and alert thresholds |
| Disk failure | Inspect/replace the offline-directory broker. Replicas already exist on other brokers (RF≥2) | Tighten JBOD disk replacement procedure |
# Temporarily lower retention to relieve disk pressure (caution: early data deletion)
kafka-configs.sh --bootstrap-server broker:9092 --alter \
--entity-type topics --entity-name verbose-logs \
--add-config retention.ms=3600000Alert at 80%, act at 90% is the safe line. Once you hit 100%, the broker can't write, and recovery is harder because compaction and segment deletion need extra space too.
8. Symptom: ISR shrinking/flapping
When replicas repeatedly enter and leave the ISR (flapping), it doesn't break availability by itself, but it's a precursor to URP and offline partitions.
Detect
| Metric | Meaning | Warning |
|---|---|---|
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec | ISR shrinks per second | Persistently above 0 |
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec | ISR expands per second | Repeating alongside shrinks = flapping |
Alternating shrink and expand means replicas are walking the tightrope right at the replica.lag.time.max.ms boundary (default 30,000 ms).
Diagnose
- GC pauses: a full GC exceeding
replica.lag.time.max.msbriefly drops a follower from the ISR before it recovers. Check stop-the-world durations in the GC log. - Disk I/O saturation: the follower can't write the leader's data to disk fast enough. Check
%util/awaitiniostat -x. - Network jitter: replication traffic stalls intermittently.
Remediate
- GC: tune heap and G1 parameters; check that partitions/segments per broker aren't excessive.
- Disk: move to faster storage or rebalance load (partition reassignment); keep page-cache headroom.
- Adjust the boundary value carefully: blindly raising
replica.lag.time.max.mskeeps truly-lagging replicas in the ISR and weakens consistency guarantees. Fix the root cause (GC/disk) first.
9. Symptom: Request latency high
Brokers are alive and replication is healthy, but produce/fetch responses are slow. The usual cause is broker-internal thread-pool saturation.
Detect
| Metric | Healthy | Warning |
|---|---|---|
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | > 0.3 (30%+ idle) | < 0.2 = I/O threads saturated |
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent | > 0.3 | < 0.2 = network threads saturated |
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce (p99) | Stable baseline | p99 spikes |
...,request=FetchConsumer (p99) | Stable baseline | p99 spikes |
RequestHandlerAvgIdlePercent ranges 0-1 and is an idle ratio. The closer to 0, the more the I/O handler threads are saturated with no time to rest.
Diagnose
TotalTimeMs can be broken into stages. Where the time leaks tells you the cause.
| Sub-metric | If elevated, suspect |
|---|---|
RequestQueueTimeMs | Too few I/O handler threads → check num.io.threads |
LocalTimeMs | Disk/page-cache bottleneck (read/write latency) |
RemoteTimeMs | Replication wait — slow follower under acks=all (§4/§8) |
ResponseQueueTimeMs / ResponseSendTimeMs | Too few network threads / slow client receive |
Remediate
| Cause | Action |
|---|---|
| I/O threads saturated | Raise num.io.threads (typically match the disk count), distribute load |
| Network threads saturated | Raise num.network.threads |
| Replication wait (RemoteTime) | Fix the slow-follower cause (§8), revisit the acks policy |
| Disk bottleneck (LocalTime) | Upgrade storage, redistribute hot partitions |
| Plain overload | Add partitions/brokers, tune client batching (linger.ms/batch.size) |
Before adding threads, look at idle ratio and queue time together. If handler idle is healthy but only p99 is high, the problem likely isn't threads but disk or replication (RemoteTime).
10. Must-watch JMX metrics at a glance
Register these as alert rules and you can have most of the tree's branch questions answered automatically.
| Metric | MBean (summary) | Healthy threshold | What it warns about | Section |
|---|---|---|---|---|
| OfflinePartitionsCount | KafkaController | = 0 | Leaderless partitions, reads/writes stopped | §5 |
| ActiveControllerCount | KafkaController | cluster sum = 1 | No controller (0) / duplicate (2+) | §6 |
| UnderReplicatedPartitions | ReplicaManager | = 0 | Reduced replication fault tolerance | §4 |
| OfflineLogDirectoryCount | LogManager | = 0 | Disk/mount failure | §7 |
| IsrShrinksPerSec | ReplicaManager | normally 0 | Slow follower / GC / disk | §8 |
| IsrExpandsPerSec | ReplicaManager | normally 0 | Flapping (alongside shrink) | §8 |
| RequestHandlerAvgIdlePercent | KafkaRequestHandlerPool | > 0.3 | I/O thread saturation | §9 |
| NetworkProcessorAvgIdlePercent | SocketServer | > 0.3 | Network thread saturation | §9 |
| TotalTimeMs (Produce/Fetch p99) | RequestMetrics | stable baseline | Request latency | §9 |
| records-lag-max | Consumer client | flat trend | Consumer lag | §3 |
| BytesInPerSec / BytesOutPerSec | BrokerTopicMetrics | vs. baseline | Sudden traffic change (capacity) | §7 |
Absolute thresholds vary per cluster. Alert on counter-type metrics (Offline/URP/Controller) by exact value (0 or 1), and on rate/gauge-type metrics (IdlePercent/p99/lag) by deviation from baseline — this cuts false positives.
11. CLI cheat sheet
The commands you'll reach for most in the trenches. Don't confuse --bootstrap-server (brokers) with --bootstrap-controller (KRaft controllers).
# ── Topic / replication state ───────────────────
# Print only under-replicated partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-replicated-partitions
# Partitions below min ISR
kafka-topics.sh --bootstrap-server broker:9092 --describe --under-min-isr-partitions
# Leaderless (offline) partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --unavailable-partitions
# ── Consumer groups / lag ───────────────────────
kafka-consumer-groups.sh --bootstrap-server broker:9092 --list
kafka-consumer-groups.sh --bootstrap-server broker:9092 --group payment-consumer --describe
# ── Disk / log directories ──────────────────────
kafka-log-dirs.sh --bootstrap-server broker:9092 --describe --broker-list 1,2,3
# ── Partition reassignment (lost / skewed brokers) ──
# 1) Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
--topics-to-move-json-file topics.json --broker-list "1,2,3" --generate
# 2) Execute
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
--reassignment-json-file reassignment.json --execute --throttle 50000000
# 3) Check progress
kafka-reassign-partitions.sh --bootstrap-server broker:9092 \
--reassignment-json-file reassignment.json --verify
# ── Preferred leader election (rebalance after broker returns) ──
kafka-leader-election.sh --bootstrap-server broker:9092 --election-type preferred --all-topic-partitions
# ── Controller / metadata quorum (KRaft) ────────
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --status
kafka-metadata-quorum.sh --bootstrap-controller controller:9093 describe --replication
# ── Dynamic config changes (retention / unclean election etc.) ──
kafka-configs.sh --bootstrap-server broker:9092 --entity-type topics --entity-name orders --describeThrottle replication bandwidth with
--throttleduring reassignment. Running a large reassignment without a throttle lets replication traffic crowd out normal produce/fetch and causes §9 request latency on yourself.
12. Series map — which part to open when
This runbook is the entry point. Once you've narrowed the cause from the symptom, dive into the relevant part for deeper diagnosis and tuning.
| Symptom / topic | Section here | Part to go deeper |
|---|---|---|
| Consumer mechanics / offsets | §3 | Parts 1-2 (consumer / offset management) |
| Advanced consumer-lag monitoring | §3 | Part 6 (lag monitoring / alerting) |
| ISR / unclean leader election semantics | §5/§8 | Part 5 (replication / ISR / durability) |
| Disk / retention / storage | §7 | Part 7 (storage operations) |
| Broker / network tuning | §4/§9 | (Broker tuning part) |
Wrapping up
- Incident response starts from the symptom, not a guess. Pin the master tree from §2 next to your pager.
- The branch order is the priority order. Check by availability impact: offline partitions → controller → disk → URP → ISR → latency → lag.
- When several symptoms appear at once, it's usually one upstream cause (e.g. a broker down) cascading. Clear it from the top of the tree.
- Alert on counter-type metrics (Offline/URP/Controller) by exact value, and on rate/gauge types by deviation from baseline.
- Unclean leader election and un-throttled reassignment are double-edged swords. Don't forget to revert after recovery and record it in the postmortem.
- A good runbook is never written once. Update the tree, thresholds, and cheat sheet after every incident to make the next 3 a.m. a little easier.
References
- Apache Kafka. "Monitoring" — https://kafka.apache.org/documentation/#monitoring
- Apache Kafka. "Operations" — https://kafka.apache.org/documentation/#operations
- Apache Kafka. "Datacenters & Geo-Replication" — https://kafka.apache.org/documentation/#datacenters
- Apache Kafka. "KRaft Metadata Quorum" — https://kafka.apache.org/documentation/#kraft
— The Data Dynamics Engineering Team