[Kafka Ops 7] Disk Full — Emergency Response and Prevention for Broker Storage
From diagnosing the symptoms of a full Kafka broker disk, to a safe ordered emergency response, to the capacity-planning, retention, and monitoring strategies that prevent it from happening again — written as an operational runbook.
3 AM, the pager goes off. Alongside an alert that "producers can't send messages," one Kafka broker shows an entire log directory marked offline. Disk usage: 100%. The most dangerous move in this moment is "we're in a hurry, let's just delete a few segment files." That single rm corrupts the log and breaks offsets, turning a disk-full incident into something far longer than the disk-full itself. This article is the runbook for that 3 AM.
What you'll learn in this post
- The symptoms a broker emits when its disk fills up, and what they mean
- The root causes that fill the disk: retention, traffic spikes, compaction lag, and more
- An ordered emergency procedure that recovers the disk without data loss
- The one thing you must never do: manually delete segment files
- The capacity planning, quotas, and monitoring that keep that 3 AM from recurring
1. Symptoms — How a Full Disk Reports Itself
A full disk doesn't arrive quietly. Kafka screams from several layers at once. The first step is to identify precisely what you're looking at.
Signals the broker sends
| Symptom | Where you see it | What it means |
|---|---|---|
| Broker shutdown or log dir offline | Broker log, kafka.log.LogManager | Can no longer write to a log.dir, so that directory is taken offline |
KafkaStorageException | Broker log, producer responses | Disk I/O failure. Cannot flush or roll segments |
| Produce requests fail | Producer client | NotEnoughReplicasException, timeouts, backpressure |
| Partitions go offline | kafka-topics.sh --describe, controller log | Partition lost its leader. Leader: none |
| Under-replicated partitions rising | JMX UnderReplicatedPartitions | Followers can't keep up. Replicas insufficient |
The key point: a broker goes offline at the log-directory level. In a JBOD (multiple disks) setup, one full disk takes only that directory's partitions offline while partitions on other disks keep working. In a single-disk setup, the broker is effectively paralyzed.
Quick symptom triage
# 1. Check the broker log for storage exceptions
grep -i "KafkaStorageException\|offline\|No space left" \
/var/log/kafka/server.log | tail -50
# 2. OS-level disk usage
df -h /data/kafka
# 3. Tally offline / under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-replicated-partitions
kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --unavailable-partitionsIf you see No space left on device and df is near 100%, the diagnosis is done. Now move to why it filled up and how to recover.
2. Root Causes — Why the Disk Filled Up
Before jumping into emergency response, forming a hypothesis about the cause determines the recovery strategy. The same "disk full" calls for a different knob depending on the cause.
Common causes
| Cause | Mechanism | Clue |
|---|---|---|
| Retention too long | retention.ms / retention.bytes oversized vs. disk | Topic size exceeds plan, old segments lingering |
| Traffic spike | Sudden ingest growth inflates data within the retention window | BytesIn graph spikes |
| Stuck consumer | Even if a consumer never reads, time-based retention keeps the data | Consumer lag spikes alongside disk |
| Compaction lag | log.cleaner can't keep up, dirty segments accumulate | Abnormal compacted-topic size, too few cleaner threads |
| Replication catch-up after rejoin | A returning broker receives data all at once, doubling usage | Only one broker's disk spikes, right after rebalance |
| Oversized segments | Large segment.bytes means segments don't close, so they never become deletion candidates | Active segment is huge, no deletion happening |
A frequently missed trap: retention only deletes closed segments
Kafka deletes at the segment-file granularity. The active segment currently being written is never a deletion candidate. Only after a segment rolls and closes — via segment.bytes (default 1 GB) or segment.ms — does that closed segment become eligible for deletion under retention.ms / retention.bytes.
On top of that, the actual deletion is performed by a background thread that runs every log.retention.check.interval.ms (default 5 minutes). In other words, lowering retention does not free the disk instantly. Not knowing this delay wastes time on "I changed the setting, why isn't it shrinking?"
When compaction is the cause
A compacted topic (cleanup.policy=compact) keeps only the latest value per key, not by time. If the cleaner falls behind, it may never reach min.cleanable.dirty.ratio (default 0.5), or log.cleaner.threads is too few and the dirty region keeps growing. In this case lowering retention won't shrink it — cleaner tuning is required.
# Check whether compaction is stalled — cleaner status
grep -i "LogCleaner\|cleaner" /var/log/kafka/log-cleaner.log | tail -30
# JMX: max-clean-time, max-buffer-utilization, dead-thread-count
# kafka.log:type=LogCleanerManager,name=max-dirty-percent3. Emergency Response — Order Is Everything
WARNING: Never delete segment files manually with
rm. If you manually delete.log/.index/.timeindexfiles because the disk is urgent, the broker's in-memory index and offset metadata fall out of sync and the log is corrupted. In the worst case the partition becomes unrecoverable, and broken consumer offsets escalate into data loss or duplication. Always let Kafka perform deletion via retention. The more urgent the disk recovery, the more this principle matters.
Apply the steps below from the top. Each step is ordered "lowest risk of data loss first."
Step 1 — Find the biggest consumers of disk
# Per-partition size by log directory (values Kafka recognizes)
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--describe --broker-list 3 \
| python3 -c "import sys,json; \
d=json.loads([l for l in sys.stdin if l.startswith('{')][0]); \
print('\n'.join(str(x) for x in sorted(((p['partition'], p['size']) \
for b in d['brokers'] for ld in b['logDirs'] for p in ld['partitions']), \
key=lambda x:-x[1])[:20]))"
# OS level — top 20 largest directories on the actual disk
du -h --max-depth=1 /data/kafka | sort -rh | head -20kafka-log-dirs.sh shows per-partition bytes as the broker sees them; du shows actual disk occupancy. Cross-referencing the two narrows down "which topic-partition" is the culprit within minutes.
Step 2 — Temporarily lower the culprit topic's retention
This is the fastest and safest recovery lever. If the cause is "retention too long" or "traffic spike," this usually ends it.
# e.g. shrink the events topic retention from 7 days to 6 hours, temporarily
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name events \
--add-config retention.ms=21600000
# or cap by size (50 GB per partition)
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name events \
--add-config retention.bytes=53687091200It will not shrink immediately after the change. Closed segments become deletion candidates, and the thread running every log.retention.check.interval.ms (default 5 minutes) must run before files actually disappear. If a huge active segment isn't rolling, you can force a roll with a short segment.ms.
# When segments aren't closing and deletion can't happen — force a roll
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name events \
--add-config segment.ms=600000These are emergency temporary values. Revert them in Step 5 after recovery, or reset to planned values.
Step 3 — Add or expand disk
If lowering retention recovers too little, or you must retain more data, grow the storage itself. Cloud volumes (e.g. EBS) support online expansion, and with JBOD you can add a new log directory to log.dirs.
| Situation | Action |
|---|---|
| Cloud block storage | Expand the volume, then grow the filesystem with resize2fs / xfs_growfs |
| Empty disk exists in JBOD | Add the directory to log.dirs, then restart the broker |
| Imminent disk OOM | Buy time with Step 2 first, expand in parallel |
Expanding the disk does not automatically migrate existing partitions to the new disk. Balance is restored only from new partition assignments or reassignment onward.
Step 4 — Move partitions off the hot broker
If only one broker is full (e.g. replication catch-up after a rejoin), reassign partitions to brokers with headroom to spread the load. Always apply a throttle so recovery replication traffic doesn't starve normal traffic.
# 1) Write a move plan (JSON): topics-to-move.json
# {"topics":[{"topic":"events"}],"version":1}
# 2) Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--topics-to-move-json-file topics-to-move.json \
--broker-list "1,2,4,5" --generate > reassignment.json
# Save the "Proposed partition reassignment" output as reassignment.json
# 3) Execute with a throttle (50 MB/s)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file reassignment.json \
--throttle 52428800 --execute
# 4) Check progress (the throttle is removed once complete)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file reassignment.json --verifyWhen --verify reports all partitions "completed successfully," the throttle is automatically removed. The disk may fill further during the move, so it's safer to run Step 4 only after Step 2 has created headroom.
Step 5 — Revert and verify after stabilizing
# Revert temporary retention/segment (e.g. back to 7 days)
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name events \
--add-config retention.ms=604800000
# Remove the temporary segment.ms
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name events \
--delete-config segment.ms
# Final check that offline / under-replicated counts are zero
kafka-topics.sh --bootstrap-server localhost:9092 --describe \
--under-replicated-partitionsA broker that had an offline log directory will, once disk space is freed, attempt to bring the directory back online automatically. If auto-recovery doesn't happen, restart the broker (after first confirming other brokers' ISR is sufficient).
4. Emergency Decision Tree
This tree is designed around the principle that "deletion is left to Kafka; humans only turn knobs." No branch falls into a manual rm.
5. Prevention — So That 3 AM Never Repeats
Emergency response puts out the fire; prevention keeps it from starting. Bake the following into your operational standards.
Disk-usage alerts at 75/85%, not 100%
If the alert arrives at 100%, it's already an incident. You need staged alerts well below 100% to give humans time to intervene.
| Threshold | Alert level | Recommended action |
|---|---|---|
| 75% | Warning | Begin capacity / retention review |
| 85% | High | Execute retention adjustment or expansion plan |
| 90% | Critical | Begin emergency response (Section 3) immediately |
Capacity planning: compute the bytes to retain first
A topic's disk footprint is estimated roughly as:
disk per partition ≈ bytes ingested per hour × retention hours × replication factor
÷ partition count × (1 + headroom ratio)
Using this to set retention.bytes comfortably below the physical disk means that even when traffic spikes, the size-based cap protects the disk before time-based retention breaks down. Applying both time and size retention together is the safe choice.
Per-topic retention review
| Check item | Question |
|---|---|
| Excessive retention | "Do we really need to keep this topic for N days?" |
| Dependence on stuck consumers | "Did we inflate retention because of a slow consumer?" |
| Compacted topics | "Is the cleaner keeping up with ingest?" |
| Segment settings | "Is segment.bytes oversized relative to the disk?" |
Producer/consumer quotas
To stop a runaway producer from filling the disk, apply byte-rate quotas per client/user.
# Limit producer byte rate (10 MB/s) — per client-id
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --add-config 'producer_byte_rate=10485760' \
--entity-type clients --entity-name ingest-appContinuously monitor the key metrics
| Metric | What it watches | Alert threshold |
|---|---|---|
log.dirs free space (OS) | Actual disk headroom | < 15% |
UnderReplicatedPartitions (JMX) | Replication health | > 0 sustained |
OfflineLogDirectoryCount | Number of offline log dirs | > 0 |
BytesInPerSec | Ingest trend (spike detection) | Spike vs. baseline |
max-dirty-percent (LogCleaner) | Compaction backlog | Sustained rise |
| Consumer lag | Detect stuck consumers | Spike |
6. Operational Checklist
During emergency response (Section 3)
- Confirm symptoms:
df,KafkaStorageException/No space leftin broker log - Identify the largest topic with
kafka-log-dirs.sh+du - Classify the cause: retention / traffic / compaction / replication catch-up / segment
- Temporarily shrink target topic retention (aware of the 5-minute check cycle)
- If needed, expand disk or reassign partitions with a throttle
- Never: manually
rmsegment files - After stabilizing, revert temporary config +
--verify
Prevention (ongoing)
- Staged disk alerts at 75/85/90%
- Per-topic time + size retention, with the size cap below the physical disk
- Producer/consumer byte quotas
- Continuous monitoring of
UnderReplicatedPartitions,OfflineLogDirectoryCount - Quarterly topic retention review
The post-incident cleanup and postmortem write-up are covered in more detail in Part 9 (Incident Runbook) of this series. Disk-full enters Part 9's runbook as one scenario, sharing the same "detect → classify → respond → verify → retrospect" skeleton.
Wrapping up
- A full disk's first report is a log directory going offline and
KafkaStorageException. With JBOD, the paralysis is partial — per disk. - There is never just one cause. Retention too long, traffic spike, stuck consumer, compaction lag, replication catch-up, oversized segments — the knob you turn depends on the cause.
- In emergency response, order is everything: identify the culprit → temporarily shrink retention → expand disk → reassign with a throttle → revert.
- The single most important line: do not
rmsegment files directly. Letting Kafka delete via retention is the fastest and safest path. - Prevention must work before 100%. Staged alerts, size-capped retention, quotas, and monitoring of
UnderReplicatedPartitionsand free space. - The best way to reduce 3 AM pages is the daytime alert you get when the disk hits 75%.
References
- Apache Kafka Documentation — Topic Configs (
retention.ms,retention.bytes,segment.bytes,min.cleanable.dirty.ratio): https://kafka.apache.org/documentation/#topicconfigs- Apache Kafka Documentation — Broker Configs (
log.retention.check.interval.ms,log.dirs,log.cleaner.threads): https://kafka.apache.org/documentation/#brokerconfigs- Apache Kafka Operations — Datacenters, Quotas, Reassignment: https://kafka.apache.org/documentation/#operations
kafka-log-dirs.sh,kafka-reassign-partitions.sh,kafka-configs.sh(Kafka distributionbin/tools)
— The Data Dynamics Engineering Team