kafkaoperationsincident-responsestorageretention

[Kafka Ops 7] Disk Full — Emergency Response and Prevention for Broker Storage

From diagnosing the symptoms of a full Kafka broker disk, to a safe ordered emergency response, to the capacity-planning, retention, and monitoring strategies that prevent it from happening again — written as an operational runbook.

Data DynamicsJune 6, 202613 min read

3 AM, the pager goes off. Alongside an alert that "producers can't send messages," one Kafka broker shows an entire log directory marked offline. Disk usage: 100%. The most dangerous move in this moment is "we're in a hurry, let's just delete a few segment files." That single rm corrupts the log and breaks offsets, turning a disk-full incident into something far longer than the disk-full itself. This article is the runbook for that 3 AM.

What you'll learn in this post

The symptoms a broker emits when its disk fills up, and what they mean

The root causes that fill the disk: retention, traffic spikes, compaction lag, and more

An ordered emergency procedure that recovers the disk without data loss

The one thing you must never do: manually delete segment files

The capacity planning, quotas, and monitoring that keep that 3 AM from recurring

1. Symptoms — How a Full Disk Reports Itself

A full disk doesn't arrive quietly. Kafka screams from several layers at once. The first step is to identify precisely what you're looking at.

Signals the broker sends

Symptom	Where you see it	What it means
Broker shutdown or log dir offline	Broker log, `kafka.log.LogManager`	Can no longer write to a `log.dir`, so that directory is taken offline
`KafkaStorageException`	Broker log, producer responses	Disk I/O failure. Cannot flush or roll segments
Produce requests fail	Producer client	`NotEnoughReplicasException`, timeouts, backpressure
Partitions go offline	`kafka-topics.sh --describe`, controller log	Partition lost its leader. `Leader: none`
Under-replicated partitions rising	JMX `UnderReplicatedPartitions`	Followers can't keep up. Replicas insufficient

The key point: a broker goes offline at the log-directory level. In a JBOD (multiple disks) setup, one full disk takes only that directory's partitions offline while partitions on other disks keep working. In a single-disk setup, the broker is effectively paralyzed.

Quick symptom triage

# 1. Check the broker log for storage exceptions
grep -i "KafkaStorageException\|offline\|No space left" \
  /var/log/kafka/server.log | tail -50
 
# 2. OS-level disk usage
df -h /data/kafka
 
# 3. Tally offline / under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions
 
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions

If you see No space left on device and df is near 100%, the diagnosis is done. Now move to why it filled up and how to recover.

2. Root Causes — Why the Disk Filled Up

Before jumping into emergency response, forming a hypothesis about the cause determines the recovery strategy. The same "disk full" calls for a different knob depending on the cause.

Common causes

Cause	Mechanism	Clue
Retention too long	`retention.ms` / `retention.bytes` oversized vs. disk	Topic size exceeds plan, old segments lingering
Traffic spike	Sudden ingest growth inflates data within the retention window	BytesIn graph spikes
Stuck consumer	Even if a consumer never reads, time-based retention keeps the data	Consumer lag spikes alongside disk
Compaction lag	`log.cleaner` can't keep up, dirty segments accumulate	Abnormal compacted-topic size, too few cleaner threads
Replication catch-up after rejoin	A returning broker receives data all at once, doubling usage	Only one broker's disk spikes, right after rebalance
Oversized segments	Large `segment.bytes` means segments don't close, so they never become deletion candidates	Active segment is huge, no deletion happening

A frequently missed trap: retention only deletes closed segments

Kafka deletes at the segment-file granularity. The active segment currently being written is never a deletion candidate. Only after a segment rolls and closes — via segment.bytes (default 1 GB) or segment.ms — does that closed segment become eligible for deletion under retention.ms / retention.bytes.

On top of that, the actual deletion is performed by a background thread that runs every log.retention.check.interval.ms (default 5 minutes). In other words, lowering retention does not free the disk instantly. Not knowing this delay wastes time on "I changed the setting, why isn't it shrinking?"

When compaction is the cause

A compacted topic (cleanup.policy=compact) keeps only the latest value per key, not by time. If the cleaner falls behind, it may never reach min.cleanable.dirty.ratio (default 0.5), or log.cleaner.threads is too few and the dirty region keeps growing. In this case lowering retention won't shrink it — cleaner tuning is required.

# Check whether compaction is stalled — cleaner status
grep -i "LogCleaner\|cleaner" /var/log/kafka/log-cleaner.log | tail -30
 
# JMX: max-clean-time, max-buffer-utilization, dead-thread-count
# kafka.log:type=LogCleanerManager,name=max-dirty-percent

3. Emergency Response — Order Is Everything

WARNING: Never delete segment files manually with rm. If you manually delete .log / .index / .timeindex files because the disk is urgent, the broker's in-memory index and offset metadata fall out of sync and the log is corrupted. In the worst case the partition becomes unrecoverable, and broken consumer offsets escalate into data loss or duplication. Always let Kafka perform deletion via retention. The more urgent the disk recovery, the more this principle matters.

Apply the steps below from the top. Each step is ordered "lowest risk of data loss first."

Step 1 — Find the biggest consumers of disk

# Per-partition size by log directory (values Kafka recognizes)
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --broker-list 3 \
  | python3 -c "import sys,json; \
d=json.loads([l for l in sys.stdin if l.startswith('{')][0]); \
print('\n'.join(str(x) for x in sorted(((p['partition'], p['size']) \
for b in d['brokers'] for ld in b['logDirs'] for p in ld['partitions']), \
key=lambda x:-x[1])[:20]))"
 
# OS level — top 20 largest directories on the actual disk
du -h --max-depth=1 /data/kafka | sort -rh | head -20

kafka-log-dirs.sh shows per-partition bytes as the broker sees them; du shows actual disk occupancy. Cross-referencing the two narrows down "which topic-partition" is the culprit within minutes.

Step 2 — Temporarily lower the culprit topic's retention

This is the fastest and safest recovery lever. If the cause is "retention too long" or "traffic spike," this usually ends it.

# e.g. shrink the events topic retention from 7 days to 6 hours, temporarily
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name events \
  --add-config retention.ms=21600000
 
# or cap by size (50 GB per partition)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name events \
  --add-config retention.bytes=53687091200

It will not shrink immediately after the change. Closed segments become deletion candidates, and the thread running every log.retention.check.interval.ms (default 5 minutes) must run before files actually disappear. If a huge active segment isn't rolling, you can force a roll with a short segment.ms.

# When segments aren't closing and deletion can't happen — force a roll
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name events \
  --add-config segment.ms=600000

These are emergency temporary values. Revert them in Step 5 after recovery, or reset to planned values.

Step 3 — Add or expand disk

If lowering retention recovers too little, or you must retain more data, grow the storage itself. Cloud volumes (e.g. EBS) support online expansion, and with JBOD you can add a new log directory to log.dirs.

Situation	Action
Cloud block storage	Expand the volume, then grow the filesystem with `resize2fs` / `xfs_growfs`
Empty disk exists in JBOD	Add the directory to `log.dirs`, then restart the broker
Imminent disk OOM	Buy time with Step 2 first, expand in parallel

Expanding the disk does not automatically migrate existing partitions to the new disk. Balance is restored only from new partition assignments or reassignment onward.

Step 4 — Move partitions off the hot broker

If only one broker is full (e.g. replication catch-up after a rejoin), reassign partitions to brokers with headroom to spread the load. Always apply a throttle so recovery replication traffic doesn't starve normal traffic.

# 1) Write a move plan (JSON): topics-to-move.json
# {"topics":[{"topic":"events"}],"version":1}
 
# 2) Generate a reassignment plan
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --topics-to-move-json-file topics-to-move.json \
  --broker-list "1,2,4,5" --generate > reassignment.json
# Save the "Proposed partition reassignment" output as reassignment.json
 
# 3) Execute with a throttle (50 MB/s)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json \
  --throttle 52428800 --execute
 
# 4) Check progress (the throttle is removed once complete)
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --verify

When --verify reports all partitions "completed successfully," the throttle is automatically removed. The disk may fill further during the move, so it's safer to run Step 4 only after Step 2 has created headroom.

Step 5 — Revert and verify after stabilizing

# Revert temporary retention/segment (e.g. back to 7 days)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name events \
  --add-config retention.ms=604800000
 
# Remove the temporary segment.ms
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name events \
  --delete-config segment.ms
 
# Final check that offline / under-replicated counts are zero
kafka-topics.sh --bootstrap-server localhost:9092 --describe \
  --under-replicated-partitions

A broker that had an offline log directory will, once disk space is freed, attempt to bring the directory back online automatically. If auto-recovery doesn't happen, restart the broker (after first confirming other brokers' ISR is sufficient).

4. Emergency Decision Tree

Loading diagram…

This tree is designed around the principle that "deletion is left to Kafka; humans only turn knobs." No branch falls into a manual rm.

5. Prevention — So That 3 AM Never Repeats

Emergency response puts out the fire; prevention keeps it from starting. Bake the following into your operational standards.

Disk-usage alerts at 75/85%, not 100%

If the alert arrives at 100%, it's already an incident. You need staged alerts well below 100% to give humans time to intervene.

Threshold	Alert level	Recommended action
75%	Warning	Begin capacity / retention review
85%	High	Execute retention adjustment or expansion plan
90%	Critical	Begin emergency response (Section 3) immediately

Capacity planning: compute the bytes to retain first

A topic's disk footprint is estimated roughly as:

disk per partition ≈ bytes ingested per hour × retention hours × replication factor
                      ÷ partition count × (1 + headroom ratio)

Using this to set retention.bytes comfortably below the physical disk means that even when traffic spikes, the size-based cap protects the disk before time-based retention breaks down. Applying both time and size retention together is the safe choice.

Per-topic retention review

Check item	Question
Excessive retention	"Do we really need to keep this topic for N days?"
Dependence on stuck consumers	"Did we inflate retention because of a slow consumer?"
Compacted topics	"Is the cleaner keeping up with ingest?"
Segment settings	"Is `segment.bytes` oversized relative to the disk?"

Producer/consumer quotas

To stop a runaway producer from filling the disk, apply byte-rate quotas per client/user.

# Limit producer byte rate (10 MB/s) — per client-id
kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --add-config 'producer_byte_rate=10485760' \
  --entity-type clients --entity-name ingest-app

Continuously monitor the key metrics

Metric	What it watches	Alert threshold
`log.dirs` free space (OS)	Actual disk headroom	< 15%
`UnderReplicatedPartitions` (JMX)	Replication health	> 0 sustained
`OfflineLogDirectoryCount`	Number of offline log dirs	> 0
`BytesInPerSec`	Ingest trend (spike detection)	Spike vs. baseline
`max-dirty-percent` (LogCleaner)	Compaction backlog	Sustained rise
Consumer lag	Detect stuck consumers	Spike

6. Operational Checklist

During emergency response (Section 3)

Confirm symptoms: df, KafkaStorageException / No space left in broker log
Identify the largest topic with kafka-log-dirs.sh + du
Classify the cause: retention / traffic / compaction / replication catch-up / segment
Temporarily shrink target topic retention (aware of the 5-minute check cycle)
If needed, expand disk or reassign partitions with a throttle
Never: manually rm segment files
After stabilizing, revert temporary config + --verify

Prevention (ongoing)

Staged disk alerts at 75/85/90%
Per-topic time + size retention, with the size cap below the physical disk
Producer/consumer byte quotas
Continuous monitoring of UnderReplicatedPartitions, OfflineLogDirectoryCount
Quarterly topic retention review

The post-incident cleanup and postmortem write-up are covered in more detail in Part 9 (Incident Runbook) of this series. Disk-full enters Part 9's runbook as one scenario, sharing the same "detect → classify → respond → verify → retrospect" skeleton.

Wrapping up

A full disk's first report is a log directory going offline and KafkaStorageException. With JBOD, the paralysis is partial — per disk.
There is never just one cause. Retention too long, traffic spike, stuck consumer, compaction lag, replication catch-up, oversized segments — the knob you turn depends on the cause.
In emergency response, order is everything: identify the culprit → temporarily shrink retention → expand disk → reassign with a throttle → revert.
The single most important line: do not rm segment files directly. Letting Kafka delete via retention is the fastest and safest path.
Prevention must work before 100%. Staged alerts, size-capped retention, quotas, and monitoring of UnderReplicatedPartitions and free space.
The best way to reduce 3 AM pages is the daytime alert you get when the disk hits 75%.

References

Apache Kafka Documentation — Topic Configs (retention.ms, retention.bytes, segment.bytes, min.cleanable.dirty.ratio): https://kafka.apache.org/documentation/#topicconfigs

Apache Kafka Documentation — Broker Configs (log.retention.check.interval.ms, log.dirs, log.cleaner.threads): https://kafka.apache.org/documentation/#brokerconfigs

Apache Kafka Operations — Datacenters, Quotas, Reassignment: https://kafka.apache.org/documentation/#operations

kafka-log-dirs.sh, kafka-reassign-partitions.sh, kafka-configs.sh (Kafka distribution bin/ tools)

— The Data Dynamics Engineering Team