[Kafka Ops 3] Where Messages Get Lost — Failure Scenarios and a No-Loss Configuration
Every path by which a Kafka message can disappear, split across the producer, broker, and consumer, plus a no-loss recipe built from acks=all, RF=3, min.insync.replicas=2, and manual commits after processing.
"We definitely sent it to Kafka, but the message is gone." It is one of the most chilling sentences you can hear in operations. A payment event, an order log, an audit record vanished into thin air and nobody saw an exception — when that happens, the culprit is usually not your code but your configuration. Kafka ships tuned to favor throughput and availability, so if you want zero loss you have to ask for it explicitly. This post walks every junction where a message can disappear, then assembles a configuration that loses nothing, end to end.
What you'll learn in this post
- Whether message loss originates on the producer, broker, or consumer side
- How
acks,retries, andflush()tie into producer-side loss- How
unclean.leader.electionandmin.insync.replicasdecide broker-side loss- The trap of "logical loss" created by auto-commit
- A no-loss recipe (unified producer, broker, and consumer settings)
- The durability vs. availability/throughput trade-off
This is Part 3 of the Kafka Operations Troubleshooting series. Following Part 1 (consumer lag) and Part 2 (rebalancing), this installment is about how not to lose data. The internals of acks and min.insync.replicas get a deeper treatment in Part 4, and the danger of unclean.leader.election in Part 5.
1. The Life of a Message — Loss Can Happen Anywhere
For a single message to survive, it must pass through several gates. The producer sends it, the leader broker receives it, followers replicate it, it gets committed, and a consumer reads and processes it. Break any link in this chain and the message is gone.
Splitting the loss points into three regions:
| Region | Typical cause | Outcome |
|---|---|---|
| Producer | weak acks, retries=0, missing flush() | never reaches the broker |
| Broker | unclean leader election, low min.insync.replicas, RF=1 | committed records disappear |
| Consumer | commit before processing, auto-commit | records skipped (logical loss) |
Now let's dissect each region in turn.
2. Producer-Side Loss — Sent Doesn't Mean Delivered
Calling send() does not mean the message is safely stored on the broker. The outcome depends entirely on your settings.
2.1 acks=0 — Fire and Forget
With acks=0 the producer waits for no response from the broker. The moment the data hits the socket buffer it is considered "successful." If the network drops or the leader is dead, the producer has no idea. You get maximum throughput — the polar opposite of zero loss.
# Never use for no-loss — fastest but most dangerous
acks=02.2 acks=1 — Leader Acks, Then the Leader Dies?
With acks=1 the leader broker returns an ack as soon as it writes to its own log. It does not wait for followers to replicate. That is where the trouble starts.
1. Producer -> leader: record written, ack returned (acks=1)
2. Before any follower replicates, the leader broker crashes
3. The controller elects one of the followers as the new leader
4. The new leader does not have that message -> lost
The producer received an ack, so it moves on as if everything succeeded. This is the most common scenario in which a message already recorded as successful silently disappears.
2.3 retries=0 — Throwing Away Transient Failures
With retries=0, a transient error (leader change in progress, a brief network blip, NotEnoughReplicas) is not retried — it immediately raises an exception or hands an error to the callback. If the caller doesn't handle that error properly, the message simply evaporates. Most transient failures resolve themselves within a few hundred milliseconds; turning off retries throws away that grace period yourself.
// Dangerous: no retries + ignored error
producer.send(record); // never inspects the returned Future -> failures get buried2.4 async send() — When the Process Dies Without flush()
When you call send(), the Kafka producer stacks the message into an in-memory batch buffer (buffer.memory) and a background I/O thread gathers and ships batches. It's fast because it's asynchronous — but anything not yet sent lives only in process memory.
for (Order order : orders) {
producer.send(new ProducerRecord<>("orders", order.id(), order.json()));
}
// What if System.exit() or SIGKILL hits here?
// Whatever remains in the buffer is never sent and is lost wholesaleIf a batch job finishes and the JVM exits without calling flush() or close(), every un-sent message left in the buffer is lost. You must drain the buffer with flush() or close() before exiting.
try {
for (Order order : orders) {
producer.send(record, (meta, ex) -> {
if (ex != null) log.error("send failed", ex); // detect failure in the callback
});
}
producer.flush(); // guarantees all in-flight sends complete
} finally {
producer.close(); // close() flushes internally too
}3. Broker-Side Loss — Even Committed Records Vanish
Sometimes a message disappears even after the producer got an ack and the record was clearly committed. The broker's replication settings are the crux.
3.1 unclean.leader.election.enable=true — Promoting a Lagging Replica
Kafka tracks the set of replicas in sync with the leader as the ISR (In-Sync Replicas). Normally a new leader is chosen only from within the ISR. But if you set unclean.leader.election.enable=true, then when the ISR is empty Kafka will elect even an out-of-sync replica as leader.
1. The leader and its ISR followers go down at once (e.g. a rack power failure)
2. The only survivor is one badly lagging replica (outside the ISR)
3. unclean election is allowed -> that lagging replica becomes the new leader
4. Every "committed" record past that replica's offset is lost
It is an option that sacrifices durability for availability (quickly restoring a writable leader no matter what). If zero loss is the goal, it must be false. The full mechanism is covered in Part 5.
3.2 A Low min.insync.replicas Defeats Even acks=all
acks=all means "wait until every replica in the ISR has the record." But if the ISR has shrunk to just 1, then even with acks=all the ack returns once that single replica has it. If that one dies, you're done.
min.insync.replicas is the floor on how many replicas must be in the ISR for a write to succeed. If the floor isn't met, the broker rejects the write and throws NotEnoughReplicasException (the producer retries).
| Setting | What acks=all actually guarantees |
|---|---|
min.insync.replicas=1 | ack with ISR of 1 -> that one dies, data lost |
min.insync.replicas=2 | ack only once 2+ have it -> safe even if one dies |
acks=all and min.insync.replicas only have teeth when configured as a pair.
3.3 replication.factor=1 — No Replica, No Recovery
With a replication factor (RF) of 1, the partition data exists on a single broker. If that broker's disk corrupts or fails permanently, there is nowhere to recover from. No matter how strong your acks or how high your min.insync.replicas, there are no replicas to lean on. RF=1 is a non-starter for production topics.
4. Consumer-Side Loss — Read, But Not Processed
Consumer-side loss isn't data vanishing from disk. It's logical loss: the consumer records messages it never processed as "processed," so it never reads them again.
4.1 Auto-Commit + Committing Before Processing
With enable.auto.commit=true (the default), the consumer commits the offsets it has polled so far in the background every auto.commit.interval.ms (5 seconds by default). The problem is that the commit happens independently of actual processing completion.
1. poll() -> receives 100 records
2. auto-commit timer fires -> offsets committed up to 100
3. The consumer crashes while processing record 60
4. Restart -> reads from offset 100
5. Records 61-100 were never processed but are skipped forever -> logical loss
Similarly, explicitly committing before processing in your code produces the same result.
// Dangerous: commit first, process later
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
consumer.commitSync(); // committed up front
for (var r : records) process(r); // crash here -> unprocessed records lost4.2 The Safe Pattern — Commit After Processing
The fix is simple: turn off auto-commit and commit manually only after processing fully completes.
props.put("enable.auto.commit", "false"); // disable auto-commit
while (running) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (var r : records) {
process(r); // 1. process first (persist to DB, etc.)
}
consumer.commitSync(); // 2. commit only after processing finishes
}This way, if a crash hits mid-processing the offset was never committed, so on restart the consumer re-reads that batch. Note that this gives an at-least-once guarantee, so duplicate processing can occur. Design your processing logic to be idempotent (upsert by key, dedup by processing ID, etc.) or wrap it in a transaction. "No loss" and "no duplicates" are separate problems; choosing zero loss usually means accepting duplicates.
5. The No-Loss Recipe — A Configuration That Loses Nothing
Here we gather, in one place, the settings that close every loss path above. The key is locking all three regions — producer, broker/topic, and consumer — at once. Miss one and the chain snaps there.
5.1 Producer Settings
# Wait until all ISR members receive it
acks=all
# Idempotent producer: prevents duplicates / reordering from retries
enable.idempotence=true
# Default when idempotence is on. Retry transient failures to the end
retries=2147483647
# Total retry ceiling (retries operate only within this window)
delivery.timeout.ms=120000
# Must be <= 5 for the idempotence guarantee (order preservation)
max.in.flight.requests.per.connection=5Enabling
enable.idempotence=trueautomatically enforces and reconcilesacks=all,retries=Integer.MAX_VALUE, andmax.in.flight<=5. Still, spelling them out makes intent explicit, and if someone accidentally flipsacks=1the conflict surfaces immediately.
5.2 Broker/Topic Settings
# Replicate the partition across 3 brokers (survive 1 broker loss)
replication.factor=3
# A write needs at least 2 ISR members to succeed (pairs with acks=all)
min.insync.replicas=2
# Never elect a lagging replica as leader (durability first)
unclean.leader.election.enable=falsemin.insync.replicas can also be set at the topic level, where the topic setting overrides the broker default. For topics that require zero loss, set it explicitly at the topic level.
5.3 Consumer Settings
# Turn off auto-commit — commit manually after processing
enable.auto.commit=falseAnd follow the pattern from 4.2: commitSync() after processing completes.
5.4 Why RF=3 + min.insync.replicas=2
The most frequently recommended combination is RF=3, min.insync.replicas=2. The reason is that it satisfies two requirements at once.
| Property | Behavior |
|---|---|
| Durability | With 3 replicas and acks=all + min.insync=2, at least 2 always hold the data. Survive a single broker loss. |
| Availability (write) | Even if one broker dies, the remaining 2 satisfy min.insync=2 -> writes keep flowing. |
As a formula, the number of broker losses you can tolerate at once is RF - min.insync.replicas.
- RF=3, min.insync=2 -> writes survive up to 1 loss (the sweet spot)
- RF=3, min.insync=3 -> a single loss halts writes (max durability, sacrifices availability)
- RF=3, min.insync=1 -> max availability but effectively as risky as
acks=1
So RF=3 + min.insync=2 is the reasonable balance between durability and availability: "keep both the data and the writes up to a single broker loss."
5.5 The No-Loss Configuration at a Glance
| Region | Setting | Value | Purpose |
|---|---|---|---|
| Producer | acks | all | wait until all ISR members receive it |
| Producer | enable.idempotence | true | prevent retry duplicates / reordering |
| Producer | retries | 2147483647 | retry transient failures to the end |
| Producer | delivery.timeout.ms | 120000 | total retry ceiling |
| Broker/Topic | replication.factor | 3 | preserve data across a single loss |
| Broker/Topic | min.insync.replicas | 2 | the real guarantee floor for acks=all |
| Broker/Topic | unclean.leader.election.enable | false | forbid lagging-replica leadership |
| Consumer | enable.auto.commit | false | commit manually after processing |
6. The Trade-off — There Is No Free Durability
A no-loss configuration is safe, and it costs you. Choose it knowing exactly what you gain and what you give up.
| Choice | What you gain | What you give up |
|---|---|---|
acks=all | strong durability | higher latency (replication wait), lower throughput |
min.insync.replicas=2 | safe across one loss | writes rejected when ISR drops below 2 |
unclean.election=false | committed records preserved | partition writes/reads halt if the entire ISR is lost |
| manual commit (after processing) | blocks logical loss | duplicates possible -> needs idempotent design |
| RF=3 | recovery headroom | 3x disk and network cost |
The core principle is not to apply zero loss to every topic. Topics that cannot lose a single record — payments, orders, audit logs — go no-loss; topics that tolerate a little loss — clickstreams, metrics — go throughput-first (acks=1, RF=2, etc.). Tier the configuration to the value of the topic — that is the realistic approach.
If you want to push durability even further, continue with Part 4 on the internals of
acks/min.insync.replicas, and Part 5 which reproduces exactly how unclean leader election destroys data.
Wrapping up
- Message loss can arise in any of the three regions — producer, broker, consumer. Locking just one region is not safe.
- Producer:
acks=0/1,retries=0, and a missingflush()block delivery itself. Lock it down withacks=all+ idempotence +flush()/close()before exit. - Broker:
unclean.leader.election=true, a lowmin.insync.replicas, and RF=1 destroy even committed records. RF=3 + min.insync=2 + unclean=false is the balance point. - Consumer: auto-commit and committing before processing create logical loss. Turn off auto-commit and commit after processing, but design idempotently to absorb duplicates.
- Zero loss comes at the cost of latency, throughput, and availability. Don't apply it wholesale; tier it to the value of each topic.
- "Sent" is not "stored," and "read" is not "processed." Closing those gaps is the essence of a no-loss configuration.
References
- Apache Kafka Documentation — https://kafka.apache.org/documentation/
- Producer Configs (
acks,enable.idempotence,retries,delivery.timeout.ms) — https://kafka.apache.org/documentation/#producerconfigs- Broker/Topic Configs (
min.insync.replicas,unclean.leader.election.enable) — https://kafka.apache.org/documentation/#brokerconfigs- Replication & ISR design — https://kafka.apache.org/documentation/#replication
— The Data Dynamics Engineering Team