Blog
kafkadurabilitydata-lossreliabilitytroubleshooting

[Kafka Ops 3] Where Messages Get Lost — Failure Scenarios and a No-Loss Configuration

Every path by which a Kafka message can disappear, split across the producer, broker, and consumer, plus a no-loss recipe built from acks=all, RF=3, min.insync.replicas=2, and manual commits after processing.

Data DynamicsJune 2, 202613 min read

"We definitely sent it to Kafka, but the message is gone." It is one of the most chilling sentences you can hear in operations. A payment event, an order log, an audit record vanished into thin air and nobody saw an exception — when that happens, the culprit is usually not your code but your configuration. Kafka ships tuned to favor throughput and availability, so if you want zero loss you have to ask for it explicitly. This post walks every junction where a message can disappear, then assembles a configuration that loses nothing, end to end.

What you'll learn in this post

  • Whether message loss originates on the producer, broker, or consumer side
  • How acks, retries, and flush() tie into producer-side loss
  • How unclean.leader.election and min.insync.replicas decide broker-side loss
  • The trap of "logical loss" created by auto-commit
  • A no-loss recipe (unified producer, broker, and consumer settings)
  • The durability vs. availability/throughput trade-off

This is Part 3 of the Kafka Operations Troubleshooting series. Following Part 1 (consumer lag) and Part 2 (rebalancing), this installment is about how not to lose data. The internals of acks and min.insync.replicas get a deeper treatment in Part 4, and the danger of unclean.leader.election in Part 5.


1. The Life of a Message — Loss Can Happen Anywhere

For a single message to survive, it must pass through several gates. The producer sends it, the leader broker receives it, followers replicate it, it gets committed, and a consumer reads and processes it. Break any link in this chain and the message is gone.

Loading diagram…

Splitting the loss points into three regions:

RegionTypical causeOutcome
Producerweak acks, retries=0, missing flush()never reaches the broker
Brokerunclean leader election, low min.insync.replicas, RF=1committed records disappear
Consumercommit before processing, auto-commitrecords skipped (logical loss)

Now let's dissect each region in turn.


2. Producer-Side Loss — Sent Doesn't Mean Delivered

Calling send() does not mean the message is safely stored on the broker. The outcome depends entirely on your settings.

2.1 acks=0 — Fire and Forget

With acks=0 the producer waits for no response from the broker. The moment the data hits the socket buffer it is considered "successful." If the network drops or the leader is dead, the producer has no idea. You get maximum throughput — the polar opposite of zero loss.

# Never use for no-loss — fastest but most dangerous
acks=0

2.2 acks=1 — Leader Acks, Then the Leader Dies?

With acks=1 the leader broker returns an ack as soon as it writes to its own log. It does not wait for followers to replicate. That is where the trouble starts.

1. Producer -> leader: record written, ack returned (acks=1)
2. Before any follower replicates, the leader broker crashes
3. The controller elects one of the followers as the new leader
4. The new leader does not have that message -> lost

The producer received an ack, so it moves on as if everything succeeded. This is the most common scenario in which a message already recorded as successful silently disappears.

2.3 retries=0 — Throwing Away Transient Failures

With retries=0, a transient error (leader change in progress, a brief network blip, NotEnoughReplicas) is not retried — it immediately raises an exception or hands an error to the callback. If the caller doesn't handle that error properly, the message simply evaporates. Most transient failures resolve themselves within a few hundred milliseconds; turning off retries throws away that grace period yourself.

// Dangerous: no retries + ignored error
producer.send(record); // never inspects the returned Future -> failures get buried

2.4 async send() — When the Process Dies Without flush()

When you call send(), the Kafka producer stacks the message into an in-memory batch buffer (buffer.memory) and a background I/O thread gathers and ships batches. It's fast because it's asynchronous — but anything not yet sent lives only in process memory.

for (Order order : orders) {
    producer.send(new ProducerRecord<>("orders", order.id(), order.json()));
}
// What if System.exit() or SIGKILL hits here?
// Whatever remains in the buffer is never sent and is lost wholesale

If a batch job finishes and the JVM exits without calling flush() or close(), every un-sent message left in the buffer is lost. You must drain the buffer with flush() or close() before exiting.

try {
    for (Order order : orders) {
        producer.send(record, (meta, ex) -> {
            if (ex != null) log.error("send failed", ex); // detect failure in the callback
        });
    }
    producer.flush(); // guarantees all in-flight sends complete
} finally {
    producer.close(); // close() flushes internally too
}

3. Broker-Side Loss — Even Committed Records Vanish

Sometimes a message disappears even after the producer got an ack and the record was clearly committed. The broker's replication settings are the crux.

3.1 unclean.leader.election.enable=true — Promoting a Lagging Replica

Kafka tracks the set of replicas in sync with the leader as the ISR (In-Sync Replicas). Normally a new leader is chosen only from within the ISR. But if you set unclean.leader.election.enable=true, then when the ISR is empty Kafka will elect even an out-of-sync replica as leader.

1. The leader and its ISR followers go down at once (e.g. a rack power failure)
2. The only survivor is one badly lagging replica (outside the ISR)
3. unclean election is allowed -> that lagging replica becomes the new leader
4. Every "committed" record past that replica's offset is lost

It is an option that sacrifices durability for availability (quickly restoring a writable leader no matter what). If zero loss is the goal, it must be false. The full mechanism is covered in Part 5.

3.2 A Low min.insync.replicas Defeats Even acks=all

acks=all means "wait until every replica in the ISR has the record." But if the ISR has shrunk to just 1, then even with acks=all the ack returns once that single replica has it. If that one dies, you're done.

min.insync.replicas is the floor on how many replicas must be in the ISR for a write to succeed. If the floor isn't met, the broker rejects the write and throws NotEnoughReplicasException (the producer retries).

SettingWhat acks=all actually guarantees
min.insync.replicas=1ack with ISR of 1 -> that one dies, data lost
min.insync.replicas=2ack only once 2+ have it -> safe even if one dies

acks=all and min.insync.replicas only have teeth when configured as a pair.

3.3 replication.factor=1 — No Replica, No Recovery

With a replication factor (RF) of 1, the partition data exists on a single broker. If that broker's disk corrupts or fails permanently, there is nowhere to recover from. No matter how strong your acks or how high your min.insync.replicas, there are no replicas to lean on. RF=1 is a non-starter for production topics.


4. Consumer-Side Loss — Read, But Not Processed

Consumer-side loss isn't data vanishing from disk. It's logical loss: the consumer records messages it never processed as "processed," so it never reads them again.

4.1 Auto-Commit + Committing Before Processing

With enable.auto.commit=true (the default), the consumer commits the offsets it has polled so far in the background every auto.commit.interval.ms (5 seconds by default). The problem is that the commit happens independently of actual processing completion.

1. poll() -> receives 100 records
2. auto-commit timer fires -> offsets committed up to 100
3. The consumer crashes while processing record 60
4. Restart -> reads from offset 100
5. Records 61-100 were never processed but are skipped forever -> logical loss

Similarly, explicitly committing before processing in your code produces the same result.

// Dangerous: commit first, process later
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
consumer.commitSync();        // committed up front
for (var r : records) process(r); // crash here -> unprocessed records lost

4.2 The Safe Pattern — Commit After Processing

The fix is simple: turn off auto-commit and commit manually only after processing fully completes.

props.put("enable.auto.commit", "false"); // disable auto-commit
 
while (running) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (var r : records) {
        process(r);        // 1. process first (persist to DB, etc.)
    }
    consumer.commitSync(); // 2. commit only after processing finishes
}

This way, if a crash hits mid-processing the offset was never committed, so on restart the consumer re-reads that batch. Note that this gives an at-least-once guarantee, so duplicate processing can occur. Design your processing logic to be idempotent (upsert by key, dedup by processing ID, etc.) or wrap it in a transaction. "No loss" and "no duplicates" are separate problems; choosing zero loss usually means accepting duplicates.


5. The No-Loss Recipe — A Configuration That Loses Nothing

Here we gather, in one place, the settings that close every loss path above. The key is locking all three regions — producer, broker/topic, and consumer — at once. Miss one and the chain snaps there.

5.1 Producer Settings

# Wait until all ISR members receive it
acks=all
 
# Idempotent producer: prevents duplicates / reordering from retries
enable.idempotence=true
 
# Default when idempotence is on. Retry transient failures to the end
retries=2147483647
 
# Total retry ceiling (retries operate only within this window)
delivery.timeout.ms=120000
 
# Must be <= 5 for the idempotence guarantee (order preservation)
max.in.flight.requests.per.connection=5

Enabling enable.idempotence=true automatically enforces and reconciles acks=all, retries=Integer.MAX_VALUE, and max.in.flight<=5. Still, spelling them out makes intent explicit, and if someone accidentally flips acks=1 the conflict surfaces immediately.

5.2 Broker/Topic Settings

# Replicate the partition across 3 brokers (survive 1 broker loss)
replication.factor=3
 
# A write needs at least 2 ISR members to succeed (pairs with acks=all)
min.insync.replicas=2
 
# Never elect a lagging replica as leader (durability first)
unclean.leader.election.enable=false

min.insync.replicas can also be set at the topic level, where the topic setting overrides the broker default. For topics that require zero loss, set it explicitly at the topic level.

5.3 Consumer Settings

# Turn off auto-commit — commit manually after processing
enable.auto.commit=false

And follow the pattern from 4.2: commitSync() after processing completes.

5.4 Why RF=3 + min.insync.replicas=2

The most frequently recommended combination is RF=3, min.insync.replicas=2. The reason is that it satisfies two requirements at once.

PropertyBehavior
DurabilityWith 3 replicas and acks=all + min.insync=2, at least 2 always hold the data. Survive a single broker loss.
Availability (write)Even if one broker dies, the remaining 2 satisfy min.insync=2 -> writes keep flowing.

As a formula, the number of broker losses you can tolerate at once is RF - min.insync.replicas.

  • RF=3, min.insync=2 -> writes survive up to 1 loss (the sweet spot)
  • RF=3, min.insync=3 -> a single loss halts writes (max durability, sacrifices availability)
  • RF=3, min.insync=1 -> max availability but effectively as risky as acks=1

So RF=3 + min.insync=2 is the reasonable balance between durability and availability: "keep both the data and the writes up to a single broker loss."

5.5 The No-Loss Configuration at a Glance

RegionSettingValuePurpose
Produceracksallwait until all ISR members receive it
Producerenable.idempotencetrueprevent retry duplicates / reordering
Producerretries2147483647retry transient failures to the end
Producerdelivery.timeout.ms120000total retry ceiling
Broker/Topicreplication.factor3preserve data across a single loss
Broker/Topicmin.insync.replicas2the real guarantee floor for acks=all
Broker/Topicunclean.leader.election.enablefalseforbid lagging-replica leadership
Consumerenable.auto.commitfalsecommit manually after processing

6. The Trade-off — There Is No Free Durability

A no-loss configuration is safe, and it costs you. Choose it knowing exactly what you gain and what you give up.

ChoiceWhat you gainWhat you give up
acks=allstrong durabilityhigher latency (replication wait), lower throughput
min.insync.replicas=2safe across one losswrites rejected when ISR drops below 2
unclean.election=falsecommitted records preservedpartition writes/reads halt if the entire ISR is lost
manual commit (after processing)blocks logical lossduplicates possible -> needs idempotent design
RF=3recovery headroom3x disk and network cost

The core principle is not to apply zero loss to every topic. Topics that cannot lose a single record — payments, orders, audit logs — go no-loss; topics that tolerate a little loss — clickstreams, metrics — go throughput-first (acks=1, RF=2, etc.). Tier the configuration to the value of the topic — that is the realistic approach.

If you want to push durability even further, continue with Part 4 on the internals of acks/min.insync.replicas, and Part 5 which reproduces exactly how unclean leader election destroys data.


Wrapping up

  • Message loss can arise in any of the three regions — producer, broker, consumer. Locking just one region is not safe.
  • Producer: acks=0/1, retries=0, and a missing flush() block delivery itself. Lock it down with acks=all + idempotence + flush()/close() before exit.
  • Broker: unclean.leader.election=true, a low min.insync.replicas, and RF=1 destroy even committed records. RF=3 + min.insync=2 + unclean=false is the balance point.
  • Consumer: auto-commit and committing before processing create logical loss. Turn off auto-commit and commit after processing, but design idempotently to absorb duplicates.
  • Zero loss comes at the cost of latency, throughput, and availability. Don't apply it wholesale; tier it to the value of each topic.
  • "Sent" is not "stored," and "read" is not "processed." Closing those gaps is the essence of a no-loss configuration.

References


— The Data Dynamics Engineering Team