[Kafka Ops 4] acks, min.insync.replicas, and Replication Factor — The Durability Trinity
How acks, replication factor (RF), and min.insync.replicas interlock to determine Kafka's data durability and availability. Covers why RF=3, min.insync.replicas=2, acks=all is the safe default, its trade-offs, a configuration matrix, and ISR behavior.
Few moments make an operator break into a cold sweat like hearing "we lost data" during an incident. Kafka is a distributed log, so it replicates data across multiple brokers — but turning on replication alone does not make you safe. The producer's acks, the topic's replication factor, and the broker's min.insync.replicas must interlock precisely before the guarantee "if a message was acknowledged, it will never be lost" actually holds. Get any one of them wrong and the other two — no matter how conservative — collapse.
In this post we treat these three settings not as separate knobs but as one system. We will trace exactly why the often-quoted RF=3, min.insync.replicas=2, acks=all is safe, and what you give up to get that safety.
What you'll learn in this post
- The precise meaning of the three
acksmodes (0/1/all) and a common misconception- How replication factor and ISR (In-Sync Replicas) work
- What happens when
min.insync.replicasmeetsacks=all- Why
RF=3 · min.insync.replicas=2 · acks=allis safe, and its limits- A configuration matrix showing the durability/availability trade-off at a glance
This is Part 4 of the Kafka Ops Troubleshooting series. The exact conditions under which messages are lost were covered in [Part 3 — Message Loss Scenarios], and the danger of unclean leader election (which appears briefly here) is dissected in [Part 5 — Unclean Leader Election]. This post is the bridge between them: how to design durability during normal operation.
1. The Durability Trinity at a Glance
In Kafka, a single write passes through three distinct layers before it is safely stored. The key is to first separate who owns each setting and what it decides.
| Setting | Owner | Scope | What it decides |
|---|---|---|---|
acks | Producer | Per-producer / per-request | When the producer considers a message acknowledged |
replication.factor | Topic | Per-partition | How many copies of the data are kept |
min.insync.replicas | Broker / Topic | Topic (or broker default) | The minimum in-sync replicas required to accept an acks=all write |
A guarantee only emerges when all three are conservatively aligned. An analogy:
- Replication factor is "how many copies do we make" — the number of safe vaults.
min.insync.replicasis "how many copies must actually be filled before we approve the transaction" — the quorum for approval.acksis "will the producer wait for that quorum" — when the receipt is handed back.
Raising the replication factor to 3 is meaningless if the producer uses acks=0, because it never waits for copies to be filled. Conversely, turning on acks=all with min.insync.replicas=1 lets a response go out as soon as a single copy exists. Let us nail this down from the start: the three are one team.
2. acks — What the Producer Waits For
acks defines what the producer must confirm before it considers a message accepted by the broker. There are three values.
acks | What it waits for | Durability | Throughput/latency | Loss risk |
|---|---|---|---|---|
0 | Nothing | Very low | Highest throughput | Fire and forget — does not even notice a network drop |
1 | Leader only | Medium | Fast | Lost if the leader fails before followers replicate |
all(=-1) | All ISR | High | Slowest | Effectively lossless if the ISR is defined well |
acks=0 — Fire and Forget
The producer writes the message to the socket and moves on immediately. Because it does not wait for the broker's response, it treats the send as a success even if the packet is dropped on the network or the broker rejects it. Latency is lowest and throughput is highest, but use this only where some loss is acceptable, such as metrics or log collection.
acks=1 — Leader Only
The leader broker responds as soon as it has written the message to its own log. The problem is the moment right after the leader responds but before a follower finishes replicating, the leader dies. The new leader has never seen that message, so the data is gone. The producer already received a "success" response, so it does not retry. This is the classic loss scenario covered in Part 3.
acks=all — But "all" Doesn't Mean Everything
Let us clear up the most important misconception first. acks=all waits for "all ISR (In-Sync Replicas)", not "all replicas".
RF=3, with one follower lagging and dropped from the ISR:
replica set = { leader, follower-A, follower-B } (3)
ISR set = { leader, follower-A } (2) ← follower-B dropped
acks=all waits only for the ISR (2) → it does NOT wait for follower-B.In other words, a lagging follower is excluded from the response path. If acks=all truly waited for "all replicas," a single slow follower would stall every write. Kafka does not work that way. Instead, the floor on "how many ISR must remain to allow a write" is controlled by a separate setting, min.insync.replicas — the subject of the next sections.
# Producer settings (producer.properties or code)
acks=all
enable.idempotence=true # exactly-once, deduplicated delivery (forces acks=all)
retries=2147483647 # absorb transient failures via retries
max.in.flight.requests.per.connection=5 # ordering cap for idempotent producersSetting
enable.idempotence=trueforcesacks=alland deduplicates on retry. If you take durability seriously, treat it as effectively a default and leave it on.
3. Replication Factor and ISR
Replication Factor
Replication factor is a per-topic setting that decides how many copies of each partition are kept. With RF=3, a partition exists across three brokers. One of them is the leader, which handles all reads and writes, while the other two are followers that continuously replicate the leader's log.
# Create a topic with RF=3
kafka-topics.sh --create \
--topic orders \
--partitions 6 \
--replication-factor 3 \
--bootstrap-server localhost:9092RF sets the ceiling on "how many simultaneous broker failures can be tolerated." With RF=3, the data itself survives up to 2 lost brokers in theory. But "data survives" and "writes can continue" are different problems; the latter is decided by ISR and min.insync.replicas below.
ISR (In-Sync Replicas)
The ISR is the set of replicas currently sufficiently synchronized with the leader. The leader is always in the ISR, and a follower stays in the ISR as long as it has caught up to the leader's latest log within replica.lag.time.max.ms (default 30 seconds). If it fails to catch up within that window it is removed from the ISR, and it rejoins once it catches up again.
The ISR matters for two reasons.
- What
acks=allwaits for is precisely the ISR (not the full replica set). - When the leader dies, a new leader is, in principle, elected only from within the ISR. Allowing a lagging replica outside the ISR to become leader is unclean leader election, which leads directly to data loss (see Part 5).
# Check ISR status — if Isr has fewer entries than Replicas, a replica is lagging
kafka-topics.sh --describe --topic orders \
--bootstrap-server localhost:9092
# Example output
# Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2
# ^^^ broker 3 dropped from ISRIn the output above, Replicas: 1,2,3 but Isr: 1,2 means broker 3's replica has lagged more than 30 seconds and fallen out of the ISR. At this moment there are only 2 in-sync copies.
4. min.insync.replicas — The Write Quorum
min.insync.replicas (min.isr for short) defines the minimum number of replicas that must be in the ISR to accept an acks=all write. It is configured per topic or per broker.
The core rule is simple.
When
acks=all, if ISR size <min.insync.replicas, the write is rejected.
When rejected, the producer receives one of two exceptions.
| Exception | When it occurs |
|---|---|
NotEnoughReplicasException | The ISR is already short at the start of the write attempt — rejected before appending to the log |
NotEnoughReplicasAfterAppendException | The message was appended to the leader log, but not enough ISR members confirmed replication |
The second exception is subtle. The message may have entered the leader log, but since it did not meet the quorum it is treated as a failure from the producer's perspective and retried (deduplicated if the producer is idempotent). In other words, Kafka chooses to "report a failure rather than lie about success" in such cases.
# Broker default (server.properties)
min.insync.replicas=2
# Or per topic (recommended in production — durability needs differ per topic)
kafka-configs.sh --alter --topic orders \
--add-config min.insync.replicas=2 \
--bootstrap-server localhost:9092Important:
min.insync.replicasonly takes effect whenacks=all. If the producer sends withacks=1, this setting is ignored and the leader responds alone. That is why all three settings must be aligned together — settingmin.insync.replicas=2on a topic has no effect whatsoever if the producer usesacks=1.
5. How the Three Interact — Why RF=3 · min.isr=2 · acks=all
Now let us combine all three. The most widely cited safe configuration is:
replication.factor = 3
min.insync.replicas = 2
acks = allLet us trace step by step why this combination is called the "golden ratio."
Normal state (all 3 brokers healthy)
ISR = = 3. min.isr=2 is satisfied. Every write is acknowledged once the leader plus at least one follower has it (acks=all waits for the whole ISR, so in practice all 3, but if one lags and 2 remain, it still passes).
One broker fails (ISR drops to 2)
When one dies, ISR = 2. This exactly satisfies min.isr=2, so writes continue. At the same time, every message is safely written to at least 2 copies. That is, you tolerate a single broker failure while remaining writable. This is the core value of the combination.
Two brokers fail (ISR drops to 1)
Now ISR = 1, which fails to meet min.isr=2. Kafka rejects the write and the producer receives NotEnoughReplicasException. It hurts, but this is the correct behavior. If writes kept being accepted with only 1 copy left, and that last copy then died, the data would be gone forever. Kafka chooses to fail fast instead of losing data silently. The producer sees the error and can retry or fire an alert — far better than losing data.
To summarize the flow above in one line: the leader receives the message the producer sent, replicates it to the ISR followers, and only after verifying that the ISR size is at least min.insync.replicas does it return an ack to the producer. That "verify" step is the heart of the lossless guarantee.
6. The Trade-off — Durability and Availability Are a Seesaw
The higher you set min.insync.replicas, the higher the durability but the lower the availability. In the extreme, setting min.insync.replicas = RF requires all replicas to be in sync for a write to succeed, so durability is maximal — but the instant even one replica lags or dies, all writes stop. A single restart or a rolling upgrade would block writes, so this is almost never used operationally.
Conversely, min.insync.replicas=1 lets a response go out as soon as a single copy exists even with acks=all, effectively dropping you to a loss risk similar to acks=1.
The key formula:
Broker failures tolerated (while staying writable) =
RF−min.insync.replicas
With RF=3, min.isr=2, that is 3 − 2 = 1, so writes survive up to 1 failure. Using this formula as the yardstick, let us compare configuration combinations.
| RF | min.insync.replicas | acks | Durability | Availability (writable) | Failures tolerated | Verdict |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Very low | Full outage on any failure | 0 | Dev/test only |
| 3 | 1 | all | Low | Writable up to 2 failures | 2 (write), but loss possible | acks=all rendered moot — not recommended |
| 3 | 2 | 1 | Low | High | — | min.isr neutralized (acks=1) — not recommended |
| 3 | 2 | all | High | Writable up to 1 failure | 1 | Standard safe config ✅ |
| 3 | 3 | all | Highest | Writes stop if even 1 drops | 0 (write) | Sacrifices availability, special use |
| 5 | 3 | all | Very high | Writable up to 2 failures | 2 | Mission-critical / high-availability |
As the table shows, for most production environments RF=3 · min.insync.replicas=2 · acks=all is the balance point of durability and availability. When you need both higher availability and durability — as in finance or payments — consider raising to RF=5 · min.isr=3 to stay writable through 2 simultaneous failures.
Common Mistakes Checklist
- Not changing
ackson the producer: Setting onlymin.insync.replicas=2on the topic while leaving the producer at the defaultacks=1(or a library default) discards the entire quorum guarantee. Configure all three together. - Setting
min.insync.replicasequal to RF: Blinded by durability,min.isr=RFblocks writes during every rolling restart. - Piling replicas on a single broker / single AZ: Even with RF=3, if all 3 copies sit in the same rack or availability zone, they vanish together when that zone fails. Enable rack-aware placement with
broker.rackto spread the copies. - Leaving
unclean.leader.election.enable=true: With this on, a replica outside the ISR can become leader, collapsing every guarantee built above. The mechanism and danger are covered in [Part 5 — Unclean Leader Election].
Wrapping up
- Data durability is not a single setting but an ensemble of three.
acks(what the producer waits for),replication.factor(number of copies), andmin.insync.replicas(write quorum) must interlock for a guarantee to emerge. acks=allwaits for "all ISR", not "all replicas". Lagging followers drop out of the response path, and the floor on minimum in-sync copies is controlled bymin.insync.replicas.RF=3 · min.insync.replicas=2 · acks=allstays writable through a single broker failure, and when 2 die it fails fast withNotEnoughReplicasExceptioninstead of losing data silently.- Durability and availability are a seesaw. Use the formula failures tolerated =
RF − min.insync.replicasto pick the balance point for each topic's requirements. - No matter how well you set these three, a single
unclean.leader.election.enable=truecan undo them all — that story continues in Part 5.
References
- Apache Kafka. "Broker Configs" (
min.insync.replicas,replica.lag.time.max.ms,unclean.leader.election.enable) — https://kafka.apache.org/documentation/#brokerconfigs- Apache Kafka. "Producer Configs" (
acks,enable.idempotence,retries) — https://kafka.apache.org/documentation/#producerconfigs- Apache Kafka. "Replication" — https://kafka.apache.org/documentation/#replication
— The Data Dynamics Engineering Team