Blog
kafkadurabilityreplicationreliabilityconfiguration

[Kafka Ops 4] acks, min.insync.replicas, and Replication Factor — The Durability Trinity

How acks, replication factor (RF), and min.insync.replicas interlock to determine Kafka's data durability and availability. Covers why RF=3, min.insync.replicas=2, acks=all is the safe default, its trade-offs, a configuration matrix, and ISR behavior.

Data DynamicsJune 3, 202613 min read

Few moments make an operator break into a cold sweat like hearing "we lost data" during an incident. Kafka is a distributed log, so it replicates data across multiple brokers — but turning on replication alone does not make you safe. The producer's acks, the topic's replication factor, and the broker's min.insync.replicas must interlock precisely before the guarantee "if a message was acknowledged, it will never be lost" actually holds. Get any one of them wrong and the other two — no matter how conservative — collapse.

In this post we treat these three settings not as separate knobs but as one system. We will trace exactly why the often-quoted RF=3, min.insync.replicas=2, acks=all is safe, and what you give up to get that safety.

What you'll learn in this post

  • The precise meaning of the three acks modes (0/1/all) and a common misconception
  • How replication factor and ISR (In-Sync Replicas) work
  • What happens when min.insync.replicas meets acks=all
  • Why RF=3 · min.insync.replicas=2 · acks=all is safe, and its limits
  • A configuration matrix showing the durability/availability trade-off at a glance

This is Part 4 of the Kafka Ops Troubleshooting series. The exact conditions under which messages are lost were covered in [Part 3 — Message Loss Scenarios], and the danger of unclean leader election (which appears briefly here) is dissected in [Part 5 — Unclean Leader Election]. This post is the bridge between them: how to design durability during normal operation.


1. The Durability Trinity at a Glance

In Kafka, a single write passes through three distinct layers before it is safely stored. The key is to first separate who owns each setting and what it decides.

SettingOwnerScopeWhat it decides
acksProducerPer-producer / per-requestWhen the producer considers a message acknowledged
replication.factorTopicPer-partitionHow many copies of the data are kept
min.insync.replicasBroker / TopicTopic (or broker default)The minimum in-sync replicas required to accept an acks=all write

A guarantee only emerges when all three are conservatively aligned. An analogy:

  • Replication factor is "how many copies do we make" — the number of safe vaults.
  • min.insync.replicas is "how many copies must actually be filled before we approve the transaction" — the quorum for approval.
  • acks is "will the producer wait for that quorum" — when the receipt is handed back.

Raising the replication factor to 3 is meaningless if the producer uses acks=0, because it never waits for copies to be filled. Conversely, turning on acks=all with min.insync.replicas=1 lets a response go out as soon as a single copy exists. Let us nail this down from the start: the three are one team.


2. acks — What the Producer Waits For

acks defines what the producer must confirm before it considers a message accepted by the broker. There are three values.

acksWhat it waits forDurabilityThroughput/latencyLoss risk
0NothingVery lowHighest throughputFire and forget — does not even notice a network drop
1Leader onlyMediumFastLost if the leader fails before followers replicate
all(=-1)All ISRHighSlowestEffectively lossless if the ISR is defined well

acks=0 — Fire and Forget

The producer writes the message to the socket and moves on immediately. Because it does not wait for the broker's response, it treats the send as a success even if the packet is dropped on the network or the broker rejects it. Latency is lowest and throughput is highest, but use this only where some loss is acceptable, such as metrics or log collection.

acks=1 — Leader Only

The leader broker responds as soon as it has written the message to its own log. The problem is the moment right after the leader responds but before a follower finishes replicating, the leader dies. The new leader has never seen that message, so the data is gone. The producer already received a "success" response, so it does not retry. This is the classic loss scenario covered in Part 3.

acks=all — But "all" Doesn't Mean Everything

Let us clear up the most important misconception first. acks=all waits for "all ISR (In-Sync Replicas)", not "all replicas".

RF=3, with one follower lagging and dropped from the ISR:
 
  replica set = { leader, follower-A, follower-B }   (3)
  ISR set     = { leader, follower-A }               (2) ← follower-B dropped
 
  acks=all waits only for the ISR (2) → it does NOT wait for follower-B.

In other words, a lagging follower is excluded from the response path. If acks=all truly waited for "all replicas," a single slow follower would stall every write. Kafka does not work that way. Instead, the floor on "how many ISR must remain to allow a write" is controlled by a separate setting, min.insync.replicas — the subject of the next sections.

# Producer settings (producer.properties or code)
acks=all
enable.idempotence=true   # exactly-once, deduplicated delivery (forces acks=all)
retries=2147483647        # absorb transient failures via retries
max.in.flight.requests.per.connection=5  # ordering cap for idempotent producers

Setting enable.idempotence=true forces acks=all and deduplicates on retry. If you take durability seriously, treat it as effectively a default and leave it on.


3. Replication Factor and ISR

Replication Factor

Replication factor is a per-topic setting that decides how many copies of each partition are kept. With RF=3, a partition exists across three brokers. One of them is the leader, which handles all reads and writes, while the other two are followers that continuously replicate the leader's log.

# Create a topic with RF=3
kafka-topics.sh --create \
  --topic orders \
  --partitions 6 \
  --replication-factor 3 \
  --bootstrap-server localhost:9092

RF sets the ceiling on "how many simultaneous broker failures can be tolerated." With RF=3, the data itself survives up to 2 lost brokers in theory. But "data survives" and "writes can continue" are different problems; the latter is decided by ISR and min.insync.replicas below.

ISR (In-Sync Replicas)

The ISR is the set of replicas currently sufficiently synchronized with the leader. The leader is always in the ISR, and a follower stays in the ISR as long as it has caught up to the leader's latest log within replica.lag.time.max.ms (default 30 seconds). If it fails to catch up within that window it is removed from the ISR, and it rejoins once it catches up again.

The ISR matters for two reasons.

  1. What acks=all waits for is precisely the ISR (not the full replica set).
  2. When the leader dies, a new leader is, in principle, elected only from within the ISR. Allowing a lagging replica outside the ISR to become leader is unclean leader election, which leads directly to data loss (see Part 5).
# Check ISR status — if Isr has fewer entries than Replicas, a replica is lagging
kafka-topics.sh --describe --topic orders \
  --bootstrap-server localhost:9092
 
# Example output
# Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2
#                                                                  ^^^ broker 3 dropped from ISR

In the output above, Replicas: 1,2,3 but Isr: 1,2 means broker 3's replica has lagged more than 30 seconds and fallen out of the ISR. At this moment there are only 2 in-sync copies.


4. min.insync.replicas — The Write Quorum

min.insync.replicas (min.isr for short) defines the minimum number of replicas that must be in the ISR to accept an acks=all write. It is configured per topic or per broker.

The core rule is simple.

When acks=all, if ISR size < min.insync.replicas, the write is rejected.

When rejected, the producer receives one of two exceptions.

ExceptionWhen it occurs
NotEnoughReplicasExceptionThe ISR is already short at the start of the write attempt — rejected before appending to the log
NotEnoughReplicasAfterAppendExceptionThe message was appended to the leader log, but not enough ISR members confirmed replication

The second exception is subtle. The message may have entered the leader log, but since it did not meet the quorum it is treated as a failure from the producer's perspective and retried (deduplicated if the producer is idempotent). In other words, Kafka chooses to "report a failure rather than lie about success" in such cases.

# Broker default (server.properties)
min.insync.replicas=2
 
# Or per topic (recommended in production — durability needs differ per topic)
kafka-configs.sh --alter --topic orders \
  --add-config min.insync.replicas=2 \
  --bootstrap-server localhost:9092

Important: min.insync.replicas only takes effect when acks=all. If the producer sends with acks=1, this setting is ignored and the leader responds alone. That is why all three settings must be aligned together — setting min.insync.replicas=2 on a topic has no effect whatsoever if the producer uses acks=1.


5. How the Three Interact — Why RF=3 · min.isr=2 · acks=all

Now let us combine all three. The most widely cited safe configuration is:

replication.factor   = 3
min.insync.replicas  = 2
acks                 = all

Let us trace step by step why this combination is called the "golden ratio."

Normal state (all 3 brokers healthy) ISR = = 3. min.isr=2 is satisfied. Every write is acknowledged once the leader plus at least one follower has it (acks=all waits for the whole ISR, so in practice all 3, but if one lags and 2 remain, it still passes).

One broker fails (ISR drops to 2) When one dies, ISR = 2. This exactly satisfies min.isr=2, so writes continue. At the same time, every message is safely written to at least 2 copies. That is, you tolerate a single broker failure while remaining writable. This is the core value of the combination.

Two brokers fail (ISR drops to 1) Now ISR = 1, which fails to meet min.isr=2. Kafka rejects the write and the producer receives NotEnoughReplicasException. It hurts, but this is the correct behavior. If writes kept being accepted with only 1 copy left, and that last copy then died, the data would be gone forever. Kafka chooses to fail fast instead of losing data silently. The producer sees the error and can retry or fire an alert — far better than losing data.

Loading diagram…

To summarize the flow above in one line: the leader receives the message the producer sent, replicates it to the ISR followers, and only after verifying that the ISR size is at least min.insync.replicas does it return an ack to the producer. That "verify" step is the heart of the lossless guarantee.


6. The Trade-off — Durability and Availability Are a Seesaw

The higher you set min.insync.replicas, the higher the durability but the lower the availability. In the extreme, setting min.insync.replicas = RF requires all replicas to be in sync for a write to succeed, so durability is maximal — but the instant even one replica lags or dies, all writes stop. A single restart or a rolling upgrade would block writes, so this is almost never used operationally.

Conversely, min.insync.replicas=1 lets a response go out as soon as a single copy exists even with acks=all, effectively dropping you to a loss risk similar to acks=1.

The key formula:

Broker failures tolerated (while staying writable) = RFmin.insync.replicas

With RF=3, min.isr=2, that is 3 − 2 = 1, so writes survive up to 1 failure. Using this formula as the yardstick, let us compare configuration combinations.

RFmin.insync.replicasacksDurabilityAvailability (writable)Failures toleratedVerdict
111Very lowFull outage on any failure0Dev/test only
31allLowWritable up to 2 failures2 (write), but loss possibleacks=all rendered moot — not recommended
321LowHighmin.isr neutralized (acks=1) — not recommended
32allHighWritable up to 1 failure1Standard safe config ✅
33allHighestWrites stop if even 1 drops0 (write)Sacrifices availability, special use
53allVery highWritable up to 2 failures2Mission-critical / high-availability

As the table shows, for most production environments RF=3 · min.insync.replicas=2 · acks=all is the balance point of durability and availability. When you need both higher availability and durability — as in finance or payments — consider raising to RF=5 · min.isr=3 to stay writable through 2 simultaneous failures.

Common Mistakes Checklist

  • Not changing acks on the producer: Setting only min.insync.replicas=2 on the topic while leaving the producer at the default acks=1 (or a library default) discards the entire quorum guarantee. Configure all three together.
  • Setting min.insync.replicas equal to RF: Blinded by durability, min.isr=RF blocks writes during every rolling restart.
  • Piling replicas on a single broker / single AZ: Even with RF=3, if all 3 copies sit in the same rack or availability zone, they vanish together when that zone fails. Enable rack-aware placement with broker.rack to spread the copies.
  • Leaving unclean.leader.election.enable=true: With this on, a replica outside the ISR can become leader, collapsing every guarantee built above. The mechanism and danger are covered in [Part 5 — Unclean Leader Election].

Wrapping up

  • Data durability is not a single setting but an ensemble of three. acks (what the producer waits for), replication.factor (number of copies), and min.insync.replicas (write quorum) must interlock for a guarantee to emerge.
  • acks=all waits for "all ISR", not "all replicas". Lagging followers drop out of the response path, and the floor on minimum in-sync copies is controlled by min.insync.replicas.
  • RF=3 · min.insync.replicas=2 · acks=all stays writable through a single broker failure, and when 2 die it fails fast with NotEnoughReplicasException instead of losing data silently.
  • Durability and availability are a seesaw. Use the formula failures tolerated = RF − min.insync.replicas to pick the balance point for each topic's requirements.
  • No matter how well you set these three, a single unclean.leader.election.enable=true can undo them all — that story continues in Part 5.

References


— The Data Dynamics Engineering Team