Blog
kafkaavailabilitydurabilityreplicationtroubleshooting

[Kafka Ops 5] Unclean Leader Election — Trading Data for Availability

When every in-sync replica for a partition dies, Kafka stands at a fork in the road: keep the partition offline to protect data, or promote a lagging replica to leader to restore availability at the cost of data loss. Here is what unclean.leader.election.enable really means and how to decide.

Data DynamicsJune 4, 202613 min read

3 a.m. A single partition has gone completely silent. Producers can't send, consumers have stalled, and the OfflinePartitionsCount on your dashboard shows a number that is not zero. You trace it down: every in-sync replica for that partition is down. Now a cruel fork in the road lies before you. Do you wait and protect the data, or do you promote a lagging replica to leader to bring the service back — knowing some data will be lost forever? That choice has a name: Unclean Leader Election.

What you'll learn in this post

  • What leaders, followers, and the ISR are, and how a follower falls out of the ISR
  • The two choices Kafka faces when every in-sync replica for a partition is down
  • What unclean.leader.election.enable (true/false) means operationally
  • How to detect offline partitions, silent data loss, and offset truncation
  • The (rare) cases where enabling it is acceptable, and when you must never enable it

This is Part 5 of the Kafka Ops Troubleshooting series. The fundamentals of replication and the ISR were covered in Part 3 (ISR and Replication Lag), and how to guarantee durability with acks and min.insync.replicas was covered in Part 4 (Producer Durability). This installment is about the worst moment — when all of those safety nets have collapsed.


1. Recap — Leaders, Followers, and the ISR

Before diving into the scenario, let's revisit the three concepts you absolutely need to understand this post.

Leaders and followers

Each Kafka partition has as many replicas as replication.factor. One of them is the leader, the rest are followers.

  • All reads and writes go through the leader. Producers send only to the leader, and consumers (by default) also read only from the leader.
  • Followers simply fetch and replicate the leader's log. A follower periodically sends fetch requests to the leader to keep up with new messages.

When the leader dies, the controller elects one of the surviving replicas as the new leader. The thing that decides "who is even eligible to become leader" is the ISR.

ISR (In-Sync Replicas)

The ISR is the set of replicas that are sufficiently in sync with the leader. The leader itself is always part of the ISR.

TermMeaning
AR (Assigned Replicas)All replicas assigned to the partition
ISR (In-Sync Replicas)The set of replicas kept in sync with the leader
OSR (Out-of-Sync Replicas)Lagging replicas dropped from the ISR (AR - ISR)

The very definition of "committed" hinges on this. A message is considered committed only once it has been replicated to all replicas in the ISR, and consumers can only read committed messages (exposed only up to the High Watermark). That's why ISR members are the replicas you can trust to "definitely have this data," and they are the first-class candidates for leader election.

How a follower falls out of the ISR

If a follower slows down or briefly dies, it can't keep up with the leader's log. When that lag exceeds a time threshold, the leader removes the follower from the ISR. That threshold is replica.lag.time.max.ms.

# broker config (default 30s)
replica.lag.time.max.ms=30000
  • If a follower fails to catch up to the leader's log-end-offset for longer than replica.lag.time.max.msremoved from the ISR (demoted to OSR).
  • Once the follower catches up again → rejoins the ISR.
Loading diagram…

The key point: the ISR is not a fixed set — it is a living set that shrinks and grows moment to moment. Under load spikes or slow disk/network, followers can fall out of the ISR one by one, and in the worst case the ISR can shrink down to just the leader alone. That is exactly where the next scenario begins.

How the ISR shrinks and grows, and how to monitor under-replicated partitions, is covered in detail in Part 3.


2. The Scenario — When Every In-Sync Replica Dies

Now the main event. Suppose every replica in a partition's ISR goes down at once (or in a cascade).

For example, on a partition with replication.factor=3:

  1. Initially ISR = — healthy.
  2. Broker3 slows down due to a GC storm and drops out → ISR = , OSR = .
  3. Right after, Broker1 (the leader) dies from a disk failure → the controller elects Broker2 as the new leader. ISR = .
  4. Then Broker2 also dies in a power outage. Now there is not a single live replica left in the ISR.

All that remains is Broker3 from the OSR. Broker3 is alive, but it does not have the data written after the moment it last fell out of the ISR. There are definitely messages that Broker2 committed but were never replicated to Broker3.

From the controller's point of view, there are exactly two options.

Option (a) — Wait (Clean): partition offline, data preserved

Wait until one of the ISR members (Broker1 or Broker2) comes back to life. An ISR member is guaranteed to hold all committed data, so when it becomes leader there is zero data loss.

But there's a price. Until it returns, the partition stays offline with no leader.

  • Producers cannot write to this partition (after a metadata refresh they retry on NotLeaderOrFollowerException and eventually time out).
  • Consumers can no longer read this partition.
  • The partition is unavailable, but not a single record is lost.

Option (b) — Unclean Election: lagging replica becomes leader, data is lost

Force Broker3 from the OSR to become the new leader. The partition comes back online immediately and can be read and written again.

But there's a fatal price. Because Broker3 was lagging:

  • Messages that were committed up to Broker2 but never reached Broker3 are gone forever. For the producer that received an ack for those messages, data it was told was successful has simply vanished.
  • Because the new leader's (Broker3's) log-end-offset is smaller than the previous leader's, the partition's offsets are truncated backward. When Broker1/Broker2 later return, they must truncate their own longer logs down to the new leader's (Broker3's) shorter log. In other words, even the data on the perfectly healthy replicas that came back gets thrown away.
Aspect(a) Wait (Clean)(b) Unclean Election
Partition stateoffline (unavailable)online (available)
Data lossnoneyes (committed messages lost)
Offsetspreservedcan be truncated
Recovery timeonly when an ISR member returnsimmediate
Value favoredDurabilityAvailability

This is the classic distributed-systems trade-off. Choose consistency/durability and you lose availability; choose availability and you lose consistency/durability. Kafka delegates that choice to you in a single line of configuration.


3. The Setting — unclean.leader.election.enable

The setting that decides which of the two paths above to take is unclean.leader.election.enable.

# set at the broker (server.properties) or topic level
unclean.leader.election.enable=false
ValueBehaviorMeaning
false (default)wait when the ISR is emptyNo data loss, but the partition stays offline. Durability first.
trueelect an OSR replica as leaderImmediate recovery, but possible data loss. Availability first.

The default is false — and that's correct

In older Kafka (before 0.11.0.0) the default was true. Silently dropping data for the sake of availability was the default behavior. That led to the operator's nightmare of "I got an ack but the data disappeared," so Kafka switched the default to false. Modern Kafka favors durability over availability by default. Don't flip this to true without a specific reason.

Configuration level — broker vs topic

  • Broker level (unclean.leader.election.enable in server.properties): the cluster-wide default policy. Requires a restart.
  • Topic level (dynamic config): lets you apply a different policy to specific topics. Applies immediately without a restart.
# Allow unclean election for one topic only (availability-first topic)
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name clickstream-raw \
  --alter --add-config unclean.leader.election.enable=true
 
# Revert back to durability-first
kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name clickstream-raw \
  --alter --delete-config unclean.leader.election.enable

Caution: If you dynamically flip this to true at the broker level, every partition currently sitting offline with an empty ISR will immediately trigger an unclean election and may lose data. Turning this on mid-incident with a "just bring it back up" mindset invites irreversible data loss — keep that in mind.


4. Symptoms and Detection

This problem has two faces. With false (the default) it shows up as a loud failure (offline partition); with true it shows up as silent data loss. You need to know how to detect both.

When false — offline partitions (loud failure)

When the ISR is empty and unclean election is disabled, the partition becomes offline with no leader.

① The key metric: OfflinePartitionsCount

This metric must be 0, and if it's anything other than 0 you should alert immediately — it's a first-priority indicator.

# JMX MBean
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
MetricHealthy valueMeaning
OfflinePartitionsCount0Number of partitions with no leader (offline). Collected from the controller.
ActiveControllerCount1 (sum across cluster)Number of active controllers
UnderMinIsrPartitionCount0Partitions that dropped below min.insync.replicas

② Check via CLI — partitions with no leader

kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic orders
Topic: orders  Partition: 7  Leader: none  Replicas: 1,2,3  Isr:

Leader: none and an empty Isr: are the smoking gun. With no leader, all reads and writes to this partition are blocked.

③ Client symptoms

  • Producers: sends to this partition stall and eventually fail with TimeoutException (no metadata, or no leader).
  • Consumers: lag on this partition stops decreasing and plateaus.

When true — silent data loss (quiet failure)

With unclean election enabled, the partition comes back to life just fine. The fact that everything looks okay on the surface is the scary part. But internally:

① log-end-offset goes backward

As the lagging replica becomes leader, the partition's end offset (LEO) becomes smaller than before. Plot the LEO over time and a graph that should monotonically increase jumps backward for a moment.

# log-end-offset over time
12:00:00  LEO=1,000,000
12:00:05  LEO=1,000,420   <- Broker2 is leader, normal growth
12:00:10  LEO=  998,800   <- unclean election to Broker3! 1,620 records vanish + offset goes back

② Truncation warnings in broker logs

When Broker1/Broker2 later recover, they truncate their own longer logs down to the new leader's shorter log. The broker logs leave messages like these.

WARN [ReplicaFetcher ...] Truncating partition orders-7 to offset 998800
      because the leader's log start/end offset is smaller ...
INFO Truncating log orders-7 to offset 998800, discarding 1620 records

③ Detection strategy

SignalWhere to look
LEO decreasingPer-broker LogEndOffset metric over time, or topic end-offset monitoring
Log truncationTruncating ... to offset (WARN/INFO) in broker logs
ISR change + leader changeController logs, LeaderElectionRateAndTimeMs
Missing acked dataApplication-level validation (message key/sequence gaps)

Silent data loss is hard to catch 100% from metrics alone. If data integrity matters, add a defensive line that validates message sequence/key gaps at the application level.


5. Decision Guidance — When to Enable, When to Disable

In the end the question is one: "On this topic, is one record worth more than a few minutes of downtime?"

Cases where you must leave it as is (keep false) — most of them

If data integrity matters even a little, the answer is false, unconditionally.

  • Financial transactions, payments, orders, settlements — losing even one record is unacceptable.
  • Event Sourcing / CDC — the log is the source of truth. Offset truncation means state inconsistency.
  • Durability hard-won with acks=all + min.insync.replicas>=2 — unclean election nullifies that guarantee in one shot. You'd be undoing the durability you carefully built in Part 4.
  • Any case where you'd rather the partition pause briefly than lose data.

Cases where (rarely) enabling it is OK (true)

Consider it on a per-topic basis only when a single record is low-value and availability/real-timeliness is absolute.

  • Low-value, high-volume telemetry/metrics — missing a few seconds of data barely affects trend analysis.
  • Parts of clickstream/log collection that can be lost — data you treat with sampling/approximation anyway.
  • Real-time monitoring feeds where downtime costs more than data loss — when "latest state" matters more than "complete history."

Even then, the rule is to apply it as a per-topic override on that topic only, not the whole cluster. Keep the cluster default at false and treat the explicitly availability-first topics as exceptions.

Decision flow

Loading diagram…

Wrapping up

Unclean Leader Election is not a "bug" — it's a decision delegated to you. When Kafka reaches a moment where it cannot guarantee both data and availability at 100%, it puts that choice into your hands as a single line of config.

  • When every ISR member of a partition dies, Kafka stands between two roads: (a) wait offline, or (b) promote a lagging replica to leader.
  • unclean.leader.election.enable is the trade-off switch between false (durability, the default) and true (availability). The modern default of false is the sensible choice — don't change it casually.
  • The symptom when false is offline partitions. Alert on OfflinePartitionsCount first, and confirm with kafka-topics.sh --describe showing Leader: none and an empty Isr:.
  • The symptom when true is silent data loss. Watch for LEO going backward and for Truncating ... to offset in the broker logs.
  • The decision rule is simple: if data integrity matters, keep it off; enable it per-topic only on low-value topics where downtime is more expensive than losing a record.
  • The best defense is to keep the ISR from emptying in the first place. Lower the probability of this extreme situation with replication.factor>=3, rack-aware placement, min.insync.replicas>=2, and the producer durability settings from Part 4.

In the next installment we'll cover consumer group rebalancing and stuck consumers — the stage on which all of this replication behavior plays out.

References


— The Data Dynamics Engineering Team