[Kafka DR 4] Failover & Failback Runbook — Actually Switching Over
An operational runbook for MirrorMaker 2-based Kafka DR: how to fail over to the DR cluster when disaster strikes, and how to fail back to the original cluster once it recovers — step by step. Covers the topic-renaming gotcha, offset translation, and duplicate/reprocessing reconciliation.
3 AM. The primary data center is gone — all of it. Monitoring is a wall of red and the on-call phone is ringing. What you need to reach for now is not heroic improvisation but a runbook that is already written and already familiar in your hands. If Parts 1–3 covered the replication architecture, the MirrorMaker 2 setup, and offset translation, this Part 4 turns all of that preparation into the actual buttons you press. Failover is not a question of "can we" — it's a question of "do it in order."
What you'll learn in this post
- The preconditions that must be in place before you ever open the failover runbook
- An ordered failover procedure, from declaring the disaster to repointing clients
- A failback procedure to return safely to the original cluster after recovery
- The
DefaultReplicationPolicytopic-renaming (prefix) gotcha, and howIdentityReplicationPolicychanges the picture- The duplicate/reprocessing reconciliation problem and why idempotent consumers are non-negotiable
- A checklist an on-call engineer can literally follow at 3 AM, plus RTO expectations
1. Preconditions — Before You Open the Runbook
A failover runbook does not operate in a vacuum. The conditions below must be satisfied ahead of time so that the 3 AM procedure doesn't wobble. Miss even one and failover becomes a gamble, not a switchover.
1.1 Is MM2 replication healthy?
A DR cluster is only meaningful if MirrorMaker 2 is replicating data primary → DR in real time. Replication lag must be under control during normal operation, not just at the moment of failover.
# Check MM2 source connector replication lag (JMX metrics)
# kafka.connect.mirror:type=MirrorSourceConnector,...
# replication-latency-ms : time from a record being written at source to landing at target
# record-age-ms : age of a record that just arrived at target
# End-to-end lag as seen by a consumer group (on the DR side)
kafka-consumer-groups.sh --bootstrap-server dr-broker:9092 \
--describe --group __replication-health-probe| Signal | Healthy | Warning | Danger |
|---|---|---|---|
replication-latency-ms (p99) | < a few seconds | tens of seconds | sustained minutes |
| MM2 connector state | RUNNING | RUNNING (some tasks FAILED) | FAILED |
| DR topic latest offset advancing | continuously increasing | stalling | stopped |
If replication is already behind when the primary dies, that backlog becomes data loss (RPO) verbatim. Remember: RPO is determined by everyday monitoring, not by the failover procedure.
1.2 Is offset translation working? (recap of Part 3)
The heart of failover is having consumers resume at the correct position on the DR cluster. Because source and target offsets are never identical (start offsets, compaction, and retention differ per partition), offsets must be translated.
sync.group.offsets.enabled = trueso MM2 periodically translates and checkpoints__consumer_offsets.- Or, at failover time, translate source offsets to target offsets with
RemoteClusterUtils.translateOffsets().
# MM2 connect-mirror-maker.properties (values set in Part 3 — reconfirm here)
primary->dr.emit.checkpoints.enabled = true
primary->dr.sync.group.offsets.enabled = true
primary->dr.sync.group.offsets.interval.seconds = 10
primary->dr.refresh.groups.interval.seconds = 10Without this, after failover consumers will read from the earliest or latest position per auto.offset.reset. The former means mass reprocessing; the latter means data loss. Both are disasters.
1.3 Do clients use an indirection layer?
This is the most frequently missed precondition. If a client's bootstrap.servers is a hardcoded primary broker address, failover means redeploying every application. Not a 3 AM task.
Instead, inject the bootstrap address through a repointable indirection layer.
| Mechanism | How to switch | Propagation delay | Watch out for |
|---|---|---|---|
| DNS CNAME | change the record | governed by TTL (keep TTL short, e.g. 30s) | client DNS caching, JVM networkaddress.cache.ttl |
| Config Service | change a key + client restart/refresh | instant to a few seconds | client must detect the change and reconnect |
| L4/L7 load balancer | switch the backend pool | instant | beware health-check false positives |
The point: you must be able to change the bootstrap address without touching client code.
1.4 Are consumers idempotent?
MM2-based failover is fundamentally at-least-once. Offset translation is approximate, and at the replication boundary some messages will be processed twice. So a consumer must produce the same result whether it sees a message once or twice.
// Idempotent handling: dedup by message key + sequence
String dedupKey = record.key() + ":" + record.headers().lastHeader("event-id");
if (processedStore.putIfAbsent(dedupKey, true) == null) {
process(record); // actually process only on first sight
}
// Already-seen key → silently skip (safe to reprocess)Idempotency is not an option for failover — it's a requirement. Fail over with non-idempotent consumers and you get duplicate orders, double billing, and duplicate notifications.
Precondition gate: If all four — (1) MM2 healthy, (2) offset translation, (3) indirection layer, (4) idempotent consumers — are not satisfied, fix those before following this runbook. A runbook cannot substitute for preparation.
2. Failover Procedure (Order Guaranteed)
In failover, order is correctness. Skip a step or reorder it and you add data loss or duplication. Follow the procedure as written.
2.1 (a) Detect and declare the disaster
First, judge whether this really is a disaster. Triggering failover on a transient network glitch costs you split-brain and needless data reconciliation.
| Criterion | Fail over | Wait |
|---|---|---|
| Entire primary cluster unreachable (DC down) | O | |
| Simultaneous loss of many brokers + ISR collapse | O | |
| Single broker failure | O (Kafka's own replication handles it) | |
| Transient network partition (tens of seconds) | O (wait for recovery first) |
Declaring a disaster is an explicit human act. Automatic failover carries a high split-brain risk, so we put a human "declaration" gate in front of it. Record the timestamp and the decision-maker in the incident channel the moment you declare.
2.2 (b) Stop producers to the primary (if possible)
If the primary is partially reachable (e.g., some brokers survive, the network partly recovers), stop the producers writing to it first. This prevents accumulating more messages that are written only to the primary and never replicated to DR after the failover point.
# How you stop producers varies by environment:
# - a "produce: paused" flag in the indirection (config service)
# - or scale the producer app down temporarily (k8s: replicas=0)
kubectl scale deployment order-producer --replicas=0 -n prodIf the primary is fully unreachable, skip this step and accept the in-flight RPO — that is, accept as loss the messages written only to the primary that never replicated to DR. Record an estimate of this loss in the incident log.
2.3 (c) Confirm replication caught up (or accept the RPO)
If the primary is at least partially alive and MM2 is running, wait briefly so DR catches up as far as possible to the primary's last offset.
# Replication progress as seen on the DR side — watch until the latest offset of
# primary.<topic> stops increasing (= caught up)
watch -n2 'kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list dr-broker:9092 \
--topic primary.orders --time -1'- Once MM2 lag ≈ 0, you can switch over with no data loss — ideal.
- If the primary is fully dead and can't catch up further, accept the current RPO and proceed. Waiting forever only inflates RTO. Set an upper bound ("how long to wait", e.g. 60 seconds) in the runbook in advance.
2.4 (d) Repoint producers to DR
Now change the producers' bootstrap address to DR. Beware the topic-name gotcha here (details in Section 4).
With DefaultReplicationPolicy, the replica on the DR cluster exists under the name primary.orders. But after failover, producers must write to DR's original topic, orders (no prefix). primary.orders is a replication topic managed by MM2 — it is not where new writes go.
# Change the indirection (config service) value
- bootstrap.servers: primary-broker:9092
+ bootstrap.servers: dr-broker:9092
# Topic name stays 'orders' — new writes on DR go to the un-prefixed original topic
topic: orders# Bring producers back up
kubectl scale deployment order-producer --replicas=4 -n prod2.5 (e) Repoint consumers to DR — resume at translated offsets
Consumers must resume at translated offsets. There are two ways.
Method 1 — use offsets auto-synced by sync.group.offsets.enabled
If MM2 has been translating and syncing __consumer_offsets to DR during normal operation, the consumer simply reattaches to the DR cluster with the same group ID. The translated position already exists in DR's __consumer_offsets.
# Consumer: keep the same group ID, change only the bootstrap to DR
group.id = order-processor
bootstrap.servers = dr-broker:9092
# auto.offset.reset triggers "only when there is no synced offset",
# so in a properly synced state the consumer resumes at the translated position
auto.offset.reset = latestMethod 2 — explicit translation with RemoteClusterUtils
If you didn't enable auto-sync, or you want to verify the exact translated position, translate directly at failover time.
// Translate source(primary) offsets → target(dr) offsets
Map<TopicPartition, OffsetAndMetadata> translated =
RemoteClusterUtils.translateOffsets(
mm2Props,
"primary", // source cluster alias
"order-processor", // consumer group
Duration.ofSeconds(20));
// Commit the translated offsets to the DR consumer group, then resume
try (Admin admin = Admin.create(drProps)) {
admin.alterConsumerGroupOffsets("order-processor", translated).all().get();
}Note: A translated offset only guarantees "everything after this position was definitely not seen" — not "everything up to this position was seen exactly." That's why some reprocessing occurs, and that's why the idempotency from 1.4 is required.
2.6 (f) Validate
Switching over is not the end. A failover without validation is only half done.
# 1) Are consumers progressing — is lag shrinking?
kafka-consumer-groups.sh --bootstrap-server dr-broker:9092 \
--describe --group order-processor
# CURRENT-OFFSET should increase, LAG should decrease
# 2) Is mass reprocessing happening?
# (If LAG jumped to the full topic size, offset translation failed → suspect earliest reprocessing)Validation checks:
- The consumer group's
CURRENT-OFFSETincreases andLAGconverges. -
LAGdid not explode to the full topic size (i.e., not an earliest reprocess). - Producers are writing normally to DR's original topic (
orders). - Business validation passes: key metrics (order count, payment success rate, etc.) in normal range.
- Data flows to downstream (DB, search index, etc.).
Once these pass, failover is complete. Record in the incident channel: "DR active, failover complete, estimated RPO N events/sec."
3. Failback Procedure — Returning to the Original Cluster
The primary data center has recovered. But don't rush. Failback is trickier than failover. Failover happens because "one side died"; failback is "both sides are alive and you move deliberately," so split-brain and duplicate reconciliation must be handled even more carefully.
3.1 Set up reverse replication (DR → Primary)
During failover, all new data piled up on DR. The primary knows nothing of it. So first enable replication in the DR → primary direction to let the primary catch up.
# MM2: enable the reverse flow (briefly making active-passive bidirectional)
clusters = primary, dr
# The original forward flow is paused during failover; stop it and enable the reverse
dr->primary.enabled = true
dr->primary.emit.checkpoints.enabled = true
dr->primary.sync.group.offsets.enabled = true
# Whitelist only original topics — exclude 'primary.'-prefixed topics so they don't replicate back
dr->primary.topics = orders, payments, .*
dr->primary.topics.exclude = .*\.internal, primary\..*, .*\.replicaCyclic-replication caveat: If you enable forward (primary→dr) and reverse (dr→primary) at the same time, the
DefaultReplicationPolicyprefix prevents an infinite loop (e.g.,primary.orderswon't replicate back), but a misconfiguration can make topics ping-pong. It's safest to fully stop the forward flow before enabling the reverse.
3.2 Wait for the primary to catch up
# Reverse-replication progress as seen on the primary — watch until dr.<topic>'s
# latest offset catches up to the DR original
watch -n2 'kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list primary-broker:9092 \
--topic dr.orders --time -1'Wait until reverse lag is small enough. This is a planned operation not pressed by RTO, so you can wait comfortably until lag ≈ 0.
3.3 Schedule a cutover window
Perform failback during a planned maintenance window. Pick a low-traffic period and obtain permission for a brief write quiesce.
3.4 Stop DR producers → repoint clients to the primary
# 1) Stop producers to DR to quiesce writes
kubectl scale deployment order-producer --replicas=0 -n prod
# 2) Final check that reverse replication flushed the last message to the primary (lag == 0)
# 3) Point the indirection back to the primary
# config service: bootstrap.servers = primary-broker:9092
# or DNS CNAME: kafka.internal → primary
# 4) Consumers to primary too — resume at translated offsets via reverse sync.group.offsets
# 5) Restart producers (now to the primary's original topic 'orders')
kubectl scale deployment order-producer --replicas=4 -n prodThe order mirrors failover: stop producers → confirm last replication → repoint consumers/producers → restart.
3.5 Reconciliation — the duplicate/reprocessing problem
This is the hardest part of failback. Crossing two boundaries — the failover boundary and the failback boundary — some messages may have been processed on both sides.
Why idempotent/dedup consumers matter: at both boundaries offset translation is approximate, so exactly-once is not guaranteed. An idempotent consumer (1.4) absorbs these duplicates automatically. With non-idempotent consumers, you must do the following manually after failback.
| Reconciliation item | Method |
|---|---|
| Detect duplicate processing | detect duplicate rows downstream keyed by the idempotency key (event ID) |
| Correct double side-effects | cancel double billing/notifications with compensating transactions |
| Verify data consistency | reconcile aggregates over the window from just-before-failover to just-after-failback against the source of truth |
| Check offset gaps | confirm consumer group offsets did not regress or jump at the translation boundary |
The cost of reconciliation is the real cost of failback. Invest in idempotency and that cost converges to zero.
4. The Topic-Renaming Gotcha — DefaultReplicationPolicy vs IdentityReplicationPolicy
This single thing causes the most incidents in active-passive failover/failback. It deserves its own section.
4.1 The DefaultReplicationPolicy prefix
MM2's default policy prepends the source cluster alias as a prefix to replicated topics. The primary's orders becomes primary.orders on DR. This prefix prevents cyclic replication and lets you trace which cluster the data came from.
The gotcha is clear: where do producers write after failover?
| Topic | What it is | New writes after failover |
|---|---|---|
primary.orders (DR) | MM2-managed replica of the primary | ❌ must not write here |
orders (DR) | DR's own original topic | ✅ write here |
Yet consumers may need to read historical data from primary.orders right after failover (pre-failover primary data). So the same logical stream is split across two topics (primary.orders + orders) on DR — a confusing situation. To handle it cleanly, consumers must subscribe to both topics or unify them with a regex subscription (.*orders).
In failback, this confusion repeats as a mirror image: DR's orders becomes dr.orders on the primary and again diverges from the primary's own original orders.
4.2 How IdentityReplicationPolicy changes the picture
IdentityReplicationPolicy (formerly LegacyReplicationPolicy) adds no prefix. The primary's orders stays orders on DR too.
# Same-name replication with no prefix
replication.policy.class = org.apache.kafka.connect.mirror.IdentityReplicationPolicy| Aspect | DefaultReplicationPolicy | IdentityReplicationPolicy |
|---|---|---|
| Topic name on DR | primary.orders | orders |
| Client change after failover | must also handle the topic name (strip prefix) | topic name unchanged — change only the bootstrap |
| Cyclic-replication prevention | naturally blocked by the prefix | risky — needs separate prevention when bidirectional |
| Provenance tracking | clear from the topic name | unclear (needs headers or another means) |
| Suitable scenario | bidirectional / fan-in (aggregation) | unidirectional active-passive DR |
For simple failover/failback in active-passive DR, IdentityReplicationPolicy greatly simplifies operations. Because topic names are the same on both sides, clients only change the bootstrap address and the topic-name branching logic disappears. The trade-off: when you enable bidirectional replication (the reverse flow during failback), the responsibility for preventing cyclic replication shifts to the operator. You must block it with the topics whitelist/exclude and source.cluster.alias header-based filtering.
Selection guide: If unidirectional DR is the main goal and you want to avoid the topic-name gotcha, use
IdentityReplicationPolicy. If you aggregate many clusters into one (aggregation) or provenance tracking matters, useDefaultReplicationPolicy. Once you pick one, use it consistently across all clusters and all stages — mixing them makes topic names mismatch at failback.
5. End-to-End Diagram — Failover Then Failback
6. On-call Checklist & RTO Expectations
6.1 Failover checklist (for 3 AM)
Follow it as is. Top to bottom, no skipping.
- Declare disaster: confirm primary fully unreachable; record timestamp and decision-maker in the incident channel.
- Check MM2 state: capture a replication-lag snapshot (used to estimate RPO).
- Stop primary producers: quiesce if primary is partially reachable; otherwise skip and accept RPO.
- Catch up replication: wait for lag→0 (upper bound 60s). If primary is fully down, accept the current RPO.
- Repoint producers: indirection → DR. Topic is the un-prefixed original (
orders). - Repoint consumers: indirection → DR. Resume at translated offsets (
sync.group.offsetsorRemoteClusterUtils). - Validate: consumer lag converges, no mass reprocessing, producers writing normally, business metrics normal.
- Announce: record "DR active, failover complete, estimated RPO" in the incident channel.
6.2 Failback checklist (for a planned window)
- Configure reverse replication:
dr->primaryenabled, cyclic replication blocked. - Wait for primary catch-up: reverse-replication lag ≈ 0.
- Enter cutover window: low-traffic period, stakeholders notified.
- Stop DR producers: quiesce, confirm reverse replication flushed the last message (lag == 0).
- Repoint clients: indirection → Primary; consumers resume at translated offsets.
- Restart producers: to the primary's original topic.
- Reconcile: correct duplicates/reprocessing (auto-absorbed if idempotent, else manual).
- Restore forward replication: re-enable
primary->dr, return to normal active-passive.
6.3 RTO expectations
RTO (recovery time objective) is a function of "where you invested."
| Readiness level | Failover RTO (approx.) | Determining factor |
|---|---|---|
| No indirection (hardcoded) | tens of minutes to hours | full app redeploy time |
| DNS CNAME (TTL 30s) | a few minutes | DNS propagation + client cache expiry |
| Config service (instant refresh) | 1 to a few minutes | change propagation + consumer reconnect |
| + offset auto-sync | < 1–2 minutes | translated offsets already on DR |
The point: RTO is determined by preparation (indirection layer, offset sync), not by the speed of the failover procedure. RPO is determined by everyday replication lag; RTO by switchover automation. Both are set "before disaster strikes."
Wrapping up
Failover is not a heroic tale — it's a procedure. To summarize this part:
- Preconditions are 90% of it: MM2 health, offset translation, a repointable indirection layer, idempotent consumers — without these four, no runbook helps.
- Order is correctness: declare disaster → (if possible) stop producers → confirm replication / accept RPO → repoint producers → repoint consumers (translated offsets) → validate.
- Failover is at-least-once: idempotency is not optional but required. Translated offsets are approximate, and reprocessing inevitably occurs at the boundary.
- Failback is harder: you must reconcile duplicates accumulated across two boundaries, and only idempotent consumers drive that cost to zero.
- The topic-name gotcha: the
DefaultReplicationPolicyprefix confuses "where to write and where to read" after failover. For unidirectional active-passive,IdentityReplicationPolicysimplifies operations. - RTO/RPO are pre-decided: RPO by everyday lag, RTO by switchover automation. You can't change them on the day of the disaster.
And one most important truth. A runbook you never rehearse will fail. No matter how precise the procedure above, a failover attempted for the first time at 3 AM collapses on a missing permission, a cached DNS entry, or one forgotten topic. In Part 5, we cover how to regularly test and validate this runbook — game days, chaos injection, verifying replication and offset-translation correctness, and measuring RTO/RPO. A runbook comes alive not the moment you write it, but the moment you practice it.
References
- Apache Kafka — Geo-Replication (Cross-Cluster Data Mirroring): https://kafka.apache.org/documentation/#georeplication
- KIP-382: MirrorMaker 2.0: https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
- Confluent — Disaster Recovery for Multi-Datacenter Apache Kafka Deployments: https://docs.confluent.io/platform/current/multi-dc-deployments/index.html
- Apache Kafka —
RemoteClusterUtils/MirrorClient(offset translation)
— The Data Dynamics Engineering Team