kafkadisaster-recoveryfailoverrunbookoperations

[Kafka DR 4] Failover & Failback Runbook — Actually Switching Over

An operational runbook for MirrorMaker 2-based Kafka DR: how to fail over to the DR cluster when disaster strikes, and how to fail back to the original cluster once it recovers — step by step. Covers the topic-renaming gotcha, offset translation, and duplicate/reprocessing reconciliation.

Data DynamicsJune 16, 202619 min read

3 AM. The primary data center is gone — all of it. Monitoring is a wall of red and the on-call phone is ringing. What you need to reach for now is not heroic improvisation but a runbook that is already written and already familiar in your hands. If Parts 1–3 covered the replication architecture, the MirrorMaker 2 setup, and offset translation, this Part 4 turns all of that preparation into the actual buttons you press. Failover is not a question of "can we" — it's a question of "do it in order."

What you'll learn in this post

The preconditions that must be in place before you ever open the failover runbook

An ordered failover procedure, from declaring the disaster to repointing clients

A failback procedure to return safely to the original cluster after recovery

The DefaultReplicationPolicy topic-renaming (prefix) gotcha, and how IdentityReplicationPolicy changes the picture

The duplicate/reprocessing reconciliation problem and why idempotent consumers are non-negotiable

A checklist an on-call engineer can literally follow at 3 AM, plus RTO expectations

1. Preconditions — Before You Open the Runbook

A failover runbook does not operate in a vacuum. The conditions below must be satisfied ahead of time so that the 3 AM procedure doesn't wobble. Miss even one and failover becomes a gamble, not a switchover.

1.1 Is MM2 replication healthy?

A DR cluster is only meaningful if MirrorMaker 2 is replicating data primary → DR in real time. Replication lag must be under control during normal operation, not just at the moment of failover.

# Check MM2 source connector replication lag (JMX metrics)
# kafka.connect.mirror:type=MirrorSourceConnector,...
#   replication-latency-ms  : time from a record being written at source to landing at target
#   record-age-ms           : age of a record that just arrived at target
 
# End-to-end lag as seen by a consumer group (on the DR side)
kafka-consumer-groups.sh --bootstrap-server dr-broker:9092 \
  --describe --group __replication-health-probe

Signal	Healthy	Warning	Danger
`replication-latency-ms` (p99)	< a few seconds	tens of seconds	sustained minutes
MM2 connector state	RUNNING	RUNNING (some tasks FAILED)	FAILED
DR topic latest offset advancing	continuously increasing	stalling	stopped

If replication is already behind when the primary dies, that backlog becomes data loss (RPO) verbatim. Remember: RPO is determined by everyday monitoring, not by the failover procedure.

1.2 Is offset translation working? (recap of Part 3)

The heart of failover is having consumers resume at the correct position on the DR cluster. Because source and target offsets are never identical (start offsets, compaction, and retention differ per partition), offsets must be translated.

sync.group.offsets.enabled = true so MM2 periodically translates and checkpoints __consumer_offsets.
Or, at failover time, translate source offsets to target offsets with RemoteClusterUtils.translateOffsets().

# MM2 connect-mirror-maker.properties (values set in Part 3 — reconfirm here)
primary->dr.emit.checkpoints.enabled = true
primary->dr.sync.group.offsets.enabled = true
primary->dr.sync.group.offsets.interval.seconds = 10
primary->dr.refresh.groups.interval.seconds = 10

Without this, after failover consumers will read from the earliest or latest position per auto.offset.reset. The former means mass reprocessing; the latter means data loss. Both are disasters.

1.3 Do clients use an indirection layer?

This is the most frequently missed precondition. If a client's bootstrap.servers is a hardcoded primary broker address, failover means redeploying every application. Not a 3 AM task.

Instead, inject the bootstrap address through a repointable indirection layer.

Loading diagram…

Mechanism	How to switch	Propagation delay	Watch out for
DNS CNAME	change the record	governed by TTL (keep TTL short, e.g. 30s)	client DNS caching, JVM `networkaddress.cache.ttl`
Config Service	change a key + client restart/refresh	instant to a few seconds	client must detect the change and reconnect
L4/L7 load balancer	switch the backend pool	instant	beware health-check false positives

The point: you must be able to change the bootstrap address without touching client code.

1.4 Are consumers idempotent?

MM2-based failover is fundamentally at-least-once. Offset translation is approximate, and at the replication boundary some messages will be processed twice. So a consumer must produce the same result whether it sees a message once or twice.

// Idempotent handling: dedup by message key + sequence
String dedupKey = record.key() + ":" + record.headers().lastHeader("event-id");
if (processedStore.putIfAbsent(dedupKey, true) == null) {
    process(record);   // actually process only on first sight
}
// Already-seen key → silently skip (safe to reprocess)

Idempotency is not an option for failover — it's a requirement. Fail over with non-idempotent consumers and you get duplicate orders, double billing, and duplicate notifications.

Precondition gate: If all four — (1) MM2 healthy, (2) offset translation, (3) indirection layer, (4) idempotent consumers — are not satisfied, fix those before following this runbook. A runbook cannot substitute for preparation.

2. Failover Procedure (Order Guaranteed)

In failover, order is correctness. Skip a step or reorder it and you add data loss or duplication. Follow the procedure as written.

2.1 (a) Detect and declare the disaster

First, judge whether this really is a disaster. Triggering failover on a transient network glitch costs you split-brain and needless data reconciliation.

Criterion	Fail over	Wait
Entire primary cluster unreachable (DC down)	O
Simultaneous loss of many brokers + ISR collapse	O
Single broker failure		O (Kafka's own replication handles it)
Transient network partition (tens of seconds)		O (wait for recovery first)

Declaring a disaster is an explicit human act. Automatic failover carries a high split-brain risk, so we put a human "declaration" gate in front of it. Record the timestamp and the decision-maker in the incident channel the moment you declare.

2.2 (b) Stop producers to the primary (if possible)

If the primary is partially reachable (e.g., some brokers survive, the network partly recovers), stop the producers writing to it first. This prevents accumulating more messages that are written only to the primary and never replicated to DR after the failover point.

# How you stop producers varies by environment:
#  - a "produce: paused" flag in the indirection (config service)
#  - or scale the producer app down temporarily (k8s: replicas=0)
kubectl scale deployment order-producer --replicas=0 -n prod

If the primary is fully unreachable, skip this step and accept the in-flight RPO — that is, accept as loss the messages written only to the primary that never replicated to DR. Record an estimate of this loss in the incident log.

2.3 (c) Confirm replication caught up (or accept the RPO)

If the primary is at least partially alive and MM2 is running, wait briefly so DR catches up as far as possible to the primary's last offset.

# Replication progress as seen on the DR side — watch until the latest offset of
# primary.<topic> stops increasing (= caught up)
watch -n2 'kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list dr-broker:9092 \
  --topic primary.orders --time -1'

Once MM2 lag ≈ 0, you can switch over with no data loss — ideal.
If the primary is fully dead and can't catch up further, accept the current RPO and proceed. Waiting forever only inflates RTO. Set an upper bound ("how long to wait", e.g. 60 seconds) in the runbook in advance.

2.4 (d) Repoint producers to DR

Now change the producers' bootstrap address to DR. Beware the topic-name gotcha here (details in Section 4).

With DefaultReplicationPolicy, the replica on the DR cluster exists under the name primary.orders. But after failover, producers must write to DR's original topic, orders (no prefix). primary.orders is a replication topic managed by MM2 — it is not where new writes go.

# Change the indirection (config service) value
- bootstrap.servers: primary-broker:9092
+ bootstrap.servers: dr-broker:9092
# Topic name stays 'orders' — new writes on DR go to the un-prefixed original topic
  topic: orders

# Bring producers back up
kubectl scale deployment order-producer --replicas=4 -n prod

2.5 (e) Repoint consumers to DR — resume at translated offsets

Consumers must resume at translated offsets. There are two ways.

Method 1 — use offsets auto-synced by sync.group.offsets.enabled

If MM2 has been translating and syncing __consumer_offsets to DR during normal operation, the consumer simply reattaches to the DR cluster with the same group ID. The translated position already exists in DR's __consumer_offsets.

# Consumer: keep the same group ID, change only the bootstrap to DR
group.id = order-processor
bootstrap.servers = dr-broker:9092
# auto.offset.reset triggers "only when there is no synced offset",
# so in a properly synced state the consumer resumes at the translated position
auto.offset.reset = latest

Method 2 — explicit translation with RemoteClusterUtils

If you didn't enable auto-sync, or you want to verify the exact translated position, translate directly at failover time.

// Translate source(primary) offsets → target(dr) offsets
Map<TopicPartition, OffsetAndMetadata> translated =
    RemoteClusterUtils.translateOffsets(
        mm2Props,
        "primary",            // source cluster alias
        "order-processor",    // consumer group
        Duration.ofSeconds(20));
 
// Commit the translated offsets to the DR consumer group, then resume
try (Admin admin = Admin.create(drProps)) {
    admin.alterConsumerGroupOffsets("order-processor", translated).all().get();
}

Note: A translated offset only guarantees "everything after this position was definitely not seen" — not "everything up to this position was seen exactly." That's why some reprocessing occurs, and that's why the idempotency from 1.4 is required.

2.6 (f) Validate

Switching over is not the end. A failover without validation is only half done.

# 1) Are consumers progressing — is lag shrinking?
kafka-consumer-groups.sh --bootstrap-server dr-broker:9092 \
  --describe --group order-processor
# CURRENT-OFFSET should increase, LAG should decrease
 
# 2) Is mass reprocessing happening?
# (If LAG jumped to the full topic size, offset translation failed → suspect earliest reprocessing)

Validation checks:

The consumer group's CURRENT-OFFSET increases and LAG converges.
LAG did not explode to the full topic size (i.e., not an earliest reprocess).
Producers are writing normally to DR's original topic (orders).
Business validation passes: key metrics (order count, payment success rate, etc.) in normal range.
Data flows to downstream (DB, search index, etc.).

Once these pass, failover is complete. Record in the incident channel: "DR active, failover complete, estimated RPO N events/sec."

3. Failback Procedure — Returning to the Original Cluster

The primary data center has recovered. But don't rush. Failback is trickier than failover. Failover happens because "one side died"; failback is "both sides are alive and you move deliberately," so split-brain and duplicate reconciliation must be handled even more carefully.

3.1 Set up reverse replication (DR → Primary)

During failover, all new data piled up on DR. The primary knows nothing of it. So first enable replication in the DR → primary direction to let the primary catch up.

# MM2: enable the reverse flow (briefly making active-passive bidirectional)
clusters = primary, dr
# The original forward flow is paused during failover; stop it and enable the reverse
dr->primary.enabled = true
dr->primary.emit.checkpoints.enabled = true
dr->primary.sync.group.offsets.enabled = true
# Whitelist only original topics — exclude 'primary.'-prefixed topics so they don't replicate back
dr->primary.topics = orders, payments, .*
dr->primary.topics.exclude = .*\.internal, primary\..*, .*\.replica

Cyclic-replication caveat: If you enable forward (primary→dr) and reverse (dr→primary) at the same time, the DefaultReplicationPolicy prefix prevents an infinite loop (e.g., primary.orders won't replicate back), but a misconfiguration can make topics ping-pong. It's safest to fully stop the forward flow before enabling the reverse.

3.2 Wait for the primary to catch up

# Reverse-replication progress as seen on the primary — watch until dr.<topic>'s
# latest offset catches up to the DR original
watch -n2 'kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list primary-broker:9092 \
  --topic dr.orders --time -1'

Wait until reverse lag is small enough. This is a planned operation not pressed by RTO, so you can wait comfortably until lag ≈ 0.

3.3 Schedule a cutover window

Perform failback during a planned maintenance window. Pick a low-traffic period and obtain permission for a brief write quiesce.

3.4 Stop DR producers → repoint clients to the primary

# 1) Stop producers to DR to quiesce writes
kubectl scale deployment order-producer --replicas=0 -n prod
 
# 2) Final check that reverse replication flushed the last message to the primary (lag == 0)
 
# 3) Point the indirection back to the primary
#   config service: bootstrap.servers = primary-broker:9092
#   or DNS CNAME: kafka.internal → primary
 
# 4) Consumers to primary too — resume at translated offsets via reverse sync.group.offsets
# 5) Restart producers (now to the primary's original topic 'orders')
kubectl scale deployment order-producer --replicas=4 -n prod

The order mirrors failover: stop producers → confirm last replication → repoint consumers/producers → restart.

3.5 Reconciliation — the duplicate/reprocessing problem

This is the hardest part of failback. Crossing two boundaries — the failover boundary and the failback boundary — some messages may have been processed on both sides.

Loading diagram…

Why idempotent/dedup consumers matter: at both boundaries offset translation is approximate, so exactly-once is not guaranteed. An idempotent consumer (1.4) absorbs these duplicates automatically. With non-idempotent consumers, you must do the following manually after failback.

Reconciliation item	Method
Detect duplicate processing	detect duplicate rows downstream keyed by the idempotency key (event ID)
Correct double side-effects	cancel double billing/notifications with compensating transactions
Verify data consistency	reconcile aggregates over the window from just-before-failover to just-after-failback against the source of truth
Check offset gaps	confirm consumer group offsets did not regress or jump at the translation boundary

The cost of reconciliation is the real cost of failback. Invest in idempotency and that cost converges to zero.

4. The Topic-Renaming Gotcha — `DefaultReplicationPolicy` vs `IdentityReplicationPolicy`

This single thing causes the most incidents in active-passive failover/failback. It deserves its own section.

4.1 The `DefaultReplicationPolicy` prefix

MM2's default policy prepends the source cluster alias as a prefix to replicated topics. The primary's orders becomes primary.orders on DR. This prefix prevents cyclic replication and lets you trace which cluster the data came from.

Loading diagram…

The gotcha is clear: where do producers write after failover?

Topic	What it is	New writes after failover
`primary.orders` (DR)	MM2-managed replica of the primary	❌ must not write here
`orders` (DR)	DR's own original topic	✅ write here

Yet consumers may need to read historical data from primary.orders right after failover (pre-failover primary data). So the same logical stream is split across two topics (primary.orders + orders) on DR — a confusing situation. To handle it cleanly, consumers must subscribe to both topics or unify them with a regex subscription (.*orders).

In failback, this confusion repeats as a mirror image: DR's orders becomes dr.orders on the primary and again diverges from the primary's own original orders.

4.2 How `IdentityReplicationPolicy` changes the picture

IdentityReplicationPolicy (formerly LegacyReplicationPolicy) adds no prefix. The primary's orders stays orders on DR too.

# Same-name replication with no prefix
replication.policy.class = org.apache.kafka.connect.mirror.IdentityReplicationPolicy

Aspect	DefaultReplicationPolicy	IdentityReplicationPolicy
Topic name on DR	`primary.orders`	`orders`
Client change after failover	must also handle the topic name (strip prefix)	topic name unchanged — change only the bootstrap
Cyclic-replication prevention	naturally blocked by the prefix	risky — needs separate prevention when bidirectional
Provenance tracking	clear from the topic name	unclear (needs headers or another means)
Suitable scenario	bidirectional / fan-in (aggregation)	unidirectional active-passive DR

For simple failover/failback in active-passive DR, IdentityReplicationPolicy greatly simplifies operations. Because topic names are the same on both sides, clients only change the bootstrap address and the topic-name branching logic disappears. The trade-off: when you enable bidirectional replication (the reverse flow during failback), the responsibility for preventing cyclic replication shifts to the operator. You must block it with the topics whitelist/exclude and source.cluster.alias header-based filtering.

Selection guide: If unidirectional DR is the main goal and you want to avoid the topic-name gotcha, use IdentityReplicationPolicy. If you aggregate many clusters into one (aggregation) or provenance tracking matters, use DefaultReplicationPolicy. Once you pick one, use it consistently across all clusters and all stages — mixing them makes topic names mismatch at failback.

5. End-to-End Diagram — Failover Then Failback

Loading diagram…

6. On-call Checklist & RTO Expectations

6.1 Failover checklist (for 3 AM)

Follow it as is. Top to bottom, no skipping.

Declare disaster: confirm primary fully unreachable; record timestamp and decision-maker in the incident channel.
Check MM2 state: capture a replication-lag snapshot (used to estimate RPO).
Stop primary producers: quiesce if primary is partially reachable; otherwise skip and accept RPO.
Catch up replication: wait for lag→0 (upper bound 60s). If primary is fully down, accept the current RPO.
Repoint producers: indirection → DR. Topic is the un-prefixed original (orders).
Repoint consumers: indirection → DR. Resume at translated offsets (sync.group.offsets or RemoteClusterUtils).
Validate: consumer lag converges, no mass reprocessing, producers writing normally, business metrics normal.
Announce: record "DR active, failover complete, estimated RPO" in the incident channel.

6.2 Failback checklist (for a planned window)

Configure reverse replication: dr->primary enabled, cyclic replication blocked.
Wait for primary catch-up: reverse-replication lag ≈ 0.
Enter cutover window: low-traffic period, stakeholders notified.
Stop DR producers: quiesce, confirm reverse replication flushed the last message (lag == 0).
Repoint clients: indirection → Primary; consumers resume at translated offsets.
Restart producers: to the primary's original topic.
Reconcile: correct duplicates/reprocessing (auto-absorbed if idempotent, else manual).
Restore forward replication: re-enable primary->dr, return to normal active-passive.

6.3 RTO expectations

RTO (recovery time objective) is a function of "where you invested."

Readiness level	Failover RTO (approx.)	Determining factor
No indirection (hardcoded)	tens of minutes to hours	full app redeploy time
DNS CNAME (TTL 30s)	a few minutes	DNS propagation + client cache expiry
Config service (instant refresh)	1 to a few minutes	change propagation + consumer reconnect
+ offset auto-sync	< 1–2 minutes	translated offsets already on DR

The point: RTO is determined by preparation (indirection layer, offset sync), not by the speed of the failover procedure. RPO is determined by everyday replication lag; RTO by switchover automation. Both are set "before disaster strikes."

Wrapping up

Failover is not a heroic tale — it's a procedure. To summarize this part:

Preconditions are 90% of it: MM2 health, offset translation, a repointable indirection layer, idempotent consumers — without these four, no runbook helps.
Order is correctness: declare disaster → (if possible) stop producers → confirm replication / accept RPO → repoint producers → repoint consumers (translated offsets) → validate.
Failover is at-least-once: idempotency is not optional but required. Translated offsets are approximate, and reprocessing inevitably occurs at the boundary.
Failback is harder: you must reconcile duplicates accumulated across two boundaries, and only idempotent consumers drive that cost to zero.
The topic-name gotcha: the DefaultReplicationPolicy prefix confuses "where to write and where to read" after failover. For unidirectional active-passive, IdentityReplicationPolicy simplifies operations.
RTO/RPO are pre-decided: RPO by everyday lag, RTO by switchover automation. You can't change them on the day of the disaster.

And one most important truth. A runbook you never rehearse will fail. No matter how precise the procedure above, a failover attempted for the first time at 3 AM collapses on a missing permission, a cached DNS entry, or one forgotten topic. In Part 5, we cover how to regularly test and validate this runbook — game days, chaos injection, verifying replication and offset-translation correctness, and measuring RTO/RPO. A runbook comes alive not the moment you write it, but the moment you practice it.

References

Apache Kafka — Geo-Replication (Cross-Cluster Data Mirroring): https://kafka.apache.org/documentation/#georeplication

KIP-382: MirrorMaker 2.0: https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0

Confluent — Disaster Recovery for Multi-Datacenter Apache Kafka Deployments: https://docs.confluent.io/platform/current/multi-dc-deployments/index.html

Apache Kafka — RemoteClusterUtils / MirrorClient (offset translation)

— The Data Dynamics Engineering Team

1. Preconditions — Before You Open the Runbook

1.1 Is MM2 replication healthy?

1.2 Is offset translation working? (recap of Part 3)

1.3 Do clients use an indirection layer?

1.4 Are consumers idempotent?

2. Failover Procedure (Order Guaranteed)

2.1 (a) Detect and declare the disaster

2.2 (b) Stop producers to the primary (if possible)

2.3 (c) Confirm replication caught up (or accept the RPO)

2.4 (d) Repoint producers to DR

2.5 (e) Repoint consumers to DR — resume at translated offsets

2.6 (f) Validate

3. Failback Procedure — Returning to the Original Cluster

3.1 Set up reverse replication (DR → Primary)

3.2 Wait for the primary to catch up

3.3 Schedule a cutover window

3.4 Stop DR producers → repoint clients to the primary

3.5 Reconciliation — the duplicate/reprocessing problem

4. The Topic-Renaming Gotcha — DefaultReplicationPolicy vs IdentityReplicationPolicy

4.1 The DefaultReplicationPolicy prefix

4.2 How IdentityReplicationPolicy changes the picture

5. End-to-End Diagram — Failover Then Failback

6. On-call Checklist & RTO Expectations

6.1 Failover checklist (for 3 AM)

6.2 Failback checklist (for a planned window)

6.3 RTO expectations

Wrapping up

4. The Topic-Renaming Gotcha — `DefaultReplicationPolicy` vs `IdentityReplicationPolicy`

4.1 The `DefaultReplicationPolicy` prefix

4.2 How `IdentityReplicationPolicy` changes the picture