[Kafka DR 5] DR Drills and Validation — Replication Alone Isn't Enough
A replication dashboard showing green does not prove recoverability. Learn how to measure your live RPO with MM2 heartbeats, measure your actual RTO with Game Days, and prove DR works with cross-cluster consistency validation.
Your DR cluster is receiving replication right now, this very second. The replication metrics on your dashboard are green, and MirrorMaker 2 hums along without missing a beat. But here's a question: if the primary cluster died right now, would failing over to DR actually bring the service back to life? Would consumers resume processing from exactly where they left off? Is your "RTO: 30 minutes" promise a measured number, or a hope?
If you can't answer that question with confidence, your DR is not an asset — it's a liability. A DR setup that has never been exercised is merely "a configuration you believe will recover," not "a system that recovers." This is the final installment of our Kafka DR series, and it covers how to turn replication into a verifiable guarantee: measurement, drills (Game Days), consistency validation, and observability.
What you'll learn in this post
- Why recoverability isn't guaranteed even when replication is "green"
- How to measure live RPO (replication lag) with MM2 heartbeats and alert when it breaches the budget
- How to design a Game Day (DR drill) and measure your actual RTO
- Cross-cluster consistency validation — message counts, checksums, canaries, offset-translation correctness
- The failures drills routinely expose, and how to automate the validation loop
1. The Trap of "Replication Is Green"
The most common delusion teams fall into after building DR is the belief that "replication is running, therefore we will recover." These are entirely different propositions.
"Replication is healthy" means "the primary cluster's data is flowing to the DR cluster." "We will recover" means "when the primary cluster is gone, the DR cluster alone lets producers and consumers resume normal operation." Between the two sits a pile of unverified assumptions.
| What replication guarantees | What replication does not guarantee |
|---|---|
| Messages flow to DR | Consumers resume at the correct offset |
| Topic data is mirrored | ACLs, topic configs, and schemas are synced |
| MM2 connectors are RUNNING | DNS/endpoint cutover works |
| Replication throughput is healthy | The DR cluster can absorb full traffic |
__consumer_offsets is replicated | Offset translation is correct (no gaps/dupes) |
A DR setup you've never failed over to is like a backup you've never restored. Its mere existence proves nothing.
The only way to prove DR is to regularly, in a controlled environment, actually execute a failover. Before we get there, let's be clear about what exactly we are measuring.
2. Measuring Replication Lag = Your Live RPO
RPO Is a Live Metric, Not a Document
RPO (Recovery Point Objective) is "how much data you're allowed to lose in a disaster." Writing "RPO: 5 seconds" in a DR design doc is easy. But your actual RPO at this very moment equals the replication lag between the primary and DR clusters. If replication is 5 seconds behind, a disaster right now loses the last 5 seconds of messages. Replication lag is your live RPO.
So honoring your RPO is exactly the work of measuring replication lag and alerting when it breaches the RPO budget.
Measuring End-to-End Latency with MM2 Heartbeats
The best tool for measuring replication lag precisely is MirrorMaker 2's heartbeats (KIP-382). Via the MirrorHeartbeatConnector, MM2 produces timestamped heartbeat messages to a heartbeats topic on the primary cluster at a fixed interval. These messages are replicated to the DR cluster just like any other message. By consuming the replicated <source>.heartbeats topic on the DR side and computing the difference between the embedded produce time and the current time, you get the end-to-end replication lag directly.
# MM2 connect-mirror-maker.properties (excerpt)
clusters = primary, dr
primary.bootstrap.servers = primary-broker:9092
dr.bootstrap.servers = dr-broker:9092
# Enable the primary -> dr replication flow
primary->dr.enabled = true
# Enable heartbeats / checkpoints (the heart of RPO measurement and offset translation)
primary->dr.emit.heartbeats.enabled = true
primary->dr.emit.checkpoints.enabled = true
emit.heartbeats.interval.seconds = 5
emit.checkpoints.interval.seconds = 30
# Offset sync (accuracy of consumer offset translation)
primary->dr.sync.group.offsets.enabled = true
sync.group.offsets.interval.seconds = 30A measurer that reads replicated heartbeats on the DR side to compute live lag is as simple as this:
from kafka import KafkaConsumer
import json, time
# Heartbeat topic replicated to the DR cluster: "<source-alias>.heartbeats"
consumer = KafkaConsumer(
"primary.heartbeats",
bootstrap_servers="dr-broker:9092",
auto_offset_reset="latest",
value_deserializer=lambda v: v, # heartbeat payload is binary
)
for msg in consumer:
# The heartbeat message records the time it was produced
produced_at_ms = extract_heartbeat_timestamp(msg) # parse KIP-382 payload
replication_lag_ms = int(time.time() * 1000) - produced_at_ms
# This value IS your "live RPO"
emit_metric("kafka_dr_replication_lag_ms", replication_lag_ms)
# Alert when it breaches the RPO budget (e.g., 5 seconds)
if replication_lag_ms > 5_000:
page_oncall(f"DR replication lag {replication_lag_ms}ms — RPO budget exceeded")Watch Connector-Level Metrics Too
Beyond heartbeat-based end-to-end lag, reinforce monitoring with the JMX metrics exposed by MirrorSourceConnector.
| Metric | Meaning | Alert threshold (example) |
|---|---|---|
replication-latency-ms (max/avg) | Time for a record to reach target from source | max > RPO budget |
record-age-ms | Age of replicated records (since original produce) | continuously rising |
byte-rate / record-rate | Replication throughput | drops to 0 (replication stalled) |
checkpoint-latency-ms | Checkpoint emission lag | freshness threshold exceeded |
Connector/Task status | RUNNING / FAILED | FAILED immediately |
There's one core rule: alert when replication lag breaches the RPO budget. RPO must not be a number on a slide; it must be a live SLO that is measured every moment and wakes someone up when violated.
3. Game Day — Put DR Drills on the Calendar
Why Regular Drills
There is exactly one way to know whether DR works: actually fail over. And not once, but regularly. Systems change constantly. New topics appear, ACLs are added, consumer groups multiply, infrastructure gets upgraded. There's no guarantee that the DR that passed last quarter passes this quarter. That's why we nail down Game Days — drills that deliberately execute a failover within a controlled time window — on the calendar, quarterly or monthly.
Script the Game Day Scenario
Every step of the drill must follow a pre-written script (runbook), not on-the-spot human judgment. The key is running the runbook you built in Part 4 from start to finish inside this controlled window.
#!/usr/bin/env bash
# game-day-failover.sh — controlled failover drill script
set -euo pipefail
DRILL_ID="gameday-$(date +%Y%m%d)"
log() { echo "[$(date -u +%H:%M:%S)] $*"; }
# 0) Pre-check: is replication lag within the RPO budget?
log "[$DRILL_ID] pre-check: verify replication lag"
./check_replication_lag.sh --max-ms 5000 || { log "lag exceeded — aborting drill"; exit 1; }
# 1) Record T0 — the start point for measuring RTO
T0=$(date +%s)
log "[$DRILL_ID] T0=$T0 failover begins"
# 2) (optional) Simulate disaster by cutting primary traffic
./simulate_primary_outage.sh
# 3) Apply translated consumer-group offsets on DR (MM2 checkpoint-based)
log "[$DRILL_ID] applying translated consumer offsets"
./apply_translated_offsets.sh --from primary --to dr
# 4) Cut DNS / endpoints over to DR
log "[$DRILL_ID] endpoint cutover (primary -> dr)"
./switch_endpoint.sh --target dr
# 5) Confirm applications resumed normal processing on DR
log "[$DRILL_ID] service health check (resume confirmed)"
./wait_for_healthy.sh --cluster dr --timeout 600
# 6) Record T1 — compute actual RTO
T1=$(date +%s)
log "[$DRILL_ID] actual RTO = $((T1 - T0))s (target: 1800s)"Target RTO vs Actual RTO
RTO (Recovery Time Objective) is "the time allowed from disaster to service recovery." A Game Day's most important output is the actual, measured RTO. In the script above, T1 - T0 is precisely that number.
| Item | Target RTO | Measured RTO | Verdict |
|---|---|---|---|
| Apply offset translation | 5 min | 7 min | ⚠️ over |
| Endpoint cutover (DNS TTL) | 2 min | 9 min | ❌ TTL too high |
| Confirm consumer resume | 5 min | 4 min | ✅ |
| Total | 30 min | 42 min | ❌ |
Every item that overshoots its target becomes an improvement task to fix before the next quarter. In the example above, DNS TTL was too long and cutover took 9 minutes — yielding an improvement to drop the failover TTL to 30 seconds. Without measurement, that "30 minutes" is just letters on a slide.
4. Consistency Validation — Replicated Doesn't Mean Identical
Even with zero replication lag and RTO within target, whether the data actually matches is a separate question. Consistency validation is done through automated cross-cluster checks.
4-1. Per-Topic/Partition Message Counts
The most basic check compares message counts (or high-watermark differences) between the two clusters at the topic/partition level.
def compare_message_counts(primary_admin, dr_admin, topic):
p_offsets = primary_admin.end_offsets_per_partition(topic)
# MM2 mirrors to DR as "<source>.topic", so account for the prefix
d_offsets = dr_admin.end_offsets_per_partition(f"primary.{topic}")
diffs = {}
for partition, p_end in p_offsets.items():
d_end = d_offsets.get(partition, 0)
# A gap the size of replication lag is normal; beyond the RPO-equivalent is suspect
diffs[partition] = p_end - d_end
return diffs # sustained/growing positive values suggest replication loss4-2. Sampled Checksums
Counts can match while content differs. Sampling a range of offsets in each partition and comparing the message-payload checksum (e.g., CRC32/hash) on both sides catches tampering and ordering drift that counts miss.
4-3. End-to-End Canary Messages
The strongest validation is the canary. Periodically produce a canary message stamped with a unique ID to the primary cluster, then verify on the DR cluster that the message (a) arrived and (b) how late it arrived. That single round trip actually exercises the full produce → replicate → arrive path.
import uuid, time
def run_canary(primary_producer, dr_consumer, topic="dr-canary"):
canary_id = str(uuid.uuid4())
sent_at = time.time()
# Produce a canary to the primary cluster
primary_producer.send(topic, key=canary_id, value=str(sent_at))
primary_producer.flush()
# Wait for the same ID to arrive on the DR cluster
deadline = time.time() + 30
for msg in dr_consumer: # subscribes to "primary.dr-canary"
if msg.key == canary_id:
arrived_lag = time.time() - sent_at
emit_metric("kafka_dr_canary_lag_s", arrived_lag)
return True
if time.time() > deadline:
page_oncall(f"canary {canary_id} did not reach DR within 30s")
return False4-4. Validating Offset-Translation Correctness (Most Important)
The subtlest and most frequently broken part of Kafka DR is consumer offset translation. A primary cluster's consumer-group offsets do not point to the same numbers on the DR cluster — because each partition's replication start point differs. MM2 builds the source→target offset mapping (the __consumer_offsets translation) via MirrorCheckpointConnector. You must verify this translation is correct by failing over a test consumer group for real.
# Validate offset translation: fail a test consumer group over to DR
def validate_offset_translation(test_group, dr_admin, checkpoint_reader):
# 1) Read the translated offsets from MM2 checkpoints (RemoteClusterUtils)
translated = checkpoint_reader.translate_offsets(
group=test_group, source="primary", target="dr"
)
# 2) Start the group on DR at the translated offsets
dr_admin.alter_consumer_group_offsets(test_group, translated)
# 3) Validate processing after resume
resumed = consume_until_idle(test_group, cluster="dr")
# Neither a large gap (loss) nor large duplication is acceptable
assert resumed.gap < ACCEPTABLE_GAP, f"offset gap too large: {resumed.gap}"
assert resumed.duplicates < ACCEPTABLE_DUP, f"too many duplicates: {resumed.duplicates}"The pass criterion is clear: the test consumer group, after failing over to DR, must resume processing without large gaps (loss) and without large duplication. Some duplication is normal — Kafka DR is inherently at-least-once, so consumers must be idempotent (see Section 6).
5. Observability — Keep DR Always Visible
Validation isn't a one-off event; it's continuous visibility. Put the following metrics on a dashboard and alert on threshold breaches.
| Area | Dashboard panel | Alert condition |
|---|---|---|
| Replication lag (RPO) | replication_lag_ms, heartbeat-based end-to-end lag | RPO budget exceeded |
| Heartbeat latency | heartbeat arrival interval / freshness | heartbeat gap or latency spike |
| Connector status | MirrorSource/Checkpoint/Heartbeat connector & task status | FAILED or PAUSED |
| Checkpoint freshness | time since last checkpoint | freshness threshold exceeded |
| Canary | canary round-trip lag / miss rate | non-arrival or latency spike |
| RTO trend | quarterly measured-RTO trend | exceeds target RTO |
The thing most often overlooked here is checkpoint freshness. If checkpoints stall, offset translation goes stale, and failing over in that state makes consumers resume at the wrong point. Heartbeats and checkpoints tell you not "replication is alive" but "recovery is possible" — which is why they rank above ordinary throughput metrics.
6. The Failures Drills Routinely Expose
The real value of a Game Day isn't success — it's discovering failures at a safe time. You make them blow up in a controlled drill rather than during a real failover. Here are the recurring defects drills surface in the field.
| Defect found | Symptom | Fix |
|---|---|---|
| Stale runbook | Scripts point to vanished hosts / old endpoints | Manage runbook as code, refresh every drill |
| Unsynced ACL/topic config | Producers/consumers denied on DR, partition-count mismatch | Automate ACL & topic-config sync, validate counts |
| Untranslated offsets | Consumers re-read from the start or hit large gaps | Enable checkpoints + make offset-translation validation continuous |
| Non-idempotent consumers | Duplicate processing after failover causes side effects (double billing, etc.) | Introduce idempotency keys, redesign for at-least-once |
| DNS/cutover gap | Long TTL keeps clients pointed at the old cluster | Short failover TTL, health-based cutover |
| Under-provisioned DR | DR can't absorb full traffic, causing lag/outage | Size DR equal to primary, run periodic load validation |
Every row here is a concrete reason "replication is green but recovery fails." And all of them are invisible until you try.
7. The Validation Loop — Keep Going Around
Everything so far is not a checklist you run once, but a continuously turning loop. You replicate, measure lag, drill the failover, validate consistency and RTO, fix the gaps exposed, and return to replication.
As long as this loop keeps turning, DR stays not a "belief" but a measured guarantee renewed every quarter. The moment the loop stops — if your last Game Day was six months ago — DR slides back into being a liability.
Wrapping up — Closing the Kafka DR Series
We close this five-part Kafka DR journey. DR is not any single stage but a continuum of design → build → translation → runbook → drills. Drop any one and the rest are rendered powerless.
| Part | Topic | Key message |
|---|---|---|
| 1 | DR design and topology | Set RPO/RTO and pick an Active-Passive/Active-Active topology |
| 2 | MirrorMaker 2 replication | Mirror topics and data to DR with MM2 |
| 3 | Offset translation & consumer failover | Use checkpoints so consumers resume where they left off |
| 4 | Failover runbook | Turn human-followable, step-by-step procedure into code |
| 5 | DR drills and validation (this post) | Prove recovery with measurement, Game Days, and consistency validation |
The single sentence this series exists to deliver is this: replication alone isn't enough. Replication is the beginning of DR, not the end. Replication moves the data, but what guarantees recovery is measurement, drills, and validation. A DR you've never exercised is a liability; only a DR that is exercised and measured every quarter is an asset.
If DR is the story of "when disaster strikes," its sibling series is about "the things that break on ordinary days." We recommend reading the "Kafka Operations Troubleshooting" series, which digs deep into everyday failures like consumer lag, message loss, acks/min.insync.replicas, and rebalance storms. Hardening your day-to-day operations is, after all, the best DR preparation there is.
May your DR be an asset, not a liability. And may the RTO you measure at your next Game Day land within target.
References
- Apache Kafka. "Geo-Replication (Cross-Cluster Data Mirroring)" — https://kafka.apache.org/documentation/#georeplication
- KIP-382: MirrorMaker 2.0 (heartbeats / checkpoints) — https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
- Confluent. "Disaster Recovery for Multi-Datacenter Apache Kafka Deployments" — https://docs.confluent.io/platform/current/multi-dc-deployments/index.html
- Confluent. "Monitoring Kafka with JMX" / Replicator monitoring — https://docs.confluent.io/platform/current/kafka/monitoring.html
— The Data Dynamics Engineering Team