Blog
kafkadisaster-recoverytestinggame-dayreliability

[Kafka DR 5] DR Drills and Validation — Replication Alone Isn't Enough

A replication dashboard showing green does not prove recoverability. Learn how to measure your live RPO with MM2 heartbeats, measure your actual RTO with Game Days, and prove DR works with cross-cluster consistency validation.

Data DynamicsJune 17, 202614 min read

Your DR cluster is receiving replication right now, this very second. The replication metrics on your dashboard are green, and MirrorMaker 2 hums along without missing a beat. But here's a question: if the primary cluster died right now, would failing over to DR actually bring the service back to life? Would consumers resume processing from exactly where they left off? Is your "RTO: 30 minutes" promise a measured number, or a hope?

If you can't answer that question with confidence, your DR is not an asset — it's a liability. A DR setup that has never been exercised is merely "a configuration you believe will recover," not "a system that recovers." This is the final installment of our Kafka DR series, and it covers how to turn replication into a verifiable guarantee: measurement, drills (Game Days), consistency validation, and observability.

What you'll learn in this post

  • Why recoverability isn't guaranteed even when replication is "green"
  • How to measure live RPO (replication lag) with MM2 heartbeats and alert when it breaches the budget
  • How to design a Game Day (DR drill) and measure your actual RTO
  • Cross-cluster consistency validation — message counts, checksums, canaries, offset-translation correctness
  • The failures drills routinely expose, and how to automate the validation loop

1. The Trap of "Replication Is Green"

The most common delusion teams fall into after building DR is the belief that "replication is running, therefore we will recover." These are entirely different propositions.

"Replication is healthy" means "the primary cluster's data is flowing to the DR cluster." "We will recover" means "when the primary cluster is gone, the DR cluster alone lets producers and consumers resume normal operation." Between the two sits a pile of unverified assumptions.

What replication guaranteesWhat replication does not guarantee
Messages flow to DRConsumers resume at the correct offset
Topic data is mirroredACLs, topic configs, and schemas are synced
MM2 connectors are RUNNINGDNS/endpoint cutover works
Replication throughput is healthyThe DR cluster can absorb full traffic
__consumer_offsets is replicatedOffset translation is correct (no gaps/dupes)

A DR setup you've never failed over to is like a backup you've never restored. Its mere existence proves nothing.

The only way to prove DR is to regularly, in a controlled environment, actually execute a failover. Before we get there, let's be clear about what exactly we are measuring.


2. Measuring Replication Lag = Your Live RPO

RPO Is a Live Metric, Not a Document

RPO (Recovery Point Objective) is "how much data you're allowed to lose in a disaster." Writing "RPO: 5 seconds" in a DR design doc is easy. But your actual RPO at this very moment equals the replication lag between the primary and DR clusters. If replication is 5 seconds behind, a disaster right now loses the last 5 seconds of messages. Replication lag is your live RPO.

So honoring your RPO is exactly the work of measuring replication lag and alerting when it breaches the RPO budget.

Measuring End-to-End Latency with MM2 Heartbeats

The best tool for measuring replication lag precisely is MirrorMaker 2's heartbeats (KIP-382). Via the MirrorHeartbeatConnector, MM2 produces timestamped heartbeat messages to a heartbeats topic on the primary cluster at a fixed interval. These messages are replicated to the DR cluster just like any other message. By consuming the replicated <source>.heartbeats topic on the DR side and computing the difference between the embedded produce time and the current time, you get the end-to-end replication lag directly.

# MM2 connect-mirror-maker.properties (excerpt)
clusters = primary, dr
primary.bootstrap.servers = primary-broker:9092
dr.bootstrap.servers = dr-broker:9092
 
# Enable the primary -> dr replication flow
primary->dr.enabled = true
 
# Enable heartbeats / checkpoints (the heart of RPO measurement and offset translation)
primary->dr.emit.heartbeats.enabled = true
primary->dr.emit.checkpoints.enabled = true
emit.heartbeats.interval.seconds = 5
emit.checkpoints.interval.seconds = 30
 
# Offset sync (accuracy of consumer offset translation)
primary->dr.sync.group.offsets.enabled = true
sync.group.offsets.interval.seconds = 30

A measurer that reads replicated heartbeats on the DR side to compute live lag is as simple as this:

from kafka import KafkaConsumer
import json, time
 
# Heartbeat topic replicated to the DR cluster: "<source-alias>.heartbeats"
consumer = KafkaConsumer(
    "primary.heartbeats",
    bootstrap_servers="dr-broker:9092",
    auto_offset_reset="latest",
    value_deserializer=lambda v: v,  # heartbeat payload is binary
)
 
for msg in consumer:
    # The heartbeat message records the time it was produced
    produced_at_ms = extract_heartbeat_timestamp(msg)   # parse KIP-382 payload
    replication_lag_ms = int(time.time() * 1000) - produced_at_ms
 
    # This value IS your "live RPO"
    emit_metric("kafka_dr_replication_lag_ms", replication_lag_ms)
 
    # Alert when it breaches the RPO budget (e.g., 5 seconds)
    if replication_lag_ms > 5_000:
        page_oncall(f"DR replication lag {replication_lag_ms}ms — RPO budget exceeded")

Watch Connector-Level Metrics Too

Beyond heartbeat-based end-to-end lag, reinforce monitoring with the JMX metrics exposed by MirrorSourceConnector.

MetricMeaningAlert threshold (example)
replication-latency-ms (max/avg)Time for a record to reach target from sourcemax > RPO budget
record-age-msAge of replicated records (since original produce)continuously rising
byte-rate / record-rateReplication throughputdrops to 0 (replication stalled)
checkpoint-latency-msCheckpoint emission lagfreshness threshold exceeded
Connector/Task statusRUNNING / FAILEDFAILED immediately

There's one core rule: alert when replication lag breaches the RPO budget. RPO must not be a number on a slide; it must be a live SLO that is measured every moment and wakes someone up when violated.


3. Game Day — Put DR Drills on the Calendar

Why Regular Drills

There is exactly one way to know whether DR works: actually fail over. And not once, but regularly. Systems change constantly. New topics appear, ACLs are added, consumer groups multiply, infrastructure gets upgraded. There's no guarantee that the DR that passed last quarter passes this quarter. That's why we nail down Game Days — drills that deliberately execute a failover within a controlled time window — on the calendar, quarterly or monthly.

Script the Game Day Scenario

Every step of the drill must follow a pre-written script (runbook), not on-the-spot human judgment. The key is running the runbook you built in Part 4 from start to finish inside this controlled window.

#!/usr/bin/env bash
# game-day-failover.sh — controlled failover drill script
set -euo pipefail
 
DRILL_ID="gameday-$(date +%Y%m%d)"
log() { echo "[$(date -u +%H:%M:%S)] $*"; }
 
# 0) Pre-check: is replication lag within the RPO budget?
log "[$DRILL_ID] pre-check: verify replication lag"
./check_replication_lag.sh --max-ms 5000 || { log "lag exceeded — aborting drill"; exit 1; }
 
# 1) Record T0 — the start point for measuring RTO
T0=$(date +%s)
log "[$DRILL_ID] T0=$T0 failover begins"
 
# 2) (optional) Simulate disaster by cutting primary traffic
./simulate_primary_outage.sh
 
# 3) Apply translated consumer-group offsets on DR (MM2 checkpoint-based)
log "[$DRILL_ID] applying translated consumer offsets"
./apply_translated_offsets.sh --from primary --to dr
 
# 4) Cut DNS / endpoints over to DR
log "[$DRILL_ID] endpoint cutover (primary -> dr)"
./switch_endpoint.sh --target dr
 
# 5) Confirm applications resumed normal processing on DR
log "[$DRILL_ID] service health check (resume confirmed)"
./wait_for_healthy.sh --cluster dr --timeout 600
 
# 6) Record T1 — compute actual RTO
T1=$(date +%s)
log "[$DRILL_ID] actual RTO = $((T1 - T0))s (target: 1800s)"

Target RTO vs Actual RTO

RTO (Recovery Time Objective) is "the time allowed from disaster to service recovery." A Game Day's most important output is the actual, measured RTO. In the script above, T1 - T0 is precisely that number.

ItemTarget RTOMeasured RTOVerdict
Apply offset translation5 min7 min⚠️ over
Endpoint cutover (DNS TTL)2 min9 min❌ TTL too high
Confirm consumer resume5 min4 min
Total30 min42 min

Every item that overshoots its target becomes an improvement task to fix before the next quarter. In the example above, DNS TTL was too long and cutover took 9 minutes — yielding an improvement to drop the failover TTL to 30 seconds. Without measurement, that "30 minutes" is just letters on a slide.


4. Consistency Validation — Replicated Doesn't Mean Identical

Even with zero replication lag and RTO within target, whether the data actually matches is a separate question. Consistency validation is done through automated cross-cluster checks.

4-1. Per-Topic/Partition Message Counts

The most basic check compares message counts (or high-watermark differences) between the two clusters at the topic/partition level.

def compare_message_counts(primary_admin, dr_admin, topic):
    p_offsets = primary_admin.end_offsets_per_partition(topic)
    # MM2 mirrors to DR as "<source>.topic", so account for the prefix
    d_offsets = dr_admin.end_offsets_per_partition(f"primary.{topic}")
 
    diffs = {}
    for partition, p_end in p_offsets.items():
        d_end = d_offsets.get(partition, 0)
        # A gap the size of replication lag is normal; beyond the RPO-equivalent is suspect
        diffs[partition] = p_end - d_end
 
    return diffs   # sustained/growing positive values suggest replication loss

4-2. Sampled Checksums

Counts can match while content differs. Sampling a range of offsets in each partition and comparing the message-payload checksum (e.g., CRC32/hash) on both sides catches tampering and ordering drift that counts miss.

4-3. End-to-End Canary Messages

The strongest validation is the canary. Periodically produce a canary message stamped with a unique ID to the primary cluster, then verify on the DR cluster that the message (a) arrived and (b) how late it arrived. That single round trip actually exercises the full produce → replicate → arrive path.

import uuid, time
 
def run_canary(primary_producer, dr_consumer, topic="dr-canary"):
    canary_id = str(uuid.uuid4())
    sent_at = time.time()
    # Produce a canary to the primary cluster
    primary_producer.send(topic, key=canary_id, value=str(sent_at))
    primary_producer.flush()
 
    # Wait for the same ID to arrive on the DR cluster
    deadline = time.time() + 30
    for msg in dr_consumer:                      # subscribes to "primary.dr-canary"
        if msg.key == canary_id:
            arrived_lag = time.time() - sent_at
            emit_metric("kafka_dr_canary_lag_s", arrived_lag)
            return True
        if time.time() > deadline:
            page_oncall(f"canary {canary_id} did not reach DR within 30s")
            return False

4-4. Validating Offset-Translation Correctness (Most Important)

The subtlest and most frequently broken part of Kafka DR is consumer offset translation. A primary cluster's consumer-group offsets do not point to the same numbers on the DR cluster — because each partition's replication start point differs. MM2 builds the source→target offset mapping (the __consumer_offsets translation) via MirrorCheckpointConnector. You must verify this translation is correct by failing over a test consumer group for real.

# Validate offset translation: fail a test consumer group over to DR
def validate_offset_translation(test_group, dr_admin, checkpoint_reader):
    # 1) Read the translated offsets from MM2 checkpoints (RemoteClusterUtils)
    translated = checkpoint_reader.translate_offsets(
        group=test_group, source="primary", target="dr"
    )
    # 2) Start the group on DR at the translated offsets
    dr_admin.alter_consumer_group_offsets(test_group, translated)
 
    # 3) Validate processing after resume
    resumed = consume_until_idle(test_group, cluster="dr")
 
    # Neither a large gap (loss) nor large duplication is acceptable
    assert resumed.gap < ACCEPTABLE_GAP, f"offset gap too large: {resumed.gap}"
    assert resumed.duplicates < ACCEPTABLE_DUP, f"too many duplicates: {resumed.duplicates}"

The pass criterion is clear: the test consumer group, after failing over to DR, must resume processing without large gaps (loss) and without large duplication. Some duplication is normal — Kafka DR is inherently at-least-once, so consumers must be idempotent (see Section 6).


5. Observability — Keep DR Always Visible

Validation isn't a one-off event; it's continuous visibility. Put the following metrics on a dashboard and alert on threshold breaches.

AreaDashboard panelAlert condition
Replication lag (RPO)replication_lag_ms, heartbeat-based end-to-end lagRPO budget exceeded
Heartbeat latencyheartbeat arrival interval / freshnessheartbeat gap or latency spike
Connector statusMirrorSource/Checkpoint/Heartbeat connector & task statusFAILED or PAUSED
Checkpoint freshnesstime since last checkpointfreshness threshold exceeded
Canarycanary round-trip lag / miss ratenon-arrival or latency spike
RTO trendquarterly measured-RTO trendexceeds target RTO

The thing most often overlooked here is checkpoint freshness. If checkpoints stall, offset translation goes stale, and failing over in that state makes consumers resume at the wrong point. Heartbeats and checkpoints tell you not "replication is alive" but "recovery is possible" — which is why they rank above ordinary throughput metrics.


6. The Failures Drills Routinely Expose

The real value of a Game Day isn't success — it's discovering failures at a safe time. You make them blow up in a controlled drill rather than during a real failover. Here are the recurring defects drills surface in the field.

Defect foundSymptomFix
Stale runbookScripts point to vanished hosts / old endpointsManage runbook as code, refresh every drill
Unsynced ACL/topic configProducers/consumers denied on DR, partition-count mismatchAutomate ACL & topic-config sync, validate counts
Untranslated offsetsConsumers re-read from the start or hit large gapsEnable checkpoints + make offset-translation validation continuous
Non-idempotent consumersDuplicate processing after failover causes side effects (double billing, etc.)Introduce idempotency keys, redesign for at-least-once
DNS/cutover gapLong TTL keeps clients pointed at the old clusterShort failover TTL, health-based cutover
Under-provisioned DRDR can't absorb full traffic, causing lag/outageSize DR equal to primary, run periodic load validation

Every row here is a concrete reason "replication is green but recovery fails." And all of them are invisible until you try.


7. The Validation Loop — Keep Going Around

Everything so far is not a checklist you run once, but a continuously turning loop. You replicate, measure lag, drill the failover, validate consistency and RTO, fix the gaps exposed, and return to replication.

Loading diagram…

As long as this loop keeps turning, DR stays not a "belief" but a measured guarantee renewed every quarter. The moment the loop stops — if your last Game Day was six months ago — DR slides back into being a liability.


Wrapping up — Closing the Kafka DR Series

We close this five-part Kafka DR journey. DR is not any single stage but a continuum of design → build → translation → runbook → drills. Drop any one and the rest are rendered powerless.

PartTopicKey message
1DR design and topologySet RPO/RTO and pick an Active-Passive/Active-Active topology
2MirrorMaker 2 replicationMirror topics and data to DR with MM2
3Offset translation & consumer failoverUse checkpoints so consumers resume where they left off
4Failover runbookTurn human-followable, step-by-step procedure into code
5DR drills and validation (this post)Prove recovery with measurement, Game Days, and consistency validation

The single sentence this series exists to deliver is this: replication alone isn't enough. Replication is the beginning of DR, not the end. Replication moves the data, but what guarantees recovery is measurement, drills, and validation. A DR you've never exercised is a liability; only a DR that is exercised and measured every quarter is an asset.

If DR is the story of "when disaster strikes," its sibling series is about "the things that break on ordinary days." We recommend reading the "Kafka Operations Troubleshooting" series, which digs deep into everyday failures like consumer lag, message loss, acks/min.insync.replicas, and rebalance storms. Hardening your day-to-day operations is, after all, the best DR preparation there is.

May your DR be an asset, not a liability. And may the RTO you measure at your next Game Day land within target.


References


— The Data Dynamics Engineering Team