kafkadisaster-recoveryarchitecturemulti-regionreliability

[Kafka DR 1] Foundations of Kafka DR — From RPO/RTO to Topology Selection

We define RPO/RTO, the starting point of any Kafka disaster recovery design, then quantitatively compare the Active-Passive, Active-Active, and Stretch cluster topologies to give you selection criteria by requirement.

Data DynamicsJune 13, 202612 min read

The moment an entire region disappears, how far can your Kafka survive? It is an invisible question on a normal day, but the very first one you must answer when an outage hits. "How much data can we afford to lose?" and "How fast must we recover?" — a DR design without these two numbers is less a design than a wish. This is the first installment of our Kafka Disaster Recovery series (5 parts), covering the RPO/RTO and topology choices that underpin every later decision, before we get into the actual build.

What you'll learn in this post

What RPO and RTO are, and why they are the starting point of every DR design

Why RPO can never be exactly zero with asynchronous replication

How the Active-Passive, Active-Active, and Stretch topologies work and their trade-offs

A table quantitatively comparing RPO, RTO, cost, complexity, and write latency per topology

Topology selection criteria by requirement: compliance, cost, global serving, and more

The build, failover, and validation roadmap covered in parts 2–5 of the series

1. Why Start With RPO/RTO

When DR comes up, a common reaction is "isn't it just about redundancy?" But the form and level of redundancy are decided by two target numbers. Without agreeing on them first, you either build an overly expensive architecture or a design that loses data exactly when disaster strikes.

Defining RPO and RTO

Metric	Full name	Meaning	One-line summary
RPO	Recovery Point Objective	Tolerable data loss	"How far back from the disaster is the last safely persisted point?"
RTO	Recovery Time Objective	Tolerable downtime	"How long may it take from outage to a healthy service?"

RPO is a tolerance for data loss, expressed as time. An RPO of 5 minutes means the business can absorb losing the data that arrived in the 5 minutes just before the disaster. RTO is a tolerance for downtime, expressed as time. An RTO of 15 minutes means consumers and producers must be operating normally again within 15 minutes of the outage.

Loading diagram…

Two Numbers Decide the Architecture

The smaller you push RPO, the more frequently and synchronously replication must happen — which raises write latency and cost. The smaller you push RTO, the more failover must be automated and pre-staged — which raises operational complexity. DR design is ultimately the work of finding the balance point between RPO/RTO targets and cost, complexity, and latency.

The key question is not "can we prevent the outage?" but "when an outage happens, how much do we lose and how fast do we recover?" RPO/RTO are the contract that writes that answer down as numbers.

2. Asynchronous Replication and the Limit of RPO

Kafka's cross-cluster replication (e.g., MirrorMaker 2) is fundamentally asynchronous. A producer writes a message to the primary cluster, then the replication tool reads that message and writes it again to the DR cluster. This "read then re-write" process inevitably has a time gap — the replication lag.

RPO ≈ Replication Lag at the Moment of Disaster

When a disaster strikes under asynchronous replication, the messages that reached the primary but had not yet been replicated to DR are lost along with it. Therefore:

RPO with asynchronous replication ≈ the replication lag at the exact moment of the disaster

If replication lag is always 200ms, RPO is about 200ms. But lag is not a fixed value. Under traffic spikes or network jitter, lag can stretch to seconds or minutes, and if the disaster happens at that instant, RPO grows with it. This is why RPO with asynchronous replication can never be zero. You can drive it close to zero, but guaranteeing "exactly zero data lost" requires synchronous replication (the Stretch cluster, covered later).

Factors That Inflate Replication Lag

Factor	Description	Impact on RPO
Network latency/bandwidth	Physical distance between regions, available bandwidth	Higher lag → higher RPO
Producer throughput spikes	Replication momentarily falls behind	Temporary lag spike
Insufficient replication tasks	Too few MM2 tasks / partition parallelism	Backlog accumulates
DR cluster load	Write-throughput ceiling on the DR side	Slower consumption

The takeaway is clear. You must monitor replication lag continuously, recognizing that this lag is your effective RPO. (Lag monitoring and offset tracking are covered in detail in part 3.)

3. The Three DR Topologies

There are three representative topologies for Kafka DR. Each sits at a completely different point on RPO, RTO, cost, and complexity.

3.1 Active-Passive (Warm Standby)

The most common and simplest setup. Only the primary cluster serves traffic, while the DR cluster receives data through asynchronous replication (e.g., MirrorMaker 2) and stands by.

Loading diagram…

Behavior: In normal operation all reads/writes happen on the primary. DR only receives replication and does not serve.
Failover: On disaster, switch consumers and producers to the DR cluster. Consumers must resume from translated offsets (the topic of part 4).
Pros: Simple to set up, and being one-way replication, there is no loop or conflict to worry about. It is also the cheapest.
Cons: Being asynchronous, RPO is not zero. DR receives no traffic in normal times, so its resources sit idle. Failover needs human/automation involvement and some time, so RTO tends to land in minutes.

3.2 Active-Active (Bidirectional)

Both clusters serve traffic at the same time and replicate each other's data bidirectionally. This fits global services that want a nearby cluster per region while still sharing all data.

Loading diagram…

Behavior: Both clusters accept writes and replicate their topics to the other.
Cycle prevention required: You must stop an infinite loop where a message replicated A → B comes back B → A. MM2 prefixes topics with the source cluster alias (e.g., A.orders) so that a cluster does not re-replicate topics it already replicated.
Conflict handling: Updating the same key on both sides simultaneously creates ordering/conflict issues. You need a design that pins "which cluster owns writes" per key (partitioning) or resolves conflicts at the application level.
Pros: Every cluster uses its resources, and per-region low-latency serving is possible. If one side dies, the other is already serving traffic, so RTO can be the shortest.
Cons: Complexity is the highest. You must account for cycle prevention, conflict handling, and offset/consumer-group management on both sides. Being asynchronous, RPO is still not zero.

3.3 Stretch Cluster (Single Cluster Across AZs/Regions)

Rather than two separate clusters, you place a single Kafka cluster across multiple availability zones (AZs) or regions. Replication is not cross-cluster but synchronous replication within the cluster.

Loading diagram…

Rack awareness: Setting broker.rack on each broker makes Kafka spread the replicas of a partition across different AZs. Even if an entire AZ dies, replicas survive in the other AZs.
Synchronous replication: With producers set to acks=all and topics to min.insync.replicas=2 (or more), a write succeeds only after replicas in multiple AZs have acknowledged it. So even if one AZ dies, already-committed data is guaranteed in the other AZs.
RPO ≈ 0: Thanks to synchronous replication, data loss effectively converges to zero. This is the Stretch cluster's biggest strength.
Quorum requirement: To satisfy the controller quorum (KRaft) and min.insync.replicas at once, at least three zones are recommended. Two zones can lose quorum if one side dies.
Cons: Every write waits for cross-zone synchronous replication, so write latency is directly tied to the cross-zone network round trip (RTT). The farther apart the zones (regions), the higher the latency, so it usually fits the same metro / nearby regions or multi-AZ.

4. Quantitative Topology Comparison

Comparing the three topologies in one table makes the trade-offs explicit.

Item	Active-Passive	Active-Active	Stretch cluster
RPO	Non-zero (equals replication lag)	Non-zero (equals replication lag)	≈ 0 (synchronous)
RTO	Minutes (failover required)	Shortest (already serving)	Shortest (automatic leader election)
Cost	Lowest (DR on standby)	High (both fully active)	Medium (single cluster, extra zones)
Complexity	Low (one-way)	Highest (cycles, conflicts)	Medium (rack/quorum design)
Write latency	Unaffected (local write)	Unaffected (local write)	High (cross-zone sync RTT)
Failure domains tolerated	Whole cluster/region	Whole cluster/region	AZ (or nearby region) granularity
Replication mode	Asynchronous (MM2)	Async bidirectional (MM2)	Synchronous (intra-cluster)
DR resource usage	Idle	Fully used	Fully used

The core trade-off in one sentence: Stretch drives RPO close to zero but pays in write latency and a distance limit between zones, while Active-Passive/Active-Active protect even distant regions without write-latency penalty but give up some RPO.

5. Topology Selection Criteria by Requirement

There is no single right answer. It depends on your RPO/RTO targets, compliance requirements, geographic distribution, and budget. Start with the criteria below.

Decision Guide

Priority / requirement	Recommended topology	Why
Zero-loss (RPO≈0), compliance-mandated (finance, payments)	Stretch cluster	Only synchronous replication guarantees zero data loss
Cost-sensitive, "minute-level RPO acceptable"	Active-Passive	Simplest and cheapest, protects against a single-region outage
Global low-latency serving, multi-region writes	Active-Active	Per-region local writes + shared data
Extremely latency-sensitive writes + distant DR needed	Active-Passive/Active-Active	Avoids Stretch's sync latency while protecting at distance
Limited ops staff / maturity	Active-Passive	Avoids the conflict management of bidirectional replication

Common Pitfalls

The myth that "just having a DR cluster means RPO is zero" — as long as replication is asynchronous, RPO equals the replication lag. If you truly need zero, evaluate Stretch (synchronous).
Building Stretch with only two zones — you can lose quorum. Secure at least three zones for the controller/ISR quorum.
Applying Stretch to distant regions — the cross-zone RTT becomes your write latency. If distant protection is the goal, an asynchronous-replication topology is the right fit.
Watching only RTO and forgetting RPO — fixating on "recover fast" makes you miss "how much do we lose." Set both numbers together.

6. Series Roadmap

This was part 1, laying the foundation. The remaining four parts carry the design through to actual build, operations, and validation.

Part	Topic	Core content
Part 1 (this post)	DR design foundations	RPO/RTO, topology selection
Part 2	Building MirrorMaker 2	MM2 architecture, connector config, topic/ACL replication
Part 3	Offset translation and failover prep	Consumer offset translation, lag monitoring, `RemoteClusterUtils`
Part 4	Failover/failback runbook	Step-by-step switchover, data consistency on failback
Part 5	Testing and validation	DR drills, chaos testing, measured RPO/RTO

Part 2 walks step by step through actually building the most common Active-Passive setup with MirrorMaker 2.

Wrapping up

Every DR design starts from two numbers: RPO (tolerable data loss) and RTO (tolerable downtime). Without them, you have a wish, not a design.
The RPO of asynchronous cross-cluster replication equals the replication lag at the moment of disaster, and can never be exactly zero. You can get close, but guaranteeing zero requires synchronous replication.
Active-Passive is simplest and cheapest, but RPO is non-zero and RTO is in minutes. It fits cost-sensitive cases that tolerate minute-level loss.
Active-Active fully uses resources and enables global low-latency serving, but cycle prevention and conflict handling make it the most complex.
The Stretch cluster converges RPO to zero through synchronous replication, but adds cross-zone write latency and a minimum-three-zone (quorum) requirement, and is unsuitable for distant regions.
Compliance/zero-loss → Stretch, cost-sensitive → Active-Passive, global low-latency → Active-Active. Requirements decide the choice.
In part 2, we build an actual Active-Passive DR with MirrorMaker 2 on top of this foundation.

References

Apache Kafka. "Geo-Replication (Cross-Cluster Data Mirroring)" — https://kafka.apache.org/documentation/#georeplication

Apache Kafka. "MirrorMaker 2.0 (KIP-382)" — https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0

Apache Kafka. "Rack Awareness & Replica Placement" — https://kafka.apache.org/documentation/#basic_ops_racks

Confluent. "Disaster Recovery for Multi-Region Kafka" — https://docs.confluent.io/platform/current/multi-dc-deployments/index.html

Confluent. "Multi-Region Clusters" — https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region.html

— The Data Dynamics Engineering Team