[Kafka DR 1] Foundations of Kafka DR — From RPO/RTO to Topology Selection
We define RPO/RTO, the starting point of any Kafka disaster recovery design, then quantitatively compare the Active-Passive, Active-Active, and Stretch cluster topologies to give you selection criteria by requirement.
The moment an entire region disappears, how far can your Kafka survive? It is an invisible question on a normal day, but the very first one you must answer when an outage hits. "How much data can we afford to lose?" and "How fast must we recover?" — a DR design without these two numbers is less a design than a wish. This is the first installment of our Kafka Disaster Recovery series (5 parts), covering the RPO/RTO and topology choices that underpin every later decision, before we get into the actual build.
What you'll learn in this post
- What RPO and RTO are, and why they are the starting point of every DR design
- Why RPO can never be exactly zero with asynchronous replication
- How the Active-Passive, Active-Active, and Stretch topologies work and their trade-offs
- A table quantitatively comparing RPO, RTO, cost, complexity, and write latency per topology
- Topology selection criteria by requirement: compliance, cost, global serving, and more
- The build, failover, and validation roadmap covered in parts 2–5 of the series
1. Why Start With RPO/RTO
When DR comes up, a common reaction is "isn't it just about redundancy?" But the form and level of redundancy are decided by two target numbers. Without agreeing on them first, you either build an overly expensive architecture or a design that loses data exactly when disaster strikes.
Defining RPO and RTO
| Metric | Full name | Meaning | One-line summary |
|---|---|---|---|
| RPO | Recovery Point Objective | Tolerable data loss | "How far back from the disaster is the last safely persisted point?" |
| RTO | Recovery Time Objective | Tolerable downtime | "How long may it take from outage to a healthy service?" |
RPO is a tolerance for data loss, expressed as time. An RPO of 5 minutes means the business can absorb losing the data that arrived in the 5 minutes just before the disaster. RTO is a tolerance for downtime, expressed as time. An RTO of 15 minutes means consumers and producers must be operating normally again within 15 minutes of the outage.
Two Numbers Decide the Architecture
The smaller you push RPO, the more frequently and synchronously replication must happen — which raises write latency and cost. The smaller you push RTO, the more failover must be automated and pre-staged — which raises operational complexity. DR design is ultimately the work of finding the balance point between RPO/RTO targets and cost, complexity, and latency.
The key question is not "can we prevent the outage?" but "when an outage happens, how much do we lose and how fast do we recover?" RPO/RTO are the contract that writes that answer down as numbers.
2. Asynchronous Replication and the Limit of RPO
Kafka's cross-cluster replication (e.g., MirrorMaker 2) is fundamentally asynchronous. A producer writes a message to the primary cluster, then the replication tool reads that message and writes it again to the DR cluster. This "read then re-write" process inevitably has a time gap — the replication lag.
RPO ≈ Replication Lag at the Moment of Disaster
When a disaster strikes under asynchronous replication, the messages that reached the primary but had not yet been replicated to DR are lost along with it. Therefore:
RPO with asynchronous replication ≈ the replication lag at the exact moment of the disaster
If replication lag is always 200ms, RPO is about 200ms. But lag is not a fixed value. Under traffic spikes or network jitter, lag can stretch to seconds or minutes, and if the disaster happens at that instant, RPO grows with it. This is why RPO with asynchronous replication can never be zero. You can drive it close to zero, but guaranteeing "exactly zero data lost" requires synchronous replication (the Stretch cluster, covered later).
Factors That Inflate Replication Lag
| Factor | Description | Impact on RPO |
|---|---|---|
| Network latency/bandwidth | Physical distance between regions, available bandwidth | Higher lag → higher RPO |
| Producer throughput spikes | Replication momentarily falls behind | Temporary lag spike |
| Insufficient replication tasks | Too few MM2 tasks / partition parallelism | Backlog accumulates |
| DR cluster load | Write-throughput ceiling on the DR side | Slower consumption |
The takeaway is clear. You must monitor replication lag continuously, recognizing that this lag is your effective RPO. (Lag monitoring and offset tracking are covered in detail in part 3.)
3. The Three DR Topologies
There are three representative topologies for Kafka DR. Each sits at a completely different point on RPO, RTO, cost, and complexity.
3.1 Active-Passive (Warm Standby)
The most common and simplest setup. Only the primary cluster serves traffic, while the DR cluster receives data through asynchronous replication (e.g., MirrorMaker 2) and stands by.
- Behavior: In normal operation all reads/writes happen on the primary. DR only receives replication and does not serve.
- Failover: On disaster, switch consumers and producers to the DR cluster. Consumers must resume from translated offsets (the topic of part 4).
- Pros: Simple to set up, and being one-way replication, there is no loop or conflict to worry about. It is also the cheapest.
- Cons: Being asynchronous, RPO is not zero. DR receives no traffic in normal times, so its resources sit idle. Failover needs human/automation involvement and some time, so RTO tends to land in minutes.
3.2 Active-Active (Bidirectional)
Both clusters serve traffic at the same time and replicate each other's data bidirectionally. This fits global services that want a nearby cluster per region while still sharing all data.
- Behavior: Both clusters accept writes and replicate their topics to the other.
- Cycle prevention required: You must stop an infinite loop where a message replicated A → B comes back B → A. MM2 prefixes topics with the source cluster alias (e.g.,
A.orders) so that a cluster does not re-replicate topics it already replicated. - Conflict handling: Updating the same key on both sides simultaneously creates ordering/conflict issues. You need a design that pins "which cluster owns writes" per key (partitioning) or resolves conflicts at the application level.
- Pros: Every cluster uses its resources, and per-region low-latency serving is possible. If one side dies, the other is already serving traffic, so RTO can be the shortest.
- Cons: Complexity is the highest. You must account for cycle prevention, conflict handling, and offset/consumer-group management on both sides. Being asynchronous, RPO is still not zero.
3.3 Stretch Cluster (Single Cluster Across AZs/Regions)
Rather than two separate clusters, you place a single Kafka cluster across multiple availability zones (AZs) or regions. Replication is not cross-cluster but synchronous replication within the cluster.
- Rack awareness: Setting
broker.rackon each broker makes Kafka spread the replicas of a partition across different AZs. Even if an entire AZ dies, replicas survive in the other AZs. - Synchronous replication: With producers set to
acks=alland topics tomin.insync.replicas=2(or more), a write succeeds only after replicas in multiple AZs have acknowledged it. So even if one AZ dies, already-committed data is guaranteed in the other AZs. - RPO ≈ 0: Thanks to synchronous replication, data loss effectively converges to zero. This is the Stretch cluster's biggest strength.
- Quorum requirement: To satisfy the controller quorum (KRaft) and
min.insync.replicasat once, at least three zones are recommended. Two zones can lose quorum if one side dies. - Cons: Every write waits for cross-zone synchronous replication, so write latency is directly tied to the cross-zone network round trip (RTT). The farther apart the zones (regions), the higher the latency, so it usually fits the same metro / nearby regions or multi-AZ.
4. Quantitative Topology Comparison
Comparing the three topologies in one table makes the trade-offs explicit.
| Item | Active-Passive | Active-Active | Stretch cluster |
|---|---|---|---|
| RPO | Non-zero (equals replication lag) | Non-zero (equals replication lag) | ≈ 0 (synchronous) |
| RTO | Minutes (failover required) | Shortest (already serving) | Shortest (automatic leader election) |
| Cost | Lowest (DR on standby) | High (both fully active) | Medium (single cluster, extra zones) |
| Complexity | Low (one-way) | Highest (cycles, conflicts) | Medium (rack/quorum design) |
| Write latency | Unaffected (local write) | Unaffected (local write) | High (cross-zone sync RTT) |
| Failure domains tolerated | Whole cluster/region | Whole cluster/region | AZ (or nearby region) granularity |
| Replication mode | Asynchronous (MM2) | Async bidirectional (MM2) | Synchronous (intra-cluster) |
| DR resource usage | Idle | Fully used | Fully used |
The core trade-off in one sentence: Stretch drives RPO close to zero but pays in write latency and a distance limit between zones, while Active-Passive/Active-Active protect even distant regions without write-latency penalty but give up some RPO.
5. Topology Selection Criteria by Requirement
There is no single right answer. It depends on your RPO/RTO targets, compliance requirements, geographic distribution, and budget. Start with the criteria below.
Decision Guide
| Priority / requirement | Recommended topology | Why |
|---|---|---|
| Zero-loss (RPO≈0), compliance-mandated (finance, payments) | Stretch cluster | Only synchronous replication guarantees zero data loss |
| Cost-sensitive, "minute-level RPO acceptable" | Active-Passive | Simplest and cheapest, protects against a single-region outage |
| Global low-latency serving, multi-region writes | Active-Active | Per-region local writes + shared data |
| Extremely latency-sensitive writes + distant DR needed | Active-Passive/Active-Active | Avoids Stretch's sync latency while protecting at distance |
| Limited ops staff / maturity | Active-Passive | Avoids the conflict management of bidirectional replication |
Common Pitfalls
- The myth that "just having a DR cluster means RPO is zero" — as long as replication is asynchronous, RPO equals the replication lag. If you truly need zero, evaluate Stretch (synchronous).
- Building Stretch with only two zones — you can lose quorum. Secure at least three zones for the controller/ISR quorum.
- Applying Stretch to distant regions — the cross-zone RTT becomes your write latency. If distant protection is the goal, an asynchronous-replication topology is the right fit.
- Watching only RTO and forgetting RPO — fixating on "recover fast" makes you miss "how much do we lose." Set both numbers together.
6. Series Roadmap
This was part 1, laying the foundation. The remaining four parts carry the design through to actual build, operations, and validation.
| Part | Topic | Core content |
|---|---|---|
| Part 1 (this post) | DR design foundations | RPO/RTO, topology selection |
| Part 2 | Building MirrorMaker 2 | MM2 architecture, connector config, topic/ACL replication |
| Part 3 | Offset translation and failover prep | Consumer offset translation, lag monitoring, RemoteClusterUtils |
| Part 4 | Failover/failback runbook | Step-by-step switchover, data consistency on failback |
| Part 5 | Testing and validation | DR drills, chaos testing, measured RPO/RTO |
Part 2 walks step by step through actually building the most common Active-Passive setup with MirrorMaker 2.
Wrapping up
- Every DR design starts from two numbers: RPO (tolerable data loss) and RTO (tolerable downtime). Without them, you have a wish, not a design.
- The RPO of asynchronous cross-cluster replication equals the replication lag at the moment of disaster, and can never be exactly zero. You can get close, but guaranteeing zero requires synchronous replication.
- Active-Passive is simplest and cheapest, but RPO is non-zero and RTO is in minutes. It fits cost-sensitive cases that tolerate minute-level loss.
- Active-Active fully uses resources and enables global low-latency serving, but cycle prevention and conflict handling make it the most complex.
- The Stretch cluster converges RPO to zero through synchronous replication, but adds cross-zone write latency and a minimum-three-zone (quorum) requirement, and is unsuitable for distant regions.
- Compliance/zero-loss → Stretch, cost-sensitive → Active-Passive, global low-latency → Active-Active. Requirements decide the choice.
- In part 2, we build an actual Active-Passive DR with MirrorMaker 2 on top of this foundation.
References
- Apache Kafka. "Geo-Replication (Cross-Cluster Data Mirroring)" — https://kafka.apache.org/documentation/#georeplication
- Apache Kafka. "MirrorMaker 2.0 (KIP-382)" — https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
- Apache Kafka. "Rack Awareness & Replica Placement" — https://kafka.apache.org/documentation/#basic_ops_racks
- Confluent. "Disaster Recovery for Multi-Region Kafka" — https://docs.confluent.io/platform/current/multi-dc-deployments/index.html
- Confluent. "Multi-Region Clusters" — https://docs.confluent.io/platform/current/multi-dc-deployments/multi-region.html
— The Data Dynamics Engineering Team