PySpark Dynamic Partition Overwrite and Idempotent Backfills — Reprocessing Without the Risk
How to recompute and overwrite just a few dates without wiping out the entire table. We cover dynamic partition overwrite mode, idempotent backfill design, and the safe partial-overwrite and replaceWhere patterns in Iceberg/Delta.
If you operate data pipelines, requests like "last Tuesday's data is wrong — recompute just that date and overwrite it" never stop coming. But if you naively write with mode("overwrite"), you don't just replace that date — the entire table is gone. Conversely, trying to avoid overwrites often leads to loading the same data twice and creating duplicates.
This post covers how to safely overwrite only specific partitions, how to design idempotent backfills whose results stay the same across reruns, and the safe partial-overwrite patterns in Iceberg/Delta.
1. How the Incident Starts — overwrite Deletes Everything
# Dangerous: we only meant to rewrite dt=2026-06-03, but...
(reprocessed_one_day
.write
.mode("overwrite") # ← overwrites the ENTIRE partitioned table! 💥
.partitionBy("dt")
.parquet("/warehouse/events"))The default overwrite deletes the entire target path and writes fresh. If you overwrite with a DataFrame containing only one day of data, the other 364 days disappear. This incident is extremely common in practice.
2. Solution ① Dynamic Partition Overwrite
Spark has a dynamic mode that overwrites "only the partitions present in the incoming data."
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
(reprocessed_one_day # contains only dt=2026-06-03
.write
.mode("overwrite")
.partitionBy("dt")
.parquet("/warehouse/events"))
# → only the dt=2026-06-03 partition is replaced; other dates stay intact| Mode | Behavior |
|---|---|
static (default) | Overwrites the entire target path |
dynamic | Overwrites only partitions present in the DataFrame |
Key point: in dynamic mode, only the partition values that appear in the DataFrame are replaced. If you write five days of data, only those five days are replaced and the rest is preserved. This is the basic tool for backfills.
Limitations of dynamic mode: you need a partition column, and replacement happens only at "partition granularity." You cannot replace just some rows within a partition.
3. Solution ② Iceberg / Delta — replaceWhere / Conditional Overwrite
Lakehouse formats are safer and more flexible. Because you explicitly state the overwrite scope as a predicate, the risk of incidents is much lower.
Delta replaceWhere
(reprocessed
.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "dt >= '2026-06-01' AND dt <= '2026-06-03'")
.save("/warehouse/events"))
# → only data matching the predicate is replaced, atomicallyIceberg overwritePartitions / dynamic
# Iceberg: overwritePartitions on DataFrameWriterV2 (dynamic)
(reprocessed
.writeTo("analytics.events")
.overwritePartitions()) # replaces only the incoming partitions
# Or a conditional overwrite (SQL)
spark.sql("""
DELETE FROM analytics.events WHERE dt = DATE '2026-06-03'
""") # then append, or use MERGE| Approach | Safety | Granularity |
|---|---|---|
| dynamic overwrite | Medium (partition-level) | Partition |
Delta replaceWhere | High (explicit predicate) | Predicate range |
Iceberg overwritePartitions | High (atomic) | Partition |
| MERGE | High (row-level) | Row |
The lakehouse advantage: overwrites are atomic. If a job fails midway, no broken partial state is left behind. And because the predicate is explicit, the "entire table wiped out" incident is structurally hard to cause.
4. Idempotent Backfills — Same Result on Every Rerun
The essence of a backfill is idempotency: "reprocessing the same date multiple times must produce the same result." Retries and duplicate executions are inevitable in operations.
Non-idempotent pipeline:
load with append → rerunning piles the same data up twice (duplicates!)
Idempotent pipeline:
"replace the target partition" → same result no matter how many times you run itTwo patterns for idempotent backfills:
Pattern A — Partition Replacement (overwrite)
def backfill_day(dt):
result = compute_for_day(dt) # recompute one day of dt
(result.write
.format("delta").mode("overwrite")
.option("replaceWhere", f"dt = '{dt}'") # replace only that date
.save(TABLE))
# no matter how many times this runs, the dt partition is replaced with the same result → idempotentPattern B — MERGE (Row-Level Idempotency)
# Key-based upsert, idempotent via updated_at comparison (see "PySpark Large-Scale Deduplication and SCD2")
spark.sql("""
MERGE INTO analytics.events t USING updates s
ON t.id = s.id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")| Pattern | Best for |
|---|---|
| Partition replacement | When the entire partition can be recomputed (daily batches) |
| MERGE | Incremental, key-based, only some rows change |
5. Backfill Orchestration
Large-range backfills (e.g., the past year) are split and run per date.
from datetime import date, timedelta
def date_range(start, end):
d = start
while d <= end:
yield d
d += timedelta(days=1)
for d in date_range(date(2025, 1, 1), date(2025, 12, 31)):
backfill_day(d.isoformat()) # each date is idempotent → on failure, rerun just that dateOperational tips:
- Make each date an independent unit: if each date's backfill is idempotent and independent, you only need to rerun the dates that failed.
- Use a scheduler like Airflow to parallelize as per-date tasks, but watch out for concurrent write conflicts (same table).
- For large data volumes, don't run the whole range as a single job — process it in chunks.
6. Common Pitfalls
| Pitfall | Result | Avoidance |
|---|---|---|
| Partial write with static overwrite | Entire table lost | dynamic mode / replaceWhere |
| Backfilling with append | Accumulating duplicates | overwrite/MERGE |
| dynamic without a partition column | Doesn't work | partitionBy is required |
| Concurrent backfill conflicts | Commit conflicts, corruption | Table-format transactions, serialization |
| Small files after overwrite | They accumulate | Regular compaction (separate post) |
To structurally prevent static-overwrite incidents, make the lakehouse
replaceWhere/overwritePartitionsyour default tools. They force you to state a predicate, making the "delete everything" mistake hard to commit.
7. Summary
| Tool | Key point |
|---|---|
| dynamic partition overwrite | Replaces only the incoming partitions |
Delta replaceWhere | Atomically replaces only the predicate range |
Iceberg overwritePartitions | Atomic partition replacement |
| MERGE | Row-level idempotent upsert |
| Backfill design | Date-independent, idempotent units |
There are two keys to partition overwrites and backfills. First, recognize the "delete everything" hazard of static overwrite and use dynamic mode or conditional overwrites. Second, design backfills to be idempotent, so the result stays stable across reruns and duplicate executions. With the atomic, conditional overwrites of Iceberg/Delta as your default tools, the everyday request to "just recompute yesterday's data" stops being a dangerous operation.
This post is based on Spark 3.5 + Iceberg/Delta. If you need help designing safe backfill and reprocessing pipelines, feel free to reach out.
— Data Dynamics Engineering Team