pysparksparkpartitionbackfillidempotenticebergdelta

PySpark Dynamic Partition Overwrite and Idempotent Backfills — Reprocessing Without the Risk

How to recompute and overwrite just a few dates without wiping out the entire table. We cover dynamic partition overwrite mode, idempotent backfill design, and the safe partial-overwrite and replaceWhere patterns in Iceberg/Delta.

Data DynamicsJune 5, 20266 min read

If you operate data pipelines, requests like "last Tuesday's data is wrong — recompute just that date and overwrite it" never stop coming. But if you naively write with mode("overwrite"), you don't just replace that date — the entire table is gone. Conversely, trying to avoid overwrites often leads to loading the same data twice and creating duplicates.

This post covers how to safely overwrite only specific partitions, how to design idempotent backfills whose results stay the same across reruns, and the safe partial-overwrite patterns in Iceberg/Delta.

1. How the Incident Starts — overwrite Deletes Everything

# Dangerous: we only meant to rewrite dt=2026-06-03, but...
(reprocessed_one_day
    .write
    .mode("overwrite")            # ← overwrites the ENTIRE partitioned table! 💥
    .partitionBy("dt")
    .parquet("/warehouse/events"))

The default overwrite deletes the entire target path and writes fresh. If you overwrite with a DataFrame containing only one day of data, the other 364 days disappear. This incident is extremely common in practice.

2. Solution ① Dynamic Partition Overwrite

Spark has a dynamic mode that overwrites "only the partitions present in the incoming data."

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
 
(reprocessed_one_day              # contains only dt=2026-06-03
    .write
    .mode("overwrite")
    .partitionBy("dt")
    .parquet("/warehouse/events"))
# → only the dt=2026-06-03 partition is replaced; other dates stay intact

Mode	Behavior
`static` (default)	Overwrites the entire target path
`dynamic`	Overwrites only partitions present in the DataFrame

Key point: in dynamic mode, only the partition values that appear in the DataFrame are replaced. If you write five days of data, only those five days are replaced and the rest is preserved. This is the basic tool for backfills.

Limitations of dynamic mode: you need a partition column, and replacement happens only at "partition granularity." You cannot replace just some rows within a partition.

3. Solution ② Iceberg / Delta — replaceWhere / Conditional Overwrite

Lakehouse formats are safer and more flexible. Because you explicitly state the overwrite scope as a predicate, the risk of incidents is much lower.

Delta replaceWhere

(reprocessed
    .write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", "dt >= '2026-06-01' AND dt <= '2026-06-03'")
    .save("/warehouse/events"))
# → only data matching the predicate is replaced, atomically

Iceberg overwritePartitions / dynamic

# Iceberg: overwritePartitions on DataFrameWriterV2 (dynamic)
(reprocessed
    .writeTo("analytics.events")
    .overwritePartitions())       # replaces only the incoming partitions
 
# Or a conditional overwrite (SQL)
spark.sql("""
  DELETE FROM analytics.events WHERE dt = DATE '2026-06-03'
""")  # then append, or use MERGE

Approach	Safety	Granularity
dynamic overwrite	Medium (partition-level)	Partition
Delta `replaceWhere`	High (explicit predicate)	Predicate range
Iceberg `overwritePartitions`	High (atomic)	Partition
MERGE	High (row-level)	Row

The lakehouse advantage: overwrites are atomic. If a job fails midway, no broken partial state is left behind. And because the predicate is explicit, the "entire table wiped out" incident is structurally hard to cause.

4. Idempotent Backfills — Same Result on Every Rerun

The essence of a backfill is idempotency: "reprocessing the same date multiple times must produce the same result." Retries and duplicate executions are inevitable in operations.

Loading diagram…

Two patterns for idempotent backfills:

Pattern A — Partition Replacement (overwrite)

def backfill_day(dt):
    result = compute_for_day(dt)       # recompute one day of dt
    (result.write
        .format("delta").mode("overwrite")
        .option("replaceWhere", f"dt = '{dt}'")   # replace only that date
        .save(TABLE))
    # no matter how many times this runs, the dt partition is replaced with the same result → idempotent

Pattern B — MERGE (Row-Level Idempotency)

# Key-based upsert, idempotent via updated_at comparison (see "PySpark Large-Scale Deduplication and SCD2")
spark.sql("""
  MERGE INTO analytics.events t USING updates s
  ON t.id = s.id
  WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")

Pattern	Best for
Partition replacement	When the entire partition can be recomputed (daily batches)
MERGE	Incremental, key-based, only some rows change

5. Backfill Orchestration

Large-range backfills (e.g., the past year) are split and run per date.

from datetime import date, timedelta
 
def date_range(start, end):
    d = start
    while d <= end:
        yield d
        d += timedelta(days=1)
 
for d in date_range(date(2025, 1, 1), date(2025, 12, 31)):
    backfill_day(d.isoformat())   # each date is idempotent → on failure, rerun just that date

Operational tips:

Make each date an independent unit: if each date's backfill is idempotent and independent, you only need to rerun the dates that failed.
Use a scheduler like Airflow to parallelize as per-date tasks, but watch out for concurrent write conflicts (same table).
For large data volumes, don't run the whole range as a single job — process it in chunks.

6. Common Pitfalls

Pitfall	Result	Avoidance
Partial write with static overwrite	Entire table lost	dynamic mode / replaceWhere
Backfilling with append	Accumulating duplicates	overwrite/MERGE
dynamic without a partition column	Doesn't work	partitionBy is required
Concurrent backfill conflicts	Commit conflicts, corruption	Table-format transactions, serialization
Small files after overwrite	They accumulate	Regular compaction (separate post)

To structurally prevent static-overwrite incidents, make the lakehouse replaceWhere/overwritePartitions your default tools. They force you to state a predicate, making the "delete everything" mistake hard to commit.

7. Summary

Tool	Key point
dynamic partition overwrite	Replaces only the incoming partitions
Delta `replaceWhere`	Atomically replaces only the predicate range
Iceberg `overwritePartitions`	Atomic partition replacement
MERGE	Row-level idempotent upsert
Backfill design	Date-independent, idempotent units

There are two keys to partition overwrites and backfills. First, recognize the "delete everything" hazard of static overwrite and use dynamic mode or conditional overwrites. Second, design backfills to be idempotent, so the result stays stable across reruns and duplicate executions. With the atomic, conditional overwrites of Iceberg/Delta as your default tools, the everyday request to "just recompute yesterday's data" stops being a dangerous operation.

This post is based on Spark 3.5 + Iceberg/Delta. If you need help designing safe backfill and reprocessing pipelines, feel free to reach out.

— Data Dynamics Engineering Team