Blog
pysparksparkpartitionbackfillidempotenticebergdelta

PySpark Dynamic Partition Overwrite and Idempotent Backfills — Reprocessing Without the Risk

How to recompute and overwrite just a few dates without wiping out the entire table. We cover dynamic partition overwrite mode, idempotent backfill design, and the safe partial-overwrite and replaceWhere patterns in Iceberg/Delta.

Data DynamicsJune 5, 20266 min read

If you operate data pipelines, requests like "last Tuesday's data is wrong — recompute just that date and overwrite it" never stop coming. But if you naively write with mode("overwrite"), you don't just replace that date — the entire table is gone. Conversely, trying to avoid overwrites often leads to loading the same data twice and creating duplicates.

This post covers how to safely overwrite only specific partitions, how to design idempotent backfills whose results stay the same across reruns, and the safe partial-overwrite patterns in Iceberg/Delta.

1. How the Incident Starts — overwrite Deletes Everything

# Dangerous: we only meant to rewrite dt=2026-06-03, but...
(reprocessed_one_day
    .write
    .mode("overwrite")            # ← overwrites the ENTIRE partitioned table! 💥
    .partitionBy("dt")
    .parquet("/warehouse/events"))

The default overwrite deletes the entire target path and writes fresh. If you overwrite with a DataFrame containing only one day of data, the other 364 days disappear. This incident is extremely common in practice.

2. Solution ① Dynamic Partition Overwrite

Spark has a dynamic mode that overwrites "only the partitions present in the incoming data."

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
 
(reprocessed_one_day              # contains only dt=2026-06-03
    .write
    .mode("overwrite")
    .partitionBy("dt")
    .parquet("/warehouse/events"))
# → only the dt=2026-06-03 partition is replaced; other dates stay intact
ModeBehavior
static (default)Overwrites the entire target path
dynamicOverwrites only partitions present in the DataFrame

Key point: in dynamic mode, only the partition values that appear in the DataFrame are replaced. If you write five days of data, only those five days are replaced and the rest is preserved. This is the basic tool for backfills.

Limitations of dynamic mode: you need a partition column, and replacement happens only at "partition granularity." You cannot replace just some rows within a partition.

3. Solution ② Iceberg / Delta — replaceWhere / Conditional Overwrite

Lakehouse formats are safer and more flexible. Because you explicitly state the overwrite scope as a predicate, the risk of incidents is much lower.

Delta replaceWhere

(reprocessed
    .write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", "dt >= '2026-06-01' AND dt <= '2026-06-03'")
    .save("/warehouse/events"))
# → only data matching the predicate is replaced, atomically

Iceberg overwritePartitions / dynamic

# Iceberg: overwritePartitions on DataFrameWriterV2 (dynamic)
(reprocessed
    .writeTo("analytics.events")
    .overwritePartitions())       # replaces only the incoming partitions
 
# Or a conditional overwrite (SQL)
spark.sql("""
  DELETE FROM analytics.events WHERE dt = DATE '2026-06-03'
""")  # then append, or use MERGE
ApproachSafetyGranularity
dynamic overwriteMedium (partition-level)Partition
Delta replaceWhereHigh (explicit predicate)Predicate range
Iceberg overwritePartitionsHigh (atomic)Partition
MERGEHigh (row-level)Row

The lakehouse advantage: overwrites are atomic. If a job fails midway, no broken partial state is left behind. And because the predicate is explicit, the "entire table wiped out" incident is structurally hard to cause.

4. Idempotent Backfills — Same Result on Every Rerun

The essence of a backfill is idempotency: "reprocessing the same date multiple times must produce the same result." Retries and duplicate executions are inevitable in operations.

Non-idempotent pipeline:
  load with append → rerunning piles the same data up twice (duplicates!)
 
Idempotent pipeline:
  "replace the target partition" → same result no matter how many times you run it

Two patterns for idempotent backfills:

Pattern A — Partition Replacement (overwrite)

def backfill_day(dt):
    result = compute_for_day(dt)       # recompute one day of dt
    (result.write
        .format("delta").mode("overwrite")
        .option("replaceWhere", f"dt = '{dt}'")   # replace only that date
        .save(TABLE))
    # no matter how many times this runs, the dt partition is replaced with the same result → idempotent

Pattern B — MERGE (Row-Level Idempotency)

# Key-based upsert, idempotent via updated_at comparison (see "PySpark Large-Scale Deduplication and SCD2")
spark.sql("""
  MERGE INTO analytics.events t USING updates s
  ON t.id = s.id
  WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")
PatternBest for
Partition replacementWhen the entire partition can be recomputed (daily batches)
MERGEIncremental, key-based, only some rows change

5. Backfill Orchestration

Large-range backfills (e.g., the past year) are split and run per date.

from datetime import date, timedelta
 
def date_range(start, end):
    d = start
    while d <= end:
        yield d
        d += timedelta(days=1)
 
for d in date_range(date(2025, 1, 1), date(2025, 12, 31)):
    backfill_day(d.isoformat())   # each date is idempotent → on failure, rerun just that date

Operational tips:

  • Make each date an independent unit: if each date's backfill is idempotent and independent, you only need to rerun the dates that failed.
  • Use a scheduler like Airflow to parallelize as per-date tasks, but watch out for concurrent write conflicts (same table).
  • For large data volumes, don't run the whole range as a single job — process it in chunks.

6. Common Pitfalls

PitfallResultAvoidance
Partial write with static overwriteEntire table lostdynamic mode / replaceWhere
Backfilling with appendAccumulating duplicatesoverwrite/MERGE
dynamic without a partition columnDoesn't workpartitionBy is required
Concurrent backfill conflictsCommit conflicts, corruptionTable-format transactions, serialization
Small files after overwriteThey accumulateRegular compaction (separate post)

To structurally prevent static-overwrite incidents, make the lakehouse replaceWhere/overwritePartitions your default tools. They force you to state a predicate, making the "delete everything" mistake hard to commit.

7. Summary

ToolKey point
dynamic partition overwriteReplaces only the incoming partitions
Delta replaceWhereAtomically replaces only the predicate range
Iceberg overwritePartitionsAtomic partition replacement
MERGERow-level idempotent upsert
Backfill designDate-independent, idempotent units

There are two keys to partition overwrites and backfills. First, recognize the "delete everything" hazard of static overwrite and use dynamic mode or conditional overwrites. Second, design backfills to be idempotent, so the result stays stable across reruns and duplicate executions. With the atomic, conditional overwrites of Iceberg/Delta as your default tools, the everyday request to "just recompute yesterday's data" stops being a dangerous operation.


This post is based on Spark 3.5 + Iceberg/Delta. If you need help designing safe backfill and reprocessing pipelines, feel free to reach out.

— Data Dynamics Engineering Team