pysparksparksmall-filesicebergdeltadata-engineering

The PySpark Small Files Problem — Taming Tens of Thousands of Tiny Files

Covers the Small Files Problem, where Spark output gets split into tens of thousands of tiny files that slow down downstream jobs and queries. We walk through the causes, coalesce vs repartition, AQE, partitioned-write strategies, and practical solutions all the way to Iceberg/Delta compaction.

Data DynamicsJune 5, 20267 min read

Your Spark job succeeded, but when you open the output directory you find 40,000 files of 2KB each. The data is 100MB, yet there are 40,000 files. When the next job reads this, it spawns 40,000 tasks, and your metastore, NameNode, and object storage all start screaming. This is the chronic disease of data engineering: the Small Files Problem.

This post covers why small files appear, what damage they cause, and practical fixes ranging from coalesce/repartition to Iceberg/Delta compaction.

1. Why Small Files Hurt

With many small files, overhead explodes regardless of the actual data volume.

Damage	Reason
Slow downstream jobs	At least 1 task per file → 40K files = scheduling 40K tasks
Metadata explosion	A metadata entry per file; NameNode memory and HMS load
Object storage cost	S3 and friends bill per request (GET/LIST) + latency
Slow query planning	The engine opens every file footer to check statistics
Worse compression	Small files reduce columnar compression and encoding efficiency

The typical target file size is 128MB to 1GB. If you have thousands to tens of thousands of files in the KB-to-MB range, they need cleanup.

2. Where Small Files Come From

Cause 1: Shuffle partition count too large for the data

Spark's default spark.sql.shuffle.partitions is 200. If you write after a shuffle (groupBy/join), you get up to 200 files even when the data is small. The smaller the data, the smaller each file.

100MB of data + shuffle.partitions=200  →  200 files × 0.5MB average

Cause 2: Partition columns + writing many partitions

When writing with partitionBy, a file is created for each (output partition × Spark partition) combination.

partition column dt=365 days × 200 Spark partitions  →  up to 73,000 files

Cause 3: Streaming micro-batches

Structured Streaming writes files on every trigger. With a 5-second trigger, that is 17,000+ batches per day = an enormous number of files.

3. Fix 1: coalesce vs repartition

Reduce the partition count right before writing to control the number of output files. You need to understand the difference between the two precisely.

# repartition: full shuffle with even redistribution (can increase or decrease partitions)
df.repartition(10).write.parquet("out")        # exactly 10, evenly balanced
 
# coalesce: merge partitions without a shuffle (decrease only)
df.coalesce(10).write.parquet("out")           # 10 files, no shuffle (fast) but possibly skewed

	`coalesce(n)`	`repartition(n)`
Shuffle	None (merges existing partitions)	Full shuffle
Speed	Fast	Slow
Balance	Can be skewed	Even
Use case	Simply reducing file count	Even distribution / repartitioning by key

Pitfall: if you use coalesce(1) to produce a single file, the final stage becomes a single task, and that one task processes all the data → slowdowns or OOM. Reduce to a reasonable count, and only go down to 1 when the data is genuinely small.

Computing a reasonable partition count

target_size = 256 * 1024 * 1024  # 256MB
# Rough estimate: total bytes / target size
num_partitions = max(1, int(total_bytes / target_size))
df.repartition(num_partitions).write.parquet("out")

4. Fix 2: AQE Automatic Coalescing (Recommended)

Instead of computing this manually, let AQE automatically merge small partitions after a shuffle at runtime.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Target partition size after coalescing
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")

AQE looks at the data size and dynamically reduces the partition count, producing consistently sized files. For most post-shuffle writes, this alone dramatically reduces small files.

5. Fix 3: Partitioned-Write Strategy

To avoid file explosion when writing with partitionBy, redistribute by the output partition column before writing.

# BAD: each Spark partition creates files for every dt → explosion
df.write.partitionBy("dt").parquet("out")
 
# GOOD: repartition by dt → control files per dt
(df.repartition("dt")                       # collect each dt into one partition
   .write.partitionBy("dt").parquet("out")) # 1 to a few files per dt
 
# If you need multiple files per dt (large partitions)
df.repartition("dt", "bucket_col").write.partitionBy("dt").parquet("out")

repartition("dt") gathers rows with the same dt into a single Spark partition, minimizing the number of files per dt directory.

6. Fix 4: Iceberg / Delta — Compaction in Table Formats

For readers of this blog, this is the most practical fix. Lakehouse table formats have built-in compaction that merges small files after the fact. Write freely, and compact on a schedule.

Iceberg

# Iceberg compaction via Spark SQL (rewrites small files into large ones)
spark.sql("""
  CALL catalog.system.rewrite_data_files(
    table => 'analytics.events',
    options => map('target-file-size-bytes', '536870912')   -- 512MB
  )
""")
 
# Clean up old snapshots and orphan files
spark.sql("CALL catalog.system.expire_snapshots('analytics.events', TIMESTAMP '2026-06-01 00:00:00')")
spark.sql("CALL catalog.system.remove_orphan_files(table => 'analytics.events')")

Iceberg also has properties controlling automatic sorting and fan-out at write time, so you can reduce small files starting from the write stage. (We covered the equivalent operations on the Trino side in a separate post, "Maintaining Iceberg Tables with Trino.")

Delta Lake

# Compaction via OPTIMIZE (+ optional Z-Order)
spark.sql("OPTIMIZE analytics.events WHERE dt >= '2026-06-01'")
spark.sql("OPTIMIZE analytics.events ZORDER BY (user_id)")
 
# Clean up old files
spark.sql("VACUUM analytics.events RETAIN 168 HOURS")  # 7 days
 
# Auto compaction / optimized writes at write time
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Format	Compaction	Cleanup
Iceberg	`rewrite_data_files`	`expire_snapshots`, `remove_orphan_files`
Delta	`OPTIMIZE` (+ZORDER)	`VACUUM`

If you run a Lakehouse, this pattern is the standard play: let streaming and frequent writes produce small files, and run scheduled compaction regularly. It lets you manage write latency and file size independently.

7. Small Files in Streaming

Structured Streaming produces small files by nature. Tackle it on two fronts.

# 1) Increase the trigger interval to grow the data per batch
query = (df.writeStream
    .trigger(processingTime="5 minutes")   # 5 minutes, not 5 seconds
    .format("iceberg").outputMode("append")
    .toTable("analytics.events"))
 
# 2) For the small files that accumulate anyway, clean up periodically with a separate compaction job

For streaming, "write small, compact separately" is the realistic answer.

8. Diagnostics — Checking Your Small-File Situation

# Check file count and size distribution (Iceberg metadata table)
spark.sql("""
  SELECT count(*) AS files,
         avg(file_size_in_bytes)/1024/1024 AS avg_mb,
         min(file_size_in_bytes)/1024 AS min_kb
  FROM catalog.analytics.events.files
""").show()

If avg_mb is in the single digits and files is in the thousands or more, the table is a compaction candidate.

9. Summary

Fix	When
AQE coalescePartitions	Writes after a shuffle — turn this on first
`repartition(n)` / `coalesce(n)`	Direct control over output file count
`repartition(partition column)` + `partitionBy`	Prevent partitioned-write explosion
Iceberg `rewrite_data_files` / Delta `OPTIMIZE`	After-the-fact compaction (scheduled)
Larger trigger interval	Streaming

The essence of the Small Files Problem comes down to two axes: "control the file count at write time" and "compact whatever accumulates anyway." For batch jobs, handle it at write time with AQE plus a sensible repartition; for streaming and frequent ingestion, schedule Iceberg/Delta compaction. As long as you resist the temptation of coalesce(1), small files are a problem you can definitely tame.

This post was written against Spark 3.5 + Iceberg/Delta. If you need file optimization and compaction automation for your Lakehouse ingestion pipelines, feel free to reach out anytime.

— The Data Dynamics Engineering Team