The PySpark Small Files Problem — Taming Tens of Thousands of Tiny Files
Covers the Small Files Problem, where Spark output gets split into tens of thousands of tiny files that slow down downstream jobs and queries. We walk through the causes, coalesce vs repartition, AQE, partitioned-write strategies, and practical solutions all the way to Iceberg/Delta compaction.
Your Spark job succeeded, but when you open the output directory you find 40,000 files of 2KB each. The data is 100MB, yet there are 40,000 files. When the next job reads this, it spawns 40,000 tasks, and your metastore, NameNode, and object storage all start screaming. This is the chronic disease of data engineering: the Small Files Problem.
This post covers why small files appear, what damage they cause, and practical fixes ranging from coalesce/repartition to Iceberg/Delta compaction.
1. Why Small Files Hurt
With many small files, overhead explodes regardless of the actual data volume.
| Damage | Reason |
|---|---|
| Slow downstream jobs | At least 1 task per file → 40K files = scheduling 40K tasks |
| Metadata explosion | A metadata entry per file; NameNode memory and HMS load |
| Object storage cost | S3 and friends bill per request (GET/LIST) + latency |
| Slow query planning | The engine opens every file footer to check statistics |
| Worse compression | Small files reduce columnar compression and encoding efficiency |
The typical target file size is 128MB to 1GB. If you have thousands to tens of thousands of files in the KB-to-MB range, they need cleanup.
2. Where Small Files Come From
Cause 1: Shuffle partition count too large for the data
Spark's default spark.sql.shuffle.partitions is 200. If you write after a shuffle (groupBy/join), you get up to 200 files even when the data is small. The smaller the data, the smaller each file.
100MB of data + shuffle.partitions=200 → 200 files × 0.5MB averageCause 2: Partition columns + writing many partitions
When writing with partitionBy, a file is created for each (output partition × Spark partition) combination.
partition column dt=365 days × 200 Spark partitions → up to 73,000 filesCause 3: Streaming micro-batches
Structured Streaming writes files on every trigger. With a 5-second trigger, that is 17,000+ batches per day = an enormous number of files.
3. Fix 1: coalesce vs repartition
Reduce the partition count right before writing to control the number of output files. You need to understand the difference between the two precisely.
# repartition: full shuffle with even redistribution (can increase or decrease partitions)
df.repartition(10).write.parquet("out") # exactly 10, evenly balanced
# coalesce: merge partitions without a shuffle (decrease only)
df.coalesce(10).write.parquet("out") # 10 files, no shuffle (fast) but possibly skewedcoalesce(n) | repartition(n) | |
|---|---|---|
| Shuffle | None (merges existing partitions) | Full shuffle |
| Speed | Fast | Slow |
| Balance | Can be skewed | Even |
| Use case | Simply reducing file count | Even distribution / repartitioning by key |
Pitfall: if you use
coalesce(1)to produce a single file, the final stage becomes a single task, and that one task processes all the data → slowdowns or OOM. Reduce to a reasonable count, and only go down to 1 when the data is genuinely small.
Computing a reasonable partition count
target_size = 256 * 1024 * 1024 # 256MB
# Rough estimate: total bytes / target size
num_partitions = max(1, int(total_bytes / target_size))
df.repartition(num_partitions).write.parquet("out")4. Fix 2: AQE Automatic Coalescing (Recommended)
Instead of computing this manually, let AQE automatically merge small partitions after a shuffle at runtime.
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# Target partition size after coalescing
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")AQE looks at the data size and dynamically reduces the partition count, producing consistently sized files. For most post-shuffle writes, this alone dramatically reduces small files.
5. Fix 3: Partitioned-Write Strategy
To avoid file explosion when writing with partitionBy, redistribute by the output partition column before writing.
# BAD: each Spark partition creates files for every dt → explosion
df.write.partitionBy("dt").parquet("out")
# GOOD: repartition by dt → control files per dt
(df.repartition("dt") # collect each dt into one partition
.write.partitionBy("dt").parquet("out")) # 1 to a few files per dt
# If you need multiple files per dt (large partitions)
df.repartition("dt", "bucket_col").write.partitionBy("dt").parquet("out")repartition("dt") gathers rows with the same dt into a single Spark partition, minimizing the number of files per dt directory.
6. Fix 4: Iceberg / Delta — Compaction in Table Formats
For readers of this blog, this is the most practical fix. Lakehouse table formats have built-in compaction that merges small files after the fact. Write freely, and compact on a schedule.
Iceberg
# Iceberg compaction via Spark SQL (rewrites small files into large ones)
spark.sql("""
CALL catalog.system.rewrite_data_files(
table => 'analytics.events',
options => map('target-file-size-bytes', '536870912') -- 512MB
)
""")
# Clean up old snapshots and orphan files
spark.sql("CALL catalog.system.expire_snapshots('analytics.events', TIMESTAMP '2026-06-01 00:00:00')")
spark.sql("CALL catalog.system.remove_orphan_files(table => 'analytics.events')")Iceberg also has properties controlling automatic sorting and fan-out at write time, so you can reduce small files starting from the write stage. (We covered the equivalent operations on the Trino side in a separate post, "Maintaining Iceberg Tables with Trino.")
Delta Lake
# Compaction via OPTIMIZE (+ optional Z-Order)
spark.sql("OPTIMIZE analytics.events WHERE dt >= '2026-06-01'")
spark.sql("OPTIMIZE analytics.events ZORDER BY (user_id)")
# Clean up old files
spark.sql("VACUUM analytics.events RETAIN 168 HOURS") # 7 days
# Auto compaction / optimized writes at write time
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")| Format | Compaction | Cleanup |
|---|---|---|
| Iceberg | rewrite_data_files | expire_snapshots, remove_orphan_files |
| Delta | OPTIMIZE (+ZORDER) | VACUUM |
If you run a Lakehouse, this pattern is the standard play: let streaming and frequent writes produce small files, and run scheduled compaction regularly. It lets you manage write latency and file size independently.
7. Small Files in Streaming
Structured Streaming produces small files by nature. Tackle it on two fronts.
# 1) Increase the trigger interval to grow the data per batch
query = (df.writeStream
.trigger(processingTime="5 minutes") # 5 minutes, not 5 seconds
.format("iceberg").outputMode("append")
.toTable("analytics.events"))
# 2) For the small files that accumulate anyway, clean up periodically with a separate compaction jobFor streaming, "write small, compact separately" is the realistic answer.
8. Diagnostics — Checking Your Small-File Situation
# Check file count and size distribution (Iceberg metadata table)
spark.sql("""
SELECT count(*) AS files,
avg(file_size_in_bytes)/1024/1024 AS avg_mb,
min(file_size_in_bytes)/1024 AS min_kb
FROM catalog.analytics.events.files
""").show()If avg_mb is in the single digits and files is in the thousands or more, the table is a compaction candidate.
9. Summary
| Fix | When |
|---|---|
| AQE coalescePartitions | Writes after a shuffle — turn this on first |
repartition(n) / coalesce(n) | Direct control over output file count |
repartition(partition column) + partitionBy | Prevent partitioned-write explosion |
Iceberg rewrite_data_files / Delta OPTIMIZE | After-the-fact compaction (scheduled) |
| Larger trigger interval | Streaming |
The essence of the Small Files Problem comes down to two axes: "control the file count at write time" and "compact whatever accumulates anyway." For batch jobs, handle it at write time with AQE plus a sensible repartition; for streaming and frequent ingestion, schedule Iceberg/Delta compaction. As long as you resist the temptation of coalesce(1), small files are a problem you can definitely tame.
This post was written against Spark 3.5 + Iceberg/Delta. If you need file optimization and compaction automation for your Lakehouse ingestion pipelines, feel free to reach out anytime.
— The Data Dynamics Engineering Team