pysparksparkconfigurationtuningaqedata-engineering

PySpark Configuration Tuning, the Complete Picture — Which Configs to Change, and When

Out of the hundreds of Spark settings, we cover only the ones that actually matter: executor sizing (cores, memory, instances), AQE, shuffle partitions, serialization, dynamic allocation — and an evidence-based tuning order to replace the "just raise the memory" anti-pattern.

Data DynamicsJune 5, 20266 min read

Spark has hundreds of configuration options. That's why so many teams fall into guess-based tuning: "it's slow, so bump executor.memory; if that doesn't work, add more instances." This wastes resources while the actual problem stays put. In reality, only a dozen or so settings make a real difference, and there is a clear order for what to change and when.

This post picks out only the Spark settings that truly matter and explains what to adjust based on what evidence. (Detailed tuning for individual problems — skew, OOM, small files — is covered in their own dedicated posts; here we focus on the big picture and priorities.)

1. The Golden Rule of Tuning — Measure First, Then Attack the Biggest Bottleneck

❌ Guessing: it's slow → raise memory → still slow → add instances → only cost goes up
✅ Evidence: measure in Spark UI → identify the bottleneck (skew/shuffle/OOM) → adjust only the relevant setting

Before touching any config, identify the bottleneck in the Spark UI first (see our separate post "Debugging Slow PySpark Jobs"). Raising memory when the problem is skew, or adding instances when the problem is shuffle, accomplishes nothing.

2. First Things First — Turn On AQE

Adaptive Query Execution in Spark 3.x automatically adjusts partition coalescing, skew joins, and join strategies at runtime. It delivers the biggest payoff for the least effort of any setting.

spark.conf.set("spark.sql.adaptive.enabled", "true")                      # master switch
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")   # auto-merge small partitions
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")             # auto-split skewed joins
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")

Recent Spark versions enable AQE by default, but verify it explicitly. AQE alone automatically mitigates a large share of small-file, skew, and partition-count problems.

3. Executor Sizing — The Most Important Decision

Resource allocation comes down to balancing three values: executor cores, memory, and instance count.

spark.conf.set("spark.executor.cores", "4")          # concurrent tasks per executor
spark.conf.set("spark.executor.memory", "8g")        # JVM heap
spark.conf.set("spark.executor.memoryOverhead", "2g")# off-heap (shuffle, native, Python)
spark.conf.set("spark.executor.instances", "20")     # number of executors

Core Count — 4 to 5 Is the Sweet Spot

Too few cores (1-2)  → inefficient relative to per-executor overhead
Too many cores (>5)  → degraded HDFS/storage I/O throughput, GC contention
→ 4-5 cores per executor is the standard recommendation

Memory — Separate Heap from Overhead

PySpark's Python workers consume off-heap memory, so give memoryOverhead generous headroom to avoid "Container killed" errors (see our separate post "Conquering PySpark Executor OOM").

Setting	Meaning	Guideline
`executor.cores`	Concurrent tasks	4-5
`executor.memory`	JVM heap	Enough per task, mind GC
`executor.memoryOverhead`	Off-heap	~25% of heap, more for PySpark
`executor.instances`	Executor count	Parallelism = cores × instances

A common mistake: "a few big executors" vs. "many small executors." Executors that are too big suffer long GC pauses; too small, and the overhead dominates. Many medium-sized executors (4-5 cores, 8-16GB) is usually the safe choice.

4. Shuffle Partitions — The Trap of the 200 Default

The default of 200 for spark.sql.shuffle.partitions fits almost no workload. With large data, each partition becomes too big and you hit OOM; with small data, you get 200 mostly-empty partitions.

# With AQE enabled, this is adjusted automatically at runtime, so the default matters less
spark.conf.set("spark.sql.adaptive.enabled", "true")
 
# Tuning manually without AQE: target ~128MB per partition
# num = total shuffle data / 128MB
spark.conf.set("spark.sql.shuffle.partitions", "800")

With AQE's coalescePartitions enabled, it's safe to set this value high (e.g., 2000) — partitions get merged down to a sensible count at runtime. In the AQE era, the burden of hand-tuning this value is largely gone.

5. Serialization and Compression

# Kryo serialization (faster and more compact than the Java default)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
 
# Shuffle compression (on by default) — saves network and disk
spark.conf.set("spark.shuffle.compress", "true")
spark.conf.set("spark.shuffle.spill.compress", "true")

Kryo is almost always a win. Compression is enabled by default and can usually be left alone.

6. Dynamic Allocation — Resource Efficiency

Instead of a fixed executor count, scale with load (especially on shared clusters and K8s).

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.shuffleTracking.enabled", "true")  # required on K8s
spark.conf.set("spark.dynamicAllocation.minExecutors", "2")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "50")
spark.conf.set("spark.dynamicAllocation.executorIdleTimeout", "60s")

(For shuffle tracking and spot-instance issues on K8s, see our separate post "PySpark on Kubernetes.")

7. Broadcast and Joins

# Auto-broadcast threshold (default 10MB) — optimizes small dimension joins
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50MB")
# Set to -1 to disable broadcasting entirely (when deliberately turning it off)

Setting this too high risks broadcasting a large table and triggering OOM (see our separate post "Broadcast Variables and Large Lookups"). 50-100MB is usually safe.

8. Configuration Priority — What to Touch First

1. Enable AQE (skew/coalesce/skewJoin)        → biggest impact, least effort
2. Executor sizing (4-5 cores, memory + overhead)
3. shuffle.partitions (with AQE, a high value is fine)
4. memoryOverhead (fixes PySpark "Container killed")
5. Serialization (Kryo), dynamic allocation
6. autoBroadcastJoinThreshold
   ───
   Still slow? → it's a code/data problem (skew, UDFs, small files) — configs can't fix it

The key point: config tuning has limits. Skew, bad joins, expensive UDFs, and small files cannot be fixed with settings — you have to change the code or the data layout. If you've exhausted the configs and it's still slow, the problem is the code, not the configuration.

9. Per-Workload Profiles

Workload	Settings to Emphasize
Large ETL batch	Big executors, shuffle.partitions↑, FTE
Interactive/short queries	Dynamic allocation, low latency, broadcast
Streaming	Triggers and backpressure, checkpoints, right-sized executors
ML training	Memory↑, caching, fewer shuffles
High-concurrency shared	Dynamic allocation, fair scheduler

10. Summary

Area	Key Settings	Effect
AQE	`adaptive.*`	Automatic skew/partition/join handling
Sizing	4-5 cores, memory, overhead	Stability and throughput
Shuffle	`shuffle.partitions` (less critical with AQE)	Partition size
Serialization	Kryo	Speed
Dynamic allocation	`dynamicAllocation.*`	Resource efficiency
Joins	`autoBroadcastJoinThreshold`	Shuffle avoidance

The essence of Spark config tuning is to "find the bottleneck through measurement, adjust the highest-impact settings first, and accept the limits of configuration." Just enabling AQE and sizing executors sensibly stabilizes most jobs. If things are still slow — before touching more configs, suspect code and data problems like skew, UDFs, and small files. Good configuration makes good code fast, but it can't fix bad code.

This article is based on Spark 3.5. If you need comprehensive performance tuning for your Spark clusters and jobs, feel free to reach out anytime.

— The Data Dynamics Engineering Team