pysparksparkmemoryoomtuningdata-engineering

Conquering PySpark Executor OOM — Putting an End to Container Killed Errors

We dig into the real causes behind the "Container killed by YARN for exceeding memory limits" error. Understand the executor memory layout (heap/overhead/off-heap), spill, GC, partition sizing, and PySpark's unique Python memory footprint, and learn how to fix OOM structurally.

Data DynamicsJune 5, 20267 min read

Two errors stand out as the most notorious in Spark operations: java.lang.OutOfMemoryError, and Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Both say "out of memory," but their causes and fixes are different. Blindly raising executor.memory only wastes resources while the problem keeps coming back.

This post starts with how executor memory is divided, then covers the difference between the two OOMs, PySpark's unique Python memory issues, and an approach to eliminating OOM structurally.

1. Executor Memory Layout — Start with the Picture

You cannot understand OOM by looking at executor.memory alone. A single container's memory is split into several regions.

┌─────────────────── Container (the unit YARN/K8s kills) ──────────────────┐
│                                                                          │
│  ┌──────────────── JVM heap (spark.executor.memory) ─────────────┐      │
│  │  Reserved (300MB)                                              │      │
│  │  ┌─ Unified Memory (spark.memory.fraction, default 0.6) ────┐  │      │
│  │  │   Execution (shuffle/sort/join buffers) ⇄ Storage (cache) │  │      │
│  │  └─────────────────────────────────────────────────────────┘  │      │
│  │  User Memory (UDFs, user data structures)                      │      │
│  └────────────────────────────────────────────────────────────────┘      │
│                                                                          │
│  Overhead (spark.executor.memoryOverhead)  ← native, shuffle, some Python│
│  Off-heap (spark.memory.offHeap.size, optional)                          │
│  ⟵ PySpark: Python worker process memory lives OUTSIDE the heap!         │
│     (it pressures overhead)                                              │
└──────────────────────────────────────────────────────────────────────────┘

The key point: YARN/K8s kills based on the container's total physical memory. Even if the heap (executor.memory) is fine, you get "Container killed" when overhead plus the Python process exceeds the container limit.

2. The Two OOMs Are Different

Error	Where it occurs	Meaning	First response
`java.lang.OutOfMemoryError: Java heap space`	Inside the JVM heap	Heap exhausted	Smaller partitions, more memory, less caching
`Container killed ... exceeding memory limits`	Whole container	Overhead/off-heap/Python over limit	memoryOverhead up, manage Python memory

The most common misconception: seeing "Container killed" and raising only executor.memory (the heap). This error usually means insufficient overhead, so raising the heap makes the container even bigger and can make things worse. In many cases, raising memoryOverhead is the right answer.

3. Key Settings at a Glance

spark = (SparkSession.builder
    .config("spark.executor.memory", "8g")              # JVM heap
    .config("spark.executor.memoryOverhead", "2g")      # native + shuffle + Python
    .config("spark.executor.cores", "4")                # concurrent tasks per executor
    .config("spark.memory.fraction", "0.6")             # execution+storage ratio
    .config("spark.sql.shuffle.partitions", "400")      # number of shuffle partitions
    .getOrCreate())

Setting	Role	Tuning direction
`executor.memory`	JVM heap	Too large worsens GC
`executor.memoryOverhead`	Off-heap regions	First thing to raise on Container killed
`executor.cores`	Concurrent task count	More cores means more memory contention
`memory.fraction`	Work/cache pool ratio	More execution headroom if you don't cache
`sql.shuffle.partitions`	Partition count after shuffle	More partitions means less memory per partition

4. The Real Cause of OOM — Partitions That Are Too Big

Most heap OOMs happen not because memory is small, but because a single partition does not fit in executor memory. One task handles one entire partition.

Data per partition ≈ input size / number of partitions
Big partitions → one task loads huge data into memory → OOM

The fix is usually not "add more memory" but "split partitions finer."

# Increase the post-shuffle partition count → less data per partition
spark.conf.set("spark.sql.shuffle.partitions", "800")
 
# Redistribute at the input stage
df = df.repartition(800, "key")
 
# Let AQE auto-tune partitions at runtime (recommended)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Rule of thumb: aim for 100-200MB of data processed per partition. When you hit OOM, try increasing the partition count first. However, if skew makes only certain partitions huge, adding partitions won't help — fix the skew first (see our separate post "Mastering Data Skew in PySpark").

5. Spill — The Warning Sign Right Before OOM

When execution memory runs short, Spark spills data to disk. Spill itself prevents OOM, but excessive spill is a warning light that memory is tight.

If Spill (Memory) and Spill (Disk) are large in the Spark UI task metrics:

Split partitions finer (shuffle.partitions up)
Drop unneeded columns (narrow early with select)
Reduce caching to free up the execution pool

Spill slows the job down with disk I/O, so a little memory headroom makes a big performance difference.

6. The PySpark-Specific Trap — Python Memory

When you use Python UDFs, pandas_udf, or applyInPandas in PySpark, separate Python worker processes run outside the JVM heap. This memory does not count against executor.memory (the heap) — it pressures the container overhead.

Loading diagram…

Mitigation:

# Cap Python worker memory (spills beyond this value)
spark.conf.set("spark.executor.pyspark.memory", "2g")
# Give overhead generous room to match
spark.conf.set("spark.executor.memoryOverhead", "3g")

The fundamental fix is reducing Python UDFs. Replace them with built-in functions where possible, and when a UDF is unavoidable, use vectorized pandas_udf instead of row-by-row UDFs. (We cover this in detail in a separate post, "PySpark Python UDF vs Pandas UDF.")

7. collect / toPandas — Driver OOM

It is also common for the driver, not the executor, to die. The culprit is actions that pull a huge result into driver memory.

# Dangerous: pulls the entire result into driver memory
data = df.collect()
pdf = df.toPandas()
 
# Safe: only small results after aggregation, or write directly
df.write.parquet("...")            # executors write in a distributed fashion
small = df.limit(1000).toPandas()  # only as much as you need

collect(), toPandas(), and large broadcast variables consume driver memory. If the result is large, write it out in a distributed manner instead of collecting it to the driver.

8. Cache Management — Storage Starves Execution

When data you cache()/persist() occupies the Storage pool, the Execution pool shrinks and the OOM risk grows.

# Cache only what is truly reused multiple times
df_reused.persist(StorageLevel.MEMORY_AND_DISK)  # falls back to disk under memory pressure
...
df_reused.unpersist()  # release immediately when done
 
# Caching something used only once is wasteful and risky

With MEMORY_ONLY, data that doesn't fit in memory gets recomputed or adds pressure. For large data, MEMORY_AND_DISK is safer.

9. Quick Diagnosis-to-Fix Table

Symptom	Likely cause	Fix
`Java heap space`	Oversized partitions	`shuffle.partitions` up, repartition, fewer columns
`Container killed`	Overhead/Python over limit	`memoryOverhead` up, fewer Python UDFs
Driver OOM	`collect`/`toPandas`	Distributed write, limit
Spill explosion	Execution memory short	More partitions, less caching
Only one task OOMs	Skew	Fix the skew (salt/broadcast)
Excessive GC time	Heap too large/too many objects	Right-size the heap, reduce objects

10. Recommended OOM Tuning Order

1. Enable AQE (automatic partition tuning)
2. Identify the error type (heap vs container killed)
3. heap → finer partitions + drop unneeded columns/caches
4. container killed → memoryOverhead up + check Python memory
5. Only one task dying → suspect skew
6. Still not enough? Only then raise executor.memory

The key takeaway: growing memory is the last resort. Fix partitions, skew, caching, and Python first, and you can process more data with the same resources.

11. Summary

Area	Key point
Memory layout	Heap + overhead + off-heap + Python; killed at the container level
Two OOM types	heap = partitions, container killed = overhead/Python
First fix	Not more memory — finer partitions
PySpark trap	Python workers pressure overhead → reduce UDFs
Driver	Avoid `collect/toPandas`; write distributed

The core insight about executor OOM is that it is not "OOM = not enough memory" but rather "OOM = the unit of data doesn't fit in memory." Split partitions to the right size, understand the container memory layout to distinguish heap from overhead, and — for PySpark — keep Python memory in view, and the Container killed error is no longer a mystery.

This article is based on Spark 3.5. If you need memory and stability tuning for large-scale Spark jobs, feel free to reach out.

— The Data Dynamics Engineering Team