Blog
pysparksparkmemoryoomtuningdata-engineering

Conquering PySpark Executor OOM — Putting an End to Container Killed Errors

We dig into the real causes behind the "Container killed by YARN for exceeding memory limits" error. Understand the executor memory layout (heap/overhead/off-heap), spill, GC, partition sizing, and PySpark's unique Python memory footprint, and learn how to fix OOM structurally.

Data DynamicsJune 5, 20267 min read

Two errors stand out as the most notorious in Spark operations: java.lang.OutOfMemoryError, and Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Both say "out of memory," but their causes and fixes are different. Blindly raising executor.memory only wastes resources while the problem keeps coming back.

This post starts with how executor memory is divided, then covers the difference between the two OOMs, PySpark's unique Python memory issues, and an approach to eliminating OOM structurally.

1. Executor Memory Layout — Start with the Picture

You cannot understand OOM by looking at executor.memory alone. A single container's memory is split into several regions.

┌─────────────────── Container (the unit YARN/K8s kills) ──────────────────┐
│                                                                          │
│  ┌──────────────── JVM heap (spark.executor.memory) ─────────────┐      │
│  │  Reserved (300MB)                                              │      │
│  │  ┌─ Unified Memory (spark.memory.fraction, default 0.6) ────┐  │      │
│  │  │   Execution (shuffle/sort/join buffers) ⇄ Storage (cache) │  │      │
│  │  └─────────────────────────────────────────────────────────┘  │      │
│  │  User Memory (UDFs, user data structures)                      │      │
│  └────────────────────────────────────────────────────────────────┘      │
│                                                                          │
│  Overhead (spark.executor.memoryOverhead)  ← native, shuffle, some Python│
│  Off-heap (spark.memory.offHeap.size, optional)                          │
│  ⟵ PySpark: Python worker process memory lives OUTSIDE the heap!         │
│     (it pressures overhead)                                              │
└──────────────────────────────────────────────────────────────────────────┘

The key point: YARN/K8s kills based on the container's total physical memory. Even if the heap (executor.memory) is fine, you get "Container killed" when overhead plus the Python process exceeds the container limit.

2. The Two OOMs Are Different

ErrorWhere it occursMeaningFirst response
java.lang.OutOfMemoryError: Java heap spaceInside the JVM heapHeap exhaustedSmaller partitions, more memory, less caching
Container killed ... exceeding memory limitsWhole containerOverhead/off-heap/Python over limitmemoryOverhead up, manage Python memory

The most common misconception: seeing "Container killed" and raising only executor.memory (the heap). This error usually means insufficient overhead, so raising the heap makes the container even bigger and can make things worse. In many cases, raising memoryOverhead is the right answer.

3. Key Settings at a Glance

spark = (SparkSession.builder
    .config("spark.executor.memory", "8g")              # JVM heap
    .config("spark.executor.memoryOverhead", "2g")      # native + shuffle + Python
    .config("spark.executor.cores", "4")                # concurrent tasks per executor
    .config("spark.memory.fraction", "0.6")             # execution+storage ratio
    .config("spark.sql.shuffle.partitions", "400")      # number of shuffle partitions
    .getOrCreate())
SettingRoleTuning direction
executor.memoryJVM heapToo large worsens GC
executor.memoryOverheadOff-heap regionsFirst thing to raise on Container killed
executor.coresConcurrent task countMore cores means more memory contention
memory.fractionWork/cache pool ratioMore execution headroom if you don't cache
sql.shuffle.partitionsPartition count after shuffleMore partitions means less memory per partition

4. The Real Cause of OOM — Partitions That Are Too Big

Most heap OOMs happen not because memory is small, but because a single partition does not fit in executor memory. One task handles one entire partition.

Data per partition ≈ input size / number of partitions
Big partitions → one task loads huge data into memory → OOM

The fix is usually not "add more memory" but "split partitions finer."

# Increase the post-shuffle partition count → less data per partition
spark.conf.set("spark.sql.shuffle.partitions", "800")
 
# Redistribute at the input stage
df = df.repartition(800, "key")
 
# Let AQE auto-tune partitions at runtime (recommended)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Rule of thumb: aim for 100-200MB of data processed per partition. When you hit OOM, try increasing the partition count first. However, if skew makes only certain partitions huge, adding partitions won't help — fix the skew first (see our separate post "Mastering Data Skew in PySpark").

5. Spill — The Warning Sign Right Before OOM

When execution memory runs short, Spark spills data to disk. Spill itself prevents OOM, but excessive spill is a warning light that memory is tight.

If Spill (Memory) and Spill (Disk) are large in the Spark UI task metrics:

  • Split partitions finer (shuffle.partitions up)
  • Drop unneeded columns (narrow early with select)
  • Reduce caching to free up the execution pool

Spill slows the job down with disk I/O, so a little memory headroom makes a big performance difference.

6. The PySpark-Specific Trap — Python Memory

When you use Python UDFs, pandas_udf, or applyInPandas in PySpark, separate Python worker processes run outside the JVM heap. This memory does not count against executor.memory (the heap) — it pressures the container overhead.

Python UDF runs → JVM ↔ Python serialization → Python process holds data in memory
→ heap looks fine but the container dies (Container killed)

Mitigation:

# Cap Python worker memory (spills beyond this value)
spark.conf.set("spark.executor.pyspark.memory", "2g")
# Give overhead generous room to match
spark.conf.set("spark.executor.memoryOverhead", "3g")

The fundamental fix is reducing Python UDFs. Replace them with built-in functions where possible, and when a UDF is unavoidable, use vectorized pandas_udf instead of row-by-row UDFs. (We cover this in detail in a separate post, "PySpark Python UDF vs Pandas UDF.")

7. collect / toPandas — Driver OOM

It is also common for the driver, not the executor, to die. The culprit is actions that pull a huge result into driver memory.

# Dangerous: pulls the entire result into driver memory
data = df.collect()
pdf = df.toPandas()
 
# Safe: only small results after aggregation, or write directly
df.write.parquet("...")            # executors write in a distributed fashion
small = df.limit(1000).toPandas()  # only as much as you need

collect(), toPandas(), and large broadcast variables consume driver memory. If the result is large, write it out in a distributed manner instead of collecting it to the driver.

8. Cache Management — Storage Starves Execution

When data you cache()/persist() occupies the Storage pool, the Execution pool shrinks and the OOM risk grows.

# Cache only what is truly reused multiple times
df_reused.persist(StorageLevel.MEMORY_AND_DISK)  # falls back to disk under memory pressure
...
df_reused.unpersist()  # release immediately when done
 
# Caching something used only once is wasteful and risky

With MEMORY_ONLY, data that doesn't fit in memory gets recomputed or adds pressure. For large data, MEMORY_AND_DISK is safer.

9. Quick Diagnosis-to-Fix Table

SymptomLikely causeFix
Java heap spaceOversized partitionsshuffle.partitions up, repartition, fewer columns
Container killedOverhead/Python over limitmemoryOverhead up, fewer Python UDFs
Driver OOMcollect/toPandasDistributed write, limit
Spill explosionExecution memory shortMore partitions, less caching
Only one task OOMsSkewFix the skew (salt/broadcast)
Excessive GC timeHeap too large/too many objectsRight-size the heap, reduce objects
1. Enable AQE (automatic partition tuning)
2. Identify the error type (heap vs container killed)
3. heap → finer partitions + drop unneeded columns/caches
4. container killed → memoryOverhead up + check Python memory
5. Only one task dying → suspect skew
6. Still not enough? Only then raise executor.memory

The key takeaway: growing memory is the last resort. Fix partitions, skew, caching, and Python first, and you can process more data with the same resources.

11. Summary

AreaKey point
Memory layoutHeap + overhead + off-heap + Python; killed at the container level
Two OOM typesheap = partitions, container killed = overhead/Python
First fixNot more memory — finer partitions
PySpark trapPython workers pressure overhead → reduce UDFs
DriverAvoid collect/toPandas; write distributed

The core insight about executor OOM is that it is not "OOM = not enough memory" but rather "OOM = the unit of data doesn't fit in memory." Split partitions to the right size, understand the container memory layout to distinguish heap from overhead, and — for PySpark — keep Python memory in view, and the Container killed error is no longer a mystery.


This article is based on Spark 3.5. If you need memory and stability tuning for large-scale Spark jobs, feel free to reach out.

— The Data Dynamics Engineering Team