pysparksparkkubernetesdynamic-allocationspotdevops

PySpark on Kubernetes — Dynamic Allocation, Shuffle, and Running on Spot

The practical challenges of running Spark on Kubernetes: the executor Pod model, dynamic allocation and shuffle data retention, the trap of losing shuffle data on spot instances, and patterns that cut cost while preserving stability.

Data DynamicsJune 5, 20266 min read

Moving Spark workloads from YARN to Kubernetes has become mainstream. Container-based deployment, autoscaling, and the cost savings of spot instances are all attractive. But Spark on K8s comes with challenges that simply did not exist in the YARN era — most notably dynamic allocation and shuffle data retention, and shuffle loss when executors disappear on spot instances.

This post walks through the execution model of Spark on Kubernetes, the key configuration settings, and the operational patterns that deliver both cost efficiency and stability.

1. Execution Model — Driver and Executor Pods

Loading diagram…

Spark concept	Kubernetes
Driver	Pod (one per job)
Executor	Pod (created/deleted by the driver)
Resource requests	Pod requests/limits
Isolation	namespace, resource quota

Unlike YARN, the driver creates and deletes executor Pods directly through the K8s API. There is no separate cluster manager — Kubernetes itself fills that role.

2. Basic Submission and Resource Settings

# spark-submit example (conceptual)
# spark-submit \
#   --master k8s://https://<api-server> \
#   --deploy-mode cluster \
#   --conf spark.kubernetes.container.image=<spark-image> \
#   --conf spark.executor.instances=10 \
#   ...
 
conf = {
    "spark.kubernetes.container.image": "registry/spark:pinned-tag",
    "spark.executor.instances": "10",
    "spark.executor.memory": "8g",
    "spark.executor.memoryOverhead": "2g",   # account for PySpark Python memory
    "spark.executor.cores": "4",
    "spark.kubernetes.executor.request.cores": "4",
}

If you run PySpark, give memoryOverhead plenty of headroom — Python workers use memory outside the JVM heap, so the container can get OOMKilled (see our separate post, "Conquering PySpark Executor OOM").

3. Dynamic Allocation — Scaling Executors with Load

Instead of a fixed executor count, dynamic allocation grows and shrinks the executor pool based on job load, eliminating idle resource waste.

conf = {
    "spark.dynamicAllocation.enabled": "true",
    "spark.dynamicAllocation.shuffleTracking.enabled": "true",  # critical on K8s!
    "spark.dynamicAllocation.minExecutors": "2",
    "spark.dynamicAllocation.maxExecutors": "50",
    "spark.dynamicAllocation.executorIdleTimeout": "60s",
}

The key point: Kubernetes (traditionally) has no equivalent of YARN's External Shuffle Service. So when dynamic allocation scales executors down, the shuffle data held by those executors disappears. That is what shuffleTracking.enabled solves — executors holding shuffle data are tracked and kept alive even when idle, instead of being reclaimed.

4. The Biggest Trap — Shuffle Data Loss

When an executor Pod disappears on K8s (scale-down, node failure, spot reclamation), the shuffle data on that Pod's local disk disappears with it. When another executor tries to read that shuffle, it hits a FetchFailedException, and Spark recomputes the affected stage.

Loading diagram…

Mitigation strategies:

Strategy	Approach
shuffle tracking	Keep executors holding shuffle data alive via `shuffleTracking.enabled=true`
Externalize shuffle data	Move shuffle outside the cluster with a remote shuffle service (e.g. Celeborn)
FTE (retries)	Allow stage recomputation and regenerate shuffle
Limit spot usage	Increase the on-demand share for shuffle-heavy jobs

5. Spot Instances — Cost Savings and Risk

Spot (or preemptible) instances are cheap but can be reclaimed at any time. The pattern for using spot safely with Spark on K8s:

Loading diagram…

conf = {
    # driver on on-demand, executors on the spot node pool (nodeSelector)
    "spark.kubernetes.driver.node.selector.node-pool": "on-demand",
    "spark.kubernetes.executor.node.selector.node-pool": "spot",
}

Component	Placement	Rationale
Driver	On-demand	If it dies, the whole job fails (SPOF)
Executor	Spot	Recoverable via recomputation/retry
Shuffle-heavy stages	Higher on-demand share	Shuffle loss is expensive

Rule of thumb: never place the driver on spot. Executor loss is recoverable through recomputation, but driver loss kills the entire job. (Same principle as the Trino coordinator — see our separate post, "Deploying Trino on Kubernetes".)

6. Data Locality and I/O

Spark on K8s usually runs with compute and storage separated (S3/object storage). Since there is no HDFS-style data locality:

I/O depends on object storage connector (S3A, etc.) performance → tune the connector (multipart, connection pooling).
Secure fast node-local SSDs for shuffle and spill (spark.local.dir).
Lakehouse (Iceberg/Delta) + object storage is the natural combination.

7. Operations — Monitoring and Isolation

Area	Approach
Resource isolation	namespace + ResourceQuota
Images	Pin versions (never use `latest`)
Logs	Collect driver/executor Pod logs
Metrics	Spark UI + Prometheus (metrics sink)
Cleanup	Policy for cleaning up completed driver Pods

When multiple teams share a cluster, isolate them with namespaces and quotas so one team's job cannot starve everyone else.

8. Spark on K8s vs YARN at a Glance

Area	YARN	Kubernetes
Resource management	RM/NM	K8s scheduler
Shuffle service	Built-in External Shuffle Service	None → shuffle tracking/remote shuffle
Autoscaling	Limited	dynamic allocation + cluster autoscaler
Spot	Limited	Natural fit (but beware shuffle loss)
Multi-tenancy	Queues	namespace/quota
Data locality	Strong with HDFS	Usually separated (object storage)

9. Summary

Area	Key takeaway
Execution model	The driver manages executor Pods directly
dynamic allocation	`shuffleTracking` is essential (shuffle retention)
Shuffle loss	Lost executor = lost shuffle → recomputation/remote shuffle
Spot	Executors only; keep the driver on on-demand
Storage	Compute-storage separation, pairs well with a Lakehouse

The core insight of Spark on Kubernetes is that most operational challenges flow from a single fact: "there is no YARN External Shuffle Service". Dynamic allocation's shuffleTracking, shuffle loss on spot, separating driver and executor node pools — all of these come down to how you manage the lifetime of shuffle data. Protect the driver on on-demand nodes, cut cost by running executors on spot, and treat shuffle-heavy stages with care, and you can have both cost efficiency and stability.

This article is based on Spark 3.5. If you need help with a Spark on Kubernetes migration or cost-optimization design, feel free to reach out.

— The Data Dynamics Engineering Team