PySpark on Kubernetes — Dynamic Allocation, Shuffle, and Running on Spot
The practical challenges of running Spark on Kubernetes: the executor Pod model, dynamic allocation and shuffle data retention, the trap of losing shuffle data on spot instances, and patterns that cut cost while preserving stability.
Moving Spark workloads from YARN to Kubernetes has become mainstream. Container-based deployment, autoscaling, and the cost savings of spot instances are all attractive. But Spark on K8s comes with challenges that simply did not exist in the YARN era — most notably dynamic allocation and shuffle data retention, and shuffle loss when executors disappear on spot instances.
This post walks through the execution model of Spark on Kubernetes, the key configuration settings, and the operational patterns that deliver both cost efficiency and stability.
1. Execution Model — Driver and Executor Pods
spark-submit (--master k8s://...)
│
▼
Driver Pod ──(creates/deletes executor Pods via the K8s API)──┐
│ │
┌────┼────────────────┐ │
▼ ▼ ▼ │
Executor Executor Executor Pod ← managed directly by the driver| Spark concept | Kubernetes |
|---|---|
| Driver | Pod (one per job) |
| Executor | Pod (created/deleted by the driver) |
| Resource requests | Pod requests/limits |
| Isolation | namespace, resource quota |
Unlike YARN, the driver creates and deletes executor Pods directly through the K8s API. There is no separate cluster manager — Kubernetes itself fills that role.
2. Basic Submission and Resource Settings
# spark-submit example (conceptual)
# spark-submit \
# --master k8s://https://<api-server> \
# --deploy-mode cluster \
# --conf spark.kubernetes.container.image=<spark-image> \
# --conf spark.executor.instances=10 \
# ...
conf = {
"spark.kubernetes.container.image": "registry/spark:pinned-tag",
"spark.executor.instances": "10",
"spark.executor.memory": "8g",
"spark.executor.memoryOverhead": "2g", # account for PySpark Python memory
"spark.executor.cores": "4",
"spark.kubernetes.executor.request.cores": "4",
}If you run PySpark, give memoryOverhead plenty of headroom — Python workers use memory outside the JVM heap, so the container can get OOMKilled (see our separate post, "Conquering PySpark Executor OOM").
3. Dynamic Allocation — Scaling Executors with Load
Instead of a fixed executor count, dynamic allocation grows and shrinks the executor pool based on job load, eliminating idle resource waste.
conf = {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.shuffleTracking.enabled": "true", # critical on K8s!
"spark.dynamicAllocation.minExecutors": "2",
"spark.dynamicAllocation.maxExecutors": "50",
"spark.dynamicAllocation.executorIdleTimeout": "60s",
}The key point: Kubernetes (traditionally) has no equivalent of YARN's External Shuffle Service. So when dynamic allocation scales executors down, the shuffle data held by those executors disappears. That is what
shuffleTracking.enabledsolves — executors holding shuffle data are tracked and kept alive even when idle, instead of being reclaimed.
4. The Biggest Trap — Shuffle Data Loss
When an executor Pod disappears on K8s (scale-down, node failure, spot reclamation), the shuffle data on that Pod's local disk disappears with it. When another executor tries to read that shuffle, it hits a FetchFailedException, and Spark recomputes the affected stage.
Executor A writes shuffle → A disappears due to spot reclamation
→ Executor B tries to read A's shuffle → FetchFailedException
→ Stage recomputation (expensive); if it repeats, the job failsMitigation strategies:
| Strategy | Approach |
|---|---|
| shuffle tracking | Keep executors holding shuffle data alive via shuffleTracking.enabled=true |
| Externalize shuffle data | Move shuffle outside the cluster with a remote shuffle service (e.g. Celeborn) |
| FTE (retries) | Allow stage recomputation and regenerate shuffle |
| Limit spot usage | Increase the on-demand share for shuffle-heavy jobs |
5. Spot Instances — Cost Savings and Risk
Spot (or preemptible) instances are cheap but can be reclaimed at any time. The pattern for using spot safely with Spark on K8s:
Driver → on-demand nodes (if the driver dies, the whole job fails — SPOF)
Executor → spot nodes (recoverable via recomputation/retry if they die)conf = {
# driver on on-demand, executors on the spot node pool (nodeSelector)
"spark.kubernetes.driver.node.selector.node-pool": "on-demand",
"spark.kubernetes.executor.node.selector.node-pool": "spot",
}| Component | Placement | Rationale |
|---|---|---|
| Driver | On-demand | If it dies, the whole job fails (SPOF) |
| Executor | Spot | Recoverable via recomputation/retry |
| Shuffle-heavy stages | Higher on-demand share | Shuffle loss is expensive |
Rule of thumb: never place the driver on spot. Executor loss is recoverable through recomputation, but driver loss kills the entire job. (Same principle as the Trino coordinator — see our separate post, "Deploying Trino on Kubernetes".)
6. Data Locality and I/O
Spark on K8s usually runs with compute and storage separated (S3/object storage). Since there is no HDFS-style data locality:
- I/O depends on object storage connector (S3A, etc.) performance → tune the connector (multipart, connection pooling).
- Secure fast node-local SSDs for shuffle and spill (
spark.local.dir). - Lakehouse (Iceberg/Delta) + object storage is the natural combination.
7. Operations — Monitoring and Isolation
| Area | Approach |
|---|---|
| Resource isolation | namespace + ResourceQuota |
| Images | Pin versions (never use latest) |
| Logs | Collect driver/executor Pod logs |
| Metrics | Spark UI + Prometheus (metrics sink) |
| Cleanup | Policy for cleaning up completed driver Pods |
When multiple teams share a cluster, isolate them with namespaces and quotas so one team's job cannot starve everyone else.
8. Spark on K8s vs YARN at a Glance
| Area | YARN | Kubernetes |
|---|---|---|
| Resource management | RM/NM | K8s scheduler |
| Shuffle service | Built-in External Shuffle Service | None → shuffle tracking/remote shuffle |
| Autoscaling | Limited | dynamic allocation + cluster autoscaler |
| Spot | Limited | Natural fit (but beware shuffle loss) |
| Multi-tenancy | Queues | namespace/quota |
| Data locality | Strong with HDFS | Usually separated (object storage) |
9. Summary
| Area | Key takeaway |
|---|---|
| Execution model | The driver manages executor Pods directly |
| dynamic allocation | shuffleTracking is essential (shuffle retention) |
| Shuffle loss | Lost executor = lost shuffle → recomputation/remote shuffle |
| Spot | Executors only; keep the driver on on-demand |
| Storage | Compute-storage separation, pairs well with a Lakehouse |
The core insight of Spark on Kubernetes is that most operational challenges flow from a single fact: "there is no YARN External Shuffle Service". Dynamic allocation's shuffleTracking, shuffle loss on spot, separating driver and executor node pools — all of these come down to how you manage the lifetime of shuffle data. Protect the driver on on-demand nodes, cut cost by running executors on spot, and treat shuffle-heavy stages with care, and you can have both cost efficiency and stability.
This article is based on Spark 3.5. If you need help with a Spark on Kubernetes migration or cost-optimization design, feel free to reach out.
— The Data Dynamics Engineering Team