Blog
pysparksparkkubernetesdynamic-allocationspotdevops

PySpark on Kubernetes — Dynamic Allocation, Shuffle, and Running on Spot

The practical challenges of running Spark on Kubernetes: the executor Pod model, dynamic allocation and shuffle data retention, the trap of losing shuffle data on spot instances, and patterns that cut cost while preserving stability.

Data DynamicsJune 5, 20266 min read

Moving Spark workloads from YARN to Kubernetes has become mainstream. Container-based deployment, autoscaling, and the cost savings of spot instances are all attractive. But Spark on K8s comes with challenges that simply did not exist in the YARN era — most notably dynamic allocation and shuffle data retention, and shuffle loss when executors disappear on spot instances.

This post walks through the execution model of Spark on Kubernetes, the key configuration settings, and the operational patterns that deliver both cost efficiency and stability.

1. Execution Model — Driver and Executor Pods

spark-submit (--master k8s://...)


   Driver Pod  ──(creates/deletes executor Pods via the K8s API)──┐
        │                                            │
   ┌────┼────────────────┐                           │
   ▼    ▼                ▼                            │
Executor  Executor  Executor Pod  ← managed directly by the driver
Spark conceptKubernetes
DriverPod (one per job)
ExecutorPod (created/deleted by the driver)
Resource requestsPod requests/limits
Isolationnamespace, resource quota

Unlike YARN, the driver creates and deletes executor Pods directly through the K8s API. There is no separate cluster manager — Kubernetes itself fills that role.

2. Basic Submission and Resource Settings

# spark-submit example (conceptual)
# spark-submit \
#   --master k8s://https://<api-server> \
#   --deploy-mode cluster \
#   --conf spark.kubernetes.container.image=<spark-image> \
#   --conf spark.executor.instances=10 \
#   ...
 
conf = {
    "spark.kubernetes.container.image": "registry/spark:pinned-tag",
    "spark.executor.instances": "10",
    "spark.executor.memory": "8g",
    "spark.executor.memoryOverhead": "2g",   # account for PySpark Python memory
    "spark.executor.cores": "4",
    "spark.kubernetes.executor.request.cores": "4",
}

If you run PySpark, give memoryOverhead plenty of headroom — Python workers use memory outside the JVM heap, so the container can get OOMKilled (see our separate post, "Conquering PySpark Executor OOM").

3. Dynamic Allocation — Scaling Executors with Load

Instead of a fixed executor count, dynamic allocation grows and shrinks the executor pool based on job load, eliminating idle resource waste.

conf = {
    "spark.dynamicAllocation.enabled": "true",
    "spark.dynamicAllocation.shuffleTracking.enabled": "true",  # critical on K8s!
    "spark.dynamicAllocation.minExecutors": "2",
    "spark.dynamicAllocation.maxExecutors": "50",
    "spark.dynamicAllocation.executorIdleTimeout": "60s",
}

The key point: Kubernetes (traditionally) has no equivalent of YARN's External Shuffle Service. So when dynamic allocation scales executors down, the shuffle data held by those executors disappears. That is what shuffleTracking.enabled solves — executors holding shuffle data are tracked and kept alive even when idle, instead of being reclaimed.

4. The Biggest Trap — Shuffle Data Loss

When an executor Pod disappears on K8s (scale-down, node failure, spot reclamation), the shuffle data on that Pod's local disk disappears with it. When another executor tries to read that shuffle, it hits a FetchFailedException, and Spark recomputes the affected stage.

Executor A writes shuffle → A disappears due to spot reclamation
→ Executor B tries to read A's shuffle → FetchFailedException
→ Stage recomputation (expensive); if it repeats, the job fails

Mitigation strategies:

StrategyApproach
shuffle trackingKeep executors holding shuffle data alive via shuffleTracking.enabled=true
Externalize shuffle dataMove shuffle outside the cluster with a remote shuffle service (e.g. Celeborn)
FTE (retries)Allow stage recomputation and regenerate shuffle
Limit spot usageIncrease the on-demand share for shuffle-heavy jobs

5. Spot Instances — Cost Savings and Risk

Spot (or preemptible) instances are cheap but can be reclaimed at any time. The pattern for using spot safely with Spark on K8s:

Driver  → on-demand nodes (if the driver dies, the whole job fails — SPOF)
Executor → spot nodes (recoverable via recomputation/retry if they die)
conf = {
    # driver on on-demand, executors on the spot node pool (nodeSelector)
    "spark.kubernetes.driver.node.selector.node-pool": "on-demand",
    "spark.kubernetes.executor.node.selector.node-pool": "spot",
}
ComponentPlacementRationale
DriverOn-demandIf it dies, the whole job fails (SPOF)
ExecutorSpotRecoverable via recomputation/retry
Shuffle-heavy stagesHigher on-demand shareShuffle loss is expensive

Rule of thumb: never place the driver on spot. Executor loss is recoverable through recomputation, but driver loss kills the entire job. (Same principle as the Trino coordinator — see our separate post, "Deploying Trino on Kubernetes".)

6. Data Locality and I/O

Spark on K8s usually runs with compute and storage separated (S3/object storage). Since there is no HDFS-style data locality:

  • I/O depends on object storage connector (S3A, etc.) performance → tune the connector (multipart, connection pooling).
  • Secure fast node-local SSDs for shuffle and spill (spark.local.dir).
  • Lakehouse (Iceberg/Delta) + object storage is the natural combination.

7. Operations — Monitoring and Isolation

AreaApproach
Resource isolationnamespace + ResourceQuota
ImagesPin versions (never use latest)
LogsCollect driver/executor Pod logs
MetricsSpark UI + Prometheus (metrics sink)
CleanupPolicy for cleaning up completed driver Pods

When multiple teams share a cluster, isolate them with namespaces and quotas so one team's job cannot starve everyone else.

8. Spark on K8s vs YARN at a Glance

AreaYARNKubernetes
Resource managementRM/NMK8s scheduler
Shuffle serviceBuilt-in External Shuffle ServiceNone → shuffle tracking/remote shuffle
AutoscalingLimiteddynamic allocation + cluster autoscaler
SpotLimitedNatural fit (but beware shuffle loss)
Multi-tenancyQueuesnamespace/quota
Data localityStrong with HDFSUsually separated (object storage)

9. Summary

AreaKey takeaway
Execution modelThe driver manages executor Pods directly
dynamic allocationshuffleTracking is essential (shuffle retention)
Shuffle lossLost executor = lost shuffle → recomputation/remote shuffle
SpotExecutors only; keep the driver on on-demand
StorageCompute-storage separation, pairs well with a Lakehouse

The core insight of Spark on Kubernetes is that most operational challenges flow from a single fact: "there is no YARN External Shuffle Service". Dynamic allocation's shuffleTracking, shuffle loss on spot, separating driver and executor node pools — all of these come down to how you manage the lifetime of shuffle data. Protect the driver on on-demand nodes, cut cost by running executors on spot, and treat shuffle-heavy stages with care, and you can have both cost efficiency and stability.


This article is based on Spark 3.5. If you need help with a Spark on Kubernetes migration or cost-optimization design, feel free to reach out.

— The Data Dynamics Engineering Team