Debugging Slow PySpark Jobs — How to Read the Spark UI and DAG
A guide to ending "it's slow but I don't know where." Learn to read the Jobs, Stages, Tasks, and SQL tabs of the Spark UI to identify skew, spill, bad joins, and shuffle explosions, and to diagnose execution plans with EXPLAIN.
"The job is slow, but I don't know where it's slow." It's the phrase data engineers say most often. Guess-driven tuning — blindly bumping executor.memory or tweaking partition counts — only wastes time. If you know how to read the Spark UI, you can pinpoint the cause of a slow job in five minutes.
This post covers how to read each tab of the Spark UI, which numbers point to which problems, and the order in which to diagnose a slow job.
1. Where Diagnosis Starts — Top Down
The Spark UI is hierarchical. You narrow down from the top.
Jobs (the overall job)
└─ Stages (steps split at shuffle boundaries)
└─ Tasks (parallel units within a stage) ← most of the truth lives here
SQL / DataFrame tab (logical/physical plans + runtime metrics)| Tab | Question it answers |
|---|---|
| Jobs | Which job/action is taking long |
| Stages | Which stage is the bottleneck |
| Tasks | Is there skew or spill (the key one) |
| SQL | Are joins, shuffles, and scans appropriate |
| Executors | Are resources, GC, and failures healthy |
2. Stages Tab — Finding the Bottleneck Stage
Click into a long-running job to see its stages, with each stage's duration and shuffle volume.
What to check:
- A stage with an overwhelmingly long Duration → the bottleneck.
- A stage with large Shuffle Read / Write → shuffle is the core of the cost.
- Input / Output → where data enters and leaves.
If you see a stage with heavy shuffle (Shuffle Read/Write), it means a join or groupBy is redistributing data in bulk there. Shuffle hits the network and disk, so it often accounts for most of a job's runtime.
3. Tasks Tab — The Most Important Truth
Click into a stage and look at the Summary Metrics (task quantile statistics). Most problems reveal themselves here.
Identifying skew
Metric Min 25th Median 75th Max
Duration 2s 3s 3s 4s 8min ← Max is 160x the Median!
Shuffle Read 50MB 52MB 51MB 53MB 9GB ← a single task is enormousIf Max is tens to hundreds of times the Median, that's data skew. A handful of tasks is dragging the whole stage. (For fixes, see the separate post "Mastering Data Skew in PySpark".)
Identifying spill
Large values in the Spill (Memory) / Spill (Disk) columns
→ execution memory is exhausted and spilling to disk → slowLarge spill is a sign that memory is tight. Use smaller partitions or free up memory (see the separate post "Conquering PySpark Executor OOM").
| Task signal | Meaning | Response |
|---|---|---|
| Max ≫ Median (Duration) | Skew | salt/broadcast |
| Large Spill | Memory pressure | More partitions, less caching |
| High GC Time | Heap pressure | Tune heap/objects |
| Task count fixed at 200 | Default shuffle partitions | Adjust shuffle.partitions/AQE |
4. SQL Tab — Execution Plans with Real Metrics
For DataFrame/SQL jobs, the SQL tab draws an execution graph per query, with actual row counts and timings attached to each node. It's the most powerful diagnostic view.
What to check:
- Join strategy:
BroadcastHashJoin(replicates the small side) vsSortMergeJoin(shuffles both sides). A small table going through SMJ means the broadcast didn't kick in — wasted shuffle. - Number of Exchange (shuffle) nodes: more nodes means more redistribution cost.
- number of output rows: which node explodes the row count (join blow-up).
- Filters/pruning on Scan nodes: did pushdown and partition pruning happen?
== Many Exchange (shuffle) nodes, and a small table handled via SortMergeJoin
→ raise the broadcast threshold or use a broadcast() hint to eliminate the shuffle5. Checking the Plan with EXPLAIN
You can inspect the plan straight from code, without the UI.
df.explain(mode="formatted") # easy-to-read format
# or
df.explain(True) # logical/optimized/physical plans, all of themKey keywords:
| Keyword | Meaning |
|---|---|
BroadcastHashJoin | Join replicating the small side (good, no shuffle) |
SortMergeJoin | Join shuffling both sides (fine for two large tables) |
Exchange | Where a shuffle occurs |
*(n) (codegen) | Whole-Stage CodeGen stage |
PartitionFilters / PushedFilters | Pruning/pushdown in action |
If PushedFilters is empty, the filter never reached the source — suspect function wrapping in your WHERE clause.
6. Checking AQE — Runtime Adaptation
With AQE enabled (spark.sql.adaptive.enabled=true), the plan in the SQL tab changes at runtime (AdaptiveSparkPlan). This is where you can confirm whether skew join splitting and partition coalescing were applied.
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Look for AdaptiveSparkPlan and coalesced/skew markers in the UI's SQL tab7. Executors Tab — Resources and Failures
| Check | Signal |
|---|---|
| Failed Tasks | Repeated retries/OOM |
| GC Time ratio | High means heap pressure |
| Storage Memory | Cache occupying memory |
| Active/Dead | Are executors dying |
If GC Time is a significant fraction of task time, the heap is oversized or there are too many objects.
8. Diagnostic Procedure (Recommended Order)
1. Jobs/Stages → identify the longest-running stage
│
2. Tasks Summary → compare Max vs Median
├─ Max ≫ Median → skew (salt/broadcast)
├─ Large Spill → memory (more partitions)
└─ High GC → tune the heap
│
3. SQL tab/EXPLAIN → check join strategy, shuffles, pruning
├─ Small table on SMJ → broadcast
├─ Too many Exchanges → reduce shuffle (bucketing/pre-aggregation)
└─ Empty PushedFilters → remove function wrapping in WHERE
│
4. Executors → check failures, GC, resources9. Symptom → Screen → Prescription
| Symptom | Where to look | Prescription |
|---|---|---|
| One task never finishes | Tasks: Max Duration | Fix skew |
| Job is slow overall | Stages: Shuffle size | Reduce shuffle (bucketing, pre-aggregation) |
| Intermittent OOM | Tasks: Spill / Executors: GC | More partitions, check memory |
| Joins are slow | SQL: join nodes | Force broadcast |
| Scans are slow | SQL: Scan PushedFilters | Restore pruning/pushdown |
| Too many/too few tasks | Stages: task count | shuffle.partitions/AQE |
10. Summary
| Tab | Key metric | Problem it catches |
|---|---|---|
| Stages | Shuffle Read/Write | Shuffle bottleneck |
| Tasks | Max vs Median, Spill | Skew, memory |
| SQL | Join strategy, Exchange | Bad joins, shuffle |
| EXPLAIN | PushedFilters | Broken pruning |
| Executors | GC, Failed | Resources, stability |
The key to debugging slow Spark jobs is "don't guess — read the Spark UI." Once the procedure of finding the bottleneck stage in Stages, skew and spill in Tasks, and bad joins in the SQL tab becomes second nature, the root cause of most performance problems surfaces within five minutes. This diagnostic skill is what tells you where to apply the skew, memory, and join tuning covered in the earlier posts.
This article is based on Spark 3.5. If you need help diagnosing and tuning slow Spark jobs or building an operational monitoring practice, feel free to reach out.
— The Data Dynamics Engineering Team