Blog
pysparksparkspark-uidebuggingperformancedata-engineering

Debugging Slow PySpark Jobs — How to Read the Spark UI and DAG

A guide to ending "it's slow but I don't know where." Learn to read the Jobs, Stages, Tasks, and SQL tabs of the Spark UI to identify skew, spill, bad joins, and shuffle explosions, and to diagnose execution plans with EXPLAIN.

Data DynamicsJune 5, 20267 min read

"The job is slow, but I don't know where it's slow." It's the phrase data engineers say most often. Guess-driven tuning — blindly bumping executor.memory or tweaking partition counts — only wastes time. If you know how to read the Spark UI, you can pinpoint the cause of a slow job in five minutes.

This post covers how to read each tab of the Spark UI, which numbers point to which problems, and the order in which to diagnose a slow job.

1. Where Diagnosis Starts — Top Down

The Spark UI is hierarchical. You narrow down from the top.

Jobs (the overall job)
  └─ Stages (steps split at shuffle boundaries)
        └─ Tasks (parallel units within a stage)   ← most of the truth lives here
SQL / DataFrame tab (logical/physical plans + runtime metrics)
TabQuestion it answers
JobsWhich job/action is taking long
StagesWhich stage is the bottleneck
TasksIs there skew or spill (the key one)
SQLAre joins, shuffles, and scans appropriate
ExecutorsAre resources, GC, and failures healthy

2. Stages Tab — Finding the Bottleneck Stage

Click into a long-running job to see its stages, with each stage's duration and shuffle volume.

What to check:

  • A stage with an overwhelmingly long Duration → the bottleneck.
  • A stage with large Shuffle Read / Write → shuffle is the core of the cost.
  • Input / Output → where data enters and leaves.

If you see a stage with heavy shuffle (Shuffle Read/Write), it means a join or groupBy is redistributing data in bulk there. Shuffle hits the network and disk, so it often accounts for most of a job's runtime.

3. Tasks Tab — The Most Important Truth

Click into a stage and look at the Summary Metrics (task quantile statistics). Most problems reveal themselves here.

Identifying skew

Metric          Min    25th   Median  75th   Max
Duration        2s     3s     3s      4s     8min   ← Max is 160x the Median!
Shuffle Read    50MB   52MB   51MB    53MB   9GB    ← a single task is enormous

If Max is tens to hundreds of times the Median, that's data skew. A handful of tasks is dragging the whole stage. (For fixes, see the separate post "Mastering Data Skew in PySpark".)

Identifying spill

Large values in the Spill (Memory) / Spill (Disk) columns
→ execution memory is exhausted and spilling to disk → slow

Large spill is a sign that memory is tight. Use smaller partitions or free up memory (see the separate post "Conquering PySpark Executor OOM").

Task signalMeaningResponse
Max ≫ Median (Duration)Skewsalt/broadcast
Large SpillMemory pressureMore partitions, less caching
High GC TimeHeap pressureTune heap/objects
Task count fixed at 200Default shuffle partitionsAdjust shuffle.partitions/AQE

4. SQL Tab — Execution Plans with Real Metrics

For DataFrame/SQL jobs, the SQL tab draws an execution graph per query, with actual row counts and timings attached to each node. It's the most powerful diagnostic view.

What to check:

  • Join strategy: BroadcastHashJoin (replicates the small side) vs SortMergeJoin (shuffles both sides). A small table going through SMJ means the broadcast didn't kick in — wasted shuffle.
  • Number of Exchange (shuffle) nodes: more nodes means more redistribution cost.
  • number of output rows: which node explodes the row count (join blow-up).
  • Filters/pruning on Scan nodes: did pushdown and partition pruning happen?
== Many Exchange (shuffle) nodes, and a small table handled via SortMergeJoin
→ raise the broadcast threshold or use a broadcast() hint to eliminate the shuffle

5. Checking the Plan with EXPLAIN

You can inspect the plan straight from code, without the UI.

df.explain(mode="formatted")    # easy-to-read format
# or
df.explain(True)                # logical/optimized/physical plans, all of them

Key keywords:

KeywordMeaning
BroadcastHashJoinJoin replicating the small side (good, no shuffle)
SortMergeJoinJoin shuffling both sides (fine for two large tables)
ExchangeWhere a shuffle occurs
*(n) (codegen)Whole-Stage CodeGen stage
PartitionFilters / PushedFiltersPruning/pushdown in action

If PushedFilters is empty, the filter never reached the source — suspect function wrapping in your WHERE clause.

6. Checking AQE — Runtime Adaptation

With AQE enabled (spark.sql.adaptive.enabled=true), the plan in the SQL tab changes at runtime (AdaptiveSparkPlan). This is where you can confirm whether skew join splitting and partition coalescing were applied.

spark.conf.set("spark.sql.adaptive.enabled", "true")
# Look for AdaptiveSparkPlan and coalesced/skew markers in the UI's SQL tab

7. Executors Tab — Resources and Failures

CheckSignal
Failed TasksRepeated retries/OOM
GC Time ratioHigh means heap pressure
Storage MemoryCache occupying memory
Active/DeadAre executors dying

If GC Time is a significant fraction of task time, the heap is oversized or there are too many objects.

1. Jobs/Stages → identify the longest-running stage

2. Tasks Summary → compare Max vs Median
        ├─ Max ≫ Median → skew (salt/broadcast)
        ├─ Large Spill  → memory (more partitions)
        └─ High GC      → tune the heap

3. SQL tab/EXPLAIN → check join strategy, shuffles, pruning
        ├─ Small table on SMJ → broadcast
        ├─ Too many Exchanges → reduce shuffle (bucketing/pre-aggregation)
        └─ Empty PushedFilters → remove function wrapping in WHERE

4. Executors → check failures, GC, resources

9. Symptom → Screen → Prescription

SymptomWhere to lookPrescription
One task never finishesTasks: Max DurationFix skew
Job is slow overallStages: Shuffle sizeReduce shuffle (bucketing, pre-aggregation)
Intermittent OOMTasks: Spill / Executors: GCMore partitions, check memory
Joins are slowSQL: join nodesForce broadcast
Scans are slowSQL: Scan PushedFiltersRestore pruning/pushdown
Too many/too few tasksStages: task countshuffle.partitions/AQE

10. Summary

TabKey metricProblem it catches
StagesShuffle Read/WriteShuffle bottleneck
TasksMax vs Median, SpillSkew, memory
SQLJoin strategy, ExchangeBad joins, shuffle
EXPLAINPushedFiltersBroken pruning
ExecutorsGC, FailedResources, stability

The key to debugging slow Spark jobs is "don't guess — read the Spark UI." Once the procedure of finding the bottleneck stage in Stages, skew and spill in Tasks, and bad joins in the SQL tab becomes second nature, the root cause of most performance problems surfaces within five minutes. This diagnostic skill is what tells you where to apply the skew, memory, and join tuning covered in the earlier posts.


This article is based on Spark 3.5. If you need help diagnosing and tuning slow Spark jobs or building an operational monitoring practice, feel free to reach out.

— The Data Dynamics Engineering Team