pysparksparkspark-uidebuggingperformancedata-engineering

Debugging Slow PySpark Jobs — How to Read the Spark UI and DAG

A guide to ending "it's slow but I don't know where." Learn to read the Jobs, Stages, Tasks, and SQL tabs of the Spark UI to identify skew, spill, bad joins, and shuffle explosions, and to diagnose execution plans with EXPLAIN.

Data DynamicsJune 5, 20267 min read

"The job is slow, but I don't know where it's slow." It's the phrase data engineers say most often. Guess-driven tuning — blindly bumping executor.memory or tweaking partition counts — only wastes time. If you know how to read the Spark UI, you can pinpoint the cause of a slow job in five minutes.

This post covers how to read each tab of the Spark UI, which numbers point to which problems, and the order in which to diagnose a slow job.

1. Where Diagnosis Starts — Top Down

The Spark UI is hierarchical. You narrow down from the top.

Loading diagram…

Tab	Question it answers
Jobs	Which job/action is taking long
Stages	Which stage is the bottleneck
Tasks	Is there skew or spill (the key one)
SQL	Are joins, shuffles, and scans appropriate
Executors	Are resources, GC, and failures healthy

2. Stages Tab — Finding the Bottleneck Stage

Click into a long-running job to see its stages, with each stage's duration and shuffle volume.

What to check:

A stage with an overwhelmingly long Duration → the bottleneck.
A stage with large Shuffle Read / Write → shuffle is the core of the cost.
Input / Output → where data enters and leaves.

If you see a stage with heavy shuffle (Shuffle Read/Write), it means a join or groupBy is redistributing data in bulk there. Shuffle hits the network and disk, so it often accounts for most of a job's runtime.

3. Tasks Tab — The Most Important Truth

Click into a stage and look at the Summary Metrics (task quantile statistics). Most problems reveal themselves here.

Identifying skew

Metric          Min    25th   Median  75th   Max
Duration        2s     3s     3s      4s     8min   ← Max is 160x the Median!
Shuffle Read    50MB   52MB   51MB    53MB   9GB    ← a single task is enormous

If Max is tens to hundreds of times the Median, that's data skew. A handful of tasks is dragging the whole stage. (For fixes, see the separate post "Mastering Data Skew in PySpark".)

Identifying spill

Large values in the Spill (Memory) / Spill (Disk) columns
→ execution memory is exhausted and spilling to disk → slow

Large spill is a sign that memory is tight. Use smaller partitions or free up memory (see the separate post "Conquering PySpark Executor OOM").

Task signal	Meaning	Response
Max ≫ Median (Duration)	Skew	salt/broadcast
Large Spill	Memory pressure	More partitions, less caching
High GC Time	Heap pressure	Tune heap/objects
Task count fixed at 200	Default shuffle partitions	Adjust `shuffle.partitions`/AQE

4. SQL Tab — Execution Plans with Real Metrics

For DataFrame/SQL jobs, the SQL tab draws an execution graph per query, with actual row counts and timings attached to each node. It's the most powerful diagnostic view.

What to check:

Join strategy: BroadcastHashJoin (replicates the small side) vs SortMergeJoin (shuffles both sides). A small table going through SMJ means the broadcast didn't kick in — wasted shuffle.
Number of Exchange (shuffle) nodes: more nodes means more redistribution cost.
number of output rows: which node explodes the row count (join blow-up).
Filters/pruning on Scan nodes: did pushdown and partition pruning happen?

== Many Exchange (shuffle) nodes, and a small table handled via SortMergeJoin
→ raise the broadcast threshold or use a broadcast() hint to eliminate the shuffle

5. Checking the Plan with EXPLAIN

You can inspect the plan straight from code, without the UI.

df.explain(mode="formatted")    # easy-to-read format
# or
df.explain(True)                # logical/optimized/physical plans, all of them

Key keywords:

Keyword	Meaning
`BroadcastHashJoin`	Join replicating the small side (good, no shuffle)
`SortMergeJoin`	Join shuffling both sides (fine for two large tables)
`Exchange`	Where a shuffle occurs
`*(n)` (codegen)	Whole-Stage CodeGen stage
`PartitionFilters` / `PushedFilters`	Pruning/pushdown in action

If PushedFilters is empty, the filter never reached the source — suspect function wrapping in your WHERE clause.

6. Checking AQE — Runtime Adaptation

With AQE enabled (spark.sql.adaptive.enabled=true), the plan in the SQL tab changes at runtime (AdaptiveSparkPlan). This is where you can confirm whether skew join splitting and partition coalescing were applied.

spark.conf.set("spark.sql.adaptive.enabled", "true")
# Look for AdaptiveSparkPlan and coalesced/skew markers in the UI's SQL tab

7. Executors Tab — Resources and Failures

Check	Signal
Failed Tasks	Repeated retries/OOM
GC Time ratio	High means heap pressure
Storage Memory	Cache occupying memory
Active/Dead	Are executors dying

If GC Time is a significant fraction of task time, the heap is oversized or there are too many objects.

8. Diagnostic Procedure (Recommended Order)

Loading diagram…

9. Symptom → Screen → Prescription

Symptom	Where to look	Prescription
One task never finishes	Tasks: Max Duration	Fix skew
Job is slow overall	Stages: Shuffle size	Reduce shuffle (bucketing, pre-aggregation)
Intermittent OOM	Tasks: Spill / Executors: GC	More partitions, check memory
Joins are slow	SQL: join nodes	Force broadcast
Scans are slow	SQL: Scan PushedFilters	Restore pruning/pushdown
Too many/too few tasks	Stages: task count	`shuffle.partitions`/AQE

10. Summary

Tab	Key metric	Problem it catches
Stages	Shuffle Read/Write	Shuffle bottleneck
Tasks	Max vs Median, Spill	Skew, memory
SQL	Join strategy, Exchange	Bad joins, shuffle
EXPLAIN	PushedFilters	Broken pruning
Executors	GC, Failed	Resources, stability

The key to debugging slow Spark jobs is "don't guess — read the Spark UI." Once the procedure of finding the bottleneck stage in Stages, skew and spill in Tasks, and bad joins in the SQL tab becomes second nature, the root cause of most performance problems surfaces within five minutes. This diagnostic skill is what tells you where to apply the skew, memory, and join tuning covered in the earlier posts.

This article is based on Spark 3.5. If you need help diagnosing and tuning slow Spark jobs or building an operational monitoring practice, feel free to reach out.

— The Data Dynamics Engineering Team