Large-Scale Pivot and Wide Transforms in PySpark — Spreading Data into Thousands of Columns
Why high-cardinality pivots get slow or blow up, and how to fix them. We cover the two-pass cost of pivot, making it a single pass by specifying values explicitly, the pitfalls of wide tables with thousands of columns, map/array alternatives, and the unpivot(stack) pattern for going back.
"Spread sales per product into monthly columns", "turn event types into columns per user" — pivot requests like these are common in analytics, but at high cardinality they get slow or blow up in PySpark. When the pivot column has thousands of distinct values, you end up with a wide table of thousands of columns, and Spark handles that poorly.
This post explains why pivot is expensive, how to cut its cost in half by specifying values explicitly, the pitfalls of wide tables, and the map/array alternatives plus the unpivot pattern.
1. The Basic Pivot and Its Hidden Cost
from pyspark.sql import functions as F
# Pivot product x month
pivoted = (df
.groupBy("product")
.pivot("month") # month values become columns
.agg(F.sum("amount")))There is a hidden cost here. Spark does not know which columns to create. It must learn the distinct values of month before it can determine the columns, so the pivot internally performs two scans.
First scan: collect distinct values of the pivot column (distinct month)
Second scan: the actual aggregation
-> the data is read twice2. Fix 1: Specify values Explicitly — Single Pass
If you already know the values you are pivoting on, spell them out. The first distinct scan disappears, halving the cost.
months = ["2026-01", "2026-02", "2026-03", "2026-04",
"2026-05", "2026-06"]
pivoted = (df
.groupBy("product")
.pivot("month", months) # explicit values -> single pass
.agg(F.sum("amount")))| values omitted | values specified | |
|---|---|---|
| Scans | 2 (distinct + aggregation) | 1 |
| Column determination | inferred at runtime | fixed |
| Speed | slow | fast |
Rule of thumb: whenever the pivot values form a known finite set (months, categories, status codes), always specify them. If you don't know them, computing the distinct values yourself and then passing them in gives you more control than leaving it to pivot.
3. Fix 2: Avoid High-Cardinality Pivots
There is an upper bound on pivot cardinality. Spark refuses to pivot when there are too many pivot columns (10,000 by default) — that limit is a safety guard.
thousands to tens of thousands of pivot columns -> wide table
-> thousands of columns per row, mostly NULL (sparse) -> memory and performance disaster# You can raise the limit (usually a bad sign)
spark.conf.set("spark.sql.pivotMaxValues", "10000")High pivot cardinality is a signal that pivot is the wrong tool. Consider the alternatives below.
4. Fix 3: map / array — Structure Instead of Wide
Rather than spreading thousands of key-value pairs into columns, pack them into a single map column — an efficient representation for sparse data.
# Aggregate into a map instead of a (wide) pivot
as_map = (df
.groupBy("product")
.agg(F.map_from_entries(
F.collect_list(F.struct("month", "amount"))).alias("by_month")))
# Lookups are map accesses
as_map.select("product", F.col("by_month")["2026-06"].alias("jun"))| Representation | Best fit |
|---|---|
| Pivot (wide columns) | low cardinality (dozens), BI-friendly |
| map column | high cardinality, sparse, dynamic keys |
| long format (as-is) | aggregation and filtering are the main use |
Key insight: a requirement to "spread it into columns" is usually a presentation-stage need. Keep processing and storage in long format or a map, and pivot only a small slice right before the final display — that is the efficient approach.
5. The Pitfalls of Wide Tables
A table with hundreds to thousands of columns is a burden on Spark by itself.
| Pitfall | Why |
|---|---|
| Planner latency | planning slows down in proportion to column count |
| Codegen limits | Whole-Stage CodeGen becomes inefficient or falls back with many columns |
| Sparse/NULL waste | cost of storing and processing mostly-NULL columns |
| Serialization cost | thousands of fields per row |
Ask yourself whether you really need thousands of columns. For analytics or ML input, it is usually better to select only the features you need or bundle them into a vector.
6. Unpivot — Turning Wide Back into Long
Conversely, you often need to turn a wide table (where columns are values) back into long format (for normalization or easier aggregation). Use stack or unpivot in Spark 3.4+.
# Spark 3.4+ unpivot (melt)
long_df = wide_df.unpivot(
ids=["product"],
values=["jan", "feb", "mar"],
variableColumnName="month",
valueColumnName="amount")
# Or the stack expression (older versions)
long_df = wide_df.select(
"product",
F.expr("stack(3, 'jan', jan, 'feb', feb, 'mar', mar) as (month, amount)"))Long format is far better suited to aggregation, filtering, and joins. It is the practical tool behind the "store and process in long, display in wide" principle.
7. Multi-Aggregation Pivots
Pivoting several metrics at once multiplies the columns (values x metrics).
pivoted = (df
.groupBy("product")
.pivot("month", months)
.agg(
F.sum("amount").alias("sum"),
F.count("*").alias("cnt")))
# -> two columns (sum, cnt) per month -> column count doublesEvery additional metric accelerates the wide-table explosion, so be even more conservative about cardinality with multi-aggregation pivots.
8. Summary
| Fix | Key point |
|---|---|
Specify values | 2 scans -> 1 scan |
| Know the cardinality limit | thousands of columns is a bad sign |
| map/array | structure sparse, dynamic keys |
| Stay in long format | process and store in long, display in wide |
unpivot/stack | restore wide -> long |
The key to large-scale pivots is to first ask "do I really need to spread this into columns?" If the pivot values are a small, known set, specify values for a single-pass pivot; if cardinality is high, keep the data as a map or in long format and pivot only a narrow slice right before display. Remember that a wide table with thousands of columns is the shape Spark hates most, and pivot stops being the operation that kills your jobs.
This post was written against Spark 3.5. If you need help designing large-scale data reshape and aggregation pipelines, feel free to reach out.
— Data Dynamics Engineering Team