pysparksparkpivotwide-tablereshapedata-engineering

Large-Scale Pivot and Wide Transforms in PySpark — Spreading Data into Thousands of Columns

Why high-cardinality pivots get slow or blow up, and how to fix them. We cover the two-pass cost of pivot, making it a single pass by specifying values explicitly, the pitfalls of wide tables with thousands of columns, map/array alternatives, and the unpivot(stack) pattern for going back.

Data DynamicsJune 5, 20265 min read

"Spread sales per product into monthly columns", "turn event types into columns per user" — pivot requests like these are common in analytics, but at high cardinality they get slow or blow up in PySpark. When the pivot column has thousands of distinct values, you end up with a wide table of thousands of columns, and Spark handles that poorly.

This post explains why pivot is expensive, how to cut its cost in half by specifying values explicitly, the pitfalls of wide tables, and the map/array alternatives plus the unpivot pattern.

1. The Basic Pivot and Its Hidden Cost

from pyspark.sql import functions as F
 
# Pivot product x month
pivoted = (df
    .groupBy("product")
    .pivot("month")               # month values become columns
    .agg(F.sum("amount")))

There is a hidden cost here. Spark does not know which columns to create. It must learn the distinct values of month before it can determine the columns, so the pivot internally performs two scans.

Loading diagram…

2. Fix 1: Specify values Explicitly — Single Pass

If you already know the values you are pivoting on, spell them out. The first distinct scan disappears, halving the cost.

months = ["2026-01", "2026-02", "2026-03", "2026-04",
          "2026-05", "2026-06"]
 
pivoted = (df
    .groupBy("product")
    .pivot("month", months)       # explicit values -> single pass
    .agg(F.sum("amount")))

	values omitted	values specified
Scans	2 (distinct + aggregation)	1
Column determination	inferred at runtime	fixed
Speed	slow	fast

Rule of thumb: whenever the pivot values form a known finite set (months, categories, status codes), always specify them. If you don't know them, computing the distinct values yourself and then passing them in gives you more control than leaving it to pivot.

3. Fix 2: Avoid High-Cardinality Pivots

There is an upper bound on pivot cardinality. Spark refuses to pivot when there are too many pivot columns (10,000 by default) — that limit is a safety guard.

Loading diagram…

# You can raise the limit (usually a bad sign)
spark.conf.set("spark.sql.pivotMaxValues", "10000")

High pivot cardinality is a signal that pivot is the wrong tool. Consider the alternatives below.

4. Fix 3: map / array — Structure Instead of Wide

Rather than spreading thousands of key-value pairs into columns, pack them into a single map column — an efficient representation for sparse data.

# Aggregate into a map instead of a (wide) pivot
as_map = (df
    .groupBy("product")
    .agg(F.map_from_entries(
        F.collect_list(F.struct("month", "amount"))).alias("by_month")))
 
# Lookups are map accesses
as_map.select("product", F.col("by_month")["2026-06"].alias("jun"))

Representation	Best fit
Pivot (wide columns)	low cardinality (dozens), BI-friendly
map column	high cardinality, sparse, dynamic keys
long format (as-is)	aggregation and filtering are the main use

Key insight: a requirement to "spread it into columns" is usually a presentation-stage need. Keep processing and storage in long format or a map, and pivot only a small slice right before the final display — that is the efficient approach.

5. The Pitfalls of Wide Tables

A table with hundreds to thousands of columns is a burden on Spark by itself.

Pitfall	Why
Planner latency	planning slows down in proportion to column count
Codegen limits	Whole-Stage CodeGen becomes inefficient or falls back with many columns
Sparse/NULL waste	cost of storing and processing mostly-NULL columns
Serialization cost	thousands of fields per row

Ask yourself whether you really need thousands of columns. For analytics or ML input, it is usually better to select only the features you need or bundle them into a vector.

6. Unpivot — Turning Wide Back into Long

Conversely, you often need to turn a wide table (where columns are values) back into long format (for normalization or easier aggregation). Use stack or unpivot in Spark 3.4+.

# Spark 3.4+ unpivot (melt)
long_df = wide_df.unpivot(
    ids=["product"],
    values=["jan", "feb", "mar"],
    variableColumnName="month",
    valueColumnName="amount")
 
# Or the stack expression (older versions)
long_df = wide_df.select(
    "product",
    F.expr("stack(3, 'jan', jan, 'feb', feb, 'mar', mar) as (month, amount)"))

Long format is far better suited to aggregation, filtering, and joins. It is the practical tool behind the "store and process in long, display in wide" principle.

7. Multi-Aggregation Pivots

Pivoting several metrics at once multiplies the columns (values x metrics).

pivoted = (df
    .groupBy("product")
    .pivot("month", months)
    .agg(
        F.sum("amount").alias("sum"),
        F.count("*").alias("cnt")))
# -> two columns (sum, cnt) per month -> column count doubles

Every additional metric accelerates the wide-table explosion, so be even more conservative about cardinality with multi-aggregation pivots.

8. Summary

Fix	Key point
Specify `values`	2 scans -> 1 scan
Know the cardinality limit	thousands of columns is a bad sign
map/array	structure sparse, dynamic keys
Stay in long format	process and store in long, display in wide
`unpivot`/`stack`	restore wide -> long

The key to large-scale pivots is to first ask "do I really need to spread this into columns?" If the pivot values are a small, known set, specify values for a single-pass pivot; if cardinality is high, keep the data as a map or in long format and pivot only a narrow slice right before display. Remember that a wide table with thousands of columns is the shape Spark hates most, and pivot stops being the operation that kills your jobs.

This post was written against Spark 3.5. If you need help designing large-scale data reshape and aggregation pipelines, feel free to reach out.

— Data Dynamics Engineering Team