pysparksparkaggregationrollupcubedata-engineering

PySpark Multi-Dimensional Aggregation — Mastering rollup, cube, and grouping sets

How to produce subtotals, grand totals, and multi-dimensional cross aggregations in a single job. We cover the differences and costs of rollup (hierarchical subtotals), cube (all combinations), and grouping sets (only the combinations you need), how to distinguish aggregation levels with grouping_id, and patterns for avoiding high-cardinality cube explosions.

Data DynamicsJune 5, 20265 min read

"Show me revenue by country, by country × product, and the overall total — all at once." A staple request in OLAP reporting. Implemented naively, it turns into messy code that runs groupBy several times and stitches the results together with union. PySpark handles this multi-dimensional aggregation elegantly in a single job with rollup, cube, and grouping sets.

This post walks through the differences and costs of the three multi-dimensional aggregations, how to tell aggregation levels apart, and how to keep cube from exploding under high cardinality.

1. The problem — multiple aggregation levels at once

Desired result:
  revenue by (country, product)   ← most granular
  revenue by (country)            ← subtotal with product collapsed
  grand total                     ← overall sum

If you run groupBy("country","product"), groupBy("country"), and a global aggregation separately and union them — you scan the data 3 times and the code gets verbose. Multi-dimensional aggregation operators solve this with a single scan.

2. rollup — hierarchical subtotals

rollup builds subtotals hierarchically from the left. Column order is meaningful.

from pyspark.sql import functions as F
 
result = (df
    .rollup("country", "product")
    .agg(F.sum("amount").alias("total"))
    .orderBy("country", "product"))

The aggregation levels it produces:

rollup(country, product) produces:
  (country, product)   ← granular
  (country, NULL)      ← subtotal per country (product collapsed)
  (NULL,    NULL)      ← grand total

rollup fits "hierarchies" — relationships where the parent contains the child, like year→month→day, country→city, category→product. Column order is the hierarchy order.

3. cube — all combinations

cube produces every possible combination of the columns.

result = df.cube("country", "product").agg(F.sum("amount").alias("total"))

cube(country, product) produces:
  (country, product)   ← both
  (country, NULL)      ← country only
  (NULL,    product)   ← product only   ← a combination rollup doesn't have!
  (NULL,    NULL)      ← grand total

	rollup	cube
Combinations	Hierarchical (from the left)	All subsets
n columns	n+1 levels	2^n levels
Best for	Hierarchical subtotals	Multi-dimensional cross analysis

Caution: with cube, the number of combinations explodes as 2^n as columns grow. Four columns means 16 aggregations. Mix in a high-cardinality column and the result rows blow up, so make sure you really need every combination.

4. grouping sets — only the combinations you want

When you want specific combinations that are neither rollup nor cube, use grouping sets (clearest in SQL).

df.createOrReplaceTempView("sales")
 
result = spark.sql("""
    SELECT country, product, sum(amount) AS total
    FROM sales
    GROUP BY GROUPING SETS (
        (country, product),   -- granular
        (country),            -- per country
        ()                    -- grand total
        -- (product) deliberately excluded
    )
""")

grouping sets computes only the aggregation levels you pick, avoiding cube's cost for unnecessary combinations. It's ideal for custom reports like "just this combination and that one."

5. Distinguishing aggregation levels — grouping_id / grouping

In the result you must distinguish whether a NULL is "NULL because it's a subtotal" or "NULL in the original data." grouping_id() and grouping() tell you.

result = (df
    .rollup("country", "product")
    .agg(
        F.sum("amount").alias("total"),
        F.grouping_id().alias("gid"),                 # aggregation-level bitmask
        F.grouping("country").alias("is_country_agg"))  # 1 means this column was collapsed
)
 
# Interpret levels via gid
result = result.withColumn("level",
    F.when(F.col("gid") == 0, "country+product")
     .when(F.col("gid") == 1, "country subtotal")
     .when(F.col("gid") == 3, "grand total"))

Function	Meaning
`grouping(col)`	1 if that column was collapsed by aggregation, 0 otherwise
`grouping_id()`	Integer combining the grouping bits of all columns

Essential in practice: you absolutely need grouping_id to distinguish the NULLs in subtotal/grand-total rows from genuine NULLs in the data. Without it, you can't tell whether a "NULL country" is a grand total or missing data.

6. Preventing high-cardinality explosions

The result row count of a multi-dimensional aggregation is (the combination of each dimension's cardinality). Put a high-cardinality column into a cube and the result becomes enormous.

cube(country[200], product[100000], date[365])
→ worst case, billions of combination rows 💥

Countermeasures:

Strategy	How
Constrain with grouping sets	Specify only the levels you need
Exclude high cardinality	Drop fine-grained dimensions from the cube
Pre-filter / pre-aggregate	Reduce dimension cardinality first
Persist as a rollup table	Pre-compute and store (see the separate post "High-Cardinality Aggregation")

The safest approach is to specify only the combinations you actually need with grouping sets instead of cube.

7. Multi-dimensional aggregation vs pivot

Tool	Result shape
rollup/cube	long (levels expressed as rows)
pivot	wide (spread across columns)

Multi-dimensional aggregation produces results as rows (distinguished by level columns), while pivot spreads them into columns (see the separate post "Large-Scale Pivots in PySpark"). The long shape is better for storage and re-aggregation; wide is better for display.

8. Summary

Operation	What it produces	Best for
`rollup`	Hierarchical subtotals (n+1)	Hierarchies like year→month→day
`cube`	All combinations (2^n)	Multi-dimensional cross analysis
`grouping sets`	Only specified combinations	Custom reports, explosion avoidance
`grouping_id`	Aggregation-level distinction	Subtotal NULL vs data NULL

The essence of multi-dimensional aggregation is producing "subtotals, grand totals, and cross aggregations in a single scan." Use rollup for hierarchical subtotals, cube for all-around cross analysis, grouping sets for custom combinations — and interpret the NULLs in the result with grouping_id. Just watch out for one thing: cube's 2^n explosion. When high-cardinality dimensions are involved, pick only the levels you need with grouping sets and you can generate OLAP reports efficiently.

This article is based on Spark 3.5. If you need help designing OLAP reporting or multi-dimensional aggregation pipelines, feel free to reach out.

— Data Dynamics Engineering Team