PySpark Multi-Dimensional Aggregation — Mastering rollup, cube, and grouping sets
How to produce subtotals, grand totals, and multi-dimensional cross aggregations in a single job. We cover the differences and costs of rollup (hierarchical subtotals), cube (all combinations), and grouping sets (only the combinations you need), how to distinguish aggregation levels with grouping_id, and patterns for avoiding high-cardinality cube explosions.
"Show me revenue by country, by country × product, and the overall total — all at once." A staple request in OLAP reporting. Implemented naively, it turns into messy code that runs groupBy several times and stitches the results together with union. PySpark handles this multi-dimensional aggregation elegantly in a single job with rollup, cube, and grouping sets.
This post walks through the differences and costs of the three multi-dimensional aggregations, how to tell aggregation levels apart, and how to keep cube from exploding under high cardinality.
1. The problem — multiple aggregation levels at once
Desired result:
revenue by (country, product) ← most granular
revenue by (country) ← subtotal with product collapsed
grand total ← overall sumIf you run groupBy("country","product"), groupBy("country"), and a global aggregation separately and union them — you scan the data 3 times and the code gets verbose. Multi-dimensional aggregation operators solve this with a single scan.
2. rollup — hierarchical subtotals
rollup builds subtotals hierarchically from the left. Column order is meaningful.
from pyspark.sql import functions as F
result = (df
.rollup("country", "product")
.agg(F.sum("amount").alias("total"))
.orderBy("country", "product"))The aggregation levels it produces:
rollup(country, product) produces:
(country, product) ← granular
(country, NULL) ← subtotal per country (product collapsed)
(NULL, NULL) ← grand totalrollup fits "hierarchies" — relationships where the parent contains the child, like year→month→day, country→city, category→product. Column order is the hierarchy order.
3. cube — all combinations
cube produces every possible combination of the columns.
result = df.cube("country", "product").agg(F.sum("amount").alias("total"))cube(country, product) produces:
(country, product) ← both
(country, NULL) ← country only
(NULL, product) ← product only ← a combination rollup doesn't have!
(NULL, NULL) ← grand total| rollup | cube | |
|---|---|---|
| Combinations | Hierarchical (from the left) | All subsets |
| n columns | n+1 levels | 2^n levels |
| Best for | Hierarchical subtotals | Multi-dimensional cross analysis |
Caution: with cube, the number of combinations explodes as 2^n as columns grow. Four columns means 16 aggregations. Mix in a high-cardinality column and the result rows blow up, so make sure you really need every combination.
4. grouping sets — only the combinations you want
When you want specific combinations that are neither rollup nor cube, use grouping sets (clearest in SQL).
df.createOrReplaceTempView("sales")
result = spark.sql("""
SELECT country, product, sum(amount) AS total
FROM sales
GROUP BY GROUPING SETS (
(country, product), -- granular
(country), -- per country
() -- grand total
-- (product) deliberately excluded
)
""")grouping sets computes only the aggregation levels you pick, avoiding cube's cost for unnecessary combinations. It's ideal for custom reports like "just this combination and that one."
5. Distinguishing aggregation levels — grouping_id / grouping
In the result you must distinguish whether a NULL is "NULL because it's a subtotal" or "NULL in the original data." grouping_id() and grouping() tell you.
result = (df
.rollup("country", "product")
.agg(
F.sum("amount").alias("total"),
F.grouping_id().alias("gid"), # aggregation-level bitmask
F.grouping("country").alias("is_country_agg")) # 1 means this column was collapsed
)
# Interpret levels via gid
result = result.withColumn("level",
F.when(F.col("gid") == 0, "country+product")
.when(F.col("gid") == 1, "country subtotal")
.when(F.col("gid") == 3, "grand total"))| Function | Meaning |
|---|---|
grouping(col) | 1 if that column was collapsed by aggregation, 0 otherwise |
grouping_id() | Integer combining the grouping bits of all columns |
Essential in practice: you absolutely need
grouping_idto distinguish the NULLs in subtotal/grand-total rows from genuine NULLs in the data. Without it, you can't tell whether a "NULL country" is a grand total or missing data.
6. Preventing high-cardinality explosions
The result row count of a multi-dimensional aggregation is (the combination of each dimension's cardinality). Put a high-cardinality column into a cube and the result becomes enormous.
cube(country[200], product[100000], date[365])
→ worst case, billions of combination rows 💥Countermeasures:
| Strategy | How |
|---|---|
| Constrain with grouping sets | Specify only the levels you need |
| Exclude high cardinality | Drop fine-grained dimensions from the cube |
| Pre-filter / pre-aggregate | Reduce dimension cardinality first |
| Persist as a rollup table | Pre-compute and store (see the separate post "High-Cardinality Aggregation") |
The safest approach is to specify only the combinations you actually need with grouping sets instead of cube.
7. Multi-dimensional aggregation vs pivot
| Tool | Result shape |
|---|---|
| rollup/cube | long (levels expressed as rows) |
| pivot | wide (spread across columns) |
Multi-dimensional aggregation produces results as rows (distinguished by level columns), while pivot spreads them into columns (see the separate post "Large-Scale Pivots in PySpark"). The long shape is better for storage and re-aggregation; wide is better for display.
8. Summary
| Operation | What it produces | Best for |
|---|---|---|
rollup | Hierarchical subtotals (n+1) | Hierarchies like year→month→day |
cube | All combinations (2^n) | Multi-dimensional cross analysis |
grouping sets | Only specified combinations | Custom reports, explosion avoidance |
grouping_id | Aggregation-level distinction | Subtotal NULL vs data NULL |
The essence of multi-dimensional aggregation is producing "subtotals, grand totals, and cross aggregations in a single scan." Use rollup for hierarchical subtotals, cube for all-around cross analysis, grouping sets for custom combinations — and interpret the NULLs in the result with grouping_id. Just watch out for one thing: cube's 2^n explosion. When high-cardinality dimensions are involved, pick only the levels you need with grouping sets and you can generate OLAP reports efficiently.
This article is based on Spark 3.5. If you need help designing OLAP reporting or multi-dimensional aggregation pipelines, feel free to reach out.
— Data Dynamics Engineering Team