pysparksparkquantilepercentilestatisticsdata-engineering

Large-Scale Quantiles and Statistics in PySpark — p95 Latency, Accurate and Fast

Why computing medians and percentiles (p95/p99) over billions of rows is hard: the cost of exact quantiles, how approxQuantile works, per-group quantiles, outlier detection with IQR, and correlation/covariance — a survey of large-scale statistics patterns.

Data DynamicsJune 5, 20265 min read

"Get me the p95 of API response times." This seemingly simple request becomes tricky at billions of rows. Percentiles and medians fundamentally require sorting the entire dataset to compute exactly, and a global sort in a distributed environment means an expensive shuffle. Averages are easy; quantiles are hard.

This post covers why large-scale quantiles are expensive, how approximate quantiles (approxQuantile) solve the problem, and statistical computation patterns like per-group quantiles, outlier detection, and correlation analysis.

1. Why Quantiles Are Harder Than Averages

Average (avg)  : sum / count → just add up partial sums (associative, little shuffle)
Median / p95   : the value at a sorted position → requires knowing the full order (global sort)

Averages, sums, and maxima are associative: each partition computes a partial result, and the results are merged. But "exactly the 95th percentile" requires knowing the order of every value, so an exact computation demands a global sort (or an equivalently expensive shuffle).

Statistic	Cost
count, sum, avg, min, max	Cheap (associative)
stddev, var	Cheap (associative)
median, percentile	Expensive (sorting)
distinct count	Expensive (see the separate post on high-cardinality aggregation)

2. Exact Quantiles — percentile

If you truly need the exact value, use percentile (expensive).

from pyspark.sql import functions as F
 
# Exact quantiles (sorting cost)
df.agg(
    F.percentile("latency", 0.5).alias("median"),
    F.percentile("latency", 0.95).alias("p95"),
    F.percentile("latency", 0.99).alias("p99"))
 
# The exact counterpart of percentile_approx, or via expr
df.selectExpr("percentile(latency, array(0.5, 0.95, 0.99)) as pcts")

percentile considers every value and is exact, but at scale it consumes significant memory and time.

3. Approximate Quantiles — approxQuantile (Recommended)

Most monitoring and analytics metrics tolerate a small amount of error. Whether p95 is 230ms or 232ms, the decision you make is the same. approxQuantile approximates quantiles with a fixed-memory sketch, computing them quickly without sorting.

# DataFrame.approxQuantile(col, probabilities, relativeError)
quantiles = df.approxQuantile(
    "latency",
    [0.5, 0.95, 0.99],
    0.001)                  # relative error (smaller = more accurate, higher cost)
# → [median, p95, p99]
 
# SQL expression version (for use in groupBy/agg)
df.groupBy("endpoint").agg(
    F.percentile_approx("latency", 0.95, 10000).alias("p95"))

	percentile (exact)	approxQuantile (approximate)
Accuracy	100%	Within relativeError
Memory	Large	Fixed (small)
Speed	Slow	Fast
Best for	Billing, compliance	Monitoring, analytics

The relativeError trade-off: 0.0 is exact but expensive, while 0.01 (1%) is very fast. For SLA monitoring, 0.001~0.01 is usually sufficient.

Rule of thumb: as with distinct counts, first ask "do I really need exact?" for quantiles too. Metrics like p95 latency are fine with approximation, and the job runs several times faster.

4. Per-Group Quantiles

Per-group quantiles — p95 per endpoint, per country — are a common need. Use percentile_approx as an aggregation.

result = (df.groupBy("endpoint")
    .agg(
        F.percentile_approx("latency", 0.5, 10000).alias("p50"),
        F.percentile_approx("latency", 0.95, 10000).alias("p95"),
        F.percentile_approx("latency", 0.99, 10000).alias("p99"),
        F.avg("latency").alias("avg")))

The third argument (accuracy) trades higher accuracy for more memory. With many groups and skew, the cost grows (see the separate post on data skew) — check for skewed groups.

5. Outlier Detection — IQR

A classic application of quantiles is outlier detection. The IQR (interquartile range) identifies values outside the normal range.

# Approximate Q1 and Q3
q1, q3 = df.approxQuantile("amount", [0.25, 0.75], 0.001)
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
 
# Flag outliers
flagged = df.withColumn("is_outlier",
    (F.col("amount") < lower) | (F.col("amount") > upper))
 
outliers = flagged.filter("is_outlier")

Method	Criterion
IQR	Outside Q1-1.5·IQR ~ Q3+1.5·IQR
Z-score	More than N standard deviations from the mean
Percentile cut	Trim the top and bottom 1%

Z-score uses only associative statistics (mean, standard deviation) and is therefore cheaper, but when the distribution is not normal, IQR is more robust.

6. Correlation, Covariance, and Summary Statistics

Quickly examine relationships between variables and overall distributions.

# Correlation coefficient / covariance
df.stat.corr("price", "demand")
df.stat.cov("price", "demand")
 
# Summary statistics (count, mean, stddev, min, max, quantiles)
df.summary("count", "mean", "stddev", "min", "25%", "50%", "75%", "max").show()
 
# Contingency table / approximate frequent items
df.stat.crosstab("device", "status")
df.stat.freqItems(["product_id"], support=0.01)

summary() is handy for EDA (exploratory data analysis), but because it includes quantiles it can be expensive on large data — applying it to a sample is one option.

7. Cutting Statistics Cost with Sampling

For many statistics, a representative sample is enough — you don't need the full dataset.

# Simple random sample
sample = df.sample(fraction=0.01, seed=42)
 
# Stratified sample (preserves group proportions)
stratified = df.sampleBy("country",
    fractions={"KR": 0.1, "US": 0.05, "JP": 0.1}, seed=42)
 
# Estimate quantiles from the sample
sample.approxQuantile("latency", [0.95], 0.001)

For distribution estimation and EDA, a 1% sample is often sufficient, dramatically reducing cost.

8. Summary

Statistic	Tool	Cost
Mean, sum, stddev	`avg`, `sum`, `stddev`	Cheap
Exact quantiles	`percentile`	Expensive
Approximate quantiles	`approxQuantile`/`percentile_approx`	Cheap (recommended)
Per-group p95	`percentile_approx` in agg	Moderate
Outliers	IQR (approxQuantile) / Z-score	Moderate/cheap
Relationships	`stat.corr`, `summary`	Moderate

The key insight for large-scale statistics is the distinction that "associative statistics are cheap; order-dependent statistics (quantiles) are expensive." When you need a p95 or median, make approxQuantile your default instead of the exact percentile — it delivers fast answers within a tolerable error bound, in fixed memory, without sorting. Add IQR outlier detection and sampling on top, and even statistical analysis over billions of rows becomes a constant-cost operation.

This post is based on Spark 3.5. If you need help with large-scale metrics and statistical analysis or designing outlier detection pipelines, feel free to reach out.

— The Data Dynamics Engineering Team