Large-Scale Fuzzy Matching in PySpark — Linking Hundreds of Millions of Records Without a Cross Join
How do you match hundreds of millions of records whose names and addresses differ slightly? We walk through the O(n²) cross join wall, blocking strategies, and a practical pattern for approximate joins with MinHash LSH that compares only likely candidates — with PySpark code.
Merging records that refer to "the same person but with slightly different names" — customer master consolidation, duplicate vendor cleanup, fraud detection — is one of data engineering's classic hard problems. If values match exactly, a join is all you need. But you have to match values that are similar yet different, like "Kim Cheolsu" vs "Kim Cheol-su", or "Seoul" vs "Seoul Special City".
The problem is scale. Comparing every pair across 1 million records requires 500 billion comparisons (O(n²)). With hundreds of millions of records, a cross join won't finish before the heat death of the universe. This post covers two key techniques for breaking through that wall — Blocking and LSH (Locality Sensitive Hashing) — and how to implement them in PySpark.
1. The Heart of the Problem — The O(n²) Wall
1M × 1M = 1 trillion pairs → similarity computation per pair → practically impossibleEvery fuzzy matching technique ultimately pursues a single goal — "removing pairs that don't need comparison from the candidate set". Instead of comparing everything (cross join), you keep only the pairs that might be similar and run precise comparisons within that subset.
| Approach | Pairs compared | Notes |
|---|---|---|
| Cross join (exhaustive) | O(n²) | Infeasible |
| Blocking | Within blocks only | Key design is critical |
| LSH | Similar buckets only | Approximate, threshold tuning |
2. Stage 1 — Blocking
The simplest and most powerful technique: compare only records that share the same block key. For example, comparing only people with the same "surname + birth year" dramatically shrinks the comparison space.
from pyspark.sql import functions as F
# Build the blocking key: first character of the name + birth year + first 3 digits of the zip code
df = df.withColumn("block_key",
F.concat_ws("|",
F.substring("name", 1, 1),
F.year("birth"),
F.substring("zipcode", 1, 3)))
# Self-join only within the same block (avoids the cross join)
a = df.alias("a")
b = df.alias("b")
candidates = (a.join(b, "block_key")
.where("a.id < b.id")) # remove duplicate pairs and self-matches| Pros | Limitations |
|---|---|
| Simple, fast | Misses true matches when block keys differ (recall↓) |
| Drastically fewer pairs | Large (skewed) blocks are O(k²) inside again |
The core trade-off: a loose block key makes blocks bigger and comparisons more expensive; a tight one misses true matches. The usual remedy is building multiple blocking keys and unioning them (multi-pass blocking) to boost recall.
Watch Out for Block Skew
A common surname (Kim) creates a giant block that becomes O(n²) inside again. Monitor block sizes and split large blocks with finer-grained keys. (For skew in general, see the separate post "Mastering PySpark Data Skew".)
3. Stage 2 — Pruning Similar Candidates with MinHash LSH
Blocking assumes "exactly the same key". To catch typos and spelling variations, you need LSH. The idea behind LSH: use hash functions that make similar items land in the same hash bucket with high probability, then compare only within each bucket.
For string similarity (Jaccard), use MinHash LSH. It is built into Spark MLlib.
Step ① Text → Feature Vectors (n-grams)
from pyspark.ml.feature import RegexTokenizer, NGram, HashingTF
# Character-level normalization
df = df.withColumn("clean", F.lower(F.regexp_replace("name", "\\s+", "")))
# Generate character 3-grams (e.g. "kimcheolsu" → ["kim", "imc", ...] — for Korean, choose jamo or character granularity)
df = df.withColumn("chars", F.split("clean", ""))
ngram = NGram(n=3, inputCol="chars", outputCol="ngrams")
df = ngram.transform(df)
# n-grams → sparse vectors
tf = HashingTF(inputCol="ngrams", outputCol="features", numFeatures=1 << 18)
df = tf.transform(df)Step ② Fit and Apply MinHashLSH
from pyspark.ml.feature import MinHashLSH
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(df)
# Find approximate similar pairs within the same dataset (Jaccard distance <= 0.3 = similarity >= 0.7)
pairs = model.approxSimilarityJoin(df, df, threshold=0.3, distCol="jaccard_dist")
candidates = (pairs
.where("datasetA.id < datasetB.id") # remove duplicates and self-matches
.select(
F.col("datasetA.id").alias("id_a"),
F.col("datasetB.id").alias("id_b"),
F.col("jaccard_dist")))approxSimilarityJoin emits only pairs whose LSH buckets overlap — no cross join involved. Increasing numHashTables improves recall (fewer misses) but also increases cost.
| Parameter | Effect |
|---|---|
numHashTables | ↑ recall, ↑ cost |
threshold (distance) | Lower = stricter (more precise), fewer candidates |
numFeatures | ↓ hash collisions, ↑ memory |
4. Stage 3 — Precise Comparison of Candidates (Scoring)
Run the expensive precise similarity computation only on the small set of candidate pairs produced by blocking/LSH. With so few pairs, a UDF/Pandas UDF is now perfectly fine.
# Weighted similarity across multiple fields (built-ins like Levenshtein + custom logic)
scored = candidates.join(df.alias("ra"), F.col("id_a") == F.col("ra.id")) \
.join(df.alias("rb"), F.col("id_b") == F.col("rb.id")) \
.withColumn("name_sim",
1 - F.levenshtein("ra.name", "rb.name") /
F.greatest(F.length("ra.name"), F.length("rb.name")))
matches = scored.withColumn("score",
0.6 * F.col("name_sim") + 0.4 * (1 - F.col("jaccard_dist"))
).where("score >= 0.85")If the precise comparison needs complex logic, a vectorized pandas_udf is a good fit (see the separate post "Why PySpark UDFs Are Slow and Pandas UDFs"). The key point is that the number of pairs reaching this stage is already small enough.
5. The Full Pipeline
[Raw data, hundreds of millions of records]
│ ① Normalize (lowercase, strip whitespace and special characters)
▼
[Blocking or LSH] ── avoid cross join, similar candidates only ──> [Candidate pairs (millions)]
│ ③ Precise scoring (Levenshtein, weighted sum)
▼
[Matched pairs (score ≥ threshold)]
│ ④ Clustering (group connected matches into one entity)
▼
[Consolidated entities]The final stage ④ — grouping matches into clusters (connected components) via "if A=B and B=C, then A=C" — is a graph problem. Solving it with Connected Components in GraphFrames is covered in the separate post "Entity Resolution with PySpark GraphFrames".
6. Tuning Precision vs Recall
Fuzzy matching is a tug-of-war between misses (recall) and false matches (precision).
| Goal | Adjustment |
|---|---|
| Fewer misses (recall↑) | Union multiple blocking keys, numHashTables↑, relax threshold |
| Fewer false matches (precision↑) | Raise the precise score threshold, weight multiple fields |
| Lower cost | Cap block sizes, tighten threshold, strengthen normalization |
Practical recommendation: a 3-stage setup — narrow first-pass candidates with blocking, cover variations with LSH, and finish with precise scoring — gives the best cost/quality balance. Don't try to do everything with a single technique.
7. Common Pitfalls
| Pitfall | Consequence |
|---|---|
| Skipping normalization | Spelling variants all treated as different values |
| Precise comparison via cross join | O(n²) blowup |
| Single blocking key | Recall collapses (any key mismatch is a miss) |
| Ignoring giant blocks | O(k²) skew within the block |
Excessive numHashTables | Candidate and cost explosion |
8. Summary
| Stage | Tool | Purpose |
|---|---|---|
| Normalization | lower, regexp_replace | Unify representations |
| Candidate reduction | Blocking / MinHash LSH | Avoid cross join |
| Precise comparison | Levenshtein, Pandas UDF | Score candidates |
| Clustering | GraphFrames CC | Consolidate entities |
The essence of large-scale fuzzy matching comes down to one thing — "never compare all pairs". Use blocking to capture obvious candidates, LSH to capture variations as well, then apply the expensive precise comparison only to that small remainder. Attacking O(n²) head-on can't be solved by any cluster, but with smart candidate reduction, even hundreds of millions of records can be matched in realistic time.
This article is based on Spark 3.5 + MLlib. If you need large-scale customer master consolidation, record linkage, or deduplication pipelines, feel free to reach out.
— Data Dynamics Engineering Team