Blog
pysparksparkfuzzy-matchinglshentity-resolutiondata-engineering

Large-Scale Fuzzy Matching in PySpark — Linking Hundreds of Millions of Records Without a Cross Join

How do you match hundreds of millions of records whose names and addresses differ slightly? We walk through the O(n²) cross join wall, blocking strategies, and a practical pattern for approximate joins with MinHash LSH that compares only likely candidates — with PySpark code.

Data DynamicsJune 5, 20267 min read

Merging records that refer to "the same person but with slightly different names" — customer master consolidation, duplicate vendor cleanup, fraud detection — is one of data engineering's classic hard problems. If values match exactly, a join is all you need. But you have to match values that are similar yet different, like "Kim Cheolsu" vs "Kim Cheol-su", or "Seoul" vs "Seoul Special City".

The problem is scale. Comparing every pair across 1 million records requires 500 billion comparisons (O(n²)). With hundreds of millions of records, a cross join won't finish before the heat death of the universe. This post covers two key techniques for breaking through that wall — Blocking and LSH (Locality Sensitive Hashing) — and how to implement them in PySpark.

1. The Heart of the Problem — The O(n²) Wall

1M × 1M = 1 trillion pairs → similarity computation per pair → practically impossible

Every fuzzy matching technique ultimately pursues a single goal — "removing pairs that don't need comparison from the candidate set". Instead of comparing everything (cross join), you keep only the pairs that might be similar and run precise comparisons within that subset.

ApproachPairs comparedNotes
Cross join (exhaustive)O(n²)Infeasible
BlockingWithin blocks onlyKey design is critical
LSHSimilar buckets onlyApproximate, threshold tuning

2. Stage 1 — Blocking

The simplest and most powerful technique: compare only records that share the same block key. For example, comparing only people with the same "surname + birth year" dramatically shrinks the comparison space.

from pyspark.sql import functions as F
 
# Build the blocking key: first character of the name + birth year + first 3 digits of the zip code
df = df.withColumn("block_key",
    F.concat_ws("|",
        F.substring("name", 1, 1),
        F.year("birth"),
        F.substring("zipcode", 1, 3)))
 
# Self-join only within the same block (avoids the cross join)
a = df.alias("a")
b = df.alias("b")
candidates = (a.join(b, "block_key")
    .where("a.id < b.id"))      # remove duplicate pairs and self-matches
ProsLimitations
Simple, fastMisses true matches when block keys differ (recall↓)
Drastically fewer pairsLarge (skewed) blocks are O(k²) inside again

The core trade-off: a loose block key makes blocks bigger and comparisons more expensive; a tight one misses true matches. The usual remedy is building multiple blocking keys and unioning them (multi-pass blocking) to boost recall.

Watch Out for Block Skew

A common surname (Kim) creates a giant block that becomes O(n²) inside again. Monitor block sizes and split large blocks with finer-grained keys. (For skew in general, see the separate post "Mastering PySpark Data Skew".)

3. Stage 2 — Pruning Similar Candidates with MinHash LSH

Blocking assumes "exactly the same key". To catch typos and spelling variations, you need LSH. The idea behind LSH: use hash functions that make similar items land in the same hash bucket with high probability, then compare only within each bucket.

For string similarity (Jaccard), use MinHash LSH. It is built into Spark MLlib.

Step ① Text → Feature Vectors (n-grams)

from pyspark.ml.feature import RegexTokenizer, NGram, HashingTF
 
# Character-level normalization
df = df.withColumn("clean", F.lower(F.regexp_replace("name", "\\s+", "")))
 
# Generate character 3-grams (e.g. "kimcheolsu" → ["kim", "imc", ...] — for Korean, choose jamo or character granularity)
df = df.withColumn("chars", F.split("clean", ""))
ngram = NGram(n=3, inputCol="chars", outputCol="ngrams")
df = ngram.transform(df)
 
# n-grams → sparse vectors
tf = HashingTF(inputCol="ngrams", outputCol="features", numFeatures=1 << 18)
df = tf.transform(df)

Step ② Fit and Apply MinHashLSH

from pyspark.ml.feature import MinHashLSH
 
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(df)
 
# Find approximate similar pairs within the same dataset (Jaccard distance <= 0.3 = similarity >= 0.7)
pairs = model.approxSimilarityJoin(df, df, threshold=0.3, distCol="jaccard_dist")
 
candidates = (pairs
    .where("datasetA.id < datasetB.id")   # remove duplicates and self-matches
    .select(
        F.col("datasetA.id").alias("id_a"),
        F.col("datasetB.id").alias("id_b"),
        F.col("jaccard_dist")))

approxSimilarityJoin emits only pairs whose LSH buckets overlap — no cross join involved. Increasing numHashTables improves recall (fewer misses) but also increases cost.

ParameterEffect
numHashTables↑ recall, ↑ cost
threshold (distance)Lower = stricter (more precise), fewer candidates
numFeatures↓ hash collisions, ↑ memory

4. Stage 3 — Precise Comparison of Candidates (Scoring)

Run the expensive precise similarity computation only on the small set of candidate pairs produced by blocking/LSH. With so few pairs, a UDF/Pandas UDF is now perfectly fine.

# Weighted similarity across multiple fields (built-ins like Levenshtein + custom logic)
scored = candidates.join(df.alias("ra"), F.col("id_a") == F.col("ra.id")) \
                   .join(df.alias("rb"), F.col("id_b") == F.col("rb.id")) \
    .withColumn("name_sim",
        1 - F.levenshtein("ra.name", "rb.name") /
            F.greatest(F.length("ra.name"), F.length("rb.name")))
 
matches = scored.withColumn("score",
    0.6 * F.col("name_sim") + 0.4 * (1 - F.col("jaccard_dist"))
).where("score >= 0.85")

If the precise comparison needs complex logic, a vectorized pandas_udf is a good fit (see the separate post "Why PySpark UDFs Are Slow and Pandas UDFs"). The key point is that the number of pairs reaching this stage is already small enough.

5. The Full Pipeline

[Raw data, hundreds of millions of records]
     │  ① Normalize (lowercase, strip whitespace and special characters)

[Blocking or LSH] ── avoid cross join, similar candidates only ──> [Candidate pairs (millions)]
     │  ③ Precise scoring (Levenshtein, weighted sum)

[Matched pairs (score ≥ threshold)]
     │  ④ Clustering (group connected matches into one entity)

[Consolidated entities]

The final stage ④ — grouping matches into clusters (connected components) via "if A=B and B=C, then A=C" — is a graph problem. Solving it with Connected Components in GraphFrames is covered in the separate post "Entity Resolution with PySpark GraphFrames".

6. Tuning Precision vs Recall

Fuzzy matching is a tug-of-war between misses (recall) and false matches (precision).

GoalAdjustment
Fewer misses (recall↑)Union multiple blocking keys, numHashTables↑, relax threshold
Fewer false matches (precision↑)Raise the precise score threshold, weight multiple fields
Lower costCap block sizes, tighten threshold, strengthen normalization

Practical recommendation: a 3-stage setup — narrow first-pass candidates with blocking, cover variations with LSH, and finish with precise scoring — gives the best cost/quality balance. Don't try to do everything with a single technique.

7. Common Pitfalls

PitfallConsequence
Skipping normalizationSpelling variants all treated as different values
Precise comparison via cross joinO(n²) blowup
Single blocking keyRecall collapses (any key mismatch is a miss)
Ignoring giant blocksO(k²) skew within the block
Excessive numHashTablesCandidate and cost explosion

8. Summary

StageToolPurpose
Normalizationlower, regexp_replaceUnify representations
Candidate reductionBlocking / MinHash LSHAvoid cross join
Precise comparisonLevenshtein, Pandas UDFScore candidates
ClusteringGraphFrames CCConsolidate entities

The essence of large-scale fuzzy matching comes down to one thing — "never compare all pairs". Use blocking to capture obvious candidates, LSH to capture variations as well, then apply the expensive precise comparison only to that small remainder. Attacking O(n²) head-on can't be solved by any cluster, but with smart candidate reduction, even hundreds of millions of records can be matched in realistic time.


This article is based on Spark 3.5 + MLlib. If you need large-scale customer master consolidation, record linkage, or deduplication pipelines, feel free to reach out.

— Data Dynamics Engineering Team