Blog
searchmultilingualbm25rrfragirpatentlegal

Designing Multilingual Search for Patents, Law, and Academic Papers - From BM25 to RRF and Cross-encoders

Specialist-domain multilingual search is not just a multilingual embedding plus RAG. This post breaks it down into multilingual BM25, language-specific analyzers, cross-lingual matching, sentence-to-document RRF fusion, and a side-by-side OpenSearch+Qdrant vs Elasticsearch+Milvus build-out.

Data DynamicsMay 18, 202622 min read

This post was written in response to a reader's request — "please cover multilingual search in domains like patents, law, and academic literature: not just multilingual embeddings and LLM RAG, but keyword matching (multilingual BM25), multilingual similarity, sentence-to-document RRF strategies." Thank you for the excellent prompt — this post is the answer.

Search for patents, law, and academic papers differs from web search in one decisive way: a single missed result can mean losing an invalidation trial, missing the controlling precedent in oral argument, or shipping a literature review that ignores prior work. Meanwhile users ask in Korean, but the answers live in English claims, Japanese rulings, and Chinese journals.

This post tackles that gap without trying to solve it with one embedding model, decomposing it into four axes:

  1. Multilingual keyword matching (BM25 plus per-language analyzers)
  2. Cross-lingual information retrieval (CLIR)
  3. Multilingual semantic similarity
  4. Sentence-to-document RRF fusion plus cross-encoder reranking

The general RAG pipeline is covered in the RAG Complete Guide. This post focuses only on going deep on the retrieval stage.


1. Why "multilingual embedding + RAG" alone falls short

Specialist-domain search differs from generic search for three decisive reasons.

(1) The vocabulary itself is the answer. A patent claim's "a plurality of", a legal "third party", a paper's "p < 0.05" — these phrases cannot be paraphrased without changing legal or scientific meaning. Embedding models are trained to pull semantically similar text together, which makes them weakest exactly where you need exact lexical matching: abbreviations, proper nouns, statute numbers, chemical formulas.

(2) Recall has to approach 1.0. Web search succeeds if the answer is in the top three. Patent prior-art search has to find that one filing that might exist somewhere in the world. Precision can be fixed by a human reviewer in the second stage; recall losses are silently invisible forever.

(3) The unit is not the sentence — it is the document. One patent has dozens of claims and hundreds of specification paragraphs. The user wants the most relevant patent, not the most relevant claim. Score by chunk and present the chunk list, and five chunks from the same patent will dominate the top five results.

These three properties together mean the naive "shove all docs into a multilingual embedder, top-k into an LLM" RAG fails on day one in this domain.

Takeaway: specialist-domain search must be evaluated along three axes — lexical precision × domain recall × document-level aggregation.


2. Per-domain requirement decomposition

"Specialist search" is not one problem. Units, ranking signals, and evaluation criteria all differ across the three domains. Spec them out before you touch the index.

ItemPatentsLawAcademic papers
Search unit1 patent (application no.)1 case or 1 article of a statute1 paper (DOI)
Chunk unitClaim, spec paragraphHolding paragraph, reasoning sectionAbstract, section, figure caption
Hard-match signalsIPC class, applicantCourt, instance, year, cited casesAuthors, journal, citation graph
Critical vocabularyClaim wording, drawing referencesStatute numbers, case numbersTaxonomic names, formulas, abbreviations
Recall standardEven one miss = invalidation riskMissing the key case = losingMissing the key prior work
Multilingual patternen/ko/ja/zh parallel filingsLocal language + English summaryEnglish body + non-English abstract

These three deserve separate indices, separate weights, and separate evaluation sets, even if they share the same engine.


3. Multilingual keyword matching: pushing BM25 across languages

BM25 is a 1990s algorithm but in specialist domains it is still the strongest single signal. The trick of taking it multilingual is not the algorithm but per-language analyzers and field weights.

3.1 Per-language analyzer map

LanguageOpenSearch / Elasticsearch analyzerStrategyCaveats
Koreannori (built-in to OpenSearch)Morphological + compound noun decompositionDrop particles with nori_part_of_speech
JapanesekuromojiMorphological + kana/kanji normalizationOld→new shinjitai conversion needed
Chinesesmartcn or ikWord segmentationTraditional↔simplified normalization (stconvert)
Englishstandard + english stemmerStemming + lowercaseDisable stemmer for domain abbreviations
DE/FR/ESPer-language stemmerStemmingCompound splitter (hyphenation_decompounder)

3.2 Same Korean sentence, four analyzers

Looking at what each analyzer emits makes the choice obvious.

Input: "This invention relates to a channel estimation method in wireless communication systems." (Korean: "본 발명은 무선 통신 시스템에서의 채널 추정 방법에 관한 것이다.")

AnalyzerTokens
standard (English default)본, 발명은, 무선, 통신, 시스템에서의, 채널, 추정, 방법에, 관한, 것이다
nori (default)본, 발명, 은, 무선, 통신, 시스템, 에서, 의, 채널, 추정, 방법, 에, 관한, 것, 이다
nori + POS filter발명, 무선, 통신, 시스템, 채널, 추정, 방법
CJK bigram (fallback)본발, 발명, 명은, 은무, 무선, … (n-gram)

standard glues "시스템에서의" into one token so a search for "시스템" finds nothing. nori + POS filter compresses to seven content words, giving the cleanest BM25 signal. The CJK bigram fallback is the dictionary-less safety net for new terms and proper nouns — always index it alongside.

3.3 OpenSearch index mapping example (patents)

{
  "settings": {
    "analysis": {
      "analyzer": {
        "ko_nori": {
          "type": "custom",
          "tokenizer": "nori_tokenizer",
          "filter": ["nori_part_of_speech", "lowercase", "ko_synonyms"]
        },
        "ko_bigram": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "cjk_bigram"]
        },
        "en_domain": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "english_minimal_stem", "ipc_synonyms"]
        }
      },
      "filter": {
        "ko_synonyms": { "type": "synonym_graph", "synonyms_path": "synonyms/ko-patent.txt" },
        "ipc_synonyms": { "type": "synonym_graph", "synonyms_path": "synonyms/ipc.txt" }
      }
    }
  },
  "mappings": {
    "properties": {
      "application_no": { "type": "keyword" },
      "ipc":             { "type": "keyword" },
      "filing_date":     { "type": "date" },
      "applicant":       { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "ko_nori" } } },
      "title_ko":        { "type": "text", "analyzer": "ko_nori",
                           "fields": { "bigram": { "type": "text", "analyzer": "ko_bigram" } } },
      "title_en":        { "type": "text", "analyzer": "en_domain" },
      "claims_ko":       { "type": "text", "analyzer": "ko_nori" },
      "claims_en":       { "type": "text", "analyzer": "en_domain" },
      "abstract_ko":     { "type": "text", "analyzer": "ko_nori" },
      "abstract_en":     { "type": "text", "analyzer": "en_domain" }
    }
  }
}

Two things to notice: (a) the same field indexed by two analyzers (title_ko + title_ko.bigram) and (b) separate per-language fields (*_ko, *_en) so queries can re-weight by the user's language.

3.4 BM25F: field weights encode domain knowledge

Claims weigh more than the spec, titles weigh more than abstracts. A multi-field BM25 query encodes that prior.

{
  "query": {
    "multi_match": {
      "query": "wireless communication channel estimation",
      "type": "best_fields",
      "fields": [
        "title_en^4",
        "claims_en^3",
        "abstract_en^2",
        "title_ko^2",
        "claims_ko^1.5"
      ],
      "tie_breaker": 0.3
    }
  }
}

Tune weights by grid search against the evaluation set from chapter 7. The heuristic "claims are 1.5–3× more important than spec" holds across nearly every patent corpus we have seen.

Takeaway: multilingual BM25 quality is decided by analyzer × field split × synonym dictionary, not the algorithm.


4. Query, translation, and cross-lingual matching (CLIR)

A user enters "무선 통신 채널 추정" in Korean and you have to find an English patent for "channel estimation in wireless communication". Three approaches.

StrategyHowProsCons
(A) Query TranslationTranslate the query into every indexed language, run BM25 separatelyIndex untouched, simpleShort queries translate badly
(B) Document TranslationPre-translate every document to every language, single indexFastest at query timeIndex size × N, large translation cost
(C) Cross-lingual EmbeddingDense search in a language-agnostic embedding spaceSingle representation for query and docLoses exact lexical matches

Recommended in practice: (A) + (C) hybrid. Translate the query with a domain glossary first, complement with a separate multilingual encoder, then fuse with the chapter-6 RRF. Avoid (B) in domains like patent claims and statutes where translation itself can change legal meaning.

4.1 Glossary-driven query expansion

LLM translation might render "채널 추정" as "channel guess". A pinned domain glossary is safer.

GLOSSARY_KO_EN = {
    "채널 추정": ["channel estimation"],
    "무선 통신": ["wireless communication", "radio communication"],
    "직교 주파수 분할 다중화": ["OFDM", "orthogonal frequency-division multiplexing"],
    "제3자":     ["third party"],
    "선행기술":   ["prior art"],
}
 
def expand_query(q_ko: str) -> dict:
    en_terms = []
    for ko_term, en_list in GLOSSARY_KO_EN.items():
        if ko_term in q_ko:
            en_terms.extend(en_list)
    return {"ko": q_ko, "en": " ".join(en_terms) or None}

The expanded query is matched against the *_ko and *_en fields from 3.3 in a single multi-match query.

4.2 Tokens that must never reach the translator

  • Chemical and math formulas: H2SO4, O(n log n)
  • Abbreviations and proper nouns: OFDM, LSTM, K-NN
  • Statute and case numbers: Supreme Court 2019Da12345
  • IPC/CPC class codes: H04L 27/26
  • Drawing reference numerals: 100, 100a

Standard practice: regex-mask before translation, then unmask the output.


5. Multilingual semantic similarity

BM25 alone cannot bridge the gap between "channel estimation", "a method of channel estimation", and "radio channel identification". That is where dense search enters. Picking a multilingual embedding model for a specialist domain is more delicate than the general RAG case — see the Embedding Model Guide for a broader comparison. Here we list only the domain-search-specific criteria.

5.1 Multilingual embedding candidates

ModelDimLangsStrengthsWeaknesses
LaBSE768109Strong on short query/sentence alignment, BERT-stableLoses on long documents
multilingual-e5-large1024100+Trained for query/doc asymmetry (query: / passage: prefixes)Prefix mandatory
BGE-M31024100+Emits dense + sparse + multi-vector in a single passIndexing complexity
Cohere embed-multilingual-v31024100+API, commercial qualityExternal call

Default for specialist domains: BGE-M3. A single pass yields dense + sparse + multi-vector (ColBERT-style) representations, so keyword and semantic signals come from the same model.

5.2 Asymmetric length problem

Patent claims often run 500+ characters in a single sentence; user queries are under 10 characters. Address the asymmetry from both sides.

(a) Unify chunk size: split each claim again on period + semicolon to 100–200 tokens. (b) Expand the query: for short queries, generate a one-to-two sentence hypothetical answer with an LLM and embed that (HyDE).

def hyde_expand(query: str, llm) -> str:
    prompt = f"""Write a one-paragraph patent abstract that would directly answer:
"{query}"
Use technical vocabulary. Do not add disclaimers."""
    return llm.complete(prompt)
 
q_vec = embed(hyde_expand(user_query, llm))

HyDE typically lifts dense recall on short queries by 8–15 percentage points on BEIR-class and domain evaluation sets.

5.3 Domain adaptation

The base model alone is not enough for IPC codes, statute numbering, or taxonomy. Two options.

  • LoRA fine-tune: collect 10k–100k domain (query, positive doc, negative doc) triples and train 1–2 contrastive epochs. Typical nDCG@10 lift of 5–10 percentage points.
  • In-context query rewriting: keep the model frozen; do dictionary lookup and abbreviation expansion in query preprocessing.

Start with the latter when domain data is scarce; switch to the former once the evaluation set exceeds ~10k examples.


6. Sentence-to-document RRF fusion (the core of this post)

After chapters 3 (BM25) and 5 (dense), assume both have returned a top-K. Two problems remain.

  1. How do you combine two result lists with different score scales?
  2. When you searched at sentence level and the same document has multiple chunks in the result, how do you aggregate to a per-document ranking?

Both answers are RRF (Reciprocal Rank Fusion). RRF uses only ranks, never raw scores, so combining incomparable systems is safe. And summing ranks of multiple chunks from the same document naturally produces a document-level score.

6.1 RRF formula

RRF(d) = Σ  1 / (k + rank_r(d))
         r ∈ R(d)
  • d : document (or sentence chunk)
  • R(d) : every ranking system in which d appeared
  • k : flattening constant, typically 60
  • rank : 1-based

6.2 Full pipeline

The end-to-end flow as a diagram:

SENTENCE-LEVEL HYBRID SEARCH WITH DOCUMENT-LEVEL RRF FUSIONUser query"wireless channel estimation"[A] Multilingual BM25nori + en stemmer, multi-field[B] Expanded BM25glossary + HyDE[C] Multilingual denseBGE-M3 / multilingual-e5Sentence chunks top-100claim_3#sent_2 → rank 1spec_§4#sent_5 → rank 2Sentence chunks top-100abstract#sent_1 → rank 1claim_1#sent_1 → rank 2Sentence chunks top-100claim_2#sent_4 → rank 1spec_§7#sent_3 → rank 2[D] Document-level RRF aggregationscore(doc) = Σ 1 / (k + rank) · k = 60sum ranks of chunks sharing application_no[E] Cross-encoder rerank top-20bge-reranker-v2-m3 (query × best_chunk_per_doc)

The crucial pattern: search at sentence level, aggregate at document level. BM25 and dense each score sentence chunks and return top-100. Chunks sharing the same application_no (patent) / case_id (precedent) / doi (paper) have their ranks summed by RRF into a document score.

6.3 RRF implementation (Python, stack-agnostic)

from collections import defaultdict
from typing import Iterable
 
def rrf_fuse(
    rankings: list[list[tuple[str, str]]],  # list of [(doc_id, chunk_id), ...]
    k: int = 60,
    weights: list[float] | None = None,
) -> list[tuple[str, float, list[str]]]:
    """Sentence-chunk rankings -> document-level RRF scores.
    Each ranking list is assumed already sorted by rank.
    """
    weights = weights or [1.0] * len(rankings)
    doc_score: dict[str, float] = defaultdict(float)
    doc_chunks: dict[str, list[str]] = defaultdict(list)
 
    for ranking, w in zip(rankings, weights):
        for rank, (doc_id, chunk_id) in enumerate(ranking, start=1):
            # If you want to dampen the contribution of the 2nd+ chunk of
            # the same doc within one run, add a decay here.
            doc_score[doc_id] += w / (k + rank)
            if chunk_id not in doc_chunks[doc_id]:
                doc_chunks[doc_id].append(chunk_id)
 
    fused = sorted(
        ((d, s, doc_chunks[d]) for d, s in doc_score.items()),
        key=lambda x: x[1],
        reverse=True,
    )
    return fused

6.4 Tuning weights and k

k=60 is the original value (Cormack et al., 2009) and works in most cases. Weights typically start at [1.0, 1.0, 1.0] and grid-search toward something like [BM25, expanded BM25, dense] = [1.0, 0.6, 1.2].

Per-domain bias:

  • Patents: nudge BM25 up to 1.2–1.5 (lexical precision matters most)
  • Law: keep BM25 and dense roughly equal (statute citations are exact, facts are paraphrased)
  • Papers: lift dense to 1.3–1.5 (authors phrase the same idea many ways)

6.5 Cross-encoder reranking

After RRF, rerank only the top 20–50 with a cross-encoder. A cross-encoder concatenates (query, passage) into one transformer pass to produce a direct score — expensive but accurate. In specialist-domain search, the final nDCG is nearly always best with cross-encoder reranking enabled.

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)
 
def rerank(query: str, candidates: list[dict], top_n: int = 20) -> list[dict]:
    # candidates: [{"doc_id": ..., "best_chunk_text": ...}, ...]
    pairs = [(query, c["best_chunk_text"]) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_n]

When you rerank at the document level, do not feed the whole document — feed the single chunk that ranked highest under RRF for that document. That cost/accuracy tradeoff has consistently been best in our deployments.

Takeaway: RRF combines the scores, RRF also changes the unit, and a cross-encoder grabs the final accuracy. Those three steps decide ~80% of specialist-search quality.


Tuning a search system without an evaluation set is gambling. Specialist evaluation differs from generic IR in two ways.

(a) There is no single correct answer. One query may have 50 relevant patents. You need graded relevance (0/1/2/3), not binary.

(b) Domain metrics are first-class. Track business measures separately: "What fraction of invalidating prior art did we catch?", "Is the controlling precedent in our top-N?".

7.1 Core metrics

MetricDefinitionDomain use
Recall@kFraction of golds inside top-kPatent prior-art: Recall@100
nDCG@10Graded relevance weighted cumulative gainProxy for user satisfaction
MRRMean reciprocal rank of first hit"Find one fast" UX
MAPMean average precisionBalanced single number
Coverage@kDid domain must-have docs land in top-k (custom)Hit rate of pivotal precedents

7.2 Multilingual benchmarks

  • MIRACL: ad-hoc retrieval across 18 languages. Useful for ranking generic multilingual dense models. Includes Korean.
  • mMARCO: MS MARCO translated into 13 languages. More useful for training data.
  • BEIR: 14 English domains. Standard for domain generalization.

Those three measure generalization. A final domain evaluation requires your own eval set — typically a minimum of 200–500 queries with graded relevance from at least two annotators per query.

7.3 Evaluation code (minimal)

import math
 
def ndcg_at_k(gains: list[int], k: int) -> float:
    dcg = sum((2**g - 1) / math.log2(i + 2) for i, g in enumerate(gains[:k]))
    ideal = sorted(gains, reverse=True)[:k]
    idcg = sum((2**g - 1) / math.log2(i + 2) for i, g in enumerate(ideal))
    return dcg / idcg if idcg > 0 else 0.0
 
def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
    if not gold_ids:
        return 0.0
    return len(set(retrieved_ids[:k]) & gold_ids) / len(gold_ids)

8. Reference architecture: two stacks side by side

Two combinations that show up most often in production, broken down by responsibility. Either stack runs the chapter-6 pipeline unchanged.

ResponsibilityStack A: OpenSearch + QdrantStack B: Elasticsearch + Milvus
Sparse (BM25)OpenSearch (Apache 2.0, nori built-in)Elasticsearch (ELv2, kuromoji/nori plugin)
DenseQdrant (Rust, strong payload filters)Milvus (large scale, GPU index)
RerankOpenSearch ML Commons or externalES inference API or external
ManagedAWS OpenSearch Service, self-hostElastic Cloud, Zilliz Cloud
LicenseOSS-friendly across the boardES is ELv2, Milvus is Apache 2.0

8.1 Stack A: OpenSearch + Qdrant

Dense: load chunks into Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
qc = QdrantClient(url="http://qdrant:6333")
 
qc.recreate_collection(
    collection_name="patent_chunks",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
 
points = [
    PointStruct(
        id=f"{doc['application_no']}#{chunk['idx']}",
        vector=embed(chunk["text"]),
        payload={
            "application_no": doc["application_no"],
            "lang":           chunk["lang"],
            "field":          chunk["field"],     # "claim" | "abstract" | "spec"
            "ipc":            doc["ipc"],
            "filing_date":    doc["filing_date"],
            "text":           chunk["text"],
        },
    )
    for doc, chunk in iter_chunks()
]
qc.upsert(collection_name="patent_chunks", points=points)

Sparse: OpenSearch multi-match

from opensearchpy import OpenSearch
 
os_client = OpenSearch(hosts=["https://opensearch:9200"])
 
def bm25_search(q_ko: str, q_en: str | None, size: int = 100):
    fields = ["title_ko^4", "claims_ko^3", "abstract_ko^2"]
    if q_en:
        fields += ["title_en^3", "claims_en^2", "abstract_en^1.5"]
    body = {
        "size": size,
        "_source": ["application_no", "field_id"],
        "query": {
            "multi_match": {
                "query": q_ko if not q_en else f"{q_ko} {q_en}",
                "type": "best_fields",
                "fields": fields,
                "tie_breaker": 0.3,
            }
        },
    }
    res = os_client.search(index="patent_chunks", body=body)
    return [(h["_source"]["application_no"], h["_source"]["field_id"])
            for h in res["hits"]["hits"]]

Dense search + RRF fusion

def dense_search(q_vec, size: int = 100, ipc_filter: list[str] | None = None):
    flt = {"must": [{"key": "ipc", "match": {"any": ipc_filter}}]} if ipc_filter else None
    res = qc.search(
        collection_name="patent_chunks",
        query_vector=q_vec,
        limit=size,
        query_filter=flt,
    )
    return [(p.payload["application_no"], p.id) for p in res]
 
def hybrid_search(query: str, llm, embedder, ipc_filter=None, top_n=20):
    expanded = expand_query(query)                 # 4.1
    bm25_a = bm25_search(expanded["ko"], None)
    bm25_b = bm25_search(expanded["ko"], expanded["en"])
    dense  = dense_search(embedder.encode(hyde_expand(query, llm)), ipc_filter=ipc_filter)
    fused  = rrf_fuse([bm25_a, bm25_b, dense], k=60, weights=[1.0, 0.6, 1.2])
    candidates = build_candidates(fused[:top_n * 3])  # best chunk per doc
    return rerank(query, candidates, top_n=top_n)

8.2 Stack B: Elasticsearch + Milvus

The same functionality with Elasticsearch 8.x hybrid search and Milvus 2.x.

Milvus collection definition

from pymilvus import MilvusClient, DataType
 
mc = MilvusClient(uri="http://milvus:19530")
 
schema = mc.create_schema(auto_id=False, enable_dynamic_field=False)
schema.add_field("id",             DataType.VARCHAR, is_primary=True, max_length=128)
schema.add_field("application_no", DataType.VARCHAR, max_length=32)
schema.add_field("lang",           DataType.VARCHAR, max_length=4)
schema.add_field("field",          DataType.VARCHAR, max_length=16)
schema.add_field("ipc",            DataType.VARCHAR, max_length=16)
schema.add_field("text",           DataType.VARCHAR, max_length=2048)
schema.add_field("vector",         DataType.FLOAT_VECTOR, dim=1024)
 
mc.create_collection(collection_name="patent_chunks", schema=schema)
mc.create_index(
    collection_name="patent_chunks",
    index_params=[{"field_name": "vector", "index_type": "HNSW",
                   "metric_type": "COSINE", "params": {"M": 16, "efConstruction": 200}}],
)

Elasticsearch: BM25 + kNN in one query (own RRF recommended)

ES 8.x supports rank: { rrf: ... }, but sentence-to-document aggregation still needs custom code — so it is simpler to fetch both result lists separately and RRF them in Python.

from elasticsearch import Elasticsearch
 
es = Elasticsearch("https://es:9200")
 
def bm25_search_es(q_ko, q_en, size=100):
    body = {
        "size": size,
        "_source": ["application_no", "chunk_id"],
        "query": {
            "multi_match": {
                "query": f"{q_ko} {q_en or ''}".strip(),
                "fields": ["title_ko^4", "claims_ko^3", "abstract_ko^2",
                           "title_en^3", "claims_en^2"],
                "tie_breaker": 0.3,
            }
        },
    }
    res = es.search(index="patent_chunks", body=body)
    return [(h["_source"]["application_no"], h["_source"]["chunk_id"])
            for h in res["hits"]["hits"]]
 
def dense_search_milvus(q_vec, size=100):
    res = mc.search(
        collection_name="patent_chunks",
        data=[q_vec],
        limit=size,
        output_fields=["application_no"],
        search_params={"metric_type": "COSINE", "params": {"ef": 128}},
    )[0]
    return [(hit["entity"]["application_no"], hit["id"]) for hit in res]

From here the rrf_fuse()rerank() flow is identical to Stack A.

8.3 How to pick between the two stacks

SituationPick
Heavy Korean BM25 trafficStack A (OpenSearch nori is stable out of the box)
100M+ chunks, GPU index neededStack B (Milvus scales better)
AWS single-cloudStack A (OpenSearch Service + self-hosted Qdrant)
Already on Elastic licensingStack B
Diverse payload filters (IPC, court, year)Stack A (Qdrant payload index is more flexible)

8.4 Operations checklist

  • Korean orthography rules and loanword variants (데이타 / 데이터) — register both directions in the synonym dictionary
  • Japanese shinjitai vs kyūjitai ( / ) — icu_normalizer
  • Traditional vs simplified Chinese (資料 / 资料) — stconvert filter
  • English abbreviation getting stemmed (SASsa) — protect with keyword_marker
  • Drawing reference numbers and claim numbers disappearing through tokenization — check word_delimiter_graph
  • Use index aliases — zero-downtime reindex via alias swap
  • Dense vector recompute policy — stamp the embedding-model version onto the chunk metadata
  • Never hardcode RRF k in production — externalize to config

Wrap-up

Multilingual specialist-domain search is not a "good embedding model" problem. It needs all six stages working together — per-language analyzers → domain dictionary → multilingual BM25 → multilingual dense → document-level RRF → cross-encoder rerank — before you get a system that does not lose a single piece of prior art. With the pipeline diagram and RRF implementation in this post, plus either of the two stack code samples, you should be able to stand up a first working system in a week.

The next post will cover what to do after this retrieval stage — how to bolt on LLM answer generation with citation, and how to reduce hallucinations in legal and patent answers. Suggestions for topics are always welcome.

Related posts