Designing Multilingual Search for Patents, Law, and Academic Papers - From BM25 to RRF and Cross-encoders
Specialist-domain multilingual search is not just a multilingual embedding plus RAG. This post breaks it down into multilingual BM25, language-specific analyzers, cross-lingual matching, sentence-to-document RRF fusion, and a side-by-side OpenSearch+Qdrant vs Elasticsearch+Milvus build-out.
This post was written in response to a reader's request — "please cover multilingual search in domains like patents, law, and academic literature: not just multilingual embeddings and LLM RAG, but keyword matching (multilingual BM25), multilingual similarity, sentence-to-document RRF strategies." Thank you for the excellent prompt — this post is the answer.
Search for patents, law, and academic papers differs from web search in one decisive way: a single missed result can mean losing an invalidation trial, missing the controlling precedent in oral argument, or shipping a literature review that ignores prior work. Meanwhile users ask in Korean, but the answers live in English claims, Japanese rulings, and Chinese journals.
This post tackles that gap without trying to solve it with one embedding model, decomposing it into four axes:
- Multilingual keyword matching (BM25 plus per-language analyzers)
- Cross-lingual information retrieval (CLIR)
- Multilingual semantic similarity
- Sentence-to-document RRF fusion plus cross-encoder reranking
The general RAG pipeline is covered in the RAG Complete Guide. This post focuses only on going deep on the retrieval stage.
1. Why "multilingual embedding + RAG" alone falls short
Specialist-domain search differs from generic search for three decisive reasons.
(1) The vocabulary itself is the answer. A patent claim's "a plurality of", a legal "third party", a paper's "p < 0.05" — these phrases cannot be paraphrased without changing legal or scientific meaning. Embedding models are trained to pull semantically similar text together, which makes them weakest exactly where you need exact lexical matching: abbreviations, proper nouns, statute numbers, chemical formulas.
(2) Recall has to approach 1.0. Web search succeeds if the answer is in the top three. Patent prior-art search has to find that one filing that might exist somewhere in the world. Precision can be fixed by a human reviewer in the second stage; recall losses are silently invisible forever.
(3) The unit is not the sentence — it is the document. One patent has dozens of claims and hundreds of specification paragraphs. The user wants the most relevant patent, not the most relevant claim. Score by chunk and present the chunk list, and five chunks from the same patent will dominate the top five results.
These three properties together mean the naive "shove all docs into a multilingual embedder, top-k into an LLM" RAG fails on day one in this domain.
Takeaway: specialist-domain search must be evaluated along three axes — lexical precision × domain recall × document-level aggregation.
2. Per-domain requirement decomposition
"Specialist search" is not one problem. Units, ranking signals, and evaluation criteria all differ across the three domains. Spec them out before you touch the index.
| Item | Patents | Law | Academic papers |
|---|---|---|---|
| Search unit | 1 patent (application no.) | 1 case or 1 article of a statute | 1 paper (DOI) |
| Chunk unit | Claim, spec paragraph | Holding paragraph, reasoning section | Abstract, section, figure caption |
| Hard-match signals | IPC class, applicant | Court, instance, year, cited cases | Authors, journal, citation graph |
| Critical vocabulary | Claim wording, drawing references | Statute numbers, case numbers | Taxonomic names, formulas, abbreviations |
| Recall standard | Even one miss = invalidation risk | Missing the key case = losing | Missing the key prior work |
| Multilingual pattern | en/ko/ja/zh parallel filings | Local language + English summary | English body + non-English abstract |
These three deserve separate indices, separate weights, and separate evaluation sets, even if they share the same engine.
3. Multilingual keyword matching: pushing BM25 across languages
BM25 is a 1990s algorithm but in specialist domains it is still the strongest single signal. The trick of taking it multilingual is not the algorithm but per-language analyzers and field weights.
3.1 Per-language analyzer map
| Language | OpenSearch / Elasticsearch analyzer | Strategy | Caveats |
|---|---|---|---|
| Korean | nori (built-in to OpenSearch) | Morphological + compound noun decomposition | Drop particles with nori_part_of_speech |
| Japanese | kuromoji | Morphological + kana/kanji normalization | Old→new shinjitai conversion needed |
| Chinese | smartcn or ik | Word segmentation | Traditional↔simplified normalization (stconvert) |
| English | standard + english stemmer | Stemming + lowercase | Disable stemmer for domain abbreviations |
| DE/FR/ES | Per-language stemmer | Stemming | Compound splitter (hyphenation_decompounder) |
3.2 Same Korean sentence, four analyzers
Looking at what each analyzer emits makes the choice obvious.
Input: "This invention relates to a channel estimation method in wireless communication systems." (Korean: "본 발명은 무선 통신 시스템에서의 채널 추정 방법에 관한 것이다.")
| Analyzer | Tokens |
|---|---|
standard (English default) | 본, 발명은, 무선, 통신, 시스템에서의, 채널, 추정, 방법에, 관한, 것이다 |
nori (default) | 본, 발명, 은, 무선, 통신, 시스템, 에서, 의, 채널, 추정, 방법, 에, 관한, 것, 이다 |
nori + POS filter | 발명, 무선, 통신, 시스템, 채널, 추정, 방법 |
| CJK bigram (fallback) | 본발, 발명, 명은, 은무, 무선, … (n-gram) |
standard glues "시스템에서의" into one token so a search for "시스템" finds nothing. nori + POS filter compresses to seven content words, giving the cleanest BM25 signal. The CJK bigram fallback is the dictionary-less safety net for new terms and proper nouns — always index it alongside.
3.3 OpenSearch index mapping example (patents)
{
"settings": {
"analysis": {
"analyzer": {
"ko_nori": {
"type": "custom",
"tokenizer": "nori_tokenizer",
"filter": ["nori_part_of_speech", "lowercase", "ko_synonyms"]
},
"ko_bigram": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "cjk_bigram"]
},
"en_domain": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "english_minimal_stem", "ipc_synonyms"]
}
},
"filter": {
"ko_synonyms": { "type": "synonym_graph", "synonyms_path": "synonyms/ko-patent.txt" },
"ipc_synonyms": { "type": "synonym_graph", "synonyms_path": "synonyms/ipc.txt" }
}
}
},
"mappings": {
"properties": {
"application_no": { "type": "keyword" },
"ipc": { "type": "keyword" },
"filing_date": { "type": "date" },
"applicant": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "ko_nori" } } },
"title_ko": { "type": "text", "analyzer": "ko_nori",
"fields": { "bigram": { "type": "text", "analyzer": "ko_bigram" } } },
"title_en": { "type": "text", "analyzer": "en_domain" },
"claims_ko": { "type": "text", "analyzer": "ko_nori" },
"claims_en": { "type": "text", "analyzer": "en_domain" },
"abstract_ko": { "type": "text", "analyzer": "ko_nori" },
"abstract_en": { "type": "text", "analyzer": "en_domain" }
}
}
}Two things to notice: (a) the same field indexed by two analyzers (title_ko + title_ko.bigram) and (b) separate per-language fields (*_ko, *_en) so queries can re-weight by the user's language.
3.4 BM25F: field weights encode domain knowledge
Claims weigh more than the spec, titles weigh more than abstracts. A multi-field BM25 query encodes that prior.
{
"query": {
"multi_match": {
"query": "wireless communication channel estimation",
"type": "best_fields",
"fields": [
"title_en^4",
"claims_en^3",
"abstract_en^2",
"title_ko^2",
"claims_ko^1.5"
],
"tie_breaker": 0.3
}
}
}Tune weights by grid search against the evaluation set from chapter 7. The heuristic "claims are 1.5–3× more important than spec" holds across nearly every patent corpus we have seen.
Takeaway: multilingual BM25 quality is decided by analyzer × field split × synonym dictionary, not the algorithm.
4. Query, translation, and cross-lingual matching (CLIR)
A user enters "무선 통신 채널 추정" in Korean and you have to find an English patent for "channel estimation in wireless communication". Three approaches.
| Strategy | How | Pros | Cons |
|---|---|---|---|
| (A) Query Translation | Translate the query into every indexed language, run BM25 separately | Index untouched, simple | Short queries translate badly |
| (B) Document Translation | Pre-translate every document to every language, single index | Fastest at query time | Index size × N, large translation cost |
| (C) Cross-lingual Embedding | Dense search in a language-agnostic embedding space | Single representation for query and doc | Loses exact lexical matches |
Recommended in practice: (A) + (C) hybrid. Translate the query with a domain glossary first, complement with a separate multilingual encoder, then fuse with the chapter-6 RRF. Avoid (B) in domains like patent claims and statutes where translation itself can change legal meaning.
4.1 Glossary-driven query expansion
LLM translation might render "채널 추정" as "channel guess". A pinned domain glossary is safer.
GLOSSARY_KO_EN = {
"채널 추정": ["channel estimation"],
"무선 통신": ["wireless communication", "radio communication"],
"직교 주파수 분할 다중화": ["OFDM", "orthogonal frequency-division multiplexing"],
"제3자": ["third party"],
"선행기술": ["prior art"],
}
def expand_query(q_ko: str) -> dict:
en_terms = []
for ko_term, en_list in GLOSSARY_KO_EN.items():
if ko_term in q_ko:
en_terms.extend(en_list)
return {"ko": q_ko, "en": " ".join(en_terms) or None}The expanded query is matched against the *_ko and *_en fields from 3.3 in a single multi-match query.
4.2 Tokens that must never reach the translator
- Chemical and math formulas:
H2SO4,O(n log n) - Abbreviations and proper nouns:
OFDM,LSTM,K-NN - Statute and case numbers:
Supreme Court 2019Da12345 - IPC/CPC class codes:
H04L 27/26 - Drawing reference numerals:
100,100a
Standard practice: regex-mask before translation, then unmask the output.
5. Multilingual semantic similarity
BM25 alone cannot bridge the gap between "channel estimation", "a method of channel estimation", and "radio channel identification". That is where dense search enters. Picking a multilingual embedding model for a specialist domain is more delicate than the general RAG case — see the Embedding Model Guide for a broader comparison. Here we list only the domain-search-specific criteria.
5.1 Multilingual embedding candidates
| Model | Dim | Langs | Strengths | Weaknesses |
|---|---|---|---|---|
| LaBSE | 768 | 109 | Strong on short query/sentence alignment, BERT-stable | Loses on long documents |
| multilingual-e5-large | 1024 | 100+ | Trained for query/doc asymmetry (query: / passage: prefixes) | Prefix mandatory |
| BGE-M3 | 1024 | 100+ | Emits dense + sparse + multi-vector in a single pass | Indexing complexity |
| Cohere embed-multilingual-v3 | 1024 | 100+ | API, commercial quality | External call |
Default for specialist domains: BGE-M3. A single pass yields dense + sparse + multi-vector (ColBERT-style) representations, so keyword and semantic signals come from the same model.
5.2 Asymmetric length problem
Patent claims often run 500+ characters in a single sentence; user queries are under 10 characters. Address the asymmetry from both sides.
(a) Unify chunk size: split each claim again on period + semicolon to 100–200 tokens. (b) Expand the query: for short queries, generate a one-to-two sentence hypothetical answer with an LLM and embed that (HyDE).
def hyde_expand(query: str, llm) -> str:
prompt = f"""Write a one-paragraph patent abstract that would directly answer:
"{query}"
Use technical vocabulary. Do not add disclaimers."""
return llm.complete(prompt)
q_vec = embed(hyde_expand(user_query, llm))HyDE typically lifts dense recall on short queries by 8–15 percentage points on BEIR-class and domain evaluation sets.
5.3 Domain adaptation
The base model alone is not enough for IPC codes, statute numbering, or taxonomy. Two options.
- LoRA fine-tune: collect 10k–100k domain (query, positive doc, negative doc) triples and train 1–2 contrastive epochs. Typical nDCG@10 lift of 5–10 percentage points.
- In-context query rewriting: keep the model frozen; do dictionary lookup and abbreviation expansion in query preprocessing.
Start with the latter when domain data is scarce; switch to the former once the evaluation set exceeds ~10k examples.
6. Sentence-to-document RRF fusion (the core of this post)
After chapters 3 (BM25) and 5 (dense), assume both have returned a top-K. Two problems remain.
- How do you combine two result lists with different score scales?
- When you searched at sentence level and the same document has multiple chunks in the result, how do you aggregate to a per-document ranking?
Both answers are RRF (Reciprocal Rank Fusion). RRF uses only ranks, never raw scores, so combining incomparable systems is safe. And summing ranks of multiple chunks from the same document naturally produces a document-level score.
6.1 RRF formula
RRF(d) = Σ 1 / (k + rank_r(d))
r ∈ R(d)d: document (or sentence chunk)R(d): every ranking system in whichdappearedk: flattening constant, typically 60rank: 1-based
6.2 Full pipeline
The end-to-end flow as a diagram:
The crucial pattern: search at sentence level, aggregate at document level. BM25 and dense each score sentence chunks and return top-100. Chunks sharing the same application_no (patent) / case_id (precedent) / doi (paper) have their ranks summed by RRF into a document score.
6.3 RRF implementation (Python, stack-agnostic)
from collections import defaultdict
from typing import Iterable
def rrf_fuse(
rankings: list[list[tuple[str, str]]], # list of [(doc_id, chunk_id), ...]
k: int = 60,
weights: list[float] | None = None,
) -> list[tuple[str, float, list[str]]]:
"""Sentence-chunk rankings -> document-level RRF scores.
Each ranking list is assumed already sorted by rank.
"""
weights = weights or [1.0] * len(rankings)
doc_score: dict[str, float] = defaultdict(float)
doc_chunks: dict[str, list[str]] = defaultdict(list)
for ranking, w in zip(rankings, weights):
for rank, (doc_id, chunk_id) in enumerate(ranking, start=1):
# If you want to dampen the contribution of the 2nd+ chunk of
# the same doc within one run, add a decay here.
doc_score[doc_id] += w / (k + rank)
if chunk_id not in doc_chunks[doc_id]:
doc_chunks[doc_id].append(chunk_id)
fused = sorted(
((d, s, doc_chunks[d]) for d, s in doc_score.items()),
key=lambda x: x[1],
reverse=True,
)
return fused6.4 Tuning weights and k
k=60 is the original value (Cormack et al., 2009) and works in most cases. Weights typically start at [1.0, 1.0, 1.0] and grid-search toward something like [BM25, expanded BM25, dense] = [1.0, 0.6, 1.2].
Per-domain bias:
- Patents: nudge BM25 up to 1.2–1.5 (lexical precision matters most)
- Law: keep BM25 and dense roughly equal (statute citations are exact, facts are paraphrased)
- Papers: lift dense to 1.3–1.5 (authors phrase the same idea many ways)
6.5 Cross-encoder reranking
After RRF, rerank only the top 20–50 with a cross-encoder. A cross-encoder concatenates (query, passage) into one transformer pass to produce a direct score — expensive but accurate. In specialist-domain search, the final nDCG is nearly always best with cross-encoder reranking enabled.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)
def rerank(query: str, candidates: list[dict], top_n: int = 20) -> list[dict]:
# candidates: [{"doc_id": ..., "best_chunk_text": ...}, ...]
pairs = [(query, c["best_chunk_text"]) for c in candidates]
scores = reranker.predict(pairs, batch_size=32)
for c, s in zip(candidates, scores):
c["rerank_score"] = float(s)
return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_n]When you rerank at the document level, do not feed the whole document — feed the single chunk that ranked highest under RRF for that document. That cost/accuracy tradeoff has consistently been best in our deployments.
Takeaway: RRF combines the scores, RRF also changes the unit, and a cross-encoder grabs the final accuracy. Those three steps decide ~80% of specialist-search quality.
7. Evaluation — how to measure specialist search
Tuning a search system without an evaluation set is gambling. Specialist evaluation differs from generic IR in two ways.
(a) There is no single correct answer. One query may have 50 relevant patents. You need graded relevance (0/1/2/3), not binary.
(b) Domain metrics are first-class. Track business measures separately: "What fraction of invalidating prior art did we catch?", "Is the controlling precedent in our top-N?".
7.1 Core metrics
| Metric | Definition | Domain use |
|---|---|---|
| Recall@k | Fraction of golds inside top-k | Patent prior-art: Recall@100 |
| nDCG@10 | Graded relevance weighted cumulative gain | Proxy for user satisfaction |
| MRR | Mean reciprocal rank of first hit | "Find one fast" UX |
| MAP | Mean average precision | Balanced single number |
| Coverage@k | Did domain must-have docs land in top-k (custom) | Hit rate of pivotal precedents |
7.2 Multilingual benchmarks
- MIRACL: ad-hoc retrieval across 18 languages. Useful for ranking generic multilingual dense models. Includes Korean.
- mMARCO: MS MARCO translated into 13 languages. More useful for training data.
- BEIR: 14 English domains. Standard for domain generalization.
Those three measure generalization. A final domain evaluation requires your own eval set — typically a minimum of 200–500 queries with graded relevance from at least two annotators per query.
7.3 Evaluation code (minimal)
import math
def ndcg_at_k(gains: list[int], k: int) -> float:
dcg = sum((2**g - 1) / math.log2(i + 2) for i, g in enumerate(gains[:k]))
ideal = sorted(gains, reverse=True)[:k]
idcg = sum((2**g - 1) / math.log2(i + 2) for i, g in enumerate(ideal))
return dcg / idcg if idcg > 0 else 0.0
def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
if not gold_ids:
return 0.0
return len(set(retrieved_ids[:k]) & gold_ids) / len(gold_ids)8. Reference architecture: two stacks side by side
Two combinations that show up most often in production, broken down by responsibility. Either stack runs the chapter-6 pipeline unchanged.
| Responsibility | Stack A: OpenSearch + Qdrant | Stack B: Elasticsearch + Milvus |
|---|---|---|
| Sparse (BM25) | OpenSearch (Apache 2.0, nori built-in) | Elasticsearch (ELv2, kuromoji/nori plugin) |
| Dense | Qdrant (Rust, strong payload filters) | Milvus (large scale, GPU index) |
| Rerank | OpenSearch ML Commons or external | ES inference API or external |
| Managed | AWS OpenSearch Service, self-host | Elastic Cloud, Zilliz Cloud |
| License | OSS-friendly across the board | ES is ELv2, Milvus is Apache 2.0 |
8.1 Stack A: OpenSearch + Qdrant
Dense: load chunks into Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
qc = QdrantClient(url="http://qdrant:6333")
qc.recreate_collection(
collection_name="patent_chunks",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
points = [
PointStruct(
id=f"{doc['application_no']}#{chunk['idx']}",
vector=embed(chunk["text"]),
payload={
"application_no": doc["application_no"],
"lang": chunk["lang"],
"field": chunk["field"], # "claim" | "abstract" | "spec"
"ipc": doc["ipc"],
"filing_date": doc["filing_date"],
"text": chunk["text"],
},
)
for doc, chunk in iter_chunks()
]
qc.upsert(collection_name="patent_chunks", points=points)Sparse: OpenSearch multi-match
from opensearchpy import OpenSearch
os_client = OpenSearch(hosts=["https://opensearch:9200"])
def bm25_search(q_ko: str, q_en: str | None, size: int = 100):
fields = ["title_ko^4", "claims_ko^3", "abstract_ko^2"]
if q_en:
fields += ["title_en^3", "claims_en^2", "abstract_en^1.5"]
body = {
"size": size,
"_source": ["application_no", "field_id"],
"query": {
"multi_match": {
"query": q_ko if not q_en else f"{q_ko} {q_en}",
"type": "best_fields",
"fields": fields,
"tie_breaker": 0.3,
}
},
}
res = os_client.search(index="patent_chunks", body=body)
return [(h["_source"]["application_no"], h["_source"]["field_id"])
for h in res["hits"]["hits"]]Dense search + RRF fusion
def dense_search(q_vec, size: int = 100, ipc_filter: list[str] | None = None):
flt = {"must": [{"key": "ipc", "match": {"any": ipc_filter}}]} if ipc_filter else None
res = qc.search(
collection_name="patent_chunks",
query_vector=q_vec,
limit=size,
query_filter=flt,
)
return [(p.payload["application_no"], p.id) for p in res]
def hybrid_search(query: str, llm, embedder, ipc_filter=None, top_n=20):
expanded = expand_query(query) # 4.1
bm25_a = bm25_search(expanded["ko"], None)
bm25_b = bm25_search(expanded["ko"], expanded["en"])
dense = dense_search(embedder.encode(hyde_expand(query, llm)), ipc_filter=ipc_filter)
fused = rrf_fuse([bm25_a, bm25_b, dense], k=60, weights=[1.0, 0.6, 1.2])
candidates = build_candidates(fused[:top_n * 3]) # best chunk per doc
return rerank(query, candidates, top_n=top_n)8.2 Stack B: Elasticsearch + Milvus
The same functionality with Elasticsearch 8.x hybrid search and Milvus 2.x.
Milvus collection definition
from pymilvus import MilvusClient, DataType
mc = MilvusClient(uri="http://milvus:19530")
schema = mc.create_schema(auto_id=False, enable_dynamic_field=False)
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=128)
schema.add_field("application_no", DataType.VARCHAR, max_length=32)
schema.add_field("lang", DataType.VARCHAR, max_length=4)
schema.add_field("field", DataType.VARCHAR, max_length=16)
schema.add_field("ipc", DataType.VARCHAR, max_length=16)
schema.add_field("text", DataType.VARCHAR, max_length=2048)
schema.add_field("vector", DataType.FLOAT_VECTOR, dim=1024)
mc.create_collection(collection_name="patent_chunks", schema=schema)
mc.create_index(
collection_name="patent_chunks",
index_params=[{"field_name": "vector", "index_type": "HNSW",
"metric_type": "COSINE", "params": {"M": 16, "efConstruction": 200}}],
)Elasticsearch: BM25 + kNN in one query (own RRF recommended)
ES 8.x supports rank: { rrf: ... }, but sentence-to-document aggregation still needs custom code — so it is simpler to fetch both result lists separately and RRF them in Python.
from elasticsearch import Elasticsearch
es = Elasticsearch("https://es:9200")
def bm25_search_es(q_ko, q_en, size=100):
body = {
"size": size,
"_source": ["application_no", "chunk_id"],
"query": {
"multi_match": {
"query": f"{q_ko} {q_en or ''}".strip(),
"fields": ["title_ko^4", "claims_ko^3", "abstract_ko^2",
"title_en^3", "claims_en^2"],
"tie_breaker": 0.3,
}
},
}
res = es.search(index="patent_chunks", body=body)
return [(h["_source"]["application_no"], h["_source"]["chunk_id"])
for h in res["hits"]["hits"]]
def dense_search_milvus(q_vec, size=100):
res = mc.search(
collection_name="patent_chunks",
data=[q_vec],
limit=size,
output_fields=["application_no"],
search_params={"metric_type": "COSINE", "params": {"ef": 128}},
)[0]
return [(hit["entity"]["application_no"], hit["id"]) for hit in res]From here the rrf_fuse() → rerank() flow is identical to Stack A.
8.3 How to pick between the two stacks
| Situation | Pick |
|---|---|
| Heavy Korean BM25 traffic | Stack A (OpenSearch nori is stable out of the box) |
| 100M+ chunks, GPU index needed | Stack B (Milvus scales better) |
| AWS single-cloud | Stack A (OpenSearch Service + self-hosted Qdrant) |
| Already on Elastic licensing | Stack B |
| Diverse payload filters (IPC, court, year) | Stack A (Qdrant payload index is more flexible) |
8.4 Operations checklist
- Korean orthography rules and loanword variants (
데이타/데이터) — register both directions in the synonym dictionary - Japanese shinjitai vs kyūjitai (
国/國) —icu_normalizer - Traditional vs simplified Chinese (
資料/资料) —stconvertfilter - English abbreviation getting stemmed (
SAS→sa) — protect withkeyword_marker - Drawing reference numbers and claim numbers disappearing through tokenization — check
word_delimiter_graph - Use index aliases — zero-downtime reindex via alias swap
- Dense vector recompute policy — stamp the embedding-model version onto the chunk metadata
- Never hardcode RRF
kin production — externalize to config
Wrap-up
Multilingual specialist-domain search is not a "good embedding model" problem. It needs all six stages working together — per-language analyzers → domain dictionary → multilingual BM25 → multilingual dense → document-level RRF → cross-encoder rerank — before you get a system that does not lose a single piece of prior art. With the pipeline diagram and RRF implementation in this post, plus either of the two stack code samples, you should be able to stand up a first working system in a week.
The next post will cover what to do after this retrieval stage — how to bolt on LLM answer generation with citation, and how to reduce hallucinations in legal and patent answers. Suggestions for topics are always welcome.
Related posts