Embedding Model Selection Guide - From Concepts to Korean Language Benchmarks
A comprehensive guide covering embedding model concepts, major model comparison (OpenAI, Cohere, BGE-M3, E5, multilingual-e5), Korean language performance benchmarks, dimensionality reduction, fine-tuning, and selection strategies for RAG applications.
Embeddings are the foundational building block of modern NLP systems, from semantic search to RAG pipelines. Choosing the right embedding model directly impacts retrieval quality, latency, and cost. This guide covers everything from embedding fundamentals to Korean-language benchmarks and production deployment strategies.
1. What are Embeddings?
Definition
An embedding is a dense vector representation of text (or other data) in a continuous, high-dimensional space. Unlike sparse representations such as TF-IDF or bag-of-words, embeddings capture semantic meaning -- words and sentences with similar meanings are mapped to nearby points in vector space.
Text Input Embedding Vector (simplified)
─────────────────────────────────────────────────────────────
"The cat sat on the mat" → [0.12, -0.34, 0.56, 0.78, ...]
"A kitten rested on a rug" → [0.11, -0.32, 0.55, 0.80, ...]
"Stock market crashed" → [-0.87, 0.45, -0.12, 0.03, ...]
The first two sentences have similar vectors because their meanings are close, while the third sentence about finance is far away in vector space.
How Text-to-Vector Conversion Works
The embedding pipeline follows a consistent pattern:
Raw Text → Tokenization → Token IDs → Transformer Encoder → Hidden States → Pooling → Embedding Vector
- Tokenization: Text is split into subword tokens (e.g., WordPiece, BPE)
- Token Encoding: Each token is mapped to an ID and passed through the model
- Transformer Layers: Self-attention layers build contextual representations
- Pooling: Token-level representations are aggregated into a single vector
Semantic Space Visualization
The classic example illustrating semantic relationships in embedding space:
king ─────── queen
│ │
│ (gender) │
│ │
man ─────── woman
king - man + woman ≈ queen
This demonstrates that embeddings capture not just similarity but also analogical relationships. The vector arithmetic king - man + woman yields a vector close to queen, showing that the model has learned the concept of gender as a direction in vector space.
Types of Embeddings
| Level | Description | Use Case | Example Models |
|---|---|---|---|
| Word-level | One vector per word, context-independent | Word similarity, simple classification | Word2Vec, GloVe, FastText |
| Sentence-level | One vector per sentence, context-aware | Semantic search, similarity matching | Sentence-BERT, E5, BGE |
| Document-level | One vector per document or paragraph | Document retrieval, clustering | Doc2Vec, long-context embedding models |
Note: Modern embedding models (2023+) primarily operate at the sentence/passage level using transformer architectures. Word-level embeddings like Word2Vec are largely superseded for production use cases.
Embedding Dimensions Explained
The dimension of an embedding vector determines how much information it can encode:
| Dimension | Storage per Vector | Characteristics |
|---|---|---|
| 384 | 1.5 KB | Lightweight, fast, good for simple tasks |
| 768 | 3 KB | Standard size, good quality-cost balance |
| 1024 | 4 KB | Higher quality, moderate cost |
| 1536 | 6 KB | High quality (OpenAI default) |
| 3072 | 12 KB | Maximum quality, highest storage cost |
Higher dimensions generally improve quality but increase:
- Storage cost: Proportional to dimension size
- Search latency: Distance computation scales linearly
- Memory usage: Index size grows with dimension
2. Embedding Model Architecture
Bi-Encoder vs Cross-Encoder
The two fundamental architectures for computing text similarity serve different purposes:
[Bi-Encoder] [Cross-Encoder]
Query → Encoder → Vector Q Query + Document
Doc → Encoder → Vector D → Encoder → Relevance Score
Similarity = cosine(Q, D) Score = sigmoid(output)
✓ Pre-compute document vectors ✗ Must process pair together
✓ Fast retrieval (ANN search) ✗ Slow (O(n) for n documents)
✗ Lower accuracy ✓ Higher accuracy
| Feature | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Input | Single text | Text pair (query + document) |
| Output | Vector embedding | Relevance score |
| Speed | Fast (independent encoding) | Slow (joint encoding) |
| Accuracy | Good | Superior |
| Offline indexing | Yes (pre-compute vectors) | No |
| Typical use | First-stage retrieval | Re-ranking stage |
| Scalability | Millions of documents | Top-K re-ranking (10-100) |
Note: In production RAG systems, the best practice is to use a Bi-Encoder for initial retrieval (top-100) and then a Cross-Encoder for re-ranking (top-100 to top-10). This combines speed with accuracy.
Sentence-BERT Architecture
Sentence-BERT (SBERT) is the foundational architecture for modern sentence embedding models. It fine-tunes BERT with a siamese/triplet network structure:
┌──────────────────────────────────────────────┐
│ Sentence-BERT │
│ │
│ Sentence A Sentence B │
│ │ │ │
│ [BERT Encoder] [BERT Encoder] │
│ (shared weights) (shared weights) │
│ │ │ │
│ [Pooling] [Pooling] │
│ │ │ │
│ Vector u Vector v │
│ │ │ │
│ └──── Objective ───┘ │
│ Function │
│ (cosine similarity / contrastive loss) │
└──────────────────────────────────────────────┘
Key properties:
- Shared weights: Both sentences are encoded by the same model
- Independent encoding: Each sentence is processed separately
- Fixed-size output: Regardless of input length, output dimension is constant
Contrastive Learning
Modern embedding models are trained using contrastive learning objectives. The goal is to pull similar pairs closer and push dissimilar pairs apart in vector space.
InfoNCE Loss (common training objective):
L = -log( exp(sim(q, d+) / τ) / Σ exp(sim(q, di) / τ) )
Where:
q = query embedding
d+ = positive (relevant) document embedding
di = all documents in batch (positive + negatives)
τ = temperature parameter
sim = cosine similarity
Training data formats:
- Pairs: (query, positive_document)
- Triplets: (query, positive_document, negative_document)
- In-batch negatives: Other positives in the batch serve as negatives
Pooling Strategies
Pooling converts variable-length token representations into a fixed-size vector:
| Strategy | Method | Pros | Cons |
|---|---|---|---|
| CLS Token | Use the [CLS] token representation | Standard BERT approach | May not capture full sentence meaning |
| Mean Pooling | Average all token representations | Captures information from all tokens | Can dilute important signals |
| Max Pooling | Take max across each dimension | Captures salient features | May lose nuance |
| Weighted Mean | Attention-weighted average | Adaptive focus | More complex |
import torch
def mean_pooling(model_output, attention_mask):
"""Mean pooling - average token embeddings weighted by attention mask."""
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
def cls_pooling(model_output):
"""CLS pooling - use the [CLS] token representation."""
return model_output[0][:, 0]Note: Mean pooling generally outperforms CLS pooling for sentence embeddings. Most modern models (BGE, E5, GTE) use mean pooling by default.
3. Major Embedding Models Comparison
Comprehensive Model Comparison
| Model | Provider | Dimensions | Max Tokens | Multilingual | License | Cost | Release |
|---|---|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | 8191 | Yes (100+) | Proprietary | $0.02/1M tokens | 2024-01 |
| text-embedding-3-large | OpenAI | 3072 | 8191 | Yes (100+) | Proprietary | $0.13/1M tokens | 2024-01 |
| embed-v3.0 | Cohere | 1024 | 512 | Yes (100+) | Proprietary | $0.10/1M tokens | 2023-11 |
| BGE-M3 | BAAI | 1024 | 8192 | Yes (100+) | MIT | Free (self-hosted) | 2024-01 |
| E5-Mistral-7B | Microsoft | 4096 | 32768 | Yes | MIT | Free (self-hosted) | 2024-01 |
| multilingual-e5-large | Microsoft | 1024 | 512 | Yes (100+) | MIT | Free (self-hosted) | 2023-06 |
| GTE-Qwen2-7B | Alibaba | 3584 | 131072 | Yes | Apache 2.0 | Free (self-hosted) | 2024-06 |
| jina-embeddings-v3 | Jina AI | 1024 | 8192 | Yes (89) | CC-BY-NC-4.0 | Free (self-hosted) | 2024-09 |
Model Characteristics Deep Dive
OpenAI text-embedding-3 Series
- Native dimensionality reduction via API parameter (e.g., 256, 512, 1024)
- Matryoshka Representation Learning support
- Best-in-class ease of use with simple API
- No self-hosting option
BGE-M3 (BAAI General Embedding - Multi-Functionality, Multi-Linguality, Multi-Granularity)
- Supports three retrieval methods: dense, sparse (lexical), and multi-vector (ColBERT)
- Excellent multilingual performance including Korean
- 8192 token context window
- Fully open-source (MIT license)
BGE-M3 Retrieval Modes:
┌─────────────────────────────────────────────────┐
│ BGE-M3 │
│ │
│ Dense Retrieval Sparse Retrieval ColBERT │
│ (single vector) (term weights) (multi-vec)│
│ │ │ │ │
│ └──────── Hybrid Score ───────────┘ │
│ (weighted combination) │
└─────────────────────────────────────────────────┘
E5-Mistral-7B
- Built on Mistral-7B decoder (largest model in this list)
- Exceptional long-context performance (32K tokens)
- Requires instruction prefix for queries:
"Instruct: ...\nQuery: ..." - High VRAM requirement (~14GB in FP16)
Cohere embed-v3
- Input type parameter (search_document, search_query, classification, clustering)
- Compression-aware training
- Built-in binary/int8 quantization support
MTEB Benchmark Comparison (English)
| Model | Retrieval | STS | Classification | Clustering | Avg |
|---|---|---|---|---|---|
| text-embedding-3-large | 55.4 | 64.6 | 75.5 | 49.0 | 64.6 |
| BGE-M3 | 54.3 | 63.2 | 74.1 | 48.5 | 62.8 |
| E5-Mistral-7B | 56.9 | 65.1 | 77.2 | 50.3 | 66.6 |
| GTE-Qwen2-7B | 57.2 | 65.8 | 76.9 | 51.1 | 67.0 |
| jina-embeddings-v3 | 55.8 | 64.3 | 75.1 | 49.8 | 65.0 |
| multilingual-e5-large | 50.1 | 61.5 | 72.3 | 46.2 | 59.8 |
Note: MTEB (Massive Text Embedding Benchmark) scores vary by version and evaluation subset. The values above are approximate and should be verified against the latest MTEB leaderboard for current rankings.
4. Korean Language Performance
Why Korean Embeddings are Challenging
Korean presents unique challenges for embedding models compared to English:
1. Agglutinative Morphology
Korean is an agglutinative language where morphemes combine to form complex words:
English: "I went to school" → 4 tokens
Korean: "학교에 갔다" → Morphemes: 학교(school) + 에(to) + 가(go) + 았(past) + 다(declarative)
A single Korean word can express what requires an entire English phrase. Tokenizers not designed for Korean often split words incorrectly.
2. Tokenization Issues
"임베딩 모델 선택" (Embedding model selection)
BPE Tokenizer (English-centric):
→ ["임", "베", "딩", " 모", "델", " 선", "택"] (7 tokens, character-level split)
Korean-optimized Tokenizer:
→ ["임베딩", "모델", "선택"] (3 tokens, word-level split)
English-centric tokenizers fragment Korean text into many small pieces, wasting context window capacity and degrading semantic representation.
3. Subject/Object Omission
Korean frequently omits subjects and objects when they can be inferred from context, making isolated sentence embeddings less informative.
4. Honorific System
The same meaning can be expressed in multiple ways depending on formality level, increasing surface-form variation.
Korean Retrieval Benchmark Results
Performance on Korean retrieval tasks (Ko-StrategyQA, KorQuAD retrieval, Ko-MIRACL):
| Model | Ko-StrategyQA (nDCG@10) | KorQuAD Retrieval (R@10) | Ko-MIRACL (nDCG@10) | Avg |
|---|---|---|---|---|
| BGE-M3 | 72.1 | 85.3 | 68.4 | 75.3 |
| multilingual-e5-large | 68.5 | 82.1 | 65.2 | 71.9 |
| KoSimCSE-roberta | 65.3 | 79.8 | 61.7 | 68.9 |
| text-embedding-3-large | 70.8 | 83.7 | 67.1 | 73.9 |
| text-embedding-3-small | 63.2 | 76.4 | 58.9 | 66.2 |
| Cohere embed-v3 | 69.5 | 81.9 | 66.0 | 72.5 |
| jina-embeddings-v3 | 69.1 | 82.5 | 65.8 | 72.5 |
| E5-Mistral-7B | 71.5 | 84.2 | 67.8 | 74.5 |
Recommended Models for Korean RAG
| Priority | Model | Reason |
|---|---|---|
| 1st | BGE-M3 | Best Korean performance, open-source, hybrid retrieval, 8K context |
| 2nd | text-embedding-3-large | Strong Korean support, easy API, Matryoshka dimensions |
| 3rd | multilingual-e5-large | Good balance of quality and resource efficiency |
| 4th | KoSimCSE-roberta | Purpose-built for Korean, lightweight, but limited context length |
Practical Tips for Korean Embeddings
- Preprocessing matters: Apply morphological analysis (e.g., Mecab, Kiwi) before embedding for better tokenization
- Test with your data: Benchmark scores do not always predict performance on domain-specific Korean text
- Consider hybrid retrieval: Combine dense embeddings with BM25 for Korean -- BM25 handles exact keyword matching well for Korean compound nouns
- Chunk size: Korean text is denser than English; use slightly smaller chunk sizes (300-400 tokens vs 500+ for English)
- Query formulation: For models requiring instruction prefixes, include Korean-specific instructions
# Example: Korean preprocessing with Kiwi before embedding
from kiwipiepy import Kiwi
kiwi = Kiwi()
def preprocess_korean(text):
"""Morphological analysis for better Korean embedding quality."""
tokens = kiwi.tokenize(text)
# Join with spaces, keeping meaningful morphemes
return " ".join([t.form for t in tokens if t.tag[0] in ('N', 'V', 'M', 'S')])
# Before embedding
raw_text = "한국어 임베딩 모델의 성능을 비교합니다"
processed = preprocess_korean(raw_text)
# "한국어 임베딩 모델 성능 비교"5. Embedding Model Usage
OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def get_openai_embedding(text, model="text-embedding-3-small", dimensions=None):
"""Get embedding from OpenAI API."""
params = {"input": text, "model": model}
if dimensions:
params["dimensions"] = dimensions # Matryoshka dimension reduction
response = client.embeddings.create(**params)
return response.data[0].embedding
# Single text
embedding = get_openai_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}") # 1536
# With dimension reduction (Matryoshka)
embedding_small = get_openai_embedding(
"What is machine learning?",
model="text-embedding-3-large",
dimensions=256 # Reduce from 3072 to 256
)
print(f"Reduced dimension: {len(embedding_small)}") # 256
# Batch processing
def get_openai_embeddings_batch(texts, model="text-embedding-3-small", batch_size=100):
"""Process texts in batches for efficiency."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(input=batch, model=model)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
return all_embeddingsHuggingFace Sentence-Transformers
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model
model = SentenceTransformer("BAAI/bge-m3")
# Single text
embedding = model.encode("What is machine learning?")
print(f"Dimension: {embedding.shape}") # (1024,)
# Batch processing with normalization
sentences = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with multiple layers.",
"The weather is sunny today.",
]
embeddings = model.encode(
sentences,
normalize_embeddings=True, # L2 normalization for cosine similarity
batch_size=32,
show_progress_bar=True,
)
# Compute similarity matrix
similarity_matrix = np.inner(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix)Using E5 models with instruction prefix:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large")
# E5 models require "query: " or "passage: " prefix
queries = ["query: What is the capital of France?"]
passages = [
"passage: Paris is the capital and largest city of France.",
"passage: Berlin is the capital of Germany.",
]
query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)
# Compute scores
scores = query_embeddings @ passage_embeddings.T
print(f"Score for Paris passage: {scores[0][0]:.4f}") # Higher
print(f"Score for Berlin passage: {scores[0][1]:.4f}") # LowerOllama Embeddings (Local)
import requests
import numpy as np
def get_ollama_embedding(text, model="bge-m3"):
"""Get embedding from local Ollama server."""
response = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": model, "prompt": text}
)
return response.json()["embedding"]
# Single embedding
embedding = get_ollama_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}")
# Batch processing (Ollama processes one at a time)
def get_ollama_embeddings_batch(texts, model="bge-m3"):
"""Process multiple texts through Ollama."""
embeddings = []
for text in texts:
emb = get_ollama_embedding(text, model)
embeddings.append(emb)
return np.array(embeddings)
# Using Ollama Python library
import ollama
response = ollama.embeddings(model="bge-m3", prompt="What is machine learning?")
embedding = response["embedding"]Normalization
import numpy as np
def normalize_embedding(embedding):
"""L2 normalize an embedding vector."""
embedding = np.array(embedding)
norm = np.linalg.norm(embedding)
if norm == 0:
return embedding
return embedding / norm
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# If embeddings are already L2-normalized, cosine similarity = dot product
a_norm = normalize_embedding(embedding_a)
b_norm = normalize_embedding(embedding_b)
similarity = np.dot(a_norm, b_norm) # Equivalent to cosine_similarity6. Dimensionality Reduction
Matryoshka Representation Learning (MRL)
Named after Russian nesting dolls, Matryoshka embeddings are trained so that the first d dimensions of a larger embedding are themselves a valid embedding. This allows flexible dimension selection at inference time without retraining.
Full embedding (3072 dims): [v1, v2, v3, ..., v256, ..., v1024, ..., v3072]
|________________|
256-dim embedding (valid!)
|________________________________|
1024-dim embedding (valid!)
|__________________________________________________|
3072-dim embedding (full quality)
Models supporting Matryoshka embeddings:
- OpenAI text-embedding-3-small/large (via API
dimensionsparameter) - jina-embeddings-v3
- nomic-embed-text-v1.5
PCA (Principal Component Analysis)
For models that do not natively support Matryoshka, PCA can reduce dimensions post-hoc:
from sklearn.decomposition import PCA
import numpy as np
# Assume we have a matrix of embeddings (n_samples x original_dim)
embeddings = np.array(all_embeddings) # shape: (10000, 1024)
# Fit PCA
target_dim = 256
pca = PCA(n_components=target_dim)
reduced_embeddings = pca.fit_transform(embeddings)
print(f"Original shape: {embeddings.shape}") # (10000, 1024)
print(f"Reduced shape: {reduced_embeddings.shape}") # (10000, 256)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")
# Save PCA model for query-time reduction
import pickle
with open("pca_model.pkl", "wb") as f:
pickle.dump(pca, f)Truncation
The simplest approach -- just take the first N dimensions. Works well with Matryoshka-trained models, poorly with others:
# Simple truncation (only for Matryoshka-trained models!)
full_embedding = model.encode("some text") # 3072 dims
truncated = full_embedding[:256] # Take first 256
truncated = truncated / np.linalg.norm(truncated) # Re-normalizeDimension vs Accuracy vs Cost Trade-offs
| Dimensions | Relative Accuracy | Storage (1M vectors) | Search Speed | Use Case |
|---|---|---|---|---|
| 3072 | 100% (baseline) | ~12 GB | Slowest | Maximum quality requirements |
| 1536 | ~98% | ~6 GB | Moderate | Standard production |
| 1024 | ~96% | ~4 GB | Moderate | Good balance |
| 512 | ~93% | ~2 GB | Fast | Cost-sensitive applications |
| 256 | ~88% | ~1 GB | Very fast | Prototyping, low-resource |
| 128 | ~82% | ~0.5 GB | Fastest | Edge deployment |
Note: Accuracy retention percentages are approximate and vary significantly by model and task. Matryoshka-trained models retain more accuracy at lower dimensions compared to post-hoc reduction methods like PCA or truncation.
7. Fine-Tuning Embeddings
When to Fine-Tune
Fine-tuning is worthwhile when:
- Domain-specific vocabulary: Medical, legal, or technical jargon not well-represented in general models
- Specialized similarity criteria: Your notion of similarity differs from general semantic similarity
- Low baseline performance: Off-the-shelf models underperform on your specific retrieval tasks
- Sufficient training data: You have at least 1,000+ labeled pairs (ideally 10K+)
Fine-tuning is NOT necessary when:
- General-purpose retrieval works well enough
- You have insufficient training data (< 500 pairs)
- The domain is well-covered by multilingual models
Training Data Format
The most effective format is query-positive-negative triples:
[
{
"query": "How to prevent SQL injection?",
"positive": "Use parameterized queries and prepared statements to prevent SQL injection attacks. Never concatenate user input directly into SQL strings.",
"negative": "SQL databases store data in tables with rows and columns. Common SQL databases include MySQL, PostgreSQL, and SQLite."
},
{
"query": "What causes memory leaks in Java?",
"positive": "Memory leaks in Java occur when objects are no longer needed but still referenced, preventing garbage collection. Common causes include static collections, unclosed resources, and listener accumulation.",
"negative": "Java is a popular programming language developed by Sun Microsystems. It runs on the Java Virtual Machine (JVM)."
}
]Hard negatives (documents that are topically related but not relevant) are far more effective than random negatives for training.
Fine-Tuning with Sentence-Transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
import json
# Load base model
model = SentenceTransformer("BAAI/bge-m3")
# Prepare training data
def load_training_data(filepath):
"""Load training triples from JSON file."""
with open(filepath) as f:
data = json.load(f)
examples = []
for item in data:
examples.append(InputExample(
texts=[item["query"], item["positive"], item["negative"]]
))
return examples
train_examples = load_training_data("training_triples.json")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Define loss function
train_loss = losses.TripletLoss(model=model)
# Alternative: MultipleNegativesRankingLoss (more effective with in-batch negatives)
# train_loss = losses.MultipleNegativesRankingLoss(model=model)
# Prepare evaluator
def create_evaluator(eval_data_path):
"""Create IR evaluator for monitoring training progress."""
with open(eval_data_path) as f:
eval_data = json.load(f)
queries = {str(i): item["query"] for i, item in enumerate(eval_data)}
corpus = {str(i): item["document"] for i, item in enumerate(eval_data)}
relevant_docs = {str(i): {str(i)} for i in range(len(eval_data))}
return InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name="custom-eval",
)
evaluator = create_evaluator("eval_data.json")
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=3,
evaluation_steps=500,
warmup_steps=100,
output_path="./finetuned-bge-m3",
save_best_model=True,
)
# Load and use fine-tuned model
finetuned_model = SentenceTransformer("./finetuned-bge-m3")
embeddings = finetuned_model.encode(["test query"], normalize_embeddings=True)Evaluation Metrics
| Metric | Description | Formula | Good Value |
|---|---|---|---|
| Recall@K | Fraction of relevant docs in top-K | Relevant in top-K / Total relevant | > 0.90 for K=10 |
| MRR (Mean Reciprocal Rank) | Average of 1/rank of first relevant result | Mean(1/rank_i) | > 0.50 |
| nDCG@K | Normalized Discounted Cumulative Gain | Considers position and graded relevance | > 0.60 |
| MAP (Mean Average Precision) | Mean of precision at each relevant rank | Mean of AP per query | > 0.50 |
# Quick evaluation example
from sentence_transformers.evaluation import InformationRetrievalEvaluator
results = evaluator(model)
print(f"Recall@10: {results['custom-eval_recall@10']:.4f}")
print(f"MRR@10: {results['custom-eval_mrr@10']:.4f}")
print(f"nDCG@10: {results['custom-eval_ndcg@10']:.4f}")
print(f"MAP@100: {results['custom-eval_map@100']:.4f}")8. RAG Integration Best Practices
Chunking and Embedding Alignment
The relationship between chunking strategy and embedding model is critical:
Document: "Machine learning overview... [2000 tokens of content]"
Strategy 1: Fixed-size chunks (512 tokens)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │ │ Chunk 4 │
│ 512 tok │ │ 512 tok │ │ 512 tok │ │ 512 tok │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Strategy 2: Semantic chunks (variable size)
┌────────────────┐ ┌──────────┐ ┌──────────────────────┐
│ Section 1 │ │ Section 2│ │ Section 3 │
│ 320 tok │ │ 180 tok │ │ 640 tok │
└────────────────┘ └──────────┘ └──────────────────────┘
| Factor | Recommendation |
|---|---|
| Chunk size | Match to model's sweet spot (typically 256-512 tokens) |
| Overlap | 10-20% overlap to avoid boundary information loss |
| Max tokens | Never exceed model's max token limit |
| Granularity | Smaller chunks for precise Q&A, larger for summarization |
| Metadata | Include source, section headers, and page numbers |
Query vs Document Embedding
Some models distinguish between query-time and document-time embeddings:
# Models requiring different prefixes for queries vs documents
# BGE models
query_prefix = "Represent this sentence for searching relevant passages: "
doc_prefix = "" # No prefix for documents
query_embedding = model.encode(query_prefix + "What is RAG?")
doc_embedding = model.encode(doc_prefix + "RAG combines retrieval with generation...")
# E5 models
query_text = "query: What is RAG?"
doc_text = "passage: RAG combines retrieval with generation..."
# Cohere API
import cohere
co = cohere.Client("your-api-key")
# Different input_type for indexing vs searching
doc_response = co.embed(
texts=["RAG combines retrieval..."],
model="embed-multilingual-v3.0",
input_type="search_document"
)
query_response = co.embed(
texts=["What is RAG?"],
model="embed-multilingual-v3.0",
input_type="search_query"
)Note: Always use the correct prefix/input_type. Using a document prefix for queries (or vice versa) can reduce retrieval accuracy by 5-15%.
Caching Strategies
import hashlib
import json
import numpy as np
from functools import lru_cache
class EmbeddingCache:
"""Simple file-based embedding cache."""
def __init__(self, cache_dir="./embedding_cache"):
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def _hash_text(self, text, model_name):
"""Create a unique hash for text + model combination."""
key = f"{model_name}:{text}"
return hashlib.sha256(key.encode()).hexdigest()
def get(self, text, model_name):
"""Retrieve cached embedding."""
hash_key = self._hash_text(text, model_name)
cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
if os.path.exists(cache_path):
return np.load(cache_path)
return None
def set(self, text, model_name, embedding):
"""Store embedding in cache."""
hash_key = self._hash_text(text, model_name)
cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
np.save(cache_path, np.array(embedding))
def get_or_compute(self, text, model_name, compute_fn):
"""Get from cache or compute and store."""
cached = self.get(text, model_name)
if cached is not None:
return cached
embedding = compute_fn(text)
self.set(text, model_name, embedding)
return embedding
# Usage
cache = EmbeddingCache()
embedding = cache.get_or_compute(
"What is machine learning?",
"bge-m3",
lambda text: model.encode(text, normalize_embeddings=True)
)Cost Optimization
| Strategy | Savings | Impact on Quality | Implementation |
|---|---|---|---|
| Batch processing | 10-20% latency | None | Group requests, process together |
| Dimension reduction | 50-75% storage | 2-10% accuracy loss | Use Matryoshka or PCA |
| Caching | 80%+ API costs | None | Cache computed embeddings |
| Model selection | 5-10x cost | Varies | Use smaller model if quality permits |
| Quantization | 50-75% storage | 1-3% accuracy loss | int8 or binary quantization |
# Cost estimation helper
def estimate_embedding_cost(
num_documents: int,
avg_tokens_per_doc: int,
model: str = "text-embedding-3-small",
):
"""Estimate OpenAI embedding costs."""
pricing = {
"text-embedding-3-small": 0.02, # per 1M tokens
"text-embedding-3-large": 0.13, # per 1M tokens
}
total_tokens = num_documents * avg_tokens_per_doc
cost = (total_tokens / 1_000_000) * pricing.get(model, 0)
print(f"Documents: {num_documents:,}")
print(f"Avg tokens/doc: {avg_tokens_per_doc}")
print(f"Total tokens: {total_tokens:,}")
print(f"Model: {model}")
print(f"Estimated cost: ${cost:.4f}")
return cost
# Example: Embedding 100K documents
estimate_embedding_cost(100_000, 300, "text-embedding-3-small")
# Documents: 100,000
# Avg tokens/doc: 300
# Total tokens: 30,000,000
# Model: text-embedding-3-small
# Estimated cost: $0.60009. Selection Guide
Decision Flowchart
Start: Choose an Embedding Model
│
├─ Do you need multilingual (Korean) support?
│ ├─ Yes
│ │ ├─ Budget for API costs?
│ │ │ ├─ Yes → OpenAI text-embedding-3-large (easy, high quality)
│ │ │ └─ No → Can you self-host?
│ │ │ ├─ Yes → BGE-M3 (best open-source multilingual)
│ │ │ └─ No → multilingual-e5-large (lightweight self-host)
│ │ └─ Need hybrid retrieval (dense + sparse)?
│ │ ├─ Yes → BGE-M3 (native hybrid support)
│ │ └─ No → Continue below
│ └─ No (English only)
│ ├─ Need maximum quality?
│ │ ├─ Yes → GTE-Qwen2-7B or E5-Mistral-7B
│ │ └─ No → text-embedding-3-small (cost-effective)
│ └─ Need long context (>8K tokens)?
│ ├─ Yes → GTE-Qwen2-7B (128K) or E5-Mistral-7B (32K)
│ └─ No → BGE-M3 or text-embedding-3-small
│
└─ Special requirements?
├─ Edge/mobile deployment → Small model + dimension reduction
├─ Commercial license needed → Check model license (MIT, Apache)
└─ Maximum privacy → Self-hosted open-source model
Scenario-Based Recommendations
| Scenario | Recommended Model | Dimensions | Rationale |
|---|---|---|---|
| Korean enterprise RAG | BGE-M3 | 1024 | Best Korean performance, MIT license, hybrid retrieval |
| Multilingual SaaS product | OpenAI text-embedding-3-large | 1024 (reduced) | Easy API, 100+ languages, Matryoshka support |
| English-only high accuracy | GTE-Qwen2-7B | 3584 | Top MTEB scores, Apache license |
| Budget-constrained startup | OpenAI text-embedding-3-small | 512 (reduced) | Lowest API cost, decent quality |
| On-premise (no external API) | BGE-M3 | 1024 | MIT license, no API dependency |
| Long document retrieval | E5-Mistral-7B | 4096 | 32K context window |
| Low-resource / edge | multilingual-e5-large | 256 (PCA) | Good quality at small size |
| Korean-specific, lightweight | KoSimCSE-roberta | 768 | Korean-native, small footprint |
| Maximum flexibility | jina-embeddings-v3 | 1024 | Task-specific LoRA adapters, 8K context |
Quick Comparison Summary
Quality vs Cost Matrix:
High Quality │ E5-Mistral GTE-Qwen2
│ ● ●
│
│ BGE-M3 text-emb-3-large
│ ● ●
│ jina-v3
│ ●
│ ml-e5-large
│ ● text-emb-3-small
│ ●
Low Quality │
└──────────────────────────────
Free Paid
(self-host) (API)
Production Checklist
Before deploying an embedding model to production, verify:
- Benchmark on your data: Run retrieval evaluations on representative queries
- Tokenization check: Verify Korean/special character handling
- Latency profiling: Measure p50/p95/p99 embedding generation time
- Cost projection: Calculate monthly embedding costs at expected volume
- Fallback plan: Have a backup model in case primary is unavailable
- Versioning: Track which model version generated each embedding
- Re-embedding strategy: Plan for model upgrades (must re-embed all documents)
- Monitoring: Set up quality metrics (retrieval recall, user feedback)
References
- Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
- Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." arXiv:2402.03216.
- Wang, L. et al. (2024). "Improving Text Embeddings with Large Language Models." arXiv:2401.00368.
- MTEB Benchmark Leaderboard - https://huggingface.co/spaces/mteb/leaderboard
- OpenAI Embeddings Documentation - https://platform.openai.com/docs/guides/embeddings
- Sentence-Transformers Documentation - https://www.sbert.net/
- Kusupati, A. et al. (2022). "Matryoshka Representation Learning." NeurIPS 2022.
- Cohere Embed v3 Documentation - https://docs.cohere.com/docs/embeddings
- Jina AI Embeddings v3 - https://jina.ai/embeddings/
— Data Dynamics Engineering Team