embeddingvector-databaseragnlpbge-m3openaimultilingualai

Embedding Model Selection Guide - From Concepts to Korean Language Benchmarks

A comprehensive guide covering embedding model concepts, major model comparison (OpenAI, Cohere, BGE-M3, E5, multilingual-e5), Korean language performance benchmarks, dimensionality reduction, fine-tuning, and selection strategies for RAG applications.

Data DynamicsApril 16, 202623 min read

Embeddings are the foundational building block of modern NLP systems, from semantic search to RAG pipelines. Choosing the right embedding model directly impacts retrieval quality, latency, and cost. This guide covers everything from embedding fundamentals to Korean-language benchmarks and production deployment strategies.

1. What are Embeddings?

Definition

An embedding is a dense vector representation of text (or other data) in a continuous, high-dimensional space. Unlike sparse representations such as TF-IDF or bag-of-words, embeddings capture semantic meaning -- words and sentences with similar meanings are mapped to nearby points in vector space.

Text Input                     Embedding Vector (simplified)
─────────────────────────────────────────────────────────────
"The cat sat on the mat"   →   [0.12, -0.34, 0.56, 0.78, ...]
"A kitten rested on a rug" →   [0.11, -0.32, 0.55, 0.80, ...]
"Stock market crashed"     →   [-0.87, 0.45, -0.12, 0.03, ...]

The first two sentences have similar vectors because their meanings are close, while the third sentence about finance is far away in vector space.

How Text-to-Vector Conversion Works

The embedding pipeline follows a consistent pattern:

Loading diagram…

Tokenization: Text is split into subword tokens (e.g., WordPiece, BPE)
Token Encoding: Each token is mapped to an ID and passed through the model
Transformer Layers: Self-attention layers build contextual representations
Pooling: Token-level representations are aggregated into a single vector

Semantic Space Visualization

The classic example illustrating semantic relationships in embedding space:

Loading diagram…

This demonstrates that embeddings capture not just similarity but also analogical relationships. The vector arithmetic king - man + woman yields a vector close to queen, showing that the model has learned the concept of gender as a direction in vector space.

Types of Embeddings

Level	Description	Use Case	Example Models
Word-level	One vector per word, context-independent	Word similarity, simple classification	Word2Vec, GloVe, FastText
Sentence-level	One vector per sentence, context-aware	Semantic search, similarity matching	Sentence-BERT, E5, BGE
Document-level	One vector per document or paragraph	Document retrieval, clustering	Doc2Vec, long-context embedding models

Note: Modern embedding models (2023+) primarily operate at the sentence/passage level using transformer architectures. Word-level embeddings like Word2Vec are largely superseded for production use cases.

Embedding Dimensions Explained

The dimension of an embedding vector determines how much information it can encode:

Dimension	Storage per Vector	Characteristics
384	1.5 KB	Lightweight, fast, good for simple tasks
768	3 KB	Standard size, good quality-cost balance
1024	4 KB	Higher quality, moderate cost
1536	6 KB	High quality (OpenAI default)
3072	12 KB	Maximum quality, highest storage cost

Higher dimensions generally improve quality but increase:

Storage cost: Proportional to dimension size
Search latency: Distance computation scales linearly
Memory usage: Index size grows with dimension

2. Embedding Model Architecture

Bi-Encoder vs Cross-Encoder

The two fundamental architectures for computing text similarity serve different purposes:

Loading diagram…

Feature	Bi-Encoder	Cross-Encoder
Input	Single text	Text pair (query + document)
Output	Vector embedding	Relevance score
Speed	Fast (independent encoding)	Slow (joint encoding)
Accuracy	Good	Superior
Offline indexing	Yes (pre-compute vectors)	No
Typical use	First-stage retrieval	Re-ranking stage
Scalability	Millions of documents	Top-K re-ranking (10-100)

Note: In production RAG systems, the best practice is to use a Bi-Encoder for initial retrieval (top-100) and then a Cross-Encoder for re-ranking (top-100 to top-10). This combines speed with accuracy.

Sentence-BERT Architecture

Sentence-BERT (SBERT) is the foundational architecture for modern sentence embedding models. It fine-tunes BERT with a siamese/triplet network structure:

Loading diagram…

Key properties:

Shared weights: Both sentences are encoded by the same model
Independent encoding: Each sentence is processed separately
Fixed-size output: Regardless of input length, output dimension is constant

Contrastive Learning

Modern embedding models are trained using contrastive learning objectives. The goal is to pull similar pairs closer and push dissimilar pairs apart in vector space.

InfoNCE Loss (common training objective):

L = -log( exp(sim(q, d+) / τ) / Σ exp(sim(q, di) / τ) )

Where:
  q   = query embedding
  d+  = positive (relevant) document embedding
  di  = all documents in batch (positive + negatives)
  τ   = temperature parameter
  sim = cosine similarity

Training data formats:

Pairs: (query, positive_document)
Triplets: (query, positive_document, negative_document)
In-batch negatives: Other positives in the batch serve as negatives

Pooling Strategies

Pooling converts variable-length token representations into a fixed-size vector:

Strategy	Method	Pros	Cons
CLS Token	Use the [CLS] token representation	Standard BERT approach	May not capture full sentence meaning
Mean Pooling	Average all token representations	Captures information from all tokens	Can dilute important signals
Max Pooling	Take max across each dimension	Captures salient features	May lose nuance
Weighted Mean	Attention-weighted average	Adaptive focus	More complex

import torch
 
def mean_pooling(model_output, attention_mask):
    """Mean pooling - average token embeddings weighted by attention mask."""
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )
 
def cls_pooling(model_output):
    """CLS pooling - use the [CLS] token representation."""
    return model_output[0][:, 0]

Note: Mean pooling generally outperforms CLS pooling for sentence embeddings. Most modern models (BGE, E5, GTE) use mean pooling by default.

3. Major Embedding Models Comparison

Comprehensive Model Comparison

Model	Provider	Dimensions	Max Tokens	Multilingual	License	Cost	Release
text-embedding-3-small	OpenAI	1536	8191	Yes (100+)	Proprietary	$0.02/1M tokens	2024-01
text-embedding-3-large	OpenAI	3072	8191	Yes (100+)	Proprietary	$0.13/1M tokens	2024-01
embed-v3.0	Cohere	1024	512	Yes (100+)	Proprietary	$0.10/1M tokens	2023-11
BGE-M3	BAAI	1024	8192	Yes (100+)	MIT	Free (self-hosted)	2024-01
E5-Mistral-7B	Microsoft	4096	32768	Yes	MIT	Free (self-hosted)	2024-01
multilingual-e5-large	Microsoft	1024	512	Yes (100+)	MIT	Free (self-hosted)	2023-06
GTE-Qwen2-7B	Alibaba	3584	131072	Yes	Apache 2.0	Free (self-hosted)	2024-06
jina-embeddings-v3	Jina AI	1024	8192	Yes (89)	CC-BY-NC-4.0	Free (self-hosted)	2024-09

Model Characteristics Deep Dive

OpenAI text-embedding-3 Series

Native dimensionality reduction via API parameter (e.g., 256, 512, 1024)
Matryoshka Representation Learning support
Best-in-class ease of use with simple API
No self-hosting option

BGE-M3 (BAAI General Embedding - Multi-Functionality, Multi-Linguality, Multi-Granularity)

Supports three retrieval methods: dense, sparse (lexical), and multi-vector (ColBERT)
Excellent multilingual performance including Korean
8192 token context window
Fully open-source (MIT license)

Loading diagram…

E5-Mistral-7B

Built on Mistral-7B decoder (largest model in this list)
Exceptional long-context performance (32K tokens)
Requires instruction prefix for queries: "Instruct: ...\nQuery: ..."
High VRAM requirement (~14GB in FP16)

Cohere embed-v3

Input type parameter (search_document, search_query, classification, clustering)
Compression-aware training
Built-in binary/int8 quantization support

MTEB Benchmark Comparison (English)

Model	Retrieval	STS	Classification	Clustering	Avg
text-embedding-3-large	55.4	64.6	75.5	49.0	64.6
BGE-M3	54.3	63.2	74.1	48.5	62.8
E5-Mistral-7B	56.9	65.1	77.2	50.3	66.6
GTE-Qwen2-7B	57.2	65.8	76.9	51.1	67.0
jina-embeddings-v3	55.8	64.3	75.1	49.8	65.0
multilingual-e5-large	50.1	61.5	72.3	46.2	59.8

Note: MTEB (Massive Text Embedding Benchmark) scores vary by version and evaluation subset. The values above are approximate and should be verified against the latest MTEB leaderboard for current rankings.

4. Korean Language Performance

Why Korean Embeddings are Challenging

Korean presents unique challenges for embedding models compared to English:

1. Agglutinative Morphology

Korean is an agglutinative language where morphemes combine to form complex words:

English: "I went to school"  →  4 tokens
Korean:  "학교에 갔다"          →  Morphemes: 학교(school) + 에(to) + 가(go) + 았(past) + 다(declarative)

A single Korean word can express what requires an entire English phrase. Tokenizers not designed for Korean often split words incorrectly.

2. Tokenization Issues

"임베딩 모델 선택" (Embedding model selection)

BPE Tokenizer (English-centric):
  → ["임", "베", "딩", " 모", "델", " 선", "택"]   (7 tokens, character-level split)

Korean-optimized Tokenizer:
  → ["임베딩", "모델", "선택"]                       (3 tokens, word-level split)

English-centric tokenizers fragment Korean text into many small pieces, wasting context window capacity and degrading semantic representation.

3. Subject/Object Omission

Korean frequently omits subjects and objects when they can be inferred from context, making isolated sentence embeddings less informative.

4. Honorific System

The same meaning can be expressed in multiple ways depending on formality level, increasing surface-form variation.

Korean Retrieval Benchmark Results

Performance on Korean retrieval tasks (Ko-StrategyQA, KorQuAD retrieval, Ko-MIRACL):

Model	Ko-StrategyQA (nDCG@10)	KorQuAD Retrieval (R@10)	Ko-MIRACL (nDCG@10)	Avg
BGE-M3	72.1	85.3	68.4	75.3
multilingual-e5-large	68.5	82.1	65.2	71.9
KoSimCSE-roberta	65.3	79.8	61.7	68.9
text-embedding-3-large	70.8	83.7	67.1	73.9
text-embedding-3-small	63.2	76.4	58.9	66.2
Cohere embed-v3	69.5	81.9	66.0	72.5
jina-embeddings-v3	69.1	82.5	65.8	72.5
E5-Mistral-7B	71.5	84.2	67.8	74.5

Recommended Models for Korean RAG

Priority	Model	Reason
1st	BGE-M3	Best Korean performance, open-source, hybrid retrieval, 8K context
2nd	text-embedding-3-large	Strong Korean support, easy API, Matryoshka dimensions
3rd	multilingual-e5-large	Good balance of quality and resource efficiency
4th	KoSimCSE-roberta	Purpose-built for Korean, lightweight, but limited context length

Practical Tips for Korean Embeddings

Preprocessing matters: Apply morphological analysis (e.g., Mecab, Kiwi) before embedding for better tokenization
Test with your data: Benchmark scores do not always predict performance on domain-specific Korean text
Consider hybrid retrieval: Combine dense embeddings with BM25 for Korean -- BM25 handles exact keyword matching well for Korean compound nouns
Chunk size: Korean text is denser than English; use slightly smaller chunk sizes (300-400 tokens vs 500+ for English)
Query formulation: For models requiring instruction prefixes, include Korean-specific instructions

# Example: Korean preprocessing with Kiwi before embedding
from kiwipiepy import Kiwi
 
kiwi = Kiwi()
 
def preprocess_korean(text):
    """Morphological analysis for better Korean embedding quality."""
    tokens = kiwi.tokenize(text)
    # Join with spaces, keeping meaningful morphemes
    return " ".join([t.form for t in tokens if t.tag[0] in ('N', 'V', 'M', 'S')])
 
# Before embedding
raw_text = "한국어 임베딩 모델의 성능을 비교합니다"
processed = preprocess_korean(raw_text)
# "한국어 임베딩 모델 성능 비교"

5. Embedding Model Usage

OpenAI Embeddings

from openai import OpenAI
 
client = OpenAI()
 
def get_openai_embedding(text, model="text-embedding-3-small", dimensions=None):
    """Get embedding from OpenAI API."""
    params = {"input": text, "model": model}
    if dimensions:
        params["dimensions"] = dimensions  # Matryoshka dimension reduction
    
    response = client.embeddings.create(**params)
    return response.data[0].embedding
 
# Single text
embedding = get_openai_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}")  # 1536
 
# With dimension reduction (Matryoshka)
embedding_small = get_openai_embedding(
    "What is machine learning?",
    model="text-embedding-3-large",
    dimensions=256  # Reduce from 3072 to 256
)
print(f"Reduced dimension: {len(embedding_small)}")  # 256
 
# Batch processing
def get_openai_embeddings_batch(texts, model="text-embedding-3-small", batch_size=100):
    """Process texts in batches for efficiency."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

HuggingFace Sentence-Transformers

from sentence_transformers import SentenceTransformer
import numpy as np
 
# Load model
model = SentenceTransformer("BAAI/bge-m3")
 
# Single text
embedding = model.encode("What is machine learning?")
print(f"Dimension: {embedding.shape}")  # (1024,)
 
# Batch processing with normalization
sentences = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "The weather is sunny today.",
]
 
embeddings = model.encode(
    sentences,
    normalize_embeddings=True,  # L2 normalization for cosine similarity
    batch_size=32,
    show_progress_bar=True,
)
 
# Compute similarity matrix
similarity_matrix = np.inner(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Using E5 models with instruction prefix:

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
 
# E5 models require "query: " or "passage: " prefix
queries = ["query: What is the capital of France?"]
passages = [
    "passage: Paris is the capital and largest city of France.",
    "passage: Berlin is the capital of Germany.",
]
 
query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)
 
# Compute scores
scores = query_embeddings @ passage_embeddings.T
print(f"Score for Paris passage: {scores[0][0]:.4f}")   # Higher
print(f"Score for Berlin passage: {scores[0][1]:.4f}")  # Lower

Ollama Embeddings (Local)

import requests
import numpy as np
 
def get_ollama_embedding(text, model="bge-m3"):
    """Get embedding from local Ollama server."""
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]
 
# Single embedding
embedding = get_ollama_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}")
 
# Batch processing (Ollama processes one at a time)
def get_ollama_embeddings_batch(texts, model="bge-m3"):
    """Process multiple texts through Ollama."""
    embeddings = []
    for text in texts:
        emb = get_ollama_embedding(text, model)
        embeddings.append(emb)
    return np.array(embeddings)
 
# Using Ollama Python library
import ollama
 
response = ollama.embeddings(model="bge-m3", prompt="What is machine learning?")
embedding = response["embedding"]

Normalization

import numpy as np
 
def normalize_embedding(embedding):
    """L2 normalize an embedding vector."""
    embedding = np.array(embedding)
    norm = np.linalg.norm(embedding)
    if norm == 0:
        return embedding
    return embedding / norm
 
def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
# If embeddings are already L2-normalized, cosine similarity = dot product
a_norm = normalize_embedding(embedding_a)
b_norm = normalize_embedding(embedding_b)
similarity = np.dot(a_norm, b_norm)  # Equivalent to cosine_similarity

6. Dimensionality Reduction

Matryoshka Representation Learning (MRL)

Named after Russian nesting dolls, Matryoshka embeddings are trained so that the first d dimensions of a larger embedding are themselves a valid embedding. This allows flexible dimension selection at inference time without retraining.

Full embedding (3072 dims):  [v1, v2, v3, ..., v256, ..., v1024, ..., v3072]
                              |________________|
                              256-dim embedding (valid!)
                              |________________________________|
                              1024-dim embedding (valid!)
                              |__________________________________________________|
                              3072-dim embedding (full quality)

Models supporting Matryoshka embeddings:

OpenAI text-embedding-3-small/large (via API dimensions parameter)
jina-embeddings-v3
nomic-embed-text-v1.5

PCA (Principal Component Analysis)

For models that do not natively support Matryoshka, PCA can reduce dimensions post-hoc:

from sklearn.decomposition import PCA
import numpy as np
 
# Assume we have a matrix of embeddings (n_samples x original_dim)
embeddings = np.array(all_embeddings)  # shape: (10000, 1024)
 
# Fit PCA
target_dim = 256
pca = PCA(n_components=target_dim)
reduced_embeddings = pca.fit_transform(embeddings)
 
print(f"Original shape: {embeddings.shape}")           # (10000, 1024)
print(f"Reduced shape: {reduced_embeddings.shape}")     # (10000, 256)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")
 
# Save PCA model for query-time reduction
import pickle
with open("pca_model.pkl", "wb") as f:
    pickle.dump(pca, f)

Truncation

The simplest approach -- just take the first N dimensions. Works well with Matryoshka-trained models, poorly with others:

# Simple truncation (only for Matryoshka-trained models!)
full_embedding = model.encode("some text")      # 3072 dims
truncated = full_embedding[:256]                 # Take first 256
truncated = truncated / np.linalg.norm(truncated)  # Re-normalize

Dimension vs Accuracy vs Cost Trade-offs

Dimensions	Relative Accuracy	Storage (1M vectors)	Search Speed	Use Case
3072	100% (baseline)	~12 GB	Slowest	Maximum quality requirements
1536	~98%	~6 GB	Moderate	Standard production
1024	~96%	~4 GB	Moderate	Good balance
512	~93%	~2 GB	Fast	Cost-sensitive applications
256	~88%	~1 GB	Very fast	Prototyping, low-resource
128	~82%	~0.5 GB	Fastest	Edge deployment

Note: Accuracy retention percentages are approximate and vary significantly by model and task. Matryoshka-trained models retain more accuracy at lower dimensions compared to post-hoc reduction methods like PCA or truncation.

7. Fine-Tuning Embeddings

When to Fine-Tune

Fine-tuning is worthwhile when:

Domain-specific vocabulary: Medical, legal, or technical jargon not well-represented in general models
Specialized similarity criteria: Your notion of similarity differs from general semantic similarity
Low baseline performance: Off-the-shelf models underperform on your specific retrieval tasks
Sufficient training data: You have at least 1,000+ labeled pairs (ideally 10K+)

Fine-tuning is NOT necessary when:

General-purpose retrieval works well enough
You have insufficient training data (< 500 pairs)
The domain is well-covered by multilingual models

Training Data Format

The most effective format is query-positive-negative triples:

[
  {
    "query": "How to prevent SQL injection?",
    "positive": "Use parameterized queries and prepared statements to prevent SQL injection attacks. Never concatenate user input directly into SQL strings.",
    "negative": "SQL databases store data in tables with rows and columns. Common SQL databases include MySQL, PostgreSQL, and SQLite."
  },
  {
    "query": "What causes memory leaks in Java?",
    "positive": "Memory leaks in Java occur when objects are no longer needed but still referenced, preventing garbage collection. Common causes include static collections, unclosed resources, and listener accumulation.",
    "negative": "Java is a popular programming language developed by Sun Microsystems. It runs on the Java Virtual Machine (JVM)."
  }
]

Hard negatives (documents that are topically related but not relevant) are far more effective than random negatives for training.

Fine-Tuning with Sentence-Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
import json
 
# Load base model
model = SentenceTransformer("BAAI/bge-m3")
 
# Prepare training data
def load_training_data(filepath):
    """Load training triples from JSON file."""
    with open(filepath) as f:
        data = json.load(f)
    
    examples = []
    for item in data:
        examples.append(InputExample(
            texts=[item["query"], item["positive"], item["negative"]]
        ))
    return examples
 
train_examples = load_training_data("training_triples.json")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
 
# Define loss function
train_loss = losses.TripletLoss(model=model)
 
# Alternative: MultipleNegativesRankingLoss (more effective with in-batch negatives)
# train_loss = losses.MultipleNegativesRankingLoss(model=model)
 
# Prepare evaluator
def create_evaluator(eval_data_path):
    """Create IR evaluator for monitoring training progress."""
    with open(eval_data_path) as f:
        eval_data = json.load(f)
    
    queries = {str(i): item["query"] for i, item in enumerate(eval_data)}
    corpus = {str(i): item["document"] for i, item in enumerate(eval_data)}
    relevant_docs = {str(i): {str(i)} for i in range(len(eval_data))}
    
    return InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name="custom-eval",
    )
 
evaluator = create_evaluator("eval_data.json")
 
# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    evaluation_steps=500,
    warmup_steps=100,
    output_path="./finetuned-bge-m3",
    save_best_model=True,
)
 
# Load and use fine-tuned model
finetuned_model = SentenceTransformer("./finetuned-bge-m3")
embeddings = finetuned_model.encode(["test query"], normalize_embeddings=True)

Evaluation Metrics

Metric	Description	Formula	Good Value
Recall@K	Fraction of relevant docs in top-K	Relevant in top-K / Total relevant	> 0.90 for K=10
MRR (Mean Reciprocal Rank)	Average of 1/rank of first relevant result	Mean(1/rank_i)	> 0.50
nDCG@K	Normalized Discounted Cumulative Gain	Considers position and graded relevance	> 0.60
MAP (Mean Average Precision)	Mean of precision at each relevant rank	Mean of AP per query	> 0.50

# Quick evaluation example
from sentence_transformers.evaluation import InformationRetrievalEvaluator
 
results = evaluator(model)
print(f"Recall@10: {results['custom-eval_recall@10']:.4f}")
print(f"MRR@10: {results['custom-eval_mrr@10']:.4f}")
print(f"nDCG@10: {results['custom-eval_ndcg@10']:.4f}")
print(f"MAP@100: {results['custom-eval_map@100']:.4f}")

8. RAG Integration Best Practices

Chunking and Embedding Alignment

The relationship between chunking strategy and embedding model is critical:

Document: "Machine learning overview... [2000 tokens of content]"

Strategy 1: Fixed-size chunks (512 tokens)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Chunk 1  │ │ Chunk 2  │ │ Chunk 3  │ │ Chunk 4  │
│ 512 tok  │ │ 512 tok  │ │ 512 tok  │ │ 512 tok  │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

Strategy 2: Semantic chunks (variable size)
┌────────────────┐ ┌──────────┐ ┌──────────────────────┐
│   Section 1    │ │ Section 2│ │      Section 3        │
│   320 tok      │ │ 180 tok  │ │      640 tok          │
└────────────────┘ └──────────┘ └──────────────────────┘

Factor	Recommendation
Chunk size	Match to model's sweet spot (typically 256-512 tokens)
Overlap	10-20% overlap to avoid boundary information loss
Max tokens	Never exceed model's max token limit
Granularity	Smaller chunks for precise Q&A, larger for summarization
Metadata	Include source, section headers, and page numbers

Query vs Document Embedding

Some models distinguish between query-time and document-time embeddings:

# Models requiring different prefixes for queries vs documents
 
# BGE models
query_prefix = "Represent this sentence for searching relevant passages: "
doc_prefix = ""  # No prefix for documents
 
query_embedding = model.encode(query_prefix + "What is RAG?")
doc_embedding = model.encode(doc_prefix + "RAG combines retrieval with generation...")
 
# E5 models
query_text = "query: What is RAG?"
doc_text = "passage: RAG combines retrieval with generation..."
 
# Cohere API
import cohere
co = cohere.Client("your-api-key")
 
# Different input_type for indexing vs searching
doc_response = co.embed(
    texts=["RAG combines retrieval..."],
    model="embed-multilingual-v3.0",
    input_type="search_document"
)
 
query_response = co.embed(
    texts=["What is RAG?"],
    model="embed-multilingual-v3.0",
    input_type="search_query"
)

Note: Always use the correct prefix/input_type. Using a document prefix for queries (or vice versa) can reduce retrieval accuracy by 5-15%.

Caching Strategies

import hashlib
import json
import numpy as np
from functools import lru_cache
 
class EmbeddingCache:
    """Simple file-based embedding cache."""
    
    def __init__(self, cache_dir="./embedding_cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)
    
    def _hash_text(self, text, model_name):
        """Create a unique hash for text + model combination."""
        key = f"{model_name}:{text}"
        return hashlib.sha256(key.encode()).hexdigest()
    
    def get(self, text, model_name):
        """Retrieve cached embedding."""
        hash_key = self._hash_text(text, model_name)
        cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
        if os.path.exists(cache_path):
            return np.load(cache_path)
        return None
    
    def set(self, text, model_name, embedding):
        """Store embedding in cache."""
        hash_key = self._hash_text(text, model_name)
        cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
        np.save(cache_path, np.array(embedding))
    
    def get_or_compute(self, text, model_name, compute_fn):
        """Get from cache or compute and store."""
        cached = self.get(text, model_name)
        if cached is not None:
            return cached
        embedding = compute_fn(text)
        self.set(text, model_name, embedding)
        return embedding
 
# Usage
cache = EmbeddingCache()
embedding = cache.get_or_compute(
    "What is machine learning?",
    "bge-m3",
    lambda text: model.encode(text, normalize_embeddings=True)
)

Cost Optimization

Strategy	Savings	Impact on Quality	Implementation
Batch processing	10-20% latency	None	Group requests, process together
Dimension reduction	50-75% storage	2-10% accuracy loss	Use Matryoshka or PCA
Caching	80%+ API costs	None	Cache computed embeddings
Model selection	5-10x cost	Varies	Use smaller model if quality permits
Quantization	50-75% storage	1-3% accuracy loss	int8 or binary quantization

# Cost estimation helper
def estimate_embedding_cost(
    num_documents: int,
    avg_tokens_per_doc: int,
    model: str = "text-embedding-3-small",
):
    """Estimate OpenAI embedding costs."""
    pricing = {
        "text-embedding-3-small": 0.02,   # per 1M tokens
        "text-embedding-3-large": 0.13,   # per 1M tokens
    }
    
    total_tokens = num_documents * avg_tokens_per_doc
    cost = (total_tokens / 1_000_000) * pricing.get(model, 0)
    
    print(f"Documents: {num_documents:,}")
    print(f"Avg tokens/doc: {avg_tokens_per_doc}")
    print(f"Total tokens: {total_tokens:,}")
    print(f"Model: {model}")
    print(f"Estimated cost: ${cost:.4f}")
    return cost
 
# Example: Embedding 100K documents
estimate_embedding_cost(100_000, 300, "text-embedding-3-small")
# Documents: 100,000
# Avg tokens/doc: 300
# Total tokens: 30,000,000
# Model: text-embedding-3-small
# Estimated cost: $0.6000

9. Selection Guide

Decision Flowchart

Loading diagram…

Scenario-Based Recommendations

Scenario	Recommended Model	Dimensions	Rationale
Korean enterprise RAG	BGE-M3	1024	Best Korean performance, MIT license, hybrid retrieval
Multilingual SaaS product	OpenAI text-embedding-3-large	1024 (reduced)	Easy API, 100+ languages, Matryoshka support
English-only high accuracy	GTE-Qwen2-7B	3584	Top MTEB scores, Apache license
Budget-constrained startup	OpenAI text-embedding-3-small	512 (reduced)	Lowest API cost, decent quality
On-premise (no external API)	BGE-M3	1024	MIT license, no API dependency
Long document retrieval	E5-Mistral-7B	4096	32K context window
Low-resource / edge	multilingual-e5-large	256 (PCA)	Good quality at small size
Korean-specific, lightweight	KoSimCSE-roberta	768	Korean-native, small footprint
Maximum flexibility	jina-embeddings-v3	1024	Task-specific LoRA adapters, 8K context

Quick Comparison Summary

Quality vs Cost Matrix:
                    
  High Quality  │  E5-Mistral    GTE-Qwen2
                │       ●            ●
                │
                │  BGE-M3    text-emb-3-large
                │    ●              ●
                │         jina-v3
                │           ●
                │  ml-e5-large
                │      ●       text-emb-3-small
                │                    ●
  Low Quality   │
                └──────────────────────────────
                 Free              Paid
                (self-host)       (API)

Production Checklist

Before deploying an embedding model to production, verify:

Benchmark on your data: Run retrieval evaluations on representative queries
Tokenization check: Verify Korean/special character handling
Latency profiling: Measure p50/p95/p99 embedding generation time
Cost projection: Calculate monthly embedding costs at expected volume
Fallback plan: Have a backup model in case primary is unavailable
Versioning: Track which model version generated each embedding
Re-embedding strategy: Plan for model upgrades (must re-embed all documents)
Monitoring: Set up quality metrics (retrieval recall, user feedback)

References

Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." arXiv:2402.03216.
Wang, L. et al. (2024). "Improving Text Embeddings with Large Language Models." arXiv:2401.00368.
MTEB Benchmark Leaderboard - https://huggingface.co/spaces/mteb/leaderboard
OpenAI Embeddings Documentation - https://platform.openai.com/docs/guides/embeddings
Sentence-Transformers Documentation - https://www.sbert.net/
Kusupati, A. et al. (2022). "Matryoshka Representation Learning." NeurIPS 2022.
Cohere Embed v3 Documentation - https://docs.cohere.com/docs/embeddings
Jina AI Embeddings v3 - https://jina.ai/embeddings/

— Data Dynamics Engineering Team