Blog
embeddingvector-databaseragnlpbge-m3openaimultilingualai

Embedding Model Selection Guide - From Concepts to Korean Language Benchmarks

A comprehensive guide covering embedding model concepts, major model comparison (OpenAI, Cohere, BGE-M3, E5, multilingual-e5), Korean language performance benchmarks, dimensionality reduction, fine-tuning, and selection strategies for RAG applications.

Data DynamicsApril 16, 202624 min read

Embeddings are the foundational building block of modern NLP systems, from semantic search to RAG pipelines. Choosing the right embedding model directly impacts retrieval quality, latency, and cost. This guide covers everything from embedding fundamentals to Korean-language benchmarks and production deployment strategies.


1. What are Embeddings?

Definition

An embedding is a dense vector representation of text (or other data) in a continuous, high-dimensional space. Unlike sparse representations such as TF-IDF or bag-of-words, embeddings capture semantic meaning -- words and sentences with similar meanings are mapped to nearby points in vector space.

Text Input                     Embedding Vector (simplified)
─────────────────────────────────────────────────────────────
"The cat sat on the mat"   →   [0.12, -0.34, 0.56, 0.78, ...]
"A kitten rested on a rug" →   [0.11, -0.32, 0.55, 0.80, ...]
"Stock market crashed"     →   [-0.87, 0.45, -0.12, 0.03, ...]

The first two sentences have similar vectors because their meanings are close, while the third sentence about finance is far away in vector space.

How Text-to-Vector Conversion Works

The embedding pipeline follows a consistent pattern:

Raw Text → Tokenization → Token IDs → Transformer Encoder → Hidden States → Pooling → Embedding Vector
  1. Tokenization: Text is split into subword tokens (e.g., WordPiece, BPE)
  2. Token Encoding: Each token is mapped to an ID and passed through the model
  3. Transformer Layers: Self-attention layers build contextual representations
  4. Pooling: Token-level representations are aggregated into a single vector

Semantic Space Visualization

The classic example illustrating semantic relationships in embedding space:

         king ─────── queen
          │              │
          │   (gender)   │
          │              │
         man ─────── woman

    king - man + woman ≈ queen

This demonstrates that embeddings capture not just similarity but also analogical relationships. The vector arithmetic king - man + woman yields a vector close to queen, showing that the model has learned the concept of gender as a direction in vector space.

Types of Embeddings

LevelDescriptionUse CaseExample Models
Word-levelOne vector per word, context-independentWord similarity, simple classificationWord2Vec, GloVe, FastText
Sentence-levelOne vector per sentence, context-awareSemantic search, similarity matchingSentence-BERT, E5, BGE
Document-levelOne vector per document or paragraphDocument retrieval, clusteringDoc2Vec, long-context embedding models

Note: Modern embedding models (2023+) primarily operate at the sentence/passage level using transformer architectures. Word-level embeddings like Word2Vec are largely superseded for production use cases.

Embedding Dimensions Explained

The dimension of an embedding vector determines how much information it can encode:

DimensionStorage per VectorCharacteristics
3841.5 KBLightweight, fast, good for simple tasks
7683 KBStandard size, good quality-cost balance
10244 KBHigher quality, moderate cost
15366 KBHigh quality (OpenAI default)
307212 KBMaximum quality, highest storage cost

Higher dimensions generally improve quality but increase:

  • Storage cost: Proportional to dimension size
  • Search latency: Distance computation scales linearly
  • Memory usage: Index size grows with dimension

2. Embedding Model Architecture

Bi-Encoder vs Cross-Encoder

The two fundamental architectures for computing text similarity serve different purposes:

[Bi-Encoder]                           [Cross-Encoder]
                                       
Query  → Encoder → Vector Q            Query + Document
Doc    → Encoder → Vector D            → Encoder → Relevance Score
                                       
Similarity = cosine(Q, D)              Score = sigmoid(output)
                                       
✓ Pre-compute document vectors         ✗ Must process pair together
✓ Fast retrieval (ANN search)          ✗ Slow (O(n) for n documents)
✗ Lower accuracy                       ✓ Higher accuracy
FeatureBi-EncoderCross-Encoder
InputSingle textText pair (query + document)
OutputVector embeddingRelevance score
SpeedFast (independent encoding)Slow (joint encoding)
AccuracyGoodSuperior
Offline indexingYes (pre-compute vectors)No
Typical useFirst-stage retrievalRe-ranking stage
ScalabilityMillions of documentsTop-K re-ranking (10-100)

Note: In production RAG systems, the best practice is to use a Bi-Encoder for initial retrieval (top-100) and then a Cross-Encoder for re-ranking (top-100 to top-10). This combines speed with accuracy.

Sentence-BERT Architecture

Sentence-BERT (SBERT) is the foundational architecture for modern sentence embedding models. It fine-tunes BERT with a siamese/triplet network structure:

┌──────────────────────────────────────────────┐
│              Sentence-BERT                    │
│                                              │
│  Sentence A        Sentence B                │
│      │                  │                    │
│  [BERT Encoder]    [BERT Encoder]            │
│  (shared weights)  (shared weights)          │
│      │                  │                    │
│  [Pooling]         [Pooling]                 │
│      │                  │                    │
│  Vector u          Vector v                  │
│      │                  │                    │
│      └──── Objective ───┘                    │
│           Function                           │
│    (cosine similarity / contrastive loss)     │
└──────────────────────────────────────────────┘

Key properties:

  • Shared weights: Both sentences are encoded by the same model
  • Independent encoding: Each sentence is processed separately
  • Fixed-size output: Regardless of input length, output dimension is constant

Contrastive Learning

Modern embedding models are trained using contrastive learning objectives. The goal is to pull similar pairs closer and push dissimilar pairs apart in vector space.

InfoNCE Loss (common training objective):

L = -log( exp(sim(q, d+) / τ) / Σ exp(sim(q, di) / τ) )

Where:
  q   = query embedding
  d+  = positive (relevant) document embedding
  di  = all documents in batch (positive + negatives)
  τ   = temperature parameter
  sim = cosine similarity

Training data formats:

  • Pairs: (query, positive_document)
  • Triplets: (query, positive_document, negative_document)
  • In-batch negatives: Other positives in the batch serve as negatives

Pooling Strategies

Pooling converts variable-length token representations into a fixed-size vector:

StrategyMethodProsCons
CLS TokenUse the [CLS] token representationStandard BERT approachMay not capture full sentence meaning
Mean PoolingAverage all token representationsCaptures information from all tokensCan dilute important signals
Max PoolingTake max across each dimensionCaptures salient featuresMay lose nuance
Weighted MeanAttention-weighted averageAdaptive focusMore complex
import torch
 
def mean_pooling(model_output, attention_mask):
    """Mean pooling - average token embeddings weighted by attention mask."""
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )
 
def cls_pooling(model_output):
    """CLS pooling - use the [CLS] token representation."""
    return model_output[0][:, 0]

Note: Mean pooling generally outperforms CLS pooling for sentence embeddings. Most modern models (BGE, E5, GTE) use mean pooling by default.


3. Major Embedding Models Comparison

Comprehensive Model Comparison

ModelProviderDimensionsMax TokensMultilingualLicenseCostRelease
text-embedding-3-smallOpenAI15368191Yes (100+)Proprietary$0.02/1M tokens2024-01
text-embedding-3-largeOpenAI30728191Yes (100+)Proprietary$0.13/1M tokens2024-01
embed-v3.0Cohere1024512Yes (100+)Proprietary$0.10/1M tokens2023-11
BGE-M3BAAI10248192Yes (100+)MITFree (self-hosted)2024-01
E5-Mistral-7BMicrosoft409632768YesMITFree (self-hosted)2024-01
multilingual-e5-largeMicrosoft1024512Yes (100+)MITFree (self-hosted)2023-06
GTE-Qwen2-7BAlibaba3584131072YesApache 2.0Free (self-hosted)2024-06
jina-embeddings-v3Jina AI10248192Yes (89)CC-BY-NC-4.0Free (self-hosted)2024-09

Model Characteristics Deep Dive

OpenAI text-embedding-3 Series

  • Native dimensionality reduction via API parameter (e.g., 256, 512, 1024)
  • Matryoshka Representation Learning support
  • Best-in-class ease of use with simple API
  • No self-hosting option

BGE-M3 (BAAI General Embedding - Multi-Functionality, Multi-Linguality, Multi-Granularity)

  • Supports three retrieval methods: dense, sparse (lexical), and multi-vector (ColBERT)
  • Excellent multilingual performance including Korean
  • 8192 token context window
  • Fully open-source (MIT license)
BGE-M3 Retrieval Modes:
┌─────────────────────────────────────────────────┐
│                    BGE-M3                        │
│                                                 │
│  Dense Retrieval    Sparse Retrieval   ColBERT   │
│  (single vector)   (term weights)    (multi-vec)│
│       │                  │               │      │
│       └──────── Hybrid Score ───────────┘      │
│              (weighted combination)              │
└─────────────────────────────────────────────────┘

E5-Mistral-7B

  • Built on Mistral-7B decoder (largest model in this list)
  • Exceptional long-context performance (32K tokens)
  • Requires instruction prefix for queries: "Instruct: ...\nQuery: ..."
  • High VRAM requirement (~14GB in FP16)

Cohere embed-v3

  • Input type parameter (search_document, search_query, classification, clustering)
  • Compression-aware training
  • Built-in binary/int8 quantization support

MTEB Benchmark Comparison (English)

ModelRetrievalSTSClassificationClusteringAvg
text-embedding-3-large55.464.675.549.064.6
BGE-M354.363.274.148.562.8
E5-Mistral-7B56.965.177.250.366.6
GTE-Qwen2-7B57.265.876.951.167.0
jina-embeddings-v355.864.375.149.865.0
multilingual-e5-large50.161.572.346.259.8

Note: MTEB (Massive Text Embedding Benchmark) scores vary by version and evaluation subset. The values above are approximate and should be verified against the latest MTEB leaderboard for current rankings.


4. Korean Language Performance

Why Korean Embeddings are Challenging

Korean presents unique challenges for embedding models compared to English:

1. Agglutinative Morphology

Korean is an agglutinative language where morphemes combine to form complex words:

English: "I went to school"  →  4 tokens
Korean:  "학교에 갔다"          →  Morphemes: 학교(school) + 에(to) + 가(go) + 았(past) + 다(declarative)

A single Korean word can express what requires an entire English phrase. Tokenizers not designed for Korean often split words incorrectly.

2. Tokenization Issues

"임베딩 모델 선택" (Embedding model selection)

BPE Tokenizer (English-centric):
  → ["임", "베", "딩", " 모", "델", " 선", "택"]   (7 tokens, character-level split)

Korean-optimized Tokenizer:
  → ["임베딩", "모델", "선택"]                       (3 tokens, word-level split)

English-centric tokenizers fragment Korean text into many small pieces, wasting context window capacity and degrading semantic representation.

3. Subject/Object Omission

Korean frequently omits subjects and objects when they can be inferred from context, making isolated sentence embeddings less informative.

4. Honorific System

The same meaning can be expressed in multiple ways depending on formality level, increasing surface-form variation.

Korean Retrieval Benchmark Results

Performance on Korean retrieval tasks (Ko-StrategyQA, KorQuAD retrieval, Ko-MIRACL):

ModelKo-StrategyQA (nDCG@10)KorQuAD Retrieval (R@10)Ko-MIRACL (nDCG@10)Avg
BGE-M372.185.368.475.3
multilingual-e5-large68.582.165.271.9
KoSimCSE-roberta65.379.861.768.9
text-embedding-3-large70.883.767.173.9
text-embedding-3-small63.276.458.966.2
Cohere embed-v369.581.966.072.5
jina-embeddings-v369.182.565.872.5
E5-Mistral-7B71.584.267.874.5
PriorityModelReason
1stBGE-M3Best Korean performance, open-source, hybrid retrieval, 8K context
2ndtext-embedding-3-largeStrong Korean support, easy API, Matryoshka dimensions
3rdmultilingual-e5-largeGood balance of quality and resource efficiency
4thKoSimCSE-robertaPurpose-built for Korean, lightweight, but limited context length

Practical Tips for Korean Embeddings

  1. Preprocessing matters: Apply morphological analysis (e.g., Mecab, Kiwi) before embedding for better tokenization
  2. Test with your data: Benchmark scores do not always predict performance on domain-specific Korean text
  3. Consider hybrid retrieval: Combine dense embeddings with BM25 for Korean -- BM25 handles exact keyword matching well for Korean compound nouns
  4. Chunk size: Korean text is denser than English; use slightly smaller chunk sizes (300-400 tokens vs 500+ for English)
  5. Query formulation: For models requiring instruction prefixes, include Korean-specific instructions
# Example: Korean preprocessing with Kiwi before embedding
from kiwipiepy import Kiwi
 
kiwi = Kiwi()
 
def preprocess_korean(text):
    """Morphological analysis for better Korean embedding quality."""
    tokens = kiwi.tokenize(text)
    # Join with spaces, keeping meaningful morphemes
    return " ".join([t.form for t in tokens if t.tag[0] in ('N', 'V', 'M', 'S')])
 
# Before embedding
raw_text = "한국어 임베딩 모델의 성능을 비교합니다"
processed = preprocess_korean(raw_text)
# "한국어 임베딩 모델 성능 비교"

5. Embedding Model Usage

OpenAI Embeddings

from openai import OpenAI
 
client = OpenAI()
 
def get_openai_embedding(text, model="text-embedding-3-small", dimensions=None):
    """Get embedding from OpenAI API."""
    params = {"input": text, "model": model}
    if dimensions:
        params["dimensions"] = dimensions  # Matryoshka dimension reduction
    
    response = client.embeddings.create(**params)
    return response.data[0].embedding
 
# Single text
embedding = get_openai_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}")  # 1536
 
# With dimension reduction (Matryoshka)
embedding_small = get_openai_embedding(
    "What is machine learning?",
    model="text-embedding-3-large",
    dimensions=256  # Reduce from 3072 to 256
)
print(f"Reduced dimension: {len(embedding_small)}")  # 256
 
# Batch processing
def get_openai_embeddings_batch(texts, model="text-embedding-3-small", batch_size=100):
    """Process texts in batches for efficiency."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

HuggingFace Sentence-Transformers

from sentence_transformers import SentenceTransformer
import numpy as np
 
# Load model
model = SentenceTransformer("BAAI/bge-m3")
 
# Single text
embedding = model.encode("What is machine learning?")
print(f"Dimension: {embedding.shape}")  # (1024,)
 
# Batch processing with normalization
sentences = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "The weather is sunny today.",
]
 
embeddings = model.encode(
    sentences,
    normalize_embeddings=True,  # L2 normalization for cosine similarity
    batch_size=32,
    show_progress_bar=True,
)
 
# Compute similarity matrix
similarity_matrix = np.inner(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Using E5 models with instruction prefix:

from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("intfloat/multilingual-e5-large")
 
# E5 models require "query: " or "passage: " prefix
queries = ["query: What is the capital of France?"]
passages = [
    "passage: Paris is the capital and largest city of France.",
    "passage: Berlin is the capital of Germany.",
]
 
query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)
 
# Compute scores
scores = query_embeddings @ passage_embeddings.T
print(f"Score for Paris passage: {scores[0][0]:.4f}")   # Higher
print(f"Score for Berlin passage: {scores[0][1]:.4f}")  # Lower

Ollama Embeddings (Local)

import requests
import numpy as np
 
def get_ollama_embedding(text, model="bge-m3"):
    """Get embedding from local Ollama server."""
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]
 
# Single embedding
embedding = get_ollama_embedding("What is machine learning?")
print(f"Dimension: {len(embedding)}")
 
# Batch processing (Ollama processes one at a time)
def get_ollama_embeddings_batch(texts, model="bge-m3"):
    """Process multiple texts through Ollama."""
    embeddings = []
    for text in texts:
        emb = get_ollama_embedding(text, model)
        embeddings.append(emb)
    return np.array(embeddings)
 
# Using Ollama Python library
import ollama
 
response = ollama.embeddings(model="bge-m3", prompt="What is machine learning?")
embedding = response["embedding"]

Normalization

import numpy as np
 
def normalize_embedding(embedding):
    """L2 normalize an embedding vector."""
    embedding = np.array(embedding)
    norm = np.linalg.norm(embedding)
    if norm == 0:
        return embedding
    return embedding / norm
 
def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
# If embeddings are already L2-normalized, cosine similarity = dot product
a_norm = normalize_embedding(embedding_a)
b_norm = normalize_embedding(embedding_b)
similarity = np.dot(a_norm, b_norm)  # Equivalent to cosine_similarity

6. Dimensionality Reduction

Matryoshka Representation Learning (MRL)

Named after Russian nesting dolls, Matryoshka embeddings are trained so that the first d dimensions of a larger embedding are themselves a valid embedding. This allows flexible dimension selection at inference time without retraining.

Full embedding (3072 dims):  [v1, v2, v3, ..., v256, ..., v1024, ..., v3072]
                              |________________|
                              256-dim embedding (valid!)
                              |________________________________|
                              1024-dim embedding (valid!)
                              |__________________________________________________|
                              3072-dim embedding (full quality)

Models supporting Matryoshka embeddings:

  • OpenAI text-embedding-3-small/large (via API dimensions parameter)
  • jina-embeddings-v3
  • nomic-embed-text-v1.5

PCA (Principal Component Analysis)

For models that do not natively support Matryoshka, PCA can reduce dimensions post-hoc:

from sklearn.decomposition import PCA
import numpy as np
 
# Assume we have a matrix of embeddings (n_samples x original_dim)
embeddings = np.array(all_embeddings)  # shape: (10000, 1024)
 
# Fit PCA
target_dim = 256
pca = PCA(n_components=target_dim)
reduced_embeddings = pca.fit_transform(embeddings)
 
print(f"Original shape: {embeddings.shape}")           # (10000, 1024)
print(f"Reduced shape: {reduced_embeddings.shape}")     # (10000, 256)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")
 
# Save PCA model for query-time reduction
import pickle
with open("pca_model.pkl", "wb") as f:
    pickle.dump(pca, f)

Truncation

The simplest approach -- just take the first N dimensions. Works well with Matryoshka-trained models, poorly with others:

# Simple truncation (only for Matryoshka-trained models!)
full_embedding = model.encode("some text")      # 3072 dims
truncated = full_embedding[:256]                 # Take first 256
truncated = truncated / np.linalg.norm(truncated)  # Re-normalize

Dimension vs Accuracy vs Cost Trade-offs

DimensionsRelative AccuracyStorage (1M vectors)Search SpeedUse Case
3072100% (baseline)~12 GBSlowestMaximum quality requirements
1536~98%~6 GBModerateStandard production
1024~96%~4 GBModerateGood balance
512~93%~2 GBFastCost-sensitive applications
256~88%~1 GBVery fastPrototyping, low-resource
128~82%~0.5 GBFastestEdge deployment

Note: Accuracy retention percentages are approximate and vary significantly by model and task. Matryoshka-trained models retain more accuracy at lower dimensions compared to post-hoc reduction methods like PCA or truncation.


7. Fine-Tuning Embeddings

When to Fine-Tune

Fine-tuning is worthwhile when:

  • Domain-specific vocabulary: Medical, legal, or technical jargon not well-represented in general models
  • Specialized similarity criteria: Your notion of similarity differs from general semantic similarity
  • Low baseline performance: Off-the-shelf models underperform on your specific retrieval tasks
  • Sufficient training data: You have at least 1,000+ labeled pairs (ideally 10K+)

Fine-tuning is NOT necessary when:

  • General-purpose retrieval works well enough
  • You have insufficient training data (< 500 pairs)
  • The domain is well-covered by multilingual models

Training Data Format

The most effective format is query-positive-negative triples:

[
  {
    "query": "How to prevent SQL injection?",
    "positive": "Use parameterized queries and prepared statements to prevent SQL injection attacks. Never concatenate user input directly into SQL strings.",
    "negative": "SQL databases store data in tables with rows and columns. Common SQL databases include MySQL, PostgreSQL, and SQLite."
  },
  {
    "query": "What causes memory leaks in Java?",
    "positive": "Memory leaks in Java occur when objects are no longer needed but still referenced, preventing garbage collection. Common causes include static collections, unclosed resources, and listener accumulation.",
    "negative": "Java is a popular programming language developed by Sun Microsystems. It runs on the Java Virtual Machine (JVM)."
  }
]

Hard negatives (documents that are topically related but not relevant) are far more effective than random negatives for training.

Fine-Tuning with Sentence-Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
import json
 
# Load base model
model = SentenceTransformer("BAAI/bge-m3")
 
# Prepare training data
def load_training_data(filepath):
    """Load training triples from JSON file."""
    with open(filepath) as f:
        data = json.load(f)
    
    examples = []
    for item in data:
        examples.append(InputExample(
            texts=[item["query"], item["positive"], item["negative"]]
        ))
    return examples
 
train_examples = load_training_data("training_triples.json")
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
 
# Define loss function
train_loss = losses.TripletLoss(model=model)
 
# Alternative: MultipleNegativesRankingLoss (more effective with in-batch negatives)
# train_loss = losses.MultipleNegativesRankingLoss(model=model)
 
# Prepare evaluator
def create_evaluator(eval_data_path):
    """Create IR evaluator for monitoring training progress."""
    with open(eval_data_path) as f:
        eval_data = json.load(f)
    
    queries = {str(i): item["query"] for i, item in enumerate(eval_data)}
    corpus = {str(i): item["document"] for i, item in enumerate(eval_data)}
    relevant_docs = {str(i): {str(i)} for i in range(len(eval_data))}
    
    return InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name="custom-eval",
    )
 
evaluator = create_evaluator("eval_data.json")
 
# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    evaluation_steps=500,
    warmup_steps=100,
    output_path="./finetuned-bge-m3",
    save_best_model=True,
)
 
# Load and use fine-tuned model
finetuned_model = SentenceTransformer("./finetuned-bge-m3")
embeddings = finetuned_model.encode(["test query"], normalize_embeddings=True)

Evaluation Metrics

MetricDescriptionFormulaGood Value
Recall@KFraction of relevant docs in top-KRelevant in top-K / Total relevant> 0.90 for K=10
MRR (Mean Reciprocal Rank)Average of 1/rank of first relevant resultMean(1/rank_i)> 0.50
nDCG@KNormalized Discounted Cumulative GainConsiders position and graded relevance> 0.60
MAP (Mean Average Precision)Mean of precision at each relevant rankMean of AP per query> 0.50
# Quick evaluation example
from sentence_transformers.evaluation import InformationRetrievalEvaluator
 
results = evaluator(model)
print(f"Recall@10: {results['custom-eval_recall@10']:.4f}")
print(f"MRR@10: {results['custom-eval_mrr@10']:.4f}")
print(f"nDCG@10: {results['custom-eval_ndcg@10']:.4f}")
print(f"MAP@100: {results['custom-eval_map@100']:.4f}")

8. RAG Integration Best Practices

Chunking and Embedding Alignment

The relationship between chunking strategy and embedding model is critical:

Document: "Machine learning overview... [2000 tokens of content]"

Strategy 1: Fixed-size chunks (512 tokens)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Chunk 1  │ │ Chunk 2  │ │ Chunk 3  │ │ Chunk 4  │
│ 512 tok  │ │ 512 tok  │ │ 512 tok  │ │ 512 tok  │
└──────────┘ └──────────┘ └──────────┘ └──────────┘

Strategy 2: Semantic chunks (variable size)
┌────────────────┐ ┌──────────┐ ┌──────────────────────┐
│   Section 1    │ │ Section 2│ │      Section 3        │
│   320 tok      │ │ 180 tok  │ │      640 tok          │
└────────────────┘ └──────────┘ └──────────────────────┘
FactorRecommendation
Chunk sizeMatch to model's sweet spot (typically 256-512 tokens)
Overlap10-20% overlap to avoid boundary information loss
Max tokensNever exceed model's max token limit
GranularitySmaller chunks for precise Q&A, larger for summarization
MetadataInclude source, section headers, and page numbers

Query vs Document Embedding

Some models distinguish between query-time and document-time embeddings:

# Models requiring different prefixes for queries vs documents
 
# BGE models
query_prefix = "Represent this sentence for searching relevant passages: "
doc_prefix = ""  # No prefix for documents
 
query_embedding = model.encode(query_prefix + "What is RAG?")
doc_embedding = model.encode(doc_prefix + "RAG combines retrieval with generation...")
 
# E5 models
query_text = "query: What is RAG?"
doc_text = "passage: RAG combines retrieval with generation..."
 
# Cohere API
import cohere
co = cohere.Client("your-api-key")
 
# Different input_type for indexing vs searching
doc_response = co.embed(
    texts=["RAG combines retrieval..."],
    model="embed-multilingual-v3.0",
    input_type="search_document"
)
 
query_response = co.embed(
    texts=["What is RAG?"],
    model="embed-multilingual-v3.0",
    input_type="search_query"
)

Note: Always use the correct prefix/input_type. Using a document prefix for queries (or vice versa) can reduce retrieval accuracy by 5-15%.

Caching Strategies

import hashlib
import json
import numpy as np
from functools import lru_cache
 
class EmbeddingCache:
    """Simple file-based embedding cache."""
    
    def __init__(self, cache_dir="./embedding_cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)
    
    def _hash_text(self, text, model_name):
        """Create a unique hash for text + model combination."""
        key = f"{model_name}:{text}"
        return hashlib.sha256(key.encode()).hexdigest()
    
    def get(self, text, model_name):
        """Retrieve cached embedding."""
        hash_key = self._hash_text(text, model_name)
        cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
        if os.path.exists(cache_path):
            return np.load(cache_path)
        return None
    
    def set(self, text, model_name, embedding):
        """Store embedding in cache."""
        hash_key = self._hash_text(text, model_name)
        cache_path = os.path.join(self.cache_dir, f"{hash_key}.npy")
        np.save(cache_path, np.array(embedding))
    
    def get_or_compute(self, text, model_name, compute_fn):
        """Get from cache or compute and store."""
        cached = self.get(text, model_name)
        if cached is not None:
            return cached
        embedding = compute_fn(text)
        self.set(text, model_name, embedding)
        return embedding
 
# Usage
cache = EmbeddingCache()
embedding = cache.get_or_compute(
    "What is machine learning?",
    "bge-m3",
    lambda text: model.encode(text, normalize_embeddings=True)
)

Cost Optimization

StrategySavingsImpact on QualityImplementation
Batch processing10-20% latencyNoneGroup requests, process together
Dimension reduction50-75% storage2-10% accuracy lossUse Matryoshka or PCA
Caching80%+ API costsNoneCache computed embeddings
Model selection5-10x costVariesUse smaller model if quality permits
Quantization50-75% storage1-3% accuracy lossint8 or binary quantization
# Cost estimation helper
def estimate_embedding_cost(
    num_documents: int,
    avg_tokens_per_doc: int,
    model: str = "text-embedding-3-small",
):
    """Estimate OpenAI embedding costs."""
    pricing = {
        "text-embedding-3-small": 0.02,   # per 1M tokens
        "text-embedding-3-large": 0.13,   # per 1M tokens
    }
    
    total_tokens = num_documents * avg_tokens_per_doc
    cost = (total_tokens / 1_000_000) * pricing.get(model, 0)
    
    print(f"Documents: {num_documents:,}")
    print(f"Avg tokens/doc: {avg_tokens_per_doc}")
    print(f"Total tokens: {total_tokens:,}")
    print(f"Model: {model}")
    print(f"Estimated cost: ${cost:.4f}")
    return cost
 
# Example: Embedding 100K documents
estimate_embedding_cost(100_000, 300, "text-embedding-3-small")
# Documents: 100,000
# Avg tokens/doc: 300
# Total tokens: 30,000,000
# Model: text-embedding-3-small
# Estimated cost: $0.6000

9. Selection Guide

Decision Flowchart

Start: Choose an Embedding Model
│
├─ Do you need multilingual (Korean) support?
│   ├─ Yes
│   │   ├─ Budget for API costs?
│   │   │   ├─ Yes → OpenAI text-embedding-3-large (easy, high quality)
│   │   │   └─ No  → Can you self-host?
│   │   │           ├─ Yes → BGE-M3 (best open-source multilingual)
│   │   │           └─ No  → multilingual-e5-large (lightweight self-host)
│   │   └─ Need hybrid retrieval (dense + sparse)?
│   │       ├─ Yes → BGE-M3 (native hybrid support)
│   │       └─ No  → Continue below
│   └─ No (English only)
│       ├─ Need maximum quality?
│       │   ├─ Yes → GTE-Qwen2-7B or E5-Mistral-7B
│       │   └─ No  → text-embedding-3-small (cost-effective)
│       └─ Need long context (>8K tokens)?
│           ├─ Yes → GTE-Qwen2-7B (128K) or E5-Mistral-7B (32K)
│           └─ No  → BGE-M3 or text-embedding-3-small
│
└─ Special requirements?
    ├─ Edge/mobile deployment → Small model + dimension reduction
    ├─ Commercial license needed → Check model license (MIT, Apache)
    └─ Maximum privacy → Self-hosted open-source model

Scenario-Based Recommendations

ScenarioRecommended ModelDimensionsRationale
Korean enterprise RAGBGE-M31024Best Korean performance, MIT license, hybrid retrieval
Multilingual SaaS productOpenAI text-embedding-3-large1024 (reduced)Easy API, 100+ languages, Matryoshka support
English-only high accuracyGTE-Qwen2-7B3584Top MTEB scores, Apache license
Budget-constrained startupOpenAI text-embedding-3-small512 (reduced)Lowest API cost, decent quality
On-premise (no external API)BGE-M31024MIT license, no API dependency
Long document retrievalE5-Mistral-7B409632K context window
Low-resource / edgemultilingual-e5-large256 (PCA)Good quality at small size
Korean-specific, lightweightKoSimCSE-roberta768Korean-native, small footprint
Maximum flexibilityjina-embeddings-v31024Task-specific LoRA adapters, 8K context

Quick Comparison Summary

Quality vs Cost Matrix:
                    
  High Quality  │  E5-Mistral    GTE-Qwen2
                │       ●            ●
                │
                │  BGE-M3    text-emb-3-large
                │    ●              ●
                │         jina-v3
                │           ●
                │  ml-e5-large
                │      ●       text-emb-3-small
                │                    ●
  Low Quality   │
                └──────────────────────────────
                 Free              Paid
                (self-host)       (API)

Production Checklist

Before deploying an embedding model to production, verify:

  • Benchmark on your data: Run retrieval evaluations on representative queries
  • Tokenization check: Verify Korean/special character handling
  • Latency profiling: Measure p50/p95/p99 embedding generation time
  • Cost projection: Calculate monthly embedding costs at expected volume
  • Fallback plan: Have a backup model in case primary is unavailable
  • Versioning: Track which model version generated each embedding
  • Re-embedding strategy: Plan for model upgrades (must re-embed all documents)
  • Monitoring: Set up quality metrics (retrieval recall, user feedback)

References


— Data Dynamics Engineering Team