ragllmvectordbembeddingailangchainretrieval

RAG (Retrieval-Augmented Generation) Complete Guide - From Architecture to Production

A comprehensive guide covering RAG architecture, chunking strategies, vector databases, retrieval techniques, advanced patterns (Self-RAG, Graph RAG, Agentic RAG), evaluation methods, and enterprise deployment.

Data DynamicsApril 16, 202620 min read

RAG (Retrieval-Augmented Generation) is a technique that overcomes LLM limitations by retrieving external knowledge to enhance responses. This post systematically covers everything from RAG fundamentals to advanced techniques and enterprise deployment.

1. What is RAG?

Definition and Background

RAG (Retrieval-Augmented Generation) was proposed in 2020 by Patrick Lewis et al. at Meta (Facebook) AI Research. It is an architecture that combines information retrieval with text generation.

The core idea is simple: if you retrieve relevant documents and provide them as context before the LLM generates an answer, it can produce more accurate and up-to-date responses.

Loading diagram…

Limitations of Standalone LLMs and the Need for RAG

Using LLMs alone has the following fundamental limitations:

Limitation	Description	How RAG Solves It
Knowledge Cutoff	Cannot know information after training	Retrieves up-to-date documents in real time
Hallucination	Confidently generates false information	Responds based on retrieved documents with citations
Lack of Domain Knowledge	Missing enterprise internal data, specialized knowledge	Stores internal docs in vector DB for retrieval
Cost	High cost for fine-tuning	Knowledge updates via document updates only

Note: RAG is a complementary technique to fine-tuning, not a replacement. Fine-tuning is suitable for changing model behavior (style, format), while RAG is suitable for providing up-to-date factual information. Using both techniques together yields optimal results.

2. RAG Architecture in Detail

Overall Pipeline Architecture

A RAG system consists of an offline indexing pipeline and an online query pipeline.

Loading diagram…

Role of each stage:

Stage	Role	Key Tools
Document Loading	Convert various document formats to text	LangChain Loaders, Unstructured
Chunking	Split documents into appropriately sized pieces	RecursiveCharacterTextSplitter
Embedding	Convert text to high-dimensional vectors	OpenAI, Cohere, BGE, E5
Vector Storage	Index and store embedding vectors	Chroma, Pinecone, Milvus
Retrieval	Quickly find similar document chunks	ANN (Approximate Nearest Neighbor)
Re-ranking	Re-order results by relevance	Cross-Encoder, Cohere Rerank
Generation	Generate responses based on retrieved results	GPT-4, Claude, LLaMA

Naive RAG vs Advanced RAG vs Modular RAG

RAG can be categorized into three generations based on its evolution.

Naive RAG (1st Generation)

The most basic RAG implementation with a simple retrieve-generate pipeline.

Loading diagram…

Pros: Simple to implement
Cons: Entirely dependent on retrieval quality, may include noisy documents

Advanced RAG (2nd Generation)

An improved pipeline with optimization stages before and after retrieval.

Loading diagram…

Query transformation: Restructures queries to improve retrieval quality
Hybrid search: Combines vector + keyword search
Re-ranking: Re-evaluates relevance of search results
Context compression: Removes unnecessary information

Modular RAG (3rd Generation)

An architecture that separates each component into independent modules for flexible composition.

Loading diagram…

Routing: Selects appropriate retrieval strategy based on query type
Module swapping: Each stage component can be independently replaced
Feedback loop: Evaluates generation results and retries retrieval

3. Data Preparation: Document Loading and Chunking

Document Loading

The first step in RAG is converting source documents of various formats into text.

Document Format	Loader	Considerations
PDF	PyPDFLoader, Unstructured	Watch for tables and OCR text extraction
HTML	BeautifulSoupLoader	Tag removal, body extraction
Markdown	MarkdownLoader	Preserve heading-based structure
Word/PPT	Unstructured	Can leverage formatting information
DB (SQL)	SQLDatabaseLoader	Documentize query results
Confluence/Notion	Dedicated API Loaders	Reflect page hierarchy

# Various document loading examples
from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    UnstructuredMarkdownLoader,
    CSVLoader
)
 
# PDF loading
pdf_loader = PyPDFLoader("report.pdf")
pdf_docs = pdf_loader.load()
 
# Web page loading
web_loader = WebBaseLoader("https://docs.example.com/guide")
web_docs = web_loader.load()
 
# Markdown loading
md_loader = UnstructuredMarkdownLoader("README.md")
md_docs = md_loader.load()

Chunking Strategies

Chunking is one of the stages with the greatest impact on RAG performance. Chunks that are too large include noise, while chunks that are too small lose context.

Key chunking strategies:

Strategy	Description	Best For
Fixed Size	Split by fixed character/token count	Unstructured text
Recursive	Try splitting by paragraph → sentence → word	General purpose (most widely used)
Semantic	Split at embedding similarity change points	Documents with frequent topic changes
Document Structure	Leverage headings, sections, etc.	Technical docs, manuals
Agentic Chunking	LLM performs the chunking	Complex documents

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# Recursive splitting (most commonly used)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Max chunk size (characters)
    chunk_overlap=50,      # Overlap between chunks (context preservation)
    separators=["\n\n", "\n", ". ", " ", ""],  # Split priority
    length_function=len
)
 
chunks = splitter.split_documents(documents)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
# Semantic splitting
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at top 5% similarity changes
)
 
semantic_chunks = semantic_splitter.split_documents(documents)

Note: A general guideline for chunk size is 200–1000 tokens. Smaller chunks (200–400) favor precise retrieval, while larger chunks (600–1000) provide richer context. Overlap should be 10–20% of chunk_size.

Metadata Management

Attaching metadata to chunks enables filtering and weight adjustment during retrieval.

# Example of a chunk with metadata
{
    "content": "Quarterly revenue increased 15% year-over-year...",
    "metadata": {
        "source": "2025_Q4_report.pdf",
        "page": 12,
        "department": "finance",
        "date": "2025-12-31",
        "doc_type": "quarterly_report",
        "access_level": "internal"
    }
}

Use cases:

Date filter: "Q4 2025 revenue" → range search on date field
Department filter: "Marketing documents" → department == "marketing" filter
Access control: Filter by access_level based on user permissions

4. Embeddings and Vector Databases

Embedding Model Selection

Embedding models convert text into high-dimensional vectors to enable semantic similarity computation. Model selection directly impacts RAG performance.

Model	Dimensions	Multilingual	Features
OpenAI text-embedding-3-large	3072	Yes	High performance, API cost
OpenAI text-embedding-3-small	1536	Yes	Excellent cost-performance ratio
Cohere embed-v3	1024	Yes	Search-optimized, multilingual strength
BGE-M3 (BAAI)	1024	Yes	Top open-source, multilingual
E5-Mistral-7B	4096	Yes	Open-source, strong on long documents
multilingual-e5-large	1024	Yes	Multilingual-focused, lightweight

# OpenAI embeddings
from langchain_openai import OpenAIEmbeddings
 
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector = embeddings.embed_query("RAG combines retrieval and generation")
# Result: 1536-dimensional vector
 
# Open-source embeddings (Hugging Face)
from langchain_huggingface import HuggingFaceEmbeddings
 
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

Note: For Korean RAG systems, multilingual models (BGE-M3, multilingual-e5, etc.) are recommended. English-only models may not accurately capture Korean semantics.

Major Vector DB Comparison

Vector DB	Type	Scalability	Hybrid Search	Metadata Filter	Best For
Chroma	Embedded	Small	No	Yes	Prototypes, PoC
Pinecone	Managed SaaS	Large	Yes	Yes	Minimal ops
Weaviate	Self-hosted/Cloud	Large	Yes	Yes	Hybrid search focus
Milvus	Self-hosted	Very large	Yes	Yes	Enterprise, high volume
Qdrant	Self-hosted/Cloud	Large	Yes	Yes	High-performance filtering
pgvector	PostgreSQL extension	Medium	No	Yes (SQL)	Existing PostgreSQL users

# Vector storage and search with Chroma
from langchain_chroma import Chroma
 
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="company_docs",
    persist_directory="./chroma_db"
)
 
# Similarity search
results = vectorstore.similarity_search_with_score(
    query="What is the data retention policy?",
    k=5
)
 
for doc, score in results:
    print(f"[Score: {score:.4f}] {doc.page_content[:100]}...")

Indexing Strategies

Vector search speed and accuracy depend heavily on the indexing algorithm.

Algorithm	Principle	Speed	Accuracy	Memory
Flat (Brute Force)	Compare against all vectors	Slow	100%	Low
HNSW	Hierarchical graph traversal	Fast	Very high	High
IVF (Inverted File)	Cluster-based search	Fast	High	Medium
PQ (Product Quantization)	Compress vectors before search	Very fast	Medium	Very low
HNSW + PQ	Combines HNSW and PQ	Fast	High	Medium

Small scale (< 100K vectors): Flat or HNSW
Medium scale (100K–10M): HNSW
Large scale (> 10M): IVF + PQ or HNSW + PQ

5. Retrieval Strategies

Vector Similarity Search

Vector search calculates similarity between query vectors and stored document vectors to find the most relevant documents.

Key similarity metrics:

Metric	Formula	Features
Cosine Similarity	cos(θ) = (A·B) / (‖A‖·‖B‖)	Direction-based, most widely used
Euclidean Distance (L2)	d = √Σ(a_i - b_i)²	Distance-based, normalization required
Inner Product (Dot Product)	s = Σ(a_i × b_i)	Fast computation, equivalent to Cosine for normalized vectors

Note: Most embedding models output normalized vectors, making Cosine Similarity and Inner Product results identical. Use Inner Product for performance-critical applications as it requires simpler computation.

Keyword Search (BM25)

BM25 is a traditional information retrieval algorithm that considers term frequency (TF) and inverse document frequency (IDF).

from langchain_community.retrievers import BM25Retriever
 
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
    documents=chunks,
    k=5
)
 
# Keyword search
results = bm25_retriever.invoke("NiFi LDAP authentication setup")

Vector Search vs BM25 comparison:

Scenario	Vector Search	BM25
"NiFi authentication setup"	Finds semantically similar docs (good)	Matches "NiFi", "authentication" keywords (good)
"security access control"	Also finds "authentication", "permission management" (good)	Misses without exact keywords (weak)
"error code 0x8007"	Low semantic similarity, inaccurate (weak)	Exact code matching (good)

Hybrid Search

Hybrid search combines the strengths of vector search and keyword search. Reciprocal Rank Fusion (RRF) or weighted scoring is used to merge results from both search methods.

from langchain.retrievers import EnsembleRetriever
 
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# BM25 retriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
 
# Hybrid search (ensemble)
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # 60% vector search, 40% BM25
)
 
results = hybrid_retriever.invoke("NiFi cluster TLS certificate setup")

RRF (Reciprocal Rank Fusion) formula:

RRF_score(d) = Σ 1 / (k + rank_i(d))

- d: document
- k: constant (typically 60)
- rank_i(d): rank in the i-th retriever

Re-ranking

A stage that re-orders retrieved documents more precisely by relevance. While initial retrieval (bi-encoder) sacrifices some accuracy for speed, re-ranking (cross-encoder) evaluates precise relevance for a small set of candidates.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
 
# Cohere Reranker
reranker = CohereRerank(
    model="rerank-v3.5",
    top_n=3  # Return only top 3
)
 
# Retriever with re-ranking
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever  # Re-rank hybrid search results
)
 
results = compression_retriever.invoke("Kudu partitioning strategies")

Effect of re-ranking:

Stage	Processes	Speed	Accuracy
Initial Retrieval (Bi-Encoder)	All documents (tens of thousands to millions)	Fast	Medium
Re-ranking (Cross-Encoder)	Candidate documents (10–50)	Slow	High

6. Prompt Construction and Response Generation

Context Injection Methods

Methods for inserting retrieved documents into the LLM prompt.

Stuff method (most basic): Insert all search results into a single prompt

Context:
[Document 1 content]
[Document 2 content]
[Document 3 content]

Based on the context above, answer the following question:
{question}

Map-Reduce method: Summarize each document individually, then combine

Loading diagram…

Map-Rerank method: Generate responses from each document, select by score

Loading diagram…

Prompt Template Design

Principles for effective RAG prompt design:

from langchain_core.prompts import ChatPromptTemplate
 
rag_prompt = ChatPromptTemplate.from_template("""
You are a technical support specialist at Data Dynamics.
Answer the question using ONLY the context information provided below.
 
## Rules
1. If the information is not in the context, respond with "I could not find that information in the provided documents."
2. Cite the source documents used in your answer.
3. Explain technical content with code examples.
 
## Context
{context}
 
## Question
{question}
 
## Answer
""")

Key design principles:

Specify role: Define the model's area of expertise
Force context-based responses: Prevent hallucination
Instruct to say "I don't know": Encourage honest responses
Require citations: Improve response credibility

Citation Processing

Including citations in RAG responses allows users to verify information.

# RAG chain implementation with citations
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
 
def format_docs_with_sources(docs):
    formatted = []
    for i, doc in enumerate(docs):
        source = doc.metadata.get("source", "Unknown")
        page = doc.metadata.get("page", "")
        ref = f"[{i+1}] {source}" + (f" (p.{page})" if page else "")
        formatted.append(f"{ref}\n{doc.page_content}")
    return "\n\n---\n\n".join(formatted)
 
rag_chain = (
    {"context": retriever | format_docs_with_sources,
     "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)
 
answer = rag_chain.invoke("What are Kudu's partitioning limitations?")

7. Advanced RAG Techniques

Query Transformation

Techniques that transform queries to improve retrieval quality.

HyDE (Hypothetical Document Embedding)

Generates a hypothetical answer to the query, then uses that answer's embedding for retrieval.

Loading diagram…

Multi-Query

Rewrites a single query from multiple perspectives to expand retrieval coverage.

from langchain.retrievers.multi_query import MultiQueryRetriever
 
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)
 
# Original query: "Kudu performance optimization"
# Auto-generated queries:
# 1. "How to improve Apache Kudu table scan performance"
# 2. "Strategies for improving query speed with Kudu partitioning"
# 3. "Kudu cluster hardware and configuration tuning guide"
results = multi_retriever.invoke("Kudu performance optimization")

Step-back Prompting

Transforms specific queries into more general ones to first retrieve background knowledge.

Loading diagram…

Self-RAG / Corrective RAG (CRAG)

Self-RAG

The LLM itself determines whether retrieval is needed, evaluates relevance of retrieved documents, and verifies generated responses.

Loading diagram…

Corrective RAG (CRAG)

Evaluates retrieval result quality and supplements with alternative sources like web search when inaccurate.

Loading diagram…

Graph RAG

A technique that leverages knowledge graphs to retrieve based on relationships between entities. Effective for relationship-based queries that are difficult to capture through simple text similarity.

Loading diagram…

Use cases:

Organizational structure queries ("Who is the team lead of Department A?")
Causal relationship tracking ("What is the root cause of this incident?")
Multi-hop reasoning ("What products did customers who bought Product A also buy?")

Agentic RAG

An LLM agent performs retrieval as a tool, building answers through a plan-execute-reflect loop for complex queries.

[User query: "How much did Q3 revenue increase compared to Q2, and what were the main drivers?"]

Agent planning:
1. Retrieve Q2 revenue data → tool: vector_search("Q2 revenue")
2. Retrieve Q3 revenue data → tool: vector_search("Q3 revenue")
3. Compare revenue changes → tool: calculator(Q3 - Q2)
4. Analyze growth drivers → tool: vector_search("Q3 revenue growth drivers")
5. Generate comprehensive report

Agent execution: Perform each step → Reflect on intermediate results → Re-retrieve if needed

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools.retriever import create_retriever_tool
 
# Create search tool
search_tool = create_retriever_tool(
    retriever=retriever,
    name="company_docs_search",
    description="Search internal company documents. Use for finding policies, technical guides, and reports."
)
 
# Create agent
agent = create_tool_calling_agent(llm, [search_tool], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[search_tool], verbose=True)
 
result = agent_executor.invoke({
    "input": "How much did Q3 revenue increase vs Q2, and what were the main drivers?"
})

8. RAG Evaluation and Optimization

Evaluation Metrics

RAG system quality is evaluated on two axes: retrieval quality and generation quality.

Retrieval quality metrics:

Metric	Description
Recall@K	Proportion of ground truth documents found in top K results
Precision@K	Proportion of relevant documents in top K results
MRR (Mean Reciprocal Rank)	Average reciprocal of the rank of the first correct document
NDCG	Relevance score considering ranking position

Generation quality metrics:

Metric	Description
Faithfulness	Degree to which the response is grounded in retrieved documents
Answer Relevancy	Degree to which the response addresses the question
Context Relevancy	Degree to which retrieved context relates to the question
Context Utilization	Degree to which retrieved context is actually used

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework for automatically evaluating RAG systems.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset
 
# Construct evaluation data
eval_data = {
    "question": ["What are Kudu's partitioning methods?"],
    "answer": ["Kudu supports Hash and Range partitioning..."],
    "contexts": [["Kudu supports Hash Partitioning and Range Partitioning..."]],
    "ground_truth": ["Kudu supports Hash, Range partitioning and..."]
}
 
dataset = Dataset.from_dict(eval_data)
 
# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
 
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.90}

Performance Tuning Points

Key checkpoints when RAG performance falls short of expectations:

Problem	Symptoms	Tuning Direction
Retrieval miss	Relevant docs not found	Adjust chunk size, apply hybrid search, increase K
Noisy documents	Irrelevant docs included	Add re-ranking, metadata filtering, set thresholds
Hallucination	Response unrelated to search results	Improve prompt, lower temperature, force citations
Incomplete response	Only partial information included	Increase chunk size, apply Multi-Query, increase K
Slow response	High latency	Optimize index, caching, apply streaming
Embedding quality	Semantic similarity inaccurate	Change embedding model, domain fine-tuning

Recommended tuning priority:

Optimize chunking strategy (highest impact)
Apply hybrid search
Add re-ranking
Change embedding model
Improve prompts
Apply query transformation

9. Enterprise RAG Deployment

Security and Access Control

In enterprise environments, document-level access permissions must be reflected in the RAG system.

# Retrieval with access control example
def secure_search(query: str, user_role: str, department: str):
    # Build metadata filter based on user permissions
    filter_conditions = {
        "access_level": {"$in": get_allowed_levels(user_role)},
        "department": {"$in": get_allowed_departments(user_role, department)}
    }
 
    results = vectorstore.similarity_search(
        query=query,
        k=5,
        filter=filter_conditions
    )
    return results
 
# Regular employee: search public documents only
results = secure_search("HR policy", user_role="employee", department="engineering")
 
# Manager: search including internal documents
results = secure_search("HR policy", user_role="manager", department="hr")

Key security considerations:

Document-level ACL: Manage per-document access permissions via metadata
Row-level security: Filter unauthorized documents from search results
Prompt injection prevention: Validate and sanitize user input
Data encryption: Apply encryption when storing in vector DB
Audit logging: Record search queries and access history

Multi-tenant Architecture

Architecture for multiple teams or customers sharing a single RAG system.

Loading diagram…

Isolation strategies:

Method	Description	Pros	Cons
Collection isolation	Separate collection per tenant	Complete data isolation	Management overhead
Namespace isolation	Namespace separation within same collection	Efficient resource use	Soft isolation
Metadata filtering	Filter by tenant ID in metadata	Simple implementation	Performance degradation at scale

Operational Monitoring and Feedback Loops

A monitoring framework for continuous quality management of production RAG systems.

Core monitoring metrics:

Category	Metric	Target
Performance	Response latency (P50/P95/P99)	P95 < 3s
Quality	User feedback (thumbs up/down)	Positive > 80%
Retrieval	No results rate	< 5%
Cost	Daily token usage	Within budget
Reliability	Error rate	< 0.1%

Building a feedback loop:

Loading diagram…

Note: A RAG system is not "build once and done." As documents are added/changed and user query patterns evolve, continuous monitoring and improvement are essential.

References

Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS
Gao, Y. et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv
Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR
Yan, S. et al. (2024). "Corrective Retrieval Augmented Generation." arXiv
Edge, D. et al. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv
Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL
LangChain Documentation — https://python.langchain.com/docs/
LlamaIndex Documentation — https://docs.llamaindex.ai/

— Data Dynamics Engineering Team