Blog
ragllmvectordbembeddingailangchainretrieval

RAG (Retrieval-Augmented Generation) Complete Guide - From Architecture to Production

A comprehensive guide covering RAG architecture, chunking strategies, vector databases, retrieval techniques, advanced patterns (Self-RAG, Graph RAG, Agentic RAG), evaluation methods, and enterprise deployment.

Data DynamicsApril 16, 202620 min read

RAG (Retrieval-Augmented Generation) is a technique that overcomes LLM limitations by retrieving external knowledge to enhance responses. This post systematically covers everything from RAG fundamentals to advanced techniques and enterprise deployment.


1. What is RAG?

Definition and Background

RAG (Retrieval-Augmented Generation) was proposed in 2020 by Patrick Lewis et al. at Meta (Facebook) AI Research. It is an architecture that combines information retrieval with text generation.

The core idea is simple: if you retrieve relevant documents and provide them as context before the LLM generates an answer, it can produce more accurate and up-to-date responses.

[Traditional LLM]
User question → LLM → Response (relies only on training data)

[RAG]
User question → Retrieve relevant docs → Retrieved results + question → LLM → Response (grounded in external knowledge)

Limitations of Standalone LLMs and the Need for RAG

Using LLMs alone has the following fundamental limitations:

LimitationDescriptionHow RAG Solves It
Knowledge CutoffCannot know information after trainingRetrieves up-to-date documents in real time
HallucinationConfidently generates false informationResponds based on retrieved documents with citations
Lack of Domain KnowledgeMissing enterprise internal data, specialized knowledgeStores internal docs in vector DB for retrieval
CostHigh cost for fine-tuningKnowledge updates via document updates only

Note: RAG is a complementary technique to fine-tuning, not a replacement. Fine-tuning is suitable for changing model behavior (style, format), while RAG is suitable for providing up-to-date factual information. Using both techniques together yields optimal results.


2. RAG Architecture in Detail

Overall Pipeline Architecture

A RAG system consists of an offline indexing pipeline and an online query pipeline.

[Offline: Indexing Pipeline]
Source documents → Document loading → Chunking → Embedding → Vector DB storage

[Online: Query Pipeline]
User query → Query embedding → Vector search → Re-ranking → Context construction → LLM generation → Response

Role of each stage:

StageRoleKey Tools
Document LoadingConvert various document formats to textLangChain Loaders, Unstructured
ChunkingSplit documents into appropriately sized piecesRecursiveCharacterTextSplitter
EmbeddingConvert text to high-dimensional vectorsOpenAI, Cohere, BGE, E5
Vector StorageIndex and store embedding vectorsChroma, Pinecone, Milvus
RetrievalQuickly find similar document chunksANN (Approximate Nearest Neighbor)
Re-rankingRe-order results by relevanceCross-Encoder, Cohere Rerank
GenerationGenerate responses based on retrieved resultsGPT-4, Claude, LLaMA

Naive RAG vs Advanced RAG vs Modular RAG

RAG can be categorized into three generations based on its evolution.

Naive RAG (1st Generation)

The most basic RAG implementation with a simple retrieve-generate pipeline.

Query → Embedding → Top-K retrieval → Insert into prompt → LLM generation
  • Pros: Simple to implement
  • Cons: Entirely dependent on retrieval quality, may include noisy documents

Advanced RAG (2nd Generation)

An improved pipeline with optimization stages before and after retrieval.

Query → [Query transformation] → Embedding → [Hybrid search] → [Re-ranking] → [Context compression] → LLM generation
  • Query transformation: Restructures queries to improve retrieval quality
  • Hybrid search: Combines vector + keyword search
  • Re-ranking: Re-evaluates relevance of search results
  • Context compression: Removes unnecessary information

Modular RAG (3rd Generation)

An architecture that separates each component into independent modules for flexible composition.

Query analysis → Routing → [Select retrieval module] → [Compose post-processing] → Generation → Verification
                              ├─ Vector search
                              ├─ Graph search
                              ├─ SQL search
                              └─ Web search
  • Routing: Selects appropriate retrieval strategy based on query type
  • Module swapping: Each stage component can be independently replaced
  • Feedback loop: Evaluates generation results and retries retrieval

3. Data Preparation: Document Loading and Chunking

Document Loading

The first step in RAG is converting source documents of various formats into text.

Document FormatLoaderConsiderations
PDFPyPDFLoader, UnstructuredWatch for tables and OCR text extraction
HTMLBeautifulSoupLoaderTag removal, body extraction
MarkdownMarkdownLoaderPreserve heading-based structure
Word/PPTUnstructuredCan leverage formatting information
DB (SQL)SQLDatabaseLoaderDocumentize query results
Confluence/NotionDedicated API LoadersReflect page hierarchy
# Various document loading examples
from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    UnstructuredMarkdownLoader,
    CSVLoader
)
 
# PDF loading
pdf_loader = PyPDFLoader("report.pdf")
pdf_docs = pdf_loader.load()
 
# Web page loading
web_loader = WebBaseLoader("https://docs.example.com/guide")
web_docs = web_loader.load()
 
# Markdown loading
md_loader = UnstructuredMarkdownLoader("README.md")
md_docs = md_loader.load()

Chunking Strategies

Chunking is one of the stages with the greatest impact on RAG performance. Chunks that are too large include noise, while chunks that are too small lose context.

Key chunking strategies:

StrategyDescriptionBest For
Fixed SizeSplit by fixed character/token countUnstructured text
RecursiveTry splitting by paragraph → sentence → wordGeneral purpose (most widely used)
SemanticSplit at embedding similarity change pointsDocuments with frequent topic changes
Document StructureLeverage headings, sections, etc.Technical docs, manuals
Agentic ChunkingLLM performs the chunkingComplex documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# Recursive splitting (most commonly used)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Max chunk size (characters)
    chunk_overlap=50,      # Overlap between chunks (context preservation)
    separators=["\n\n", "\n", ". ", " ", ""],  # Split priority
    length_function=len
)
 
chunks = splitter.split_documents(documents)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
# Semantic splitting
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at top 5% similarity changes
)
 
semantic_chunks = semantic_splitter.split_documents(documents)

Note: A general guideline for chunk size is 200–1000 tokens. Smaller chunks (200–400) favor precise retrieval, while larger chunks (600–1000) provide richer context. Overlap should be 10–20% of chunk_size.

Metadata Management

Attaching metadata to chunks enables filtering and weight adjustment during retrieval.

# Example of a chunk with metadata
{
    "content": "Quarterly revenue increased 15% year-over-year...",
    "metadata": {
        "source": "2025_Q4_report.pdf",
        "page": 12,
        "department": "finance",
        "date": "2025-12-31",
        "doc_type": "quarterly_report",
        "access_level": "internal"
    }
}

Use cases:

  • Date filter: "Q4 2025 revenue" → range search on date field
  • Department filter: "Marketing documents" → department == "marketing" filter
  • Access control: Filter by access_level based on user permissions

4. Embeddings and Vector Databases

Embedding Model Selection

Embedding models convert text into high-dimensional vectors to enable semantic similarity computation. Model selection directly impacts RAG performance.

ModelDimensionsMultilingualFeatures
OpenAI text-embedding-3-large3072YesHigh performance, API cost
OpenAI text-embedding-3-small1536YesExcellent cost-performance ratio
Cohere embed-v31024YesSearch-optimized, multilingual strength
BGE-M3 (BAAI)1024YesTop open-source, multilingual
E5-Mistral-7B4096YesOpen-source, strong on long documents
multilingual-e5-large1024YesMultilingual-focused, lightweight
# OpenAI embeddings
from langchain_openai import OpenAIEmbeddings
 
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector = embeddings.embed_query("RAG combines retrieval and generation")
# Result: 1536-dimensional vector
 
# Open-source embeddings (Hugging Face)
from langchain_huggingface import HuggingFaceEmbeddings
 
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}
)

Note: For Korean RAG systems, multilingual models (BGE-M3, multilingual-e5, etc.) are recommended. English-only models may not accurately capture Korean semantics.

Major Vector DB Comparison

Vector DBTypeScalabilityHybrid SearchMetadata FilterBest For
ChromaEmbeddedSmallNoYesPrototypes, PoC
PineconeManaged SaaSLargeYesYesMinimal ops
WeaviateSelf-hosted/CloudLargeYesYesHybrid search focus
MilvusSelf-hostedVery largeYesYesEnterprise, high volume
QdrantSelf-hosted/CloudLargeYesYesHigh-performance filtering
pgvectorPostgreSQL extensionMediumNoYes (SQL)Existing PostgreSQL users
# Vector storage and search with Chroma
from langchain_chroma import Chroma
 
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="company_docs",
    persist_directory="./chroma_db"
)
 
# Similarity search
results = vectorstore.similarity_search_with_score(
    query="What is the data retention policy?",
    k=5
)
 
for doc, score in results:
    print(f"[Score: {score:.4f}] {doc.page_content[:100]}...")

Indexing Strategies

Vector search speed and accuracy depend heavily on the indexing algorithm.

AlgorithmPrincipleSpeedAccuracyMemory
Flat (Brute Force)Compare against all vectorsSlow100%Low
HNSWHierarchical graph traversalFastVery highHigh
IVF (Inverted File)Cluster-based searchFastHighMedium
PQ (Product Quantization)Compress vectors before searchVery fastMediumVery low
HNSW + PQCombines HNSW and PQFastHighMedium
  • Small scale (< 100K vectors): Flat or HNSW
  • Medium scale (100K–10M): HNSW
  • Large scale (> 10M): IVF + PQ or HNSW + PQ

5. Retrieval Strategies

Vector search calculates similarity between query vectors and stored document vectors to find the most relevant documents.

Key similarity metrics:

MetricFormulaFeatures
Cosine Similaritycos(θ) = (A·B) / (‖A‖·‖B‖)Direction-based, most widely used
Euclidean Distance (L2)d = √Σ(a_i - b_i)²Distance-based, normalization required
Inner Product (Dot Product)s = Σ(a_i × b_i)Fast computation, equivalent to Cosine for normalized vectors

Note: Most embedding models output normalized vectors, making Cosine Similarity and Inner Product results identical. Use Inner Product for performance-critical applications as it requires simpler computation.

Keyword Search (BM25)

BM25 is a traditional information retrieval algorithm that considers term frequency (TF) and inverse document frequency (IDF).

from langchain_community.retrievers import BM25Retriever
 
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
    documents=chunks,
    k=5
)
 
# Keyword search
results = bm25_retriever.invoke("NiFi LDAP authentication setup")

Vector Search vs BM25 comparison:

ScenarioVector SearchBM25
"NiFi authentication setup"Finds semantically similar docs (good)Matches "NiFi", "authentication" keywords (good)
"security access control"Also finds "authentication", "permission management" (good)Misses without exact keywords (weak)
"error code 0x8007"Low semantic similarity, inaccurate (weak)Exact code matching (good)

Hybrid search combines the strengths of vector search and keyword search. Reciprocal Rank Fusion (RRF) or weighted scoring is used to merge results from both search methods.

from langchain.retrievers import EnsembleRetriever
 
# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# BM25 retriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
 
# Hybrid search (ensemble)
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # 60% vector search, 40% BM25
)
 
results = hybrid_retriever.invoke("NiFi cluster TLS certificate setup")

RRF (Reciprocal Rank Fusion) formula:

RRF_score(d) = Σ 1 / (k + rank_i(d))

- d: document
- k: constant (typically 60)
- rank_i(d): rank in the i-th retriever

Re-ranking

A stage that re-orders retrieved documents more precisely by relevance. While initial retrieval (bi-encoder) sacrifices some accuracy for speed, re-ranking (cross-encoder) evaluates precise relevance for a small set of candidates.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
 
# Cohere Reranker
reranker = CohereRerank(
    model="rerank-v3.5",
    top_n=3  # Return only top 3
)
 
# Retriever with re-ranking
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever  # Re-rank hybrid search results
)
 
results = compression_retriever.invoke("Kudu partitioning strategies")

Effect of re-ranking:

StageProcessesSpeedAccuracy
Initial Retrieval (Bi-Encoder)All documents (tens of thousands to millions)FastMedium
Re-ranking (Cross-Encoder)Candidate documents (10–50)SlowHigh

6. Prompt Construction and Response Generation

Context Injection Methods

Methods for inserting retrieved documents into the LLM prompt.

Stuff method (most basic): Insert all search results into a single prompt

Context:
[Document 1 content]
[Document 2 content]
[Document 3 content]

Based on the context above, answer the following question:
{question}

Map-Reduce method: Summarize each document individually, then combine

[Document 1] → LLM → Summary 1 ─┐
[Document 2] → LLM → Summary 2 ─┼→ Integration LLM → Final response
[Document 3] → LLM → Summary 3 ─┘

Map-Rerank method: Generate responses from each document, select by score

[Document 1] → LLM → Response 1 (score: 0.9) ← Selected
[Document 2] → LLM → Response 2 (score: 0.3)
[Document 3] → LLM → Response 3 (score: 0.7)

Prompt Template Design

Principles for effective RAG prompt design:

from langchain_core.prompts import ChatPromptTemplate
 
rag_prompt = ChatPromptTemplate.from_template("""
You are a technical support specialist at Data Dynamics.
Answer the question using ONLY the context information provided below.
 
## Rules
1. If the information is not in the context, respond with "I could not find that information in the provided documents."
2. Cite the source documents used in your answer.
3. Explain technical content with code examples.
 
## Context
{context}
 
## Question
{question}
 
## Answer
""")

Key design principles:

  • Specify role: Define the model's area of expertise
  • Force context-based responses: Prevent hallucination
  • Instruct to say "I don't know": Encourage honest responses
  • Require citations: Improve response credibility

Citation Processing

Including citations in RAG responses allows users to verify information.

# RAG chain implementation with citations
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
 
def format_docs_with_sources(docs):
    formatted = []
    for i, doc in enumerate(docs):
        source = doc.metadata.get("source", "Unknown")
        page = doc.metadata.get("page", "")
        ref = f"[{i+1}] {source}" + (f" (p.{page})" if page else "")
        formatted.append(f"{ref}\n{doc.page_content}")
    return "\n\n---\n\n".join(formatted)
 
rag_chain = (
    {"context": retriever | format_docs_with_sources,
     "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)
 
answer = rag_chain.invoke("What are Kudu's partitioning limitations?")

7. Advanced RAG Techniques

Query Transformation

Techniques that transform queries to improve retrieval quality.

HyDE (Hypothetical Document Embedding)

Generates a hypothetical answer to the query, then uses that answer's embedding for retrieval.

Original query: "What are Kudu's partitioning methods?"
    ↓ LLM generates hypothetical answer
Hypothetical answer: "Apache Kudu supports Hash Partitioning and Range Partitioning.
                      Hash Partitioning distributes data evenly..."
    ↓ Embed hypothetical answer for retrieval
Search results: Actual Kudu partitioning documents

Multi-Query

Rewrites a single query from multiple perspectives to expand retrieval coverage.

from langchain.retrievers.multi_query import MultiQueryRetriever
 
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)
 
# Original query: "Kudu performance optimization"
# Auto-generated queries:
# 1. "How to improve Apache Kudu table scan performance"
# 2. "Strategies for improving query speed with Kudu partitioning"
# 3. "Kudu cluster hardware and configuration tuning guide"
results = multi_retriever.invoke("Kudu performance optimization")

Step-back Prompting

Transforms specific queries into more general ones to first retrieve background knowledge.

Original query: "Getting TLS errors with LDAP authentication in NiFi 1.23"
    ↓ Step-back
Transformed query: "What is Apache NiFi's LDAP authentication and TLS configuration architecture?"
    ↓ Retrieve background knowledge, then combine with original query

Self-RAG / Corrective RAG (CRAG)

Self-RAG

The LLM itself determines whether retrieval is needed, evaluates relevance of retrieved documents, and verifies generated responses.

Query received
  ↓
[Is retrieval needed?] → No → Direct response
  ↓ Yes
[Retrieve documents]
  ↓
[Are retrieved docs relevant?] → No → Re-retrieve with different strategy
  ↓ Yes
[Generate response]
  ↓
[Is response grounded in documents?] → No → Regenerate
  ↓ Yes
[Is response useful for the question?] → No → Regenerate
  ↓ Yes
Output final response

Corrective RAG (CRAG)

Evaluates retrieval result quality and supplements with alternative sources like web search when inaccurate.

[Document retrieval] → [Relevance evaluation]
                         ├─ Correct → Knowledge refinement → Generation
                         ├─ Ambiguous → Supplement with web search → Generation
                         └─ Incorrect → Replace with web search → Generation

Graph RAG

A technique that leverages knowledge graphs to retrieve based on relationships between entities. Effective for relationship-based queries that are difficult to capture through simple text similarity.

[Standard RAG]
"Who is the data architect for this project?" → Text similarity search → Difficult to find accurate results

[Graph RAG]
"Who is the data architect for this project?"
  → Traverse "project" node in knowledge graph
  → Follow "role: data architect" relationship
  → Return connected "person" node

Use cases:

  • Organizational structure queries ("Who is the team lead of Department A?")
  • Causal relationship tracking ("What is the root cause of this incident?")
  • Multi-hop reasoning ("What products did customers who bought Product A also buy?")

Agentic RAG

An LLM agent performs retrieval as a tool, building answers through a plan-execute-reflect loop for complex queries.

[User query: "How much did Q3 revenue increase compared to Q2, and what were the main drivers?"]

Agent planning:
1. Retrieve Q2 revenue data → tool: vector_search("Q2 revenue")
2. Retrieve Q3 revenue data → tool: vector_search("Q3 revenue")
3. Compare revenue changes → tool: calculator(Q3 - Q2)
4. Analyze growth drivers → tool: vector_search("Q3 revenue growth drivers")
5. Generate comprehensive report

Agent execution: Perform each step → Reflect on intermediate results → Re-retrieve if needed
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools.retriever import create_retriever_tool
 
# Create search tool
search_tool = create_retriever_tool(
    retriever=retriever,
    name="company_docs_search",
    description="Search internal company documents. Use for finding policies, technical guides, and reports."
)
 
# Create agent
agent = create_tool_calling_agent(llm, [search_tool], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[search_tool], verbose=True)
 
result = agent_executor.invoke({
    "input": "How much did Q3 revenue increase vs Q2, and what were the main drivers?"
})

8. RAG Evaluation and Optimization

Evaluation Metrics

RAG system quality is evaluated on two axes: retrieval quality and generation quality.

Retrieval quality metrics:

MetricDescription
Recall@KProportion of ground truth documents found in top K results
Precision@KProportion of relevant documents in top K results
MRR (Mean Reciprocal Rank)Average reciprocal of the rank of the first correct document
NDCGRelevance score considering ranking position

Generation quality metrics:

MetricDescription
FaithfulnessDegree to which the response is grounded in retrieved documents
Answer RelevancyDegree to which the response addresses the question
Context RelevancyDegree to which retrieved context relates to the question
Context UtilizationDegree to which retrieved context is actually used

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework for automatically evaluating RAG systems.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset
 
# Construct evaluation data
eval_data = {
    "question": ["What are Kudu's partitioning methods?"],
    "answer": ["Kudu supports Hash and Range partitioning..."],
    "contexts": [["Kudu supports Hash Partitioning and Range Partitioning..."]],
    "ground_truth": ["Kudu supports Hash, Range partitioning and..."]
}
 
dataset = Dataset.from_dict(eval_data)
 
# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
 
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.90}

Performance Tuning Points

Key checkpoints when RAG performance falls short of expectations:

ProblemSymptomsTuning Direction
Retrieval missRelevant docs not foundAdjust chunk size, apply hybrid search, increase K
Noisy documentsIrrelevant docs includedAdd re-ranking, metadata filtering, set thresholds
HallucinationResponse unrelated to search resultsImprove prompt, lower temperature, force citations
Incomplete responseOnly partial information includedIncrease chunk size, apply Multi-Query, increase K
Slow responseHigh latencyOptimize index, caching, apply streaming
Embedding qualitySemantic similarity inaccurateChange embedding model, domain fine-tuning

Recommended tuning priority:

  1. Optimize chunking strategy (highest impact)
  2. Apply hybrid search
  3. Add re-ranking
  4. Change embedding model
  5. Improve prompts
  6. Apply query transformation

9. Enterprise RAG Deployment

Security and Access Control

In enterprise environments, document-level access permissions must be reflected in the RAG system.

# Retrieval with access control example
def secure_search(query: str, user_role: str, department: str):
    # Build metadata filter based on user permissions
    filter_conditions = {
        "access_level": {"$in": get_allowed_levels(user_role)},
        "department": {"$in": get_allowed_departments(user_role, department)}
    }
 
    results = vectorstore.similarity_search(
        query=query,
        k=5,
        filter=filter_conditions
    )
    return results
 
# Regular employee: search public documents only
results = secure_search("HR policy", user_role="employee", department="engineering")
 
# Manager: search including internal documents
results = secure_search("HR policy", user_role="manager", department="hr")

Key security considerations:

  • Document-level ACL: Manage per-document access permissions via metadata
  • Row-level security: Filter unauthorized documents from search results
  • Prompt injection prevention: Validate and sanitize user input
  • Data encryption: Apply encryption when storing in vector DB
  • Audit logging: Record search queries and access history

Multi-tenant Architecture

Architecture for multiple teams or customers sharing a single RAG system.

[Multi-tenant RAG Architecture]

Tenant A ─┐                      ┌─ Collection A (Vector DB)
Tenant B ─┼→ API Gateway    → ─┼─ Collection B (Vector DB)
Tenant C ─┘   (Auth/Routing)     └─ Collection C (Vector DB)
                                          ↓
                                   Shared LLM Endpoint

Isolation strategies:

MethodDescriptionProsCons
Collection isolationSeparate collection per tenantComplete data isolationManagement overhead
Namespace isolationNamespace separation within same collectionEfficient resource useSoft isolation
Metadata filteringFilter by tenant ID in metadataSimple implementationPerformance degradation at scale

Operational Monitoring and Feedback Loops

A monitoring framework for continuous quality management of production RAG systems.

Core monitoring metrics:

CategoryMetricTarget
PerformanceResponse latency (P50/P95/P99)P95 < 3s
QualityUser feedback (thumbs up/down)Positive > 80%
RetrievalNo results rate< 5%
CostDaily token usageWithin budget
ReliabilityError rate< 0.1%

Building a feedback loop:

[User query] → [RAG response] → [User feedback]
                                       ↓
                               [Feedback analysis]
                               ├─ Negative feedback → Collect failure cases → Improve
                               │                      ├─ Retrieval miss → Add docs / adjust chunking
                               │                      ├─ Wrong answer → Improve prompt
                               │                      └─ Slow response → Optimize infrastructure
                               └─ Positive feedback → Analyze success patterns → Maintain/expand

Note: A RAG system is not "build once and done." As documents are added/changed and user query patterns evolve, continuous monitoring and improvement are essential.


References

  • Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS
  • Gao, Y. et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv
  • Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR
  • Yan, S. et al. (2024). "Corrective Retrieval Augmented Generation." arXiv
  • Edge, D. et al. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv
  • Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL
  • LangChain Documentation — https://python.langchain.com/docs/
  • LlamaIndex Documentation — https://docs.llamaindex.ai/

— Data Dynamics Engineering Team