llmaitransformerrag

LLM (Large Language Model) Complete Guide - From Concepts to Applications

A comprehensive overview of LLM core concepts, Transformer architecture, training process, prompt engineering, and RAG.

Data DynamicsApril 15, 202613 min read

LLM (Large Language Model) is a large-scale AI model that learns from vast amounts of text data to understand and generate human language. This post provides a structured overview of LLM concepts, from fundamentals to practical applications.

1. What is an LLM?

Definition and Background

An LLM (Large Language Model) is a deep learning model with billions to trillions of parameters, trained on massive text corpora to understand and generate natural language.

Here is a timeline of key milestones in LLM development:

Year	Technology	Significance
2017	Transformer paper	"Attention Is All You Need" — new parallelizable architecture
2018	GPT-1 / BERT	Established the pre-training + fine-tuning paradigm
2020	GPT-3 (175B params)	Demonstrated few-shot learning capabilities
2022	ChatGPT	Popularized conversational AI with RLHF
2023–2024	GPT-4, Claude, Gemini, LLaMA	Multimodal, long context, open-source competition
2025–2026	Claude 4, GPT-5, etc.	Agent-based autonomous work, 1M+ token context

Differences from Traditional NLP

The fundamental difference between traditional NLP and LLMs lies in the task-specific vs. general-purpose model paradigm.

Aspect	Traditional NLP	LLM
Training	Task-specific individual training	Large-scale pre-training → fine-tuning
Data	Small labeled datasets	Internet-scale unstructured text
Model Size	Millions of parameters	Billions to trillions of parameters
Versatility	Single task (sentiment, NER, etc.)	Translation, summarization, coding, reasoning
Zero-shot	Cannot perform untrained tasks	Can perform untrained tasks via prompts

2. Core Technology: Transformer Architecture

Attention Mechanism

Attention is a mechanism that computes how each token in a sequence relates to every other token. It overcomes the sequential processing limitations of RNN/LSTM and effectively captures relationships between distant words in long sentences.

The core attention formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Q (Query): What the current token is asking about other tokens
K (Key): Feature information each token carries
V (Value): The actual values to be passed
√d_k: Scaling factor (square root of dimension)

Self-Attention and Multi-Head Attention

Self-Attention computes relationships between each token and all other tokens within the same sequence. For example, in "The cat sat on the mat. It looked comfortable," it can determine that "It" refers to "the cat."

Multi-Head Attention performs Self-Attention in parallel across multiple "heads," learning diverse perspectives on token relationships.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O

Each head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

For instance, with 8 heads, each head independently learns different patterns such as grammatical, semantic, and positional relationships.

Encoder vs Decoder Architecture

The Transformer architecture is used in three variations depending on the purpose:

Architecture	Representative Models	Characteristics	Primary Use
Encoder-only	BERT, RoBERTa	Bidirectional context understanding	Classification, NER, sentiment analysis
Decoder-only	GPT, Claude, LLaMA	Left-to-right autoregressive generation	Text generation, chat, coding
Encoder-Decoder	T5, BART	Input understanding + output generation	Translation, summarization

Most current LLMs (GPT, Claude, LLaMA, Gemini, etc.) adopt the Decoder-only architecture. This structure generates text in an autoregressive manner by predicting the next token.

3. LLM Training Process

LLM training is broadly divided into three stages.

Pre-training

Pre-training is the first and most expensive stage. The model learns language structure and knowledge from massive text data including internet content, books, and papers.

Objective: Next Token Prediction

Input: "The weather today is really"
Prediction: "nice" (prob 0.35), "hot" (prob 0.20), "cold" (prob 0.15), ...

Key characteristics:

Data Scale: Multiple terabytes of text data
Compute Cost: Thousands to tens of thousands of GPUs for weeks to months
Unsupervised: Learns patterns from text without labels
Cost: Millions to tens of millions of dollars for large models

Fine-tuning

Additional training to adapt a pre-trained model for specific purposes.

Method	Description	Advantage
Full Fine-tuning	Updates all parameters	Best possible performance
LoRA (Low-Rank Adaptation)	Trains only low-rank matrices	Less memory, faster training
QLoRA	Quantization + LoRA	Feasible on consumer GPUs
Instruction Tuning	Trains on instruction-response pairs	Improves instruction following

# Fine-tuning with LoRA example (Hugging Face PEFT)
from peft import LoraConfig, get_peft_model
 
lora_config = LoraConfig(
    r=16,                      # Low-rank dimension
    lora_alpha=32,             # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Target layers
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(base_model, lora_config)
# Only ~0.1% of total parameters are trained

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a technique that improves LLM response quality to align with human preferences.

RLHF 3-step process:

SFT (Supervised Fine-Tuning): Supervised learning with high-quality conversation data
Reward Model Training: Building a reward model that learns human response preferences
PPO (Proximal Policy Optimization): Reinforcement learning based on the reward model

Loading diagram…

Note: Beyond RLHF, recent research explores various alignment techniques such as DPO (Direct Preference Optimization) and RLAIF (Reinforcement Learning from AI Feedback).

4. Tokenization and Embedding

Tokenization Methods

Tokenization is the process of splitting text into the smallest units (tokens) that the model can process.

Method	Description	Used By
BPE (Byte-Pair Encoding)	Iteratively merges frequent character pairs	GPT series
SentencePiece	Language-independent subword tokenization	LLaMA, T5
WordPiece	BPE variant, likelihood-based merging	BERT
Tiktoken	BPE-based, OpenAI optimized implementation	GPT-3.5/4

Tokenization example:

# Tokenization example using tiktoken
import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4")
text = "Large Language Models are transforming AI"
tokens = enc.encode(text)
# [27050, 11688, 27972, 527, 46890, 15592]  → 6 tokens
 
# Korean tokenization
text_ko = "대규모 언어 모델은 AI를 변화시키고 있다"
tokens_ko = enc.encode(text_ko)
# Korean requires approximately 2-3x more tokens than English

Note: Korean uses more tokens than English to express the same meaning, which affects API costs and context utilization efficiency.

Vector Embedding Concepts

Embedding is the process of mapping tokens into a high-dimensional vector space. Semantically similar words are placed close together in the vector space.

"king" - "man" + "woman" ≈ "queen"
"Seoul" - "Korea" + "Japan" ≈ "Tokyo"

Roles of embedding in LLMs:

Input Embedding: Token → vector conversion (model input)
Positional Embedding: Encoding token order information
Output Embedding: Vector → token probability conversion (model output)

Context Window

The context window is the maximum number of tokens a model can process at once.

Model	Context Window	Notes
GPT-3	4,096 tokens	~3,000 words
GPT-4	128K tokens	~96,000 words
Claude 3.5	200K tokens	~150,000 words
Claude Opus 4	1M tokens	~750,000 words
Gemini 1.5 Pro	1M tokens	~750,000 words

Larger context windows allow processing longer documents at once, but computational costs increase quadratically with token count.

5. Major LLM Model Comparison

GPT Series (OpenAI)

OpenAI's GPT (Generative Pre-trained Transformer) series led the popularization of LLMs.

Model	Parameters	Key Features
GPT-3	175B	Demonstrated few-shot learning potential
GPT-3.5	Undisclosed	ChatGPT foundation, RLHF applied
GPT-4	Undisclosed (est. 1.7T MoE)	Multimodal, enhanced reasoning
GPT-4o	Undisclosed	Voice/image integration, fast inference

Claude (Anthropic)

Anthropic's Claude emphasizes the balance between safety and helpfulness.

Constitutional AI: Safety through constitutional AI techniques
Long Context: Up to 1M token support
Coding Ability: Strengths in code generation, analysis, and debugging
Agent Use: Autonomous task execution through tools like Claude Code

Open-Source Models

Model	Developer	Parameters	Features
LLaMA 3	Meta	8B / 70B / 405B	Open source, commercially usable
Mistral	Mistral AI	7B / 8x7B (MoE)	High performance for size
Gemma	Google	2B / 7B / 27B	Lightweight, various sizes
Qwen 2.5	Alibaba	0.5B–72B	Multilingual, code, math strengths

Note: MoE (Mixture of Experts) is an architecture that activates only a subset of expert networks from the total parameters for efficient inference. For example, Mistral 8x7B has 46.7B total parameters but uses only ~12.9B during inference.

6. Prompt Engineering

Prompt Structure and Principles

Prompt engineering is the technique of optimizing inputs (prompts) sent to LLMs to achieve desired results.

Components of an effective prompt:

Role: Define the role for the model
Context: Background information and constraints
Instruction: Clear description of the task
Examples: Desired input/output formats
Output Format: Response structure (JSON, table, list, etc.)

Zero-shot / Few-shot / Chain-of-Thought

Zero-shot: Performing tasks with instructions only, no examples

Analyze the sentiment of the following sentence: "This product is absolutely amazing!"
→ Positive

Few-shot: Performing tasks with a few examples

Sentence: "Fast delivery, very happy" → Positive
Sentence: "Screen keeps turning off" → Negative
Sentence: "Price is ok but quality is poor" → ?
→ Negative

Chain-of-Thought (CoT): Inducing step-by-step reasoning

Q: A store had 23 apples. They sold 8 and received 12 new ones.
   How many apples are there now?

A: Let's solve step by step.
   1. Initial apples: 23
   2. After selling: 23 - 8 = 15
   3. After restocking: 15 + 12 = 27
   Therefore, there are currently 27 apples.

System Prompts and Role Assignment

System prompts define the overall behavior of the model and are set at the beginning of a conversation.

{
  "messages": [
    {
      "role": "system",
      "content": "You are a data engineer with 10 years of experience. Provide accurate, practical answers to technical questions. Include code examples."
    },
    {
      "role": "user",
      "content": "How do you resolve data skew issues in Spark?"
    }
  ]
}

Note: System prompt support varies by model. Claude uses a system parameter, while GPT uses a system role message.

7. RAG (Retrieval-Augmented Generation)

Why RAG and How It Works

RAG is a technique that retrieves relevant information from external knowledge bases to enhance LLM responses.

Limitations of standalone LLMs:

Cannot know information after training data cutoff
Lacks enterprise internal data or domain knowledge
Risk of hallucination (generating false information)

RAG Pipeline Architecture:

Loading diagram…

Vector Database Integration

Key vector database options that play a central role in RAG:

Vector DB	Features	Best For
Chroma	Lightweight, embedded, Python-friendly	Prototypes, small scale
Pinecone	Fully managed SaaS	Minimal ops overhead
Weaviate	Hybrid search support	Keyword + vector search
Milvus	Large-scale distributed processing	Enterprise, high volume
pgvector	PostgreSQL extension	Existing PostgreSQL users

# Simple RAG implementation with LangChain + Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import ChatOpenAI
 
# 1. Load and split documents
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
 
# 2. Store in vector DB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
 
# 3. Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)
 
# 4. Query
result = qa_chain.invoke({"query": "What is our company's data retention policy?"})
print(result["result"])

RAG Pipelines in Enterprise Environments

Key considerations when adopting RAG in enterprise settings:

Chunking Strategy: Splitting documents into appropriate sizes (fixed-size vs. semantic)
Embedding Model Selection: Multilingual support, domain specialization
Hybrid Search: Combining vector search + keyword search (BM25)
Re-ranking: Reordering search results with LLM or Cross-Encoder
Security: Filtering search results based on document access permissions
Evaluation: Monitoring retrieval accuracy (Recall@K) and response quality

8. LLM Limitations and Considerations

Hallucination

Hallucination is the phenomenon where LLMs confidently generate information that is not factual.

Types of hallucination:

Factual errors: Citing non-existent papers or statistics
Logical errors: Plausible but incorrect reasoning
Consistency errors: Contradictory answers within the same conversation

Mitigation strategies:

RAG-based responses grounded in external knowledge
Requesting citations
Lowering the temperature parameter
Building response verification pipelines

Bias

LLMs can reflect biases present in their training data.

Social bias: Gender, racial, and cultural stereotypes
Language bias: Performance degradation for non-English languages due to English-centric training
Temporal bias: Information skew based on training data timeframe

Cost and Latency

Operating LLMs requires significant cost and infrastructure.

Factor	Considerations
API Cost	Per-token billing for input/output (varies by model)
Self-hosting	GPU server costs (A100/H100, etc.)
Latency	Token generation speed (TTFT, TPS)
Concurrency	Batch processing, request queue management
Optimization	Quantization, KV cache, speculative decoding

Note: TTFT (Time to First Token) is the time until the first token is generated, and TPS (Tokens Per Second) is the number of tokens generated per second.

9. Enterprise LLM Use Cases

Document Summarization and Search

Meeting notes auto-summarization: Extracting key points and action items from transcripts
Technical documentation search: RAG-based search across internal wikis, Confluence, etc.
Report generation: Converting data analysis results into natural language reports

Code Generation and Review

Code auto-completion: IDE-integrated coding assistants (Claude Code, GitHub Copilot, etc.)
Code review: Automated PR-based code review and improvement suggestions
Legacy code migration: Converting old code to modern frameworks
Test code generation: Auto-generating unit tests and integration tests

Data Pipeline Automation

SQL generation: Converting natural language queries to SQL (Text-to-SQL)
ETL pipeline design: Data flow design and code generation
Anomaly detection: Detecting system anomalies through log analysis
Data quality validation: Schema change detection, data integrity verification automation

References

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS
Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS
Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS
Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR
Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS
Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv

— Data Dynamics Engineering Team