LLM (Large Language Model) Complete Guide - From Concepts to Applications
A comprehensive overview of LLM core concepts, Transformer architecture, training process, prompt engineering, and RAG.
LLM (Large Language Model) is a large-scale AI model that learns from vast amounts of text data to understand and generate human language. This post provides a structured overview of LLM concepts, from fundamentals to practical applications.
1. What is an LLM?
Definition and Background
An LLM (Large Language Model) is a deep learning model with billions to trillions of parameters, trained on massive text corpora to understand and generate natural language.
Here is a timeline of key milestones in LLM development:
| Year | Technology | Significance |
|---|---|---|
| 2017 | Transformer paper | "Attention Is All You Need" — new parallelizable architecture |
| 2018 | GPT-1 / BERT | Established the pre-training + fine-tuning paradigm |
| 2020 | GPT-3 (175B params) | Demonstrated few-shot learning capabilities |
| 2022 | ChatGPT | Popularized conversational AI with RLHF |
| 2023–2024 | GPT-4, Claude, Gemini, LLaMA | Multimodal, long context, open-source competition |
| 2025–2026 | Claude 4, GPT-5, etc. | Agent-based autonomous work, 1M+ token context |
Differences from Traditional NLP
The fundamental difference between traditional NLP and LLMs lies in the task-specific vs. general-purpose model paradigm.
| Aspect | Traditional NLP | LLM |
|---|---|---|
| Training | Task-specific individual training | Large-scale pre-training → fine-tuning |
| Data | Small labeled datasets | Internet-scale unstructured text |
| Model Size | Millions of parameters | Billions to trillions of parameters |
| Versatility | Single task (sentiment, NER, etc.) | Translation, summarization, coding, reasoning |
| Zero-shot | Cannot perform untrained tasks | Can perform untrained tasks via prompts |
2. Core Technology: Transformer Architecture
Attention Mechanism
Attention is a mechanism that computes how each token in a sequence relates to every other token. It overcomes the sequential processing limitations of RNN/LSTM and effectively captures relationships between distant words in long sentences.
The core attention formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
- Q (Query): What the current token is asking about other tokens
- K (Key): Feature information each token carries
- V (Value): The actual values to be passed
- √d_k: Scaling factor (square root of dimension)
Self-Attention and Multi-Head Attention
Self-Attention computes relationships between each token and all other tokens within the same sequence. For example, in "The cat sat on the mat. It looked comfortable," it can determine that "It" refers to "the cat."
Multi-Head Attention performs Self-Attention in parallel across multiple "heads," learning diverse perspectives on token relationships.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O
Each head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
For instance, with 8 heads, each head independently learns different patterns such as grammatical, semantic, and positional relationships.
Encoder vs Decoder Architecture
The Transformer architecture is used in three variations depending on the purpose:
| Architecture | Representative Models | Characteristics | Primary Use |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa | Bidirectional context understanding | Classification, NER, sentiment analysis |
| Decoder-only | GPT, Claude, LLaMA | Left-to-right autoregressive generation | Text generation, chat, coding |
| Encoder-Decoder | T5, BART | Input understanding + output generation | Translation, summarization |
Most current LLMs (GPT, Claude, LLaMA, Gemini, etc.) adopt the Decoder-only architecture. This structure generates text in an autoregressive manner by predicting the next token.
3. LLM Training Process
LLM training is broadly divided into three stages.
Pre-training
Pre-training is the first and most expensive stage. The model learns language structure and knowledge from massive text data including internet content, books, and papers.
Objective: Next Token Prediction
Input: "The weather today is really"
Prediction: "nice" (prob 0.35), "hot" (prob 0.20), "cold" (prob 0.15), ...
Key characteristics:
- Data Scale: Multiple terabytes of text data
- Compute Cost: Thousands to tens of thousands of GPUs for weeks to months
- Unsupervised: Learns patterns from text without labels
- Cost: Millions to tens of millions of dollars for large models
Fine-tuning
Additional training to adapt a pre-trained model for specific purposes.
| Method | Description | Advantage |
|---|---|---|
| Full Fine-tuning | Updates all parameters | Best possible performance |
| LoRA (Low-Rank Adaptation) | Trains only low-rank matrices | Less memory, faster training |
| QLoRA | Quantization + LoRA | Feasible on consumer GPUs |
| Instruction Tuning | Trains on instruction-response pairs | Improves instruction following |
# Fine-tuning with LoRA example (Hugging Face PEFT)
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Low-rank dimension
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Target layers
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Only ~0.1% of total parameters are trainedRLHF (Reinforcement Learning from Human Feedback)
RLHF is a technique that improves LLM response quality to align with human preferences.
RLHF 3-step process:
- SFT (Supervised Fine-Tuning): Supervised learning with high-quality conversation data
- Reward Model Training: Building a reward model that learns human response preferences
- PPO (Proximal Policy Optimization): Reinforcement learning based on the reward model
User question → LLM generates multiple responses → Humans rank them
→ Reward model learns scores → RL improves LLM policy
Note: Beyond RLHF, recent research explores various alignment techniques such as DPO (Direct Preference Optimization) and RLAIF (Reinforcement Learning from AI Feedback).
4. Tokenization and Embedding
Tokenization Methods
Tokenization is the process of splitting text into the smallest units (tokens) that the model can process.
| Method | Description | Used By |
|---|---|---|
| BPE (Byte-Pair Encoding) | Iteratively merges frequent character pairs | GPT series |
| SentencePiece | Language-independent subword tokenization | LLaMA, T5 |
| WordPiece | BPE variant, likelihood-based merging | BERT |
| Tiktoken | BPE-based, OpenAI optimized implementation | GPT-3.5/4 |
Tokenization example:
# Tokenization example using tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Large Language Models are transforming AI"
tokens = enc.encode(text)
# [27050, 11688, 27972, 527, 46890, 15592] → 6 tokens
# Korean tokenization
text_ko = "대규모 언어 모델은 AI를 변화시키고 있다"
tokens_ko = enc.encode(text_ko)
# Korean requires approximately 2-3x more tokens than EnglishNote: Korean uses more tokens than English to express the same meaning, which affects API costs and context utilization efficiency.
Vector Embedding Concepts
Embedding is the process of mapping tokens into a high-dimensional vector space. Semantically similar words are placed close together in the vector space.
"king" - "man" + "woman" ≈ "queen"
"Seoul" - "Korea" + "Japan" ≈ "Tokyo"
Roles of embedding in LLMs:
- Input Embedding: Token → vector conversion (model input)
- Positional Embedding: Encoding token order information
- Output Embedding: Vector → token probability conversion (model output)
Context Window
The context window is the maximum number of tokens a model can process at once.
| Model | Context Window | Notes |
|---|---|---|
| GPT-3 | 4,096 tokens | ~3,000 words |
| GPT-4 | 128K tokens | ~96,000 words |
| Claude 3.5 | 200K tokens | ~150,000 words |
| Claude Opus 4 | 1M tokens | ~750,000 words |
| Gemini 1.5 Pro | 1M tokens | ~750,000 words |
Larger context windows allow processing longer documents at once, but computational costs increase quadratically with token count.
5. Major LLM Model Comparison
GPT Series (OpenAI)
OpenAI's GPT (Generative Pre-trained Transformer) series led the popularization of LLMs.
| Model | Parameters | Key Features |
|---|---|---|
| GPT-3 | 175B | Demonstrated few-shot learning potential |
| GPT-3.5 | Undisclosed | ChatGPT foundation, RLHF applied |
| GPT-4 | Undisclosed (est. 1.7T MoE) | Multimodal, enhanced reasoning |
| GPT-4o | Undisclosed | Voice/image integration, fast inference |
Claude (Anthropic)
Anthropic's Claude emphasizes the balance between safety and helpfulness.
- Constitutional AI: Safety through constitutional AI techniques
- Long Context: Up to 1M token support
- Coding Ability: Strengths in code generation, analysis, and debugging
- Agent Use: Autonomous task execution through tools like Claude Code
Open-Source Models
| Model | Developer | Parameters | Features |
|---|---|---|---|
| LLaMA 3 | Meta | 8B / 70B / 405B | Open source, commercially usable |
| Mistral | Mistral AI | 7B / 8x7B (MoE) | High performance for size |
| Gemma | 2B / 7B / 27B | Lightweight, various sizes | |
| Qwen 2.5 | Alibaba | 0.5B–72B | Multilingual, code, math strengths |
Note: MoE (Mixture of Experts) is an architecture that activates only a subset of expert networks from the total parameters for efficient inference. For example, Mistral 8x7B has 46.7B total parameters but uses only ~12.9B during inference.
6. Prompt Engineering
Prompt Structure and Principles
Prompt engineering is the technique of optimizing inputs (prompts) sent to LLMs to achieve desired results.
Components of an effective prompt:
- Role: Define the role for the model
- Context: Background information and constraints
- Instruction: Clear description of the task
- Examples: Desired input/output formats
- Output Format: Response structure (JSON, table, list, etc.)
Zero-shot / Few-shot / Chain-of-Thought
Zero-shot: Performing tasks with instructions only, no examples
Analyze the sentiment of the following sentence: "This product is absolutely amazing!"
→ Positive
Few-shot: Performing tasks with a few examples
Sentence: "Fast delivery, very happy" → Positive
Sentence: "Screen keeps turning off" → Negative
Sentence: "Price is ok but quality is poor" → ?
→ Negative
Chain-of-Thought (CoT): Inducing step-by-step reasoning
Q: A store had 23 apples. They sold 8 and received 12 new ones.
How many apples are there now?
A: Let's solve step by step.
1. Initial apples: 23
2. After selling: 23 - 8 = 15
3. After restocking: 15 + 12 = 27
Therefore, there are currently 27 apples.
System Prompts and Role Assignment
System prompts define the overall behavior of the model and are set at the beginning of a conversation.
{
"messages": [
{
"role": "system",
"content": "You are a data engineer with 10 years of experience. Provide accurate, practical answers to technical questions. Include code examples."
},
{
"role": "user",
"content": "How do you resolve data skew issues in Spark?"
}
]
}Note: System prompt support varies by model. Claude uses a
systemparameter, while GPT uses asystemrole message.
7. RAG (Retrieval-Augmented Generation)
Why RAG and How It Works
RAG is a technique that retrieves relevant information from external knowledge bases to enhance LLM responses.
Limitations of standalone LLMs:
- Cannot know information after training data cutoff
- Lacks enterprise internal data or domain knowledge
- Risk of hallucination (generating false information)
RAG Pipeline Architecture:
[User Query]
↓
[1. Query Embedding] → Convert query to vector
↓
[2. Vector Search] → Search similar documents in vector DB
↓
[3. Context Construction] → Combine retrieved documents + original query
↓
[4. LLM Generation] → Generate response based on context
↓
[Response Output]
Vector Database Integration
Key vector database options that play a central role in RAG:
| Vector DB | Features | Best For |
|---|---|---|
| Chroma | Lightweight, embedded, Python-friendly | Prototypes, small scale |
| Pinecone | Fully managed SaaS | Minimal ops overhead |
| Weaviate | Hybrid search support | Keyword + vector search |
| Milvus | Large-scale distributed processing | Enterprise, high volume |
| pgvector | PostgreSQL extension | Existing PostgreSQL users |
# Simple RAG implementation with LangChain + Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import ChatOpenAI
# 1. Load and split documents
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 2. Store in vector DB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
# 4. Query
result = qa_chain.invoke({"query": "What is our company's data retention policy?"})
print(result["result"])RAG Pipelines in Enterprise Environments
Key considerations when adopting RAG in enterprise settings:
- Chunking Strategy: Splitting documents into appropriate sizes (fixed-size vs. semantic)
- Embedding Model Selection: Multilingual support, domain specialization
- Hybrid Search: Combining vector search + keyword search (BM25)
- Re-ranking: Reordering search results with LLM or Cross-Encoder
- Security: Filtering search results based on document access permissions
- Evaluation: Monitoring retrieval accuracy (Recall@K) and response quality
8. LLM Limitations and Considerations
Hallucination
Hallucination is the phenomenon where LLMs confidently generate information that is not factual.
Types of hallucination:
- Factual errors: Citing non-existent papers or statistics
- Logical errors: Plausible but incorrect reasoning
- Consistency errors: Contradictory answers within the same conversation
Mitigation strategies:
- RAG-based responses grounded in external knowledge
- Requesting citations
- Lowering the temperature parameter
- Building response verification pipelines
Bias
LLMs can reflect biases present in their training data.
- Social bias: Gender, racial, and cultural stereotypes
- Language bias: Performance degradation for non-English languages due to English-centric training
- Temporal bias: Information skew based on training data timeframe
Cost and Latency
Operating LLMs requires significant cost and infrastructure.
| Factor | Considerations |
|---|---|
| API Cost | Per-token billing for input/output (varies by model) |
| Self-hosting | GPU server costs (A100/H100, etc.) |
| Latency | Token generation speed (TTFT, TPS) |
| Concurrency | Batch processing, request queue management |
| Optimization | Quantization, KV cache, speculative decoding |
Note: TTFT (Time to First Token) is the time until the first token is generated, and TPS (Tokens Per Second) is the number of tokens generated per second.
9. Enterprise LLM Use Cases
Document Summarization and Search
- Meeting notes auto-summarization: Extracting key points and action items from transcripts
- Technical documentation search: RAG-based search across internal wikis, Confluence, etc.
- Report generation: Converting data analysis results into natural language reports
Code Generation and Review
- Code auto-completion: IDE-integrated coding assistants (Claude Code, GitHub Copilot, etc.)
- Code review: Automated PR-based code review and improvement suggestions
- Legacy code migration: Converting old code to modern frameworks
- Test code generation: Auto-generating unit tests and integration tests
Data Pipeline Automation
- SQL generation: Converting natural language queries to SQL (Text-to-SQL)
- ETL pipeline design: Data flow design and code generation
- Anomaly detection: Detecting system anomalies through log analysis
- Data quality validation: Schema change detection, data integrity verification automation
References
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS
- Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS
- Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS
- Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR
- Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS
- Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS
- Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv
— Data Dynamics Engineering Team