Blog
llmaitransformerrag

LLM (Large Language Model) Complete Guide - From Concepts to Applications

A comprehensive overview of LLM core concepts, Transformer architecture, training process, prompt engineering, and RAG.

Data DynamicsApril 15, 202613 min read

LLM (Large Language Model) is a large-scale AI model that learns from vast amounts of text data to understand and generate human language. This post provides a structured overview of LLM concepts, from fundamentals to practical applications.


1. What is an LLM?

Definition and Background

An LLM (Large Language Model) is a deep learning model with billions to trillions of parameters, trained on massive text corpora to understand and generate natural language.

Here is a timeline of key milestones in LLM development:

YearTechnologySignificance
2017Transformer paper"Attention Is All You Need" — new parallelizable architecture
2018GPT-1 / BERTEstablished the pre-training + fine-tuning paradigm
2020GPT-3 (175B params)Demonstrated few-shot learning capabilities
2022ChatGPTPopularized conversational AI with RLHF
2023–2024GPT-4, Claude, Gemini, LLaMAMultimodal, long context, open-source competition
2025–2026Claude 4, GPT-5, etc.Agent-based autonomous work, 1M+ token context

Differences from Traditional NLP

The fundamental difference between traditional NLP and LLMs lies in the task-specific vs. general-purpose model paradigm.

AspectTraditional NLPLLM
TrainingTask-specific individual trainingLarge-scale pre-training → fine-tuning
DataSmall labeled datasetsInternet-scale unstructured text
Model SizeMillions of parametersBillions to trillions of parameters
VersatilitySingle task (sentiment, NER, etc.)Translation, summarization, coding, reasoning
Zero-shotCannot perform untrained tasksCan perform untrained tasks via prompts

2. Core Technology: Transformer Architecture

Attention Mechanism

Attention is a mechanism that computes how each token in a sequence relates to every other token. It overcomes the sequential processing limitations of RNN/LSTM and effectively captures relationships between distant words in long sentences.

The core attention formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V
  • Q (Query): What the current token is asking about other tokens
  • K (Key): Feature information each token carries
  • V (Value): The actual values to be passed
  • √d_k: Scaling factor (square root of dimension)

Self-Attention and Multi-Head Attention

Self-Attention computes relationships between each token and all other tokens within the same sequence. For example, in "The cat sat on the mat. It looked comfortable," it can determine that "It" refers to "the cat."

Multi-Head Attention performs Self-Attention in parallel across multiple "heads," learning diverse perspectives on token relationships.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O

Each head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

For instance, with 8 heads, each head independently learns different patterns such as grammatical, semantic, and positional relationships.

Encoder vs Decoder Architecture

The Transformer architecture is used in three variations depending on the purpose:

ArchitectureRepresentative ModelsCharacteristicsPrimary Use
Encoder-onlyBERT, RoBERTaBidirectional context understandingClassification, NER, sentiment analysis
Decoder-onlyGPT, Claude, LLaMALeft-to-right autoregressive generationText generation, chat, coding
Encoder-DecoderT5, BARTInput understanding + output generationTranslation, summarization

Most current LLMs (GPT, Claude, LLaMA, Gemini, etc.) adopt the Decoder-only architecture. This structure generates text in an autoregressive manner by predicting the next token.


3. LLM Training Process

LLM training is broadly divided into three stages.

Pre-training

Pre-training is the first and most expensive stage. The model learns language structure and knowledge from massive text data including internet content, books, and papers.

Objective: Next Token Prediction

Input: "The weather today is really"
Prediction: "nice" (prob 0.35), "hot" (prob 0.20), "cold" (prob 0.15), ...

Key characteristics:

  • Data Scale: Multiple terabytes of text data
  • Compute Cost: Thousands to tens of thousands of GPUs for weeks to months
  • Unsupervised: Learns patterns from text without labels
  • Cost: Millions to tens of millions of dollars for large models

Fine-tuning

Additional training to adapt a pre-trained model for specific purposes.

MethodDescriptionAdvantage
Full Fine-tuningUpdates all parametersBest possible performance
LoRA (Low-Rank Adaptation)Trains only low-rank matricesLess memory, faster training
QLoRAQuantization + LoRAFeasible on consumer GPUs
Instruction TuningTrains on instruction-response pairsImproves instruction following
# Fine-tuning with LoRA example (Hugging Face PEFT)
from peft import LoraConfig, get_peft_model
 
lora_config = LoraConfig(
    r=16,                      # Low-rank dimension
    lora_alpha=32,             # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Target layers
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(base_model, lora_config)
# Only ~0.1% of total parameters are trained

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a technique that improves LLM response quality to align with human preferences.

RLHF 3-step process:

  1. SFT (Supervised Fine-Tuning): Supervised learning with high-quality conversation data
  2. Reward Model Training: Building a reward model that learns human response preferences
  3. PPO (Proximal Policy Optimization): Reinforcement learning based on the reward model
User question → LLM generates multiple responses → Humans rank them
→ Reward model learns scores → RL improves LLM policy

Note: Beyond RLHF, recent research explores various alignment techniques such as DPO (Direct Preference Optimization) and RLAIF (Reinforcement Learning from AI Feedback).


4. Tokenization and Embedding

Tokenization Methods

Tokenization is the process of splitting text into the smallest units (tokens) that the model can process.

MethodDescriptionUsed By
BPE (Byte-Pair Encoding)Iteratively merges frequent character pairsGPT series
SentencePieceLanguage-independent subword tokenizationLLaMA, T5
WordPieceBPE variant, likelihood-based mergingBERT
TiktokenBPE-based, OpenAI optimized implementationGPT-3.5/4

Tokenization example:

# Tokenization example using tiktoken
import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4")
text = "Large Language Models are transforming AI"
tokens = enc.encode(text)
# [27050, 11688, 27972, 527, 46890, 15592]  → 6 tokens
 
# Korean tokenization
text_ko = "대규모 언어 모델은 AI를 변화시키고 있다"
tokens_ko = enc.encode(text_ko)
# Korean requires approximately 2-3x more tokens than English

Note: Korean uses more tokens than English to express the same meaning, which affects API costs and context utilization efficiency.

Vector Embedding Concepts

Embedding is the process of mapping tokens into a high-dimensional vector space. Semantically similar words are placed close together in the vector space.

"king" - "man" + "woman" ≈ "queen"
"Seoul" - "Korea" + "Japan" ≈ "Tokyo"

Roles of embedding in LLMs:

  • Input Embedding: Token → vector conversion (model input)
  • Positional Embedding: Encoding token order information
  • Output Embedding: Vector → token probability conversion (model output)

Context Window

The context window is the maximum number of tokens a model can process at once.

ModelContext WindowNotes
GPT-34,096 tokens~3,000 words
GPT-4128K tokens~96,000 words
Claude 3.5200K tokens~150,000 words
Claude Opus 41M tokens~750,000 words
Gemini 1.5 Pro1M tokens~750,000 words

Larger context windows allow processing longer documents at once, but computational costs increase quadratically with token count.


5. Major LLM Model Comparison

GPT Series (OpenAI)

OpenAI's GPT (Generative Pre-trained Transformer) series led the popularization of LLMs.

ModelParametersKey Features
GPT-3175BDemonstrated few-shot learning potential
GPT-3.5UndisclosedChatGPT foundation, RLHF applied
GPT-4Undisclosed (est. 1.7T MoE)Multimodal, enhanced reasoning
GPT-4oUndisclosedVoice/image integration, fast inference

Claude (Anthropic)

Anthropic's Claude emphasizes the balance between safety and helpfulness.

  • Constitutional AI: Safety through constitutional AI techniques
  • Long Context: Up to 1M token support
  • Coding Ability: Strengths in code generation, analysis, and debugging
  • Agent Use: Autonomous task execution through tools like Claude Code

Open-Source Models

ModelDeveloperParametersFeatures
LLaMA 3Meta8B / 70B / 405BOpen source, commercially usable
MistralMistral AI7B / 8x7B (MoE)High performance for size
GemmaGoogle2B / 7B / 27BLightweight, various sizes
Qwen 2.5Alibaba0.5B–72BMultilingual, code, math strengths

Note: MoE (Mixture of Experts) is an architecture that activates only a subset of expert networks from the total parameters for efficient inference. For example, Mistral 8x7B has 46.7B total parameters but uses only ~12.9B during inference.


6. Prompt Engineering

Prompt Structure and Principles

Prompt engineering is the technique of optimizing inputs (prompts) sent to LLMs to achieve desired results.

Components of an effective prompt:

  • Role: Define the role for the model
  • Context: Background information and constraints
  • Instruction: Clear description of the task
  • Examples: Desired input/output formats
  • Output Format: Response structure (JSON, table, list, etc.)

Zero-shot / Few-shot / Chain-of-Thought

Zero-shot: Performing tasks with instructions only, no examples

Analyze the sentiment of the following sentence: "This product is absolutely amazing!"
→ Positive

Few-shot: Performing tasks with a few examples

Sentence: "Fast delivery, very happy" → Positive
Sentence: "Screen keeps turning off" → Negative
Sentence: "Price is ok but quality is poor" → ?
→ Negative

Chain-of-Thought (CoT): Inducing step-by-step reasoning

Q: A store had 23 apples. They sold 8 and received 12 new ones.
   How many apples are there now?

A: Let's solve step by step.
   1. Initial apples: 23
   2. After selling: 23 - 8 = 15
   3. After restocking: 15 + 12 = 27
   Therefore, there are currently 27 apples.

System Prompts and Role Assignment

System prompts define the overall behavior of the model and are set at the beginning of a conversation.

{
  "messages": [
    {
      "role": "system",
      "content": "You are a data engineer with 10 years of experience. Provide accurate, practical answers to technical questions. Include code examples."
    },
    {
      "role": "user",
      "content": "How do you resolve data skew issues in Spark?"
    }
  ]
}

Note: System prompt support varies by model. Claude uses a system parameter, while GPT uses a system role message.


7. RAG (Retrieval-Augmented Generation)

Why RAG and How It Works

RAG is a technique that retrieves relevant information from external knowledge bases to enhance LLM responses.

Limitations of standalone LLMs:

  • Cannot know information after training data cutoff
  • Lacks enterprise internal data or domain knowledge
  • Risk of hallucination (generating false information)

RAG Pipeline Architecture:

[User Query]
     ↓
[1. Query Embedding] → Convert query to vector
     ↓
[2. Vector Search] → Search similar documents in vector DB
     ↓
[3. Context Construction] → Combine retrieved documents + original query
     ↓
[4. LLM Generation] → Generate response based on context
     ↓
[Response Output]

Vector Database Integration

Key vector database options that play a central role in RAG:

Vector DBFeaturesBest For
ChromaLightweight, embedded, Python-friendlyPrototypes, small scale
PineconeFully managed SaaSMinimal ops overhead
WeaviateHybrid search supportKeyword + vector search
MilvusLarge-scale distributed processingEnterprise, high volume
pgvectorPostgreSQL extensionExisting PostgreSQL users
# Simple RAG implementation with LangChain + Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import ChatOpenAI
 
# 1. Load and split documents
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
 
# 2. Store in vector DB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
 
# 3. Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)
 
# 4. Query
result = qa_chain.invoke({"query": "What is our company's data retention policy?"})
print(result["result"])

RAG Pipelines in Enterprise Environments

Key considerations when adopting RAG in enterprise settings:

  • Chunking Strategy: Splitting documents into appropriate sizes (fixed-size vs. semantic)
  • Embedding Model Selection: Multilingual support, domain specialization
  • Hybrid Search: Combining vector search + keyword search (BM25)
  • Re-ranking: Reordering search results with LLM or Cross-Encoder
  • Security: Filtering search results based on document access permissions
  • Evaluation: Monitoring retrieval accuracy (Recall@K) and response quality

8. LLM Limitations and Considerations

Hallucination

Hallucination is the phenomenon where LLMs confidently generate information that is not factual.

Types of hallucination:

  • Factual errors: Citing non-existent papers or statistics
  • Logical errors: Plausible but incorrect reasoning
  • Consistency errors: Contradictory answers within the same conversation

Mitigation strategies:

  • RAG-based responses grounded in external knowledge
  • Requesting citations
  • Lowering the temperature parameter
  • Building response verification pipelines

Bias

LLMs can reflect biases present in their training data.

  • Social bias: Gender, racial, and cultural stereotypes
  • Language bias: Performance degradation for non-English languages due to English-centric training
  • Temporal bias: Information skew based on training data timeframe

Cost and Latency

Operating LLMs requires significant cost and infrastructure.

FactorConsiderations
API CostPer-token billing for input/output (varies by model)
Self-hostingGPU server costs (A100/H100, etc.)
LatencyToken generation speed (TTFT, TPS)
ConcurrencyBatch processing, request queue management
OptimizationQuantization, KV cache, speculative decoding

Note: TTFT (Time to First Token) is the time until the first token is generated, and TPS (Tokens Per Second) is the number of tokens generated per second.


9. Enterprise LLM Use Cases

  • Meeting notes auto-summarization: Extracting key points and action items from transcripts
  • Technical documentation search: RAG-based search across internal wikis, Confluence, etc.
  • Report generation: Converting data analysis results into natural language reports

Code Generation and Review

  • Code auto-completion: IDE-integrated coding assistants (Claude Code, GitHub Copilot, etc.)
  • Code review: Automated PR-based code review and improvement suggestions
  • Legacy code migration: Converting old code to modern frameworks
  • Test code generation: Auto-generating unit tests and integration tests

Data Pipeline Automation

  • SQL generation: Converting natural language queries to SQL (Text-to-SQL)
  • ETL pipeline design: Data flow design and code generation
  • Anomaly detection: Detecting system anomalies through log analysis
  • Data quality validation: Schema change detection, data integrity verification automation

References

  • Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS
  • Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS
  • Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS
  • Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR
  • Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS
  • Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS
  • Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv

— Data Dynamics Engineering Team