LLM Inference Optimization Guide - Quantization, KV Cache, Speculative Decoding
A comprehensive guide covering LLM inference optimization techniques: quantization (GPTQ, AWQ, GGUF), KV Cache/PagedAttention, speculative decoding, continuous batching, FlashAttention, and tensor parallelism.
LLM inference optimization is the key technology for achieving maximum throughput and minimum latency with limited GPU resources. This post systematically covers quantization, KV Cache, speculative decoding, and other major optimization techniques.
1. Understanding LLM Inference
Inference Stages
LLM inference consists of two stages: Prefill (input processing) and Decode (token generation).
[LLM Inference Stages]
Input: "How to optimize Spark performance?"
↓
[Prefill] — Process all input tokens in parallel
- Generate KV Cache
- Compute-bound (GPU compute intensive)
- Determines TTFT (Time to First Token)
↓
[Decode] — Generate tokens one at a time
- Read + update KV Cache
- Memory-bound (memory bandwidth intensive)
- Determines TPS (Tokens Per Second)
↓
Output: "To optimize Spark performance, you should..."
Key Performance Metrics
| Metric | Description | Affected By |
|---|---|---|
| TTFT | Time to first token | Prefill speed, model size |
| TPS | Tokens per second | Decode speed, memory bandwidth |
| Throughput | Total throughput (tok/s) | Batching, GPU utilization |
| Latency (P95) | 95th percentile response time | Queuing, scheduling |
GPU Memory Breakdown
[7B Model GPU Memory (FP16)]
Total GPU Memory: 24 GB (RTX 4090)
Model weights: ██████████████ 14 GB (58%)
KV Cache: ████████ 8 GB (33%)
Activations/temp: ██ 2 GB (9%)
→ KV Cache is the bottleneck determining concurrent request count
2. Quantization
What is Quantization
Reducing the precision of model weights to save memory and computation.
FP32 (32-bit): 0.12345678901234567890 → 4 bytes/param
FP16 (16-bit): 0.1235 → 2 bytes/param (2x savings)
INT8 (8-bit): 31 → 1 byte/param (4x savings)
INT4 (4-bit): 7 → 0.5 bytes/param (8x savings)
Quantization Methods Comparison
| Method | Bits | Quality Loss | Speed | Environment | Tools |
|---|---|---|---|---|---|
| FP16/BF16 | 16 | None | Baseline | GPU | Default |
| GPTQ | 4 | Very low | Fast | GPU (CUDA) | AutoGPTQ |
| AWQ | 4 | Very low | Very fast | GPU (CUDA) | AutoAWQ |
| GGUF | 2-8 | Varies | Medium | CPU/GPU | llama.cpp |
| BitsAndBytes | 4/8 | Low | Medium | GPU | HuggingFace |
| FP8 | 8 | Minimal | Fast | H100+ | vLLM, TRT-LLM |
Selection Guide
[GPU serving?]
├─ Yes
│ ├─ H100/A100 → FP8 (minimal loss, best speed)
│ ├─ A100/A10 → AWQ 4-bit (high quality, fast)
│ └─ RTX 4090 → GPTQ or AWQ 4-bit
└─ No (CPU or hybrid)
└─ GGUF Q4_K_M (Ollama/llama.cpp)
3. KV Cache Optimization
What is KV Cache
Caching previous tokens' Key/Value in Transformer Self-Attention to avoid redundant computation.
[Without KV Cache] → O(n²) computation
[With KV Cache] → O(n) computation (reuse cached K,V)
PagedAttention (vLLM)
[Traditional: Contiguous allocation]
Request 1: [████████░░░░░░░░] ← Reserve max length, waste
Request 2: [██████░░░░░░░░░░]
Request 3: [Out of memory]
[PagedAttention: Block-level allocation]
Request 1: [██][██][██][██] ← Allocate only needed blocks
Request 2: [██][██][██]
Request 3: [██][██] ← Remaining blocks available!
→ 60-80% memory waste reduction
→ 2-4x more concurrent requests
GQA (Grouped-Query Attention)
| Method | KV Heads | Cache Size | Models |
|---|---|---|---|
| MHA (Multi-Head) | Q=K=V same | Baseline | GPT-3 |
| MQA (Multi-Query) | K=V=1 | 1/32 | PaLM, Falcon |
| GQA (Grouped-Query) | K=V grouped | 1/4~1/8 | LLaMA 3, Gemma |
4. Speculative Decoding
Principle
Use a small Draft model to quickly generate multiple tokens, then verify with the large Target model in one pass.
[Standard decoding]
Target(70B): Token 1 → Token 2 → Token 3 → Token 4 → Token 5 (5 steps)
[Speculative decoding]
Draft(8B): Generate tokens 1,2,3,4,5 quickly (1 step)
Target(70B): Verify all 5 at once → Accept 4, regenerate 1 (1 step)
→ ~2.5x speedup
# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5| Scenario | Standard | Speculative | Speedup |
|---|---|---|---|
| Code generation (repetitive) | 30 tok/s | 75 tok/s | 2.5x |
| General conversation | 30 tok/s | 55 tok/s | 1.8x |
| Creative writing (unpredictable) | 30 tok/s | 40 tok/s | 1.3x |
5. Batching and Scheduling
Continuous Batching
[Static Batching]
Batch: [A(100tok), B(50tok), C(200tok)]
→ Wait for longest (C) to complete, A/B idle after finishing
[Continuous Batching]
Time 1: [A, B, C] processing
Time 2: B done → D immediately added [A, D, C]
Time 3: A done → E immediately added [E, D, C]
→ Minimize GPU idle time
# vLLM optimization settings
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-model-len 40966. Additional Optimization Techniques
FlashAttention
| Version | Speedup | Memory Savings | GPU Support |
|---|---|---|---|
| FlashAttention-1 | 2-4x | O(N) → O(√N) | A100+ |
| FlashAttention-2 | 5-9x (FP16) | Same | A100+ |
| FlashAttention-3 | 1.5-2x (additional) | FP8 support | H100 |
Prefix Caching
Reuse prefix KV Cache for requests sharing the same system prompt.
# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching
# → 50-80% TTFT reduction for long system promptsTensor Parallelism
| Strategy | Splits | Best For |
|---|---|---|
| Tensor Parallel | Within layers (weights) | Single node multi-GPU |
| Pipeline Parallel | Across layers (depth) | Multi-node |
| Data Parallel | Requests (replicas) | N identical model instances |
7. Optimization Strategy Summary
Scenario Guide
| Goal | Key Technique | Expected Effect |
|---|---|---|
| Reduce model memory | Quantization (AWQ/GGUF Q4) | 75% model size reduction |
| Increase throughput | PagedAttention + Continuous Batching | 2-4x throughput |
| Faster first token | Prefix Caching + FlashAttention | 50-80% TTFT reduction |
| Faster token generation | Speculative Decoding | 1.5-2.5x TPS |
| Run large models | Tensor Parallelism | 70B+ model serving |
Recommended Priority
1. Apply quantization (easiest, highest impact)
2. vLLM + Continuous Batching (switch serving engine)
3. Enable FlashAttention (usually enabled by default)
4. Prefix Caching (reuse system prompts)
5. Speculative Decoding (large model acceleration)
6. Tensor Parallelism (very large models)
Note: Most optimization techniques can be combined. Applying quantization + PagedAttention + Continuous Batching + Prefix Caching together yields maximum effect.
References
- Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP
- Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR
- Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML
- Frantar, E. et al. (2023). "GPTQ: Accurate Post-Training Quantization." ICLR
- Lin, J. et al. (2024). "AWQ: Activation-aware Weight Quantization." MLSys
— Data Dynamics Engineering Team