Blog
llm-inferencequantizationkv-cacheoptimizationvllmai

LLM Inference Optimization Guide - Quantization, KV Cache, Speculative Decoding

A comprehensive guide covering LLM inference optimization techniques: quantization (GPTQ, AWQ, GGUF), KV Cache/PagedAttention, speculative decoding, continuous batching, FlashAttention, and tensor parallelism.

Data DynamicsApril 16, 20266 min read

LLM inference optimization is the key technology for achieving maximum throughput and minimum latency with limited GPU resources. This post systematically covers quantization, KV Cache, speculative decoding, and other major optimization techniques.


1. Understanding LLM Inference

Inference Stages

LLM inference consists of two stages: Prefill (input processing) and Decode (token generation).

[LLM Inference Stages]

Input: "How to optimize Spark performance?"
       ↓
[Prefill] — Process all input tokens in parallel
  - Generate KV Cache
  - Compute-bound (GPU compute intensive)
  - Determines TTFT (Time to First Token)
       ↓
[Decode] — Generate tokens one at a time
  - Read + update KV Cache
  - Memory-bound (memory bandwidth intensive)
  - Determines TPS (Tokens Per Second)
       ↓
Output: "To optimize Spark performance, you should..."

Key Performance Metrics

MetricDescriptionAffected By
TTFTTime to first tokenPrefill speed, model size
TPSTokens per secondDecode speed, memory bandwidth
ThroughputTotal throughput (tok/s)Batching, GPU utilization
Latency (P95)95th percentile response timeQueuing, scheduling

GPU Memory Breakdown

[7B Model GPU Memory (FP16)]

Total GPU Memory: 24 GB (RTX 4090)

Model weights:     ██████████████  14 GB (58%)
KV Cache:          ████████         8 GB (33%)
Activations/temp:  ██               2 GB (9%)

→ KV Cache is the bottleneck determining concurrent request count

2. Quantization

What is Quantization

Reducing the precision of model weights to save memory and computation.

FP32 (32-bit): 0.12345678901234567890  → 4 bytes/param
FP16 (16-bit): 0.1235                  → 2 bytes/param (2x savings)
INT8 (8-bit):  31                       → 1 byte/param (4x savings)
INT4 (4-bit):  7                        → 0.5 bytes/param (8x savings)

Quantization Methods Comparison

MethodBitsQuality LossSpeedEnvironmentTools
FP16/BF1616NoneBaselineGPUDefault
GPTQ4Very lowFastGPU (CUDA)AutoGPTQ
AWQ4Very lowVery fastGPU (CUDA)AutoAWQ
GGUF2-8VariesMediumCPU/GPUllama.cpp
BitsAndBytes4/8LowMediumGPUHuggingFace
FP88MinimalFastH100+vLLM, TRT-LLM

Selection Guide

[GPU serving?]
├─ Yes
│  ├─ H100/A100 → FP8 (minimal loss, best speed)
│  ├─ A100/A10 → AWQ 4-bit (high quality, fast)
│  └─ RTX 4090 → GPTQ or AWQ 4-bit
└─ No (CPU or hybrid)
   └─ GGUF Q4_K_M (Ollama/llama.cpp)

3. KV Cache Optimization

What is KV Cache

Caching previous tokens' Key/Value in Transformer Self-Attention to avoid redundant computation.

[Without KV Cache] → O(n²) computation
[With KV Cache]    → O(n) computation (reuse cached K,V)

PagedAttention (vLLM)

[Traditional: Contiguous allocation]
Request 1: [████████░░░░░░░░]  ← Reserve max length, waste
Request 2: [██████░░░░░░░░░░]
Request 3: [Out of memory]

[PagedAttention: Block-level allocation]
Request 1: [██][██][██][██]    ← Allocate only needed blocks
Request 2: [██][██][██]
Request 3: [██][██]            ← Remaining blocks available!

→ 60-80% memory waste reduction
→ 2-4x more concurrent requests

GQA (Grouped-Query Attention)

MethodKV HeadsCache SizeModels
MHA (Multi-Head)Q=K=V sameBaselineGPT-3
MQA (Multi-Query)K=V=11/32PaLM, Falcon
GQA (Grouped-Query)K=V grouped1/4~1/8LLaMA 3, Gemma

4. Speculative Decoding

Principle

Use a small Draft model to quickly generate multiple tokens, then verify with the large Target model in one pass.

[Standard decoding]
Target(70B): Token 1 → Token 2 → Token 3 → Token 4 → Token 5  (5 steps)

[Speculative decoding]
Draft(8B):   Generate tokens 1,2,3,4,5 quickly (1 step)
Target(70B): Verify all 5 at once → Accept 4, regenerate 1 (1 step)
→ ~2.5x speedup
# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.1-8B-Instruct \
    --num-speculative-tokens 5
ScenarioStandardSpeculativeSpeedup
Code generation (repetitive)30 tok/s75 tok/s2.5x
General conversation30 tok/s55 tok/s1.8x
Creative writing (unpredictable)30 tok/s40 tok/s1.3x

5. Batching and Scheduling

Continuous Batching

[Static Batching]
Batch: [A(100tok), B(50tok), C(200tok)]
→ Wait for longest (C) to complete, A/B idle after finishing

[Continuous Batching]
Time 1: [A, B, C] processing
Time 2: B done → D immediately added [A, D, C]
Time 3: A done → E immediately added [E, D, C]
→ Minimize GPU idle time
# vLLM optimization settings
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --max-model-len 4096

6. Additional Optimization Techniques

FlashAttention

VersionSpeedupMemory SavingsGPU Support
FlashAttention-12-4xO(N) → O(√N)A100+
FlashAttention-25-9x (FP16)SameA100+
FlashAttention-31.5-2x (additional)FP8 supportH100

Prefix Caching

Reuse prefix KV Cache for requests sharing the same system prompt.

# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --enable-prefix-caching
# → 50-80% TTFT reduction for long system prompts

Tensor Parallelism

StrategySplitsBest For
Tensor ParallelWithin layers (weights)Single node multi-GPU
Pipeline ParallelAcross layers (depth)Multi-node
Data ParallelRequests (replicas)N identical model instances

7. Optimization Strategy Summary

Scenario Guide

GoalKey TechniqueExpected Effect
Reduce model memoryQuantization (AWQ/GGUF Q4)75% model size reduction
Increase throughputPagedAttention + Continuous Batching2-4x throughput
Faster first tokenPrefix Caching + FlashAttention50-80% TTFT reduction
Faster token generationSpeculative Decoding1.5-2.5x TPS
Run large modelsTensor Parallelism70B+ model serving
1. Apply quantization (easiest, highest impact)
2. vLLM + Continuous Batching (switch serving engine)
3. Enable FlashAttention (usually enabled by default)
4. Prefix Caching (reuse system prompts)
5. Speculative Decoding (large model acceleration)
6. Tensor Parallelism (very large models)

Note: Most optimization techniques can be combined. Applying quantization + PagedAttention + Continuous Batching + Prefix Caching together yields maximum effect.


References

  • Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP
  • Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR
  • Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML
  • Frantar, E. et al. (2023). "GPTQ: Accurate Post-Training Quantization." ICLR
  • Lin, J. et al. (2024). "AWQ: Activation-aware Weight Quantization." MLSys

— Data Dynamics Engineering Team