llm-inferencequantizationkv-cacheoptimizationvllmai

LLM Inference Optimization Guide - Quantization, KV Cache, Speculative Decoding

A comprehensive guide covering LLM inference optimization techniques: quantization (GPTQ, AWQ, GGUF), KV Cache/PagedAttention, speculative decoding, continuous batching, FlashAttention, and tensor parallelism.

Data DynamicsApril 16, 20266 min read

LLM inference optimization is the key technology for achieving maximum throughput and minimum latency with limited GPU resources. This post systematically covers quantization, KV Cache, speculative decoding, and other major optimization techniques.

1. Understanding LLM Inference

Inference Stages

LLM inference consists of two stages: Prefill (input processing) and Decode (token generation).

Loading diagram…

Key Performance Metrics

Metric	Description	Affected By
TTFT	Time to first token	Prefill speed, model size
TPS	Tokens per second	Decode speed, memory bandwidth
Throughput	Total throughput (tok/s)	Batching, GPU utilization
Latency (P95)	95th percentile response time	Queuing, scheduling

GPU Memory Breakdown

[7B Model GPU Memory (FP16)]

Total GPU Memory: 24 GB (RTX 4090)

Model weights:     ██████████████  14 GB (58%)
KV Cache:          ████████         8 GB (33%)
Activations/temp:  ██               2 GB (9%)

→ KV Cache is the bottleneck determining concurrent request count

2. Quantization

What is Quantization

Reducing the precision of model weights to save memory and computation.

FP32 (32-bit): 0.12345678901234567890  → 4 bytes/param
FP16 (16-bit): 0.1235                  → 2 bytes/param (2x savings)
INT8 (8-bit):  31                       → 1 byte/param (4x savings)
INT4 (4-bit):  7                        → 0.5 bytes/param (8x savings)

Quantization Methods Comparison

Method	Bits	Quality Loss	Speed	Environment	Tools
FP16/BF16	16	None	Baseline	GPU	Default
GPTQ	4	Very low	Fast	GPU (CUDA)	AutoGPTQ
AWQ	4	Very low	Very fast	GPU (CUDA)	AutoAWQ
GGUF	2-8	Varies	Medium	CPU/GPU	llama.cpp
BitsAndBytes	4/8	Low	Medium	GPU	HuggingFace
FP8	8	Minimal	Fast	H100+	vLLM, TRT-LLM

Selection Guide

Loading diagram…

3. KV Cache Optimization

What is KV Cache

Caching previous tokens' Key/Value in Transformer Self-Attention to avoid redundant computation.

[Without KV Cache] → O(n²) computation
[With KV Cache]    → O(n) computation (reuse cached K,V)

PagedAttention (vLLM)

[Traditional: Contiguous allocation]
Request 1: [████████░░░░░░░░]  ← Reserve max length, waste
Request 2: [██████░░░░░░░░░░]
Request 3: [Out of memory]

[PagedAttention: Block-level allocation]
Request 1: [██][██][██][██]    ← Allocate only needed blocks
Request 2: [██][██][██]
Request 3: [██][██]            ← Remaining blocks available!

→ 60-80% memory waste reduction
→ 2-4x more concurrent requests

GQA (Grouped-Query Attention)

Method	KV Heads	Cache Size	Models
MHA (Multi-Head)	Q=K=V same	Baseline	GPT-3
MQA (Multi-Query)	K=V=1	1/32	PaLM, Falcon
GQA (Grouped-Query)	K=V grouped	1/4~1/8	LLaMA 3, Gemma

4. Speculative Decoding

Principle

Use a small Draft model to quickly generate multiple tokens, then verify with the large Target model in one pass.

[Standard decoding]
Target(70B): Token 1 → Token 2 → Token 3 → Token 4 → Token 5  (5 steps)

[Speculative decoding]
Draft(8B):   Generate tokens 1,2,3,4,5 quickly (1 step)
Target(70B): Verify all 5 at once → Accept 4, regenerate 1 (1 step)
→ ~2.5x speedup

# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.1-8B-Instruct \
    --num-speculative-tokens 5

Scenario	Standard	Speculative	Speedup
Code generation (repetitive)	30 tok/s	75 tok/s	2.5x
General conversation	30 tok/s	55 tok/s	1.8x
Creative writing (unpredictable)	30 tok/s	40 tok/s	1.3x

5. Batching and Scheduling

Continuous Batching

[Static Batching]
Batch: [A(100tok), B(50tok), C(200tok)]
→ Wait for longest (C) to complete, A/B idle after finishing

[Continuous Batching]
Time 1: [A, B, C] processing
Time 2: B done → D immediately added [A, D, C]
Time 3: A done → E immediately added [E, D, C]
→ Minimize GPU idle time

# vLLM optimization settings
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --max-model-len 4096

6. Additional Optimization Techniques

FlashAttention

Version	Speedup	Memory Savings	GPU Support
FlashAttention-1	2-4x	O(N) → O(√N)	A100+
FlashAttention-2	5-9x (FP16)	Same	A100+
FlashAttention-3	1.5-2x (additional)	FP8 support	H100

Prefix Caching

Reuse prefix KV Cache for requests sharing the same system prompt.

# Enable in vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --enable-prefix-caching
# → 50-80% TTFT reduction for long system prompts

Tensor Parallelism

Strategy	Splits	Best For
Tensor Parallel	Within layers (weights)	Single node multi-GPU
Pipeline Parallel	Across layers (depth)	Multi-node
Data Parallel	Requests (replicas)	N identical model instances

7. Optimization Strategy Summary

Scenario Guide

Goal	Key Technique	Expected Effect
Reduce model memory	Quantization (AWQ/GGUF Q4)	75% model size reduction
Increase throughput	PagedAttention + Continuous Batching	2-4x throughput
Faster first token	Prefix Caching + FlashAttention	50-80% TTFT reduction
Faster token generation	Speculative Decoding	1.5-2.5x TPS
Run large models	Tensor Parallelism	70B+ model serving

Recommended Priority

1. Apply quantization (easiest, highest impact)
2. vLLM + Continuous Batching (switch serving engine)
3. Enable FlashAttention (usually enabled by default)
4. Prefix Caching (reuse system prompts)
5. Speculative Decoding (large model acceleration)
6. Tensor Parallelism (very large models)

Note: Most optimization techniques can be combined. Applying quantization + PagedAttention + Continuous Batching + Prefix Caching together yields maximum effect.

References

Kwon, W. et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP
Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR
Leviathan, Y. et al. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML
Frantar, E. et al. (2023). "GPTQ: Accurate Post-Training Quantization." ICLR
Lin, J. et al. (2024). "AWQ: Activation-aware Weight Quantization." MLSys

— Data Dynamics Engineering Team