LoRA, QLoRA, and DoRA Explained — What, When, and Why
A visual guide to the three core PEFT techniques for fine-tuning large models with few resources — LoRA, QLoRA, and DoRA — covering their principles, uses, differences, and code.
When you want to adapt an LLM with billions of parameters to your own domain, retraining the entire model is impractical for most teams. LoRA, QLoRA, and DoRA are techniques that "train only a tiny fraction of the model yet achieve an effect close to retraining the whole thing." This article explains, with diagrams, what each of the three techniques is for and when to use it, along with their principles, differences, and code.
If the terminology is unfamiliar, it helps to first skim the "Parameter-Efficient Fine-Tuning (PEFT)" entry in the AI Glossary.
Why Is Fine-Tuning Expensive?
Full fine-tuning updates every weight in the model. The problem is GPU memory. During a single training step, all four of the following sit in memory at once:
- Model weights (e.g., a 7B model = about 14GB in FP16)
- Gradients (same size as the weights)
- Optimizer state (Adam uses roughly 2x the weights)
- Activations
Roughly 4–6x the weight size of memory is needed. An 80GB GPU is tight for full fine-tuning of a 7B model, and 70B requires several GPUs. On top of that, you have to duplicate and store the entire model per task, so the operational burden is heavy too.
The idea that starts here is PEFT (Parameter-Efficient Fine-Tuning). The huge base model is frozen, and only a very small set of additional parameters is trained.
LoRA — Approximating the Weight Change with a Low Rank
Core Idea
Fine-tuning ultimately means adding a change ΔW to the original weights W (W' = W + ΔW). The insight of LoRA (Low-Rank Adaptation) is that this ΔW actually has a low-rank structure. In other words, the large matrix ΔW can be approximated by the product of two small matrices, B·A.
If W is d×d, then A is r×d and B is d×r, where the rank r is usually a very small 4–64. The number of trainable parameters drops dramatically.
The base weight W is left untouched, and only the low-rank path B·A attached alongside it is trained. Once training finishes, B·A can be merged into W to form a single weight, so there is no added latency at inference time.
Key Hyperparameters
| Parameter | Role | Practical sense |
|---|---|---|
r (rank) | Size of the low-rank path | Smaller is lighter, larger gives more expressiveness (usually 8–32) |
lora_alpha | Scale applied to B·A (alpha/r) | Often start at 2x r |
target_modules | Layers where LoRA is attached | Attention's q_proj and v_proj are the default; attaching to all layers raises performance and cost |
lora_dropout | Overfitting prevention | Around 0.05 |
Seeing It in Code
With Hugging Face's peft library, it takes only a few lines.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(base, config)
model.print_trainable_parameters()
# e.g.: trainable params: 6.8M || all params: 8.0B || trainable%: 0.085You can see the trainable parameters shrink to less than 0.1% of the total.
When to Use LoRA
- One base, many tasks: You only need to store a few-MB adapter per task. This is ideal for multi-tenant serving where you swap adapters on a single base model.
- Fast, cheap iterative experiments: Training is fast and memory-light, so experiment cycles are short.
- Most domain adaptations where full fine-tuning is burdensome: tone, format, injecting domain knowledge, and so on.
QLoRA — Quantize to 4 Bits to Go Bigger
Core Idea
LoRA reduced what gets trained, but the frozen base model itself still sits in memory in full (14GB for 7B). QLoRA (Quantized LoRA) loads this base model quantized to 4 bits. The 14GB shrinks to about 4–5GB, making it possible to fine-tune a large model even on a single consumer GPU.
The key point is that the base is stored in 4 bits but is briefly dequantized back to 16 bits only at the moment of actual computation. Gradients flow only through the 16-bit LoRA adapter. The base stays frozen the whole time.
The Three Things That Made QLoRA Possible
- NF4 (4-bit NormalFloat): A 4-bit data type optimized for the distribution of neural network weights (close to a normal distribution). It loses less accuracy than ordinary INT4.
- Double Quantization: The quantization constants used in quantization are themselves quantized once more, saving additional memory.
- Paged Optimizers: At moments of memory spikes, optimizer state is briefly offloaded to the CPU to prevent OOM (out-of-memory) errors.
Seeing It in Code
You load in 4 bits with bitsandbytes and then layer LoRA on top.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 data type
bnb_4bit_use_double_quant=True, # double quantization
bnb_4bit_compute_dtype=torch.bfloat16,
)
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B", quantization_config=bnb, device_map="auto",
)
base = prepare_model_for_kbit_training(base)
model = get_peft_model(base, LoraConfig(
r=16, lora_alpha=32, target_modules="all-linear",
lora_dropout=0.05, task_type="CAUSAL_LM",
))When to Use QLoRA
- When GPU resources are tight but you need to fine-tune a large model (e.g., 13B–70B on a single 24GB/48GB GPU).
- When minimizing cost is the top priority. You accept a slight quantization loss in exchange for large resource savings.
- That said, if inference quality is extremely important and you have ample GPU, LoRA without quantization or full fine-tuning may be the safer choice.
DoRA — Separating Magnitude and Direction
Core Idea
LoRA is lightweight, but its training dynamics differ subtly from full fine-tuning, so a performance gap can appear. DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes weights into magnitude and direction. It then applies LoRA only to the direction component and adjusts the magnitude separately with a small, separate trainable vector.
Training magnitude and direction separately like this yields a training pattern closer to full fine-tuning, so it tends to reach higher accuracy than LoRA under the same parameter budget. At inference time, like LoRA, it can be merged back so there is no added latency.
Seeing It in Code
In peft, you just flip a single flag in the LoRA configuration.
from peft import LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
use_dora=True, # enable DoRA
task_type="CAUSAL_LM",
)When to Use DoRA
- When you've tried LoRA but the performance is slightly disappointing, and you want to push up quality without greatly increasing parameters.
- That said, the added decomposition and computation make training somewhat slower than LoRA. It is a compromise between "LoRA's lightness" and "full fine-tuning's quality."
At-a-Glance Comparison
| Item | LoRA | QLoRA | DoRA |
|---|---|---|---|
| One-line definition | Approximate ΔW with low-rank B·A | Quantize base to 4 bits + LoRA | Decompose weights into magnitude and direction, then apply LoRA to direction |
| Main goal | Reduce trainable parameters | Reduce GPU memory (bigger models) | Improve quality over LoRA |
| Base precision | 16-bit (frozen) | 4-bit (frozen) | 16-bit (frozen) |
| Memory | Low | Lowest | Low (similar to LoRA) |
| Training speed | Fast | Fast (slight quantization overhead) | Somewhat slower than LoRA |
| Inference latency | None when merged | None when merged | None when merged |
| Typical use | Multi-task, fast experiments | Large models under resource constraints | Upgrade for a LoRA that falls short on quality |
The three are a combination, not a competition. In practice, configurations like "QLoRA + DoRA" — layering DoRA on a 4-bit base — are widely used to save memory while also securing quality.
What to Use When
To sum up, the decision has two axes. If resources (memory) are short, quantize (QLoRA); if quality is short, decompose (DoRA); if both, combine. With no particular constraints, the starting point is always plain LoRA.
Wrapping Up
All three techniques share the one-line PEFT principle: "freeze the base, train only a small part." From there, LoRA pushes harder on training volume, QLoRA on memory, and DoRA on quality. For most projects, it's enough to start with LoRA, move to QLoRA when resources fall short, and move to DoRA when quality falls short.
You can review more related terms (PEFT, quantization, adapters, rank, and so on) all at once in the AI Glossary.