loraqloradorapeftfine-tuning

LoRA, QLoRA, and DoRA Explained — What, When, and Why

A visual guide to the three core PEFT techniques for fine-tuning large models with few resources — LoRA, QLoRA, and DoRA — covering their principles, uses, differences, and code.

Data DynamicsJune 24, 20268 min read

When you want to adapt an LLM with billions of parameters to your own domain, retraining the entire model is impractical for most teams. LoRA, QLoRA, and DoRA are techniques that "train only a tiny fraction of the model yet achieve an effect close to retraining the whole thing." This article explains, with diagrams, what each of the three techniques is for and when to use it, along with their principles, differences, and code.

If the terminology is unfamiliar, it helps to first skim the "Parameter-Efficient Fine-Tuning (PEFT)" entry in the AI Glossary.

Why Is Fine-Tuning Expensive?

Full fine-tuning updates every weight in the model. The problem is GPU memory. During a single training step, all four of the following sit in memory at once:

Model weights (e.g., a 7B model = about 14GB in FP16)
Gradients (same size as the weights)
Optimizer state (Adam uses roughly 2x the weights)
Activations

Roughly 4–6x the weight size of memory is needed. An 80GB GPU is tight for full fine-tuning of a 7B model, and 70B requires several GPUs. On top of that, you have to duplicate and store the entire model per task, so the operational burden is heavy too.

Loading diagram…

The idea that starts here is PEFT (Parameter-Efficient Fine-Tuning). The huge base model is frozen, and only a very small set of additional parameters is trained.

LoRA — Approximating the Weight Change with a Low Rank

Core Idea

Fine-tuning ultimately means adding a change ΔW to the original weights W (W' = W + ΔW). The insight of LoRA (Low-Rank Adaptation) is that this ΔW actually has a low-rank structure. In other words, the large matrix ΔW can be approximated by the product of two small matrices, B·A.

If W is d×d, then A is r×d and B is d×r, where the rank r is usually a very small 4–64. The number of trainable parameters drops dramatically.

Loading diagram…

The base weight W is left untouched, and only the low-rank path B·A attached alongside it is trained. Once training finishes, B·A can be merged into W to form a single weight, so there is no added latency at inference time.

Key Hyperparameters

Parameter	Role	Practical sense
`r` (rank)	Size of the low-rank path	Smaller is lighter, larger gives more expressiveness (usually 8–32)
`lora_alpha`	Scale applied to `B·A` (`alpha/r`)	Often start at 2x `r`
`target_modules`	Layers where LoRA is attached	Attention's `q_proj` and `v_proj` are the default; attaching to all layers raises performance and cost
`lora_dropout`	Overfitting prevention	Around 0.05

Seeing It in Code

With Hugging Face's peft library, it takes only a few lines.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
 
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
 
model = get_peft_model(base, config)
model.print_trainable_parameters()
# e.g.: trainable params: 6.8M || all params: 8.0B || trainable%: 0.085

You can see the trainable parameters shrink to less than 0.1% of the total.

When to Use LoRA

One base, many tasks: You only need to store a few-MB adapter per task. This is ideal for multi-tenant serving where you swap adapters on a single base model.
Fast, cheap iterative experiments: Training is fast and memory-light, so experiment cycles are short.
Most domain adaptations where full fine-tuning is burdensome: tone, format, injecting domain knowledge, and so on.

QLoRA — Quantize to 4 Bits to Go Bigger

Core Idea

LoRA reduced what gets trained, but the frozen base model itself still sits in memory in full (14GB for 7B). QLoRA (Quantized LoRA) loads this base model quantized to 4 bits. The 14GB shrinks to about 4–5GB, making it possible to fine-tune a large model even on a single consumer GPU.

Loading diagram…

The key point is that the base is stored in 4 bits but is briefly dequantized back to 16 bits only at the moment of actual computation. Gradients flow only through the 16-bit LoRA adapter. The base stays frozen the whole time.

The Three Things That Made QLoRA Possible

NF4 (4-bit NormalFloat): A 4-bit data type optimized for the distribution of neural network weights (close to a normal distribution). It loses less accuracy than ordinary INT4.
Double Quantization: The quantization constants used in quantization are themselves quantized once more, saving additional memory.
Paged Optimizers: At moments of memory spikes, optimizer state is briefly offloaded to the CPU to prevent OOM (out-of-memory) errors.

Seeing It in Code

You load in 4 bits with bitsandbytes and then layer LoRA on top.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
 
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 data type
    bnb_4bit_use_double_quant=True,     # double quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
)
 
base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B", quantization_config=bnb, device_map="auto",
)
base = prepare_model_for_kbit_training(base)
 
model = get_peft_model(base, LoraConfig(
    r=16, lora_alpha=32, target_modules="all-linear",
    lora_dropout=0.05, task_type="CAUSAL_LM",
))

When to Use QLoRA

When GPU resources are tight but you need to fine-tune a large model (e.g., 13B–70B on a single 24GB/48GB GPU).
When minimizing cost is the top priority. You accept a slight quantization loss in exchange for large resource savings.
That said, if inference quality is extremely important and you have ample GPU, LoRA without quantization or full fine-tuning may be the safer choice.

DoRA — Separating Magnitude and Direction

Core Idea

LoRA is lightweight, but its training dynamics differ subtly from full fine-tuning, so a performance gap can appear. DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes weights into magnitude and direction. It then applies LoRA only to the direction component and adjusts the magnitude separately with a small, separate trainable vector.

Loading diagram…

Training magnitude and direction separately like this yields a training pattern closer to full fine-tuning, so it tends to reach higher accuracy than LoRA under the same parameter budget. At inference time, like LoRA, it can be merged back so there is no added latency.

Seeing It in Code

In peft, you just flip a single flag in the LoRA configuration.

from peft import LoraConfig
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    use_dora=True,            # enable DoRA
    task_type="CAUSAL_LM",
)

When to Use DoRA

When you've tried LoRA but the performance is slightly disappointing, and you want to push up quality without greatly increasing parameters.
That said, the added decomposition and computation make training somewhat slower than LoRA. It is a compromise between "LoRA's lightness" and "full fine-tuning's quality."

At-a-Glance Comparison

Item	LoRA	QLoRA	DoRA
One-line definition	Approximate ΔW with low-rank `B·A`	Quantize base to 4 bits + LoRA	Decompose weights into magnitude and direction, then apply LoRA to direction
Main goal	Reduce trainable parameters	Reduce GPU memory (bigger models)	Improve quality over LoRA
Base precision	16-bit (frozen)	4-bit (frozen)	16-bit (frozen)
Memory	Low	Lowest	Low (similar to LoRA)
Training speed	Fast	Fast (slight quantization overhead)	Somewhat slower than LoRA
Inference latency	None when merged	None when merged	None when merged
Typical use	Multi-task, fast experiments	Large models under resource constraints	Upgrade for a LoRA that falls short on quality

The three are a combination, not a competition. In practice, configurations like "QLoRA + DoRA" — layering DoRA on a 4-bit base — are widely used to save memory while also securing quality.

What to Use When

Loading diagram…

To sum up, the decision has two axes. If resources (memory) are short, quantize (QLoRA); if quality is short, decompose (DoRA); if both, combine. With no particular constraints, the starting point is always plain LoRA.

Wrapping Up

All three techniques share the one-line PEFT principle: "freeze the base, train only a small part." From there, LoRA pushes harder on training volume, QLoRA on memory, and DoRA on quality. For most projects, it's enough to start with LoRA, move to QLoRA when resources fall short, and move to DoRA when quality falls short.

You can review more related terms (PEFT, quantization, adapters, rank, and so on) all at once in the AI Glossary.