Interactive Explainer
Full fine-tuning of a 70B model touches 70 billion parameters. LoRA touches 200 million. The quality is often indistinguishable, and the checkpoint is 300× smaller. Slide the rank, and see the numbers move.
The LoRA insight (Hu et al. 2021): the change you want to apply to a pretrained weight is typically low-rank. You don't need 70B parameters of update budget — you need a tiny correction. So instead of fine-tuning W, fine-tune two small matrices B and A with W + BA, keeping W frozen.
| Method | Trainable params | Ratio vs full | Adapter size on disk |
|---|---|---|---|
| Full fine-tune | – | 1.00× | – |
| LoRA · r=8 | – | – | – |
| QLoRA (4-bit base + r=8) | – | – | – |
The magic: LoRA adapters at r=8 for a 7B model are ~4M trainable parameters. You can fine-tune on a single consumer GPU (24 GB RAM), save the adapter as a ~16 MB file, and ship it alongside the base weights.
y = W₀x + B · A · x rank(BA) ≤ r
W₀ is frozen. A ∈ ℝ^{d×r} initialized Gaussian; B ∈ ℝ^{r×d} initialized zero (so at t=0, the LoRA term is the identity modification — zero). Only A and B receive gradients.
from peft import LoraConfig, get_peft_model
lora_cfg = LoraConfig(
r=8, # rank
lora_alpha=16, # scale
target_modules=["q_proj", "v_proj"], # where to inject
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(base_model, lora_cfg)
model.print_trainable_parameters()
# > trainable params: 4.2M all params: 7B trainable%: 0.06%
QLoRA (Dettmers 2023) quantizes the frozen base to 4-bit (NF4), then trains LoRA adapters in fp16 on top. You can now fine-tune a 70B model on a single 48 GB GPU. Same quality as LoRA at a fraction of the memory.
Part of the ES 667 Deep Learning course · IIT Gandhinagar · Aug 2026.