Typical layer ·
Full fine-tune. Trainable =
LoRA at rank
Savings ·
| Method | Trainable params | Ratio | Disk |
|---|---|---|---|
| Full fine-tune | 7B | 100% | 14 GB |
| LoRA · r=8 | ~4M | 0.06% | 8 MB |
| LoRA · r=64 | ~33M | 0.47% | 66 MB |
| QLoRA · 4-bit base + r=8 | ~4M | 0.06% | 8 MB + 3.5 GB base |
Ship the 8 MB adapter alongside the public 7B base. Everyone downloads the base once; each task is just an adapter swap. This changed the open-source LLM ecosystem.
from peft import LoraConfig, get_peft_model
lora_cfg = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"], # where to inject
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(base_model, lora_cfg)
model.print_trainable_parameters()
# trainable params: 4,194,304 all: 7,000,000,000 trainable%: 0.06%
Training loop is identical to normal SFT — only 0.06% of parameters have gradients, everything else is frozen.
Analogy. A photo .bmp (millions of colours) → save as .gif (256-colour palette). The GIF maps each pixel to its closest palette colour. Looks similar, vastly smaller.
QLoRA does this to weights. bf16 has
NF4 ("NormalFloat 4-bit") · a clever choice of those 16 codebook values, spaced to match the typical bell-shaped distribution of NN weights.
Per parameter:
Standard LoRA (bf16 base).
QLoRA (4-bit base).
Mechanism.
Used everywhere in open-source LLM fine-tuning since 2023.
The alignment loop that made ChatGPT
SFT teaches the model one way to respond to each prompt. But for open-ended questions, many responses are acceptable — and you want the model to prefer the best one.
"Explain quantum entanglement to a 12-year-old."
SFT would weight all three equally if all are in the dataset. RLHF adds preference learning.
SFT is teaching a dog the command "sit." One correct action; demonstrate, repeat.
RLHF is teaching a dog to choose between options · "bark or stay quiet when the doorbell rings" · we reward good choices, penalize bad ones.
By showing the model pairs of options (good answer · bad answer) and letting it learn from preference, we shape general behavior rather than memorize a single correct response.
Train a dog to fetch.
RLHF balances both · max reward, but stay near the SFT model.
Symbols:
Part 1 drives the model to produce high-reward answers. Part 2 keeps the policy close to its SFT initialization → prevents the dog from tearing up the garden.
Prompt. "Explain gravity to a 5-year-old."
Reward.
KL penalty. Contribution from this example
Total objective.
PPO updates
Reward hacking. The policy finds ways to game the RM that humans wouldn't approve of. Classic example: the RM rewards longer answers, so the policy learns to produce verbose output regardless of content.
Mode collapse. PPO can push the policy to always produce very similar responses — diversity loss.
Sycophancy. If human labelers preferred agreeable answers, the model learns to agree with whatever the user says, even when wrong.
Mitigating these is half the art of alignment.
Bypass the reward model entirely
Rafailov et al. 2023 · "Direct Preference Optimization: Your Language Model is Secretly a Reward Model."
Insight: the RLHF objective has a closed-form optimal policy given the reward. Work backwards — derive a loss directly on preference pairs, skipping the RM and PPO entirely.
RLHF · two sodas → judge scores each 1–10 → use scores to declare a winner. Indirect.
DPO · just ask "A or B?" → direct preference.
DPO writes a loss that says · "directly increase prob of the winner
Build it from inside:
Prompt · "Suggest a coffee shop name." Winner
| 0.05 | 0.06 | |
| 0.10 | 0.12 |
Winner score.
Loser score.
Difference =
Loss =
Gradient pushes
Pure supervised loss. No RL loop. No reward model. ~50 lines vs RLHF's thousands.
| DPO | RLHF | |
|---|---|---|
| Training stages | 1 | 2 (RM + PPO) |
| Code complexity | ~50 LoC | ~2000 LoC (PPO, RM, rollout) |
| Compute | 1× | 3–5× |
| Quality at top scale | tied | often slightly ahead |
| Open-source preference | DPO | RLHF for frontier labs |
Open-source (Hugging Face, Mistral, most Llama fine-tunes) · DPO. Frontier labs (Anthropic, OpenAI, Google) · RLHF variants with proprietary tooling. Both work.
Anthropic 2022 variation · use the model itself (or a stronger one) to generate preference labels from a written "constitution":
Scales human annotation by factors of 100+. Used in Claude's alignment pipeline.
Test-time compute as a new axis
Latest generation · o1, o3, Claude extended thinking, DeepSeek R1.
The core idea:
Result: dramatically better on math, code, logic benchmarks.
Scaling laws in training compute produced pretrained capability. A new axis — scaling test-time compute — now unlocks reasoning. Both will likely continue.
Traditional RL reward · "was the final answer correct?" · sparse, late signal.
Process reward model (PRM) · grades individual reasoning steps · dense signal, catches bad intermediate logic.
| Stage | Input | Output |
|---|---|---|
| PRM training | 100k human-labeled step-by-step proofs | reward-per-step model |
| RL with PRM | sampled chains-of-thought | step-reward gradient up |
| Result | chains improve step-by-step, not just final | better generalization |
OpenAI's "math-shepherd" (2024) · PRM-trained models beat outcome-reward-only models by 20+ points on AIME. Process > outcome rewards for multi-step reasoning.
| Model | AIME 2024 (math) | Codeforces Elo | GPQA (science) |
|---|---|---|---|
| GPT-4 | 12% | ~800 | 39% |
| o1-preview | 44% | ~1500 | 73% |
| o1 | 74% | ~1900 | 78% |
| o3 | 97% | ~2700 (grandmaster) | 88% |
o3 at 97% on AIME · humans gold-medal at ~85%. One year of inference-compute scaling delivered this jump. Same base model class; the training regime changed.
| Scenario | Recommended |
|---|---|
| Small task-specific dataset (< 10k) | SFT on full model |
| Large instruction dataset (100k+) | SFT + LoRA |
| Consumer GPU (≤ 48GB) on 70B model | QLoRA (NF4 + LoRA) |
| Need preference alignment, small team | DPO |
| Need precise control of behaviors | RLHF + custom RM |
| Need safety + low-cost labels | Constitutional AI / RLAIF |
| Need reasoning / math / code | SFT + RL-with-process-rewards (o1 style) |
In 2026 open source · QLoRA + DPO is the dominant recipe for instruction tuning. Frontier labs mix all of the above.