How LLMs Are Trained: Pretraining to RLHF
2 of 38Large Language Models & GenAI
How LLMs Are Trained: Pretraining to RLHF
A raw transformer architecture is just an empty shell — billions of parameters initialized to random values. Turning it into a capable assistant like Gemma 4 or Claude requires a carefully orchestrated, multi-phase training pipeline. In this lesson we will walk through every phase, from pretraining on raw internet text to alignment with human preferences.
1. Phase 1 — Pretraining
Pretraining is the most expensive and foundational phase. The model learns the statistical structure of language by predicting the next token across trillions of words. This is self-supervised learning — no human labels are required, only raw text.
The Next-Token Prediction Objective
Given a sequence of tokens , the model learns to predict the probability of the next token . The training objective minimizes the cross-entropy loss over the entire training corpus:
In plain English: the model reads text left to right and tries to guess each next word. When it guesses wrong, the loss is high and the weights are updated via backpropagation. Over trillions of predictions, the model learns grammar, facts, reasoning patterns, and even some common sense.
Training Data
Pretraining datasets are enormous and diverse. A typical mix includes:
| Source | % of Mix (typical) | What It Teaches |
|---|---|---|
| Web crawl (CommonCrawl, C4) | 50–70% | Broad world knowledge, language patterns |
| Books (BookCorpus, Gutenberg) | 5–10% | Long-form coherence, narrative, style |
| Wikipedia | 3–5% | Factual knowledge, structured information |
| Code (GitHub, StackOverflow) | 10–20% | Programming, logical reasoning |
| Scientific papers (arXiv, PubMed) | 3–5% | Technical reasoning, domain knowledge |
| Multilingual text | 5–15% | Cross-lingual understanding |
| Math datasets | 2–5% | Mathematical reasoning, proofs |
Compute Requirements
Pretraining is extremely compute-intensive. Some reference points:
| Model | GPUs | Training Time | FLOPs (approx) |
|---|---|---|---|
| GPT-3 (175B) | ~1,000 V100s | ~34 days | 3.14 × 10²³ |
| Llama 2 70B | 2,000 A100-80GB | ~25 days | 1.0 × 10²⁴ |
| Llama 3 405B | 16,384 H100s | ~54 days | 3.8 × 10²⁵ |
| Gemma 4 27B | TPU v5p pods | — | — |
After pretraining, the model is a powerful text completer — it can continue any prompt in a fluent, coherent way. But it is not yet an assistant. If you ask it a question, it might continue the prompt with another question, or start writing a Wikipedia-style article instead of answering directly.
2. Phase 2 — Supervised Fine-Tuning (SFT)
SFT transforms a text completer into an instruction follower. The model is trained on curated (instruction, response) pairs where a human (or a stronger model) has written the ideal output.
# Example SFT training pair (in chat format)
sft_example = {
"messages": [
{
"role": "user",
"content": "Explain photosynthesis in simple terms."
},
{
"role": "assistant",
"content": "Photosynthesis is how plants make food. They use "
"sunlight, water, and carbon dioxide to produce "
"glucose (sugar) and oxygen. Think of it as the "
"plant's way of cooking — sunlight is the stove, "
"water and CO₂ are the ingredients, and glucose "
"is the meal."
}
]
}
SFT Dataset Characteristics
| Property | Typical Value | Why It Matters |
|---|---|---|
| Dataset size | 10K – 100K examples | Quality > quantity; even 1K high-quality examples can dramatically improve behavior |
| Diversity | Wide range of tasks | The model should generalize to unseen instructions |
| Format | Chat / instruction-response | Teaches the model the turn-taking pattern |
| Source | Human-written or distilled from stronger models | Distillation (e.g., using GPT-4 outputs) is common but has licensing implications |
After SFT, the model can follow instructions and hold conversations. However, it may still produce outputs that are technically correct but not what humans actually prefer — it might be verbose when brevity is wanted, or hedge when a direct answer is expected.
3. Phase 3 — RLHF (Reinforcement Learning from Human Feedback)
RLHF aligns the model's outputs with human preferences. This is the phase that turned GPT-3 from a clever autocomplete into ChatGPT. The process has two sub-steps:
Step A: Train a Reward Model
Human annotators are shown pairs of model outputs for the same prompt and asked to rank which response is better. These preference pairs are used to train a reward model that predicts a scalar score for any (prompt, response) pair:
Where is the preferred (winning) response and is the rejected (losing) response. The reward model learns to assign higher scores to outputs humans prefer.
Step B: Optimize Policy with PPO
The LLM is then fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking):
The KL-divergence penalty is crucial — without it, the model would learn to exploit quirks in the reward model rather than genuinely improve quality. This is known as reward hacking.
4. DPO — A Simpler Alternative to RLHF
Direct Preference Optimization (DPO) (Rafailov et al., 2023) achieves similar results to RLHF without needing a separate reward model or RL training loop. DPO directly optimizes the language model on preference pairs:
DPO is simpler to implement, more stable to train, and requires less compute. It has become the preferred alignment method for many open-source models, including Gemma and Llama variants.
| Property | RLHF (PPO) | DPO |
|---|---|---|
| Reward model needed? | Yes (separate model) | No |
| RL training loop? | Yes (complex, unstable) | No (standard supervised loss) |
| Compute cost | High (4 models in memory) | Lower (2 models in memory) |
| Training stability | Harder to tune | More stable |
| Quality | Excellent | Comparable (sometimes better) |
| Used by | OpenAI (ChatGPT), Anthropic | Meta (Llama), Google (Gemma), many open models |
5. Constitutional AI (Anthropic's Approach)
Anthropic's Constitutional AI (CAI) approach, used to train Claude, adds a self-improvement loop. Instead of relying solely on human annotators for preference data, the model critiques and revises its own outputs according to a set of principles (the "constitution"):
- Red-teaming — Generate potentially harmful outputs
- Self-critique — Ask the model to identify problems in its own output using constitutional principles
- Revision — Ask the model to rewrite the output to fix the identified problems
- RL from AI Feedback (RLAIF) — Train on the revised outputs using the same RLHF/DPO pipeline
This reduces dependence on expensive human annotation and allows encoding specific safety and helpfulness principles directly into training.
6. How Gemma 4 Is Trained
Google's Gemma 4 follows a similar pipeline tailored for open-weights release:
- Pretraining — Trained on a large, curated dataset of web text, code, math, and multilingual content on Google's TPU v5p infrastructure.
- SFT — Fine-tuned on high-quality instruction-response pairs across diverse tasks including coding, reasoning, and creative writing.
- Alignment — Aligned using RLHF with a mix of human
and synthetic preference data. The
-it(instruction-tuned) variants are the aligned models. - Safety — Extensive red-teaming and safety filtering, with specific attention to reducing harmful outputs while maintaining helpfulness.
import os
from google import genai
client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])
# The "-it" suffix means "instruction-tuned" (SFT + RLHF aligned)
# Compare base model vs instruction-tuned behavior:
# Instruction-tuned model follows the instruction directly
response = client.models.generate_content(
model="gemma-4-12b-it",
contents="List three benefits of RLHF alignment in LLMs."
)
print("=== Gemma 4 12B (instruction-tuned) ===")
print(response.text)
# Base models would simply continue the text as if it were
# a document, not answer the question directly.
# That's why we always use the "-it" variants for applications.
# Expected output from Gemma 4 (instruction-tuned):
print(response.text)
=== Gemma 4 12B (instruction-tuned) === Here are three key benefits of RLHF alignment in LLMs: 1. **Improved Helpfulness** — RLHF teaches the model to provide direct, relevant answers rather than generic text completions, making it significantly more useful as an assistant. 2. **Reduced Harmful Outputs** — By learning from human preferences that penalize toxic, biased, or dangerous content, the model becomes safer to deploy in real-world applications. 3. **Better Calibration** — RLHF-aligned models learn to express uncertainty appropriately, say "I don't know" when applicable, and avoid confidently stating incorrect information.
7. The Complete Training Pipeline — Summary
| Phase | Data | Objective | Result | Cost |
|---|---|---|---|---|
| Pretraining | Trillions of tokens (unlabeled) | Next-token prediction | Fluent text completer | 100M+ |
| SFT | 10K – 100K instruction pairs | Supervised cross-entropy | Instruction follower | 50K |
| RLHF / DPO | 100K+ preference pairs | Maximize reward / DPO loss | Aligned assistant | 500K |
"Pretraining gives the model knowledge. SFT teaches it manners. RLHF teaches it judgment."