The libraryLarge Language Models & GenAI

How LLMs Are Trained: Pretraining to RLHF

35 min readvideoLLM Foundations

2 of 38Large Language Models & GenAI

How LLMs Are Trained: Pretraining to RLHF

A raw transformer architecture is just an empty shell — billions of parameters initialized to random values. Turning it into a capable assistant like Gemma 4 or Claude requires a carefully orchestrated, multi-phase training pipeline. In this lesson we will walk through every phase, from pretraining on raw internet text to alignment with human preferences.

1. Phase 1 — Pretraining

Pretraining is the most expensive and foundational phase. The model learns the statistical structure of language by predicting the next token across trillions of words. This is self-supervised learning — no human labels are required, only raw text.

The Next-Token Prediction Objective

Given a sequence of tokens $x_{1}, x_{2}, \dots, x_{t - 1}$ , the model learns to predict the probability of the next token $x_{t}$ . The training objective minimizes the cross-entropy loss over the entire training corpus:

L_{pretrain} = - t = 1 \sum T lo g P_{θ} (x_{t} ∣ x_{1}, x_{2}, \dots, x_{t - 1})

In plain English: the model reads text left to right and tries to guess each next word. When it guesses wrong, the loss is high and the weights are updated via backpropagation. Over trillions of predictions, the model learns grammar, facts, reasoning patterns, and even some common sense.

Training Data

Pretraining datasets are enormous and diverse. A typical mix includes:

Source	% of Mix (typical)	What It Teaches
Web crawl (CommonCrawl, C4)	50–70%	Broad world knowledge, language patterns
Books (BookCorpus, Gutenberg)	5–10%	Long-form coherence, narrative, style
Wikipedia	3–5%	Factual knowledge, structured information
Code (GitHub, StackOverflow)	10–20%	Programming, logical reasoning
Scientific papers (arXiv, PubMed)	3–5%	Technical reasoning, domain knowledge
Multilingual text	5–15%	Cross-lingual understanding
Math datasets	2–5%	Mathematical reasoning, proofs

Compute Requirements

Pretraining is extremely compute-intensive. Some reference points:

Model	GPUs	Training Time	FLOPs (approx)
GPT-3 (175B)	~1,000 V100s	~34 days	3.14 × 10²³
Llama 2 70B	2,000 A100-80GB	~25 days	1.0 × 10²⁴
Llama 3 405B	16,384 H100s	~54 days	3.8 × 10²⁵
Gemma 4 27B	TPU v5p pods	—	—

After pretraining, the model is a powerful text completer — it can continue any prompt in a fluent, coherent way. But it is not yet an assistant. If you ask it a question, it might continue the prompt with another question, or start writing a Wikipedia-style article instead of answering directly.

2. Phase 2 — Supervised Fine-Tuning (SFT)

SFT transforms a text completer into an instruction follower. The model is trained on curated (instruction, response) pairs where a human (or a stronger model) has written the ideal output.

python

# Example SFT training pair (in chat format)
sft_example = {
    "messages": [
        {
            "role": "user",
            "content": "Explain photosynthesis in simple terms."
        },
        {
            "role": "assistant",
            "content": "Photosynthesis is how plants make food. They use "
                       "sunlight, water, and carbon dioxide to produce "
                       "glucose (sugar) and oxygen. Think of it as the "
                       "plant's way of cooking — sunlight is the stove, "
                       "water and CO₂ are the ingredients, and glucose "
                       "is the meal."
        }
    ]
}

SFT Dataset Characteristics

Property	Typical Value	Why It Matters
Dataset size	10K – 100K examples	Quality > quantity; even 1K high-quality examples can dramatically improve behavior
Diversity	Wide range of tasks	The model should generalize to unseen instructions
Format	Chat / instruction-response	Teaches the model the turn-taking pattern
Source	Human-written or distilled from stronger models	Distillation (e.g., using GPT-4 outputs) is common but has licensing implications

After SFT, the model can follow instructions and hold conversations. However, it may still produce outputs that are technically correct but not what humans actually prefer — it might be verbose when brevity is wanted, or hedge when a direct answer is expected.

3. Phase 3 — RLHF (Reinforcement Learning from Human Feedback)

RLHF aligns the model's outputs with human preferences. This is the phase that turned GPT-3 from a clever autocomplete into ChatGPT. The process has two sub-steps:

Step A: Train a Reward Model

Human annotators are shown pairs of model outputs for the same prompt and asked to rank which response is better. These preference pairs are used to train a reward model $R_{ϕ}$ that predicts a scalar score for any (prompt, response) pair:

L_{reward} = - lo g σ (R_{ϕ} (x, y_{w}) - R_{ϕ} (x, y_{l}))

Where $y_{w}$ is the preferred (winning) response and $y_{l}$ is the rejected (losing) response. The reward model learns to assign higher scores to outputs humans prefer.

Step B: Optimize Policy with PPO

The LLM is then fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking):

L_{RLHF} = E_{x, y \sim π_{θ}} [R_{ϕ} (x, y)] - β KL [π_{θ} ∥ π_{SFT}]

The KL-divergence penalty $β$ is crucial — without it, the model would learn to exploit quirks in the reward model rather than genuinely improve quality. This is known as reward hacking.

4. DPO — A Simpler Alternative to RLHF

Direct Preference Optimization (DPO) (Rafailov et al., 2023) achieves similar results to RLHF without needing a separate reward model or RL training loop. DPO directly optimizes the language model on preference pairs:

L_{DPO} = - lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})

DPO is simpler to implement, more stable to train, and requires less compute. It has become the preferred alignment method for many open-source models, including Gemma and Llama variants.

Property	RLHF (PPO)	DPO
Reward model needed?	Yes (separate model)	No
RL training loop?	Yes (complex, unstable)	No (standard supervised loss)
Compute cost	High (4 models in memory)	Lower (2 models in memory)
Training stability	Harder to tune	More stable
Quality	Excellent	Comparable (sometimes better)
Used by	OpenAI (ChatGPT), Anthropic	Meta (Llama), Google (Gemma), many open models

5. Constitutional AI (Anthropic's Approach)

Anthropic's Constitutional AI (CAI) approach, used to train Claude, adds a self-improvement loop. Instead of relying solely on human annotators for preference data, the model critiques and revises its own outputs according to a set of principles (the "constitution"):

Red-teaming — Generate potentially harmful outputs
Self-critique — Ask the model to identify problems in its own output using constitutional principles
Revision — Ask the model to rewrite the output to fix the identified problems
RL from AI Feedback (RLAIF) — Train on the revised outputs using the same RLHF/DPO pipeline

This reduces dependence on expensive human annotation and allows encoding specific safety and helpfulness principles directly into training.

6. How Gemma 4 Is Trained

Google's Gemma 4 follows a similar pipeline tailored for open-weights release:

Pretraining — Trained on a large, curated dataset of web text, code, math, and multilingual content on Google's TPU v5p infrastructure.
SFT — Fine-tuned on high-quality instruction-response pairs across diverse tasks including coding, reasoning, and creative writing.
Alignment — Aligned using RLHF with a mix of human and synthetic preference data. The -it (instruction-tuned) variants are the aligned models.
Safety — Extensive red-teaming and safety filtering, with specific attention to reducing harmful outputs while maintaining helpfulness.

python

import os
from google import genai

client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])

# The "-it" suffix means "instruction-tuned" (SFT + RLHF aligned)
# Compare base model vs instruction-tuned behavior:

# Instruction-tuned model follows the instruction directly
response = client.models.generate_content(
    model="gemma-4-12b-it",
    contents="List three benefits of RLHF alignment in LLMs."
)
print("=== Gemma 4 12B (instruction-tuned) ===")
print(response.text)

# Base models would simply continue the text as if it were
# a document, not answer the question directly.
# That's why we always use the "-it" variants for applications.

rlhf_demo_output.py Show Output

python

# Expected output from Gemma 4 (instruction-tuned):
print(response.text)

=== Gemma 4 12B (instruction-tuned) ===
Here are three key benefits of RLHF alignment in LLMs:

1. **Improved Helpfulness** — RLHF teaches the model to provide
   direct, relevant answers rather than generic text completions,
   making it significantly more useful as an assistant.

2. **Reduced Harmful Outputs** — By learning from human preferences
   that penalize toxic, biased, or dangerous content, the model
   becomes safer to deploy in real-world applications.

3. **Better Calibration** — RLHF-aligned models learn to express
   uncertainty appropriately, say "I don't know" when applicable,
   and avoid confidently stating incorrect information.

7. The Complete Training Pipeline — Summary

Phase	Data	Objective	Result	Cost
Pretraining	Trillions of tokens (unlabeled)	Next-token prediction	Fluent text completer	$1 M -$ 100M+
SFT	10K – 100K instruction pairs	Supervised cross-entropy	Instruction follower	$1 K -$ 50K
RLHF / DPO	100K+ preference pairs	Maximize reward / DPO loss	Aligned assistant	$10 K -$ 500K

"Pretraining gives the model knowledge. SFT teaches it manners. RLHF teaches it judgment."

← Previous lessonIntroduction to Large Language Models

Up next · Exploring LLM Architectures