AIMaks

How LLMs Are Trained: Pretraining to RLHF

35 min readvideoLLM Foundations
2 of 38Large Language Models & GenAI

How LLMs Are Trained: Pretraining to RLHF

A raw transformer architecture is just an empty shell — billions of parameters initialized to random values. Turning it into a capable assistant like Gemma 4 or Claude requires a carefully orchestrated, multi-phase training pipeline. In this lesson we will walk through every phase, from pretraining on raw internet text to alignment with human preferences.

1. Phase 1 — Pretraining

Pretraining is the most expensive and foundational phase. The model learns the statistical structure of language by predicting the next token across trillions of words. This is self-supervised learning — no human labels are required, only raw text.

The Next-Token Prediction Objective

Given a sequence of tokens , the model learns to predict the probability of the next token . The training objective minimizes the cross-entropy loss over the entire training corpus:

In plain English: the model reads text left to right and tries to guess each next word. When it guesses wrong, the loss is high and the weights are updated via backpropagation. Over trillions of predictions, the model learns grammar, facts, reasoning patterns, and even some common sense.

Training Data

Pretraining datasets are enormous and diverse. A typical mix includes:

Source% of Mix (typical)What It Teaches
Web crawl (CommonCrawl, C4)50–70%Broad world knowledge, language patterns
Books (BookCorpus, Gutenberg)5–10%Long-form coherence, narrative, style
Wikipedia3–5%Factual knowledge, structured information
Code (GitHub, StackOverflow)10–20%Programming, logical reasoning
Scientific papers (arXiv, PubMed)3–5%Technical reasoning, domain knowledge
Multilingual text5–15%Cross-lingual understanding
Math datasets2–5%Mathematical reasoning, proofs

Compute Requirements

Pretraining is extremely compute-intensive. Some reference points:

ModelGPUsTraining TimeFLOPs (approx)
GPT-3 (175B)~1,000 V100s~34 days3.14 × 10²³
Llama 2 70B2,000 A100-80GB~25 days1.0 × 10²⁴
Llama 3 405B16,384 H100s~54 days3.8 × 10²⁵
Gemma 4 27BTPU v5p pods

After pretraining, the model is a powerful text completer — it can continue any prompt in a fluent, coherent way. But it is not yet an assistant. If you ask it a question, it might continue the prompt with another question, or start writing a Wikipedia-style article instead of answering directly.

2. Phase 2 — Supervised Fine-Tuning (SFT)

SFT transforms a text completer into an instruction follower. The model is trained on curated (instruction, response) pairs where a human (or a stronger model) has written the ideal output.

python
# Example SFT training pair (in chat format)
sft_example = {
    "messages": [
        {
            "role": "user",
            "content": "Explain photosynthesis in simple terms."
        },
        {
            "role": "assistant",
            "content": "Photosynthesis is how plants make food. They use "
                       "sunlight, water, and carbon dioxide to produce "
                       "glucose (sugar) and oxygen. Think of it as the "
                       "plant's way of cooking — sunlight is the stove, "
                       "water and CO₂ are the ingredients, and glucose "
                       "is the meal."
        }
    ]
}

SFT Dataset Characteristics

PropertyTypical ValueWhy It Matters
Dataset size10K – 100K examplesQuality > quantity; even 1K high-quality examples can dramatically improve behavior
DiversityWide range of tasksThe model should generalize to unseen instructions
FormatChat / instruction-responseTeaches the model the turn-taking pattern
SourceHuman-written or distilled from stronger modelsDistillation (e.g., using GPT-4 outputs) is common but has licensing implications

After SFT, the model can follow instructions and hold conversations. However, it may still produce outputs that are technically correct but not what humans actually prefer — it might be verbose when brevity is wanted, or hedge when a direct answer is expected.

3. Phase 3 — RLHF (Reinforcement Learning from Human Feedback)

RLHF aligns the model's outputs with human preferences. This is the phase that turned GPT-3 from a clever autocomplete into ChatGPT. The process has two sub-steps:

Step A: Train a Reward Model

Human annotators are shown pairs of model outputs for the same prompt and asked to rank which response is better. These preference pairs are used to train a reward model that predicts a scalar score for any (prompt, response) pair:

Where is the preferred (winning) response and is the rejected (losing) response. The reward model learns to assign higher scores to outputs humans prefer.

Step B: Optimize Policy with PPO

The LLM is then fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking):

The KL-divergence penalty is crucial — without it, the model would learn to exploit quirks in the reward model rather than genuinely improve quality. This is known as reward hacking.

4. DPO — A Simpler Alternative to RLHF

Direct Preference Optimization (DPO) (Rafailov et al., 2023) achieves similar results to RLHF without needing a separate reward model or RL training loop. DPO directly optimizes the language model on preference pairs:

DPO is simpler to implement, more stable to train, and requires less compute. It has become the preferred alignment method for many open-source models, including Gemma and Llama variants.

PropertyRLHF (PPO)DPO
Reward model needed?Yes (separate model)No
RL training loop?Yes (complex, unstable)No (standard supervised loss)
Compute costHigh (4 models in memory)Lower (2 models in memory)
Training stabilityHarder to tuneMore stable
QualityExcellentComparable (sometimes better)
Used byOpenAI (ChatGPT), AnthropicMeta (Llama), Google (Gemma), many open models

5. Constitutional AI (Anthropic's Approach)

Anthropic's Constitutional AI (CAI) approach, used to train Claude, adds a self-improvement loop. Instead of relying solely on human annotators for preference data, the model critiques and revises its own outputs according to a set of principles (the "constitution"):

  1. Red-teaming — Generate potentially harmful outputs
  2. Self-critique — Ask the model to identify problems in its own output using constitutional principles
  3. Revision — Ask the model to rewrite the output to fix the identified problems
  4. RL from AI Feedback (RLAIF) — Train on the revised outputs using the same RLHF/DPO pipeline

This reduces dependence on expensive human annotation and allows encoding specific safety and helpfulness principles directly into training.

6. How Gemma 4 Is Trained

Google's Gemma 4 follows a similar pipeline tailored for open-weights release:

  • Pretraining — Trained on a large, curated dataset of web text, code, math, and multilingual content on Google's TPU v5p infrastructure.
  • SFT — Fine-tuned on high-quality instruction-response pairs across diverse tasks including coding, reasoning, and creative writing.
  • Alignment — Aligned using RLHF with a mix of human and synthetic preference data. The -it (instruction-tuned) variants are the aligned models.
  • Safety — Extensive red-teaming and safety filtering, with specific attention to reducing harmful outputs while maintaining helpfulness.
python
import os
from google import genai

client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])

# The "-it" suffix means "instruction-tuned" (SFT + RLHF aligned)
# Compare base model vs instruction-tuned behavior:

# Instruction-tuned model follows the instruction directly
response = client.models.generate_content(
    model="gemma-4-12b-it",
    contents="List three benefits of RLHF alignment in LLMs."
)
print("=== Gemma 4 12B (instruction-tuned) ===")
print(response.text)

# Base models would simply continue the text as if it were
# a document, not answer the question directly.
# That's why we always use the "-it" variants for applications.
rlhf_demo_output.py Show Output
python
# Expected output from Gemma 4 (instruction-tuned):
print(response.text)
=== Gemma 4 12B (instruction-tuned) ===
Here are three key benefits of RLHF alignment in LLMs:

1. **Improved Helpfulness** — RLHF teaches the model to provide
   direct, relevant answers rather than generic text completions,
   making it significantly more useful as an assistant.

2. **Reduced Harmful Outputs** — By learning from human preferences
   that penalize toxic, biased, or dangerous content, the model
   becomes safer to deploy in real-world applications.

3. **Better Calibration** — RLHF-aligned models learn to express
   uncertainty appropriately, say "I don't know" when applicable,
   and avoid confidently stating incorrect information.

7. The Complete Training Pipeline — Summary

PhaseDataObjectiveResultCost
PretrainingTrillions of tokens (unlabeled)Next-token predictionFluent text completer100M+
SFT10K – 100K instruction pairsSupervised cross-entropyInstruction follower50K
RLHF / DPO100K+ preference pairsMaximize reward / DPO lossAligned assistant500K
"Pretraining gives the model knowledge. SFT teaches it manners. RLHF teaches it judgment."
Up next · Exploring LLM Architectures