Exploring LLM Architectures
3 of 38Large Language Models & GenAI
Exploring LLM Architectures
Every modern LLM is built on the Transformer architecture introduced in the landmark 2017 paper "Attention Is All You Need". But not all Transformers are the same — the choices around attention type, normalization, positional encoding, and whether to use an encoder, decoder, or both create profoundly different models. This lesson takes you inside the architecture so you understand why models behave the way they do.
1. The Self-Attention Mechanism
Self-attention is the core innovation that makes Transformers powerful. Unlike RNNs, which process tokens one by one, self-attention lets every token attend to every other token in the sequence in parallel. This solves the long-range dependency problem that plagued earlier architectures.
For each token, the model computes three vectors from the input embedding:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I provide?"
These are computed by multiplying the input embedding by learned weight matrices:
The attention scores are then computed using scaled dot-product attention:
The scaling factor (where is the dimension of the key vectors) prevents the dot products from growing too large, which would push the softmax into regions with vanishingly small gradients.
2. Multi-Head Attention
Rather than computing a single attention function, Transformers use multi-head attention — multiple attention "heads" operating in parallel, each with its own Q, K, V weight matrices. This allows the model to jointly attend to information from different representation subspaces:
For example, in a model with and heads, each head operates on dimensions. One head might learn to attend to syntactic relationships, another to semantic similarity, and yet another to positional proximity.
Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)
Standard multi-head attention requires separate K and V projections for every head, which is expensive during inference (the KV-cache grows linearly with the number of heads). Modern models use optimized variants:
| Variant | K/V Heads | Memory | Used By |
|---|---|---|---|
| Multi-Head Attention (MHA) | Same as Q heads (e.g., 32) | Highest | Original Transformer, GPT-2 |
| Multi-Query Attention (MQA) | 1 shared K/V for all Q heads | Lowest | PaLM, Falcon |
| Grouped-Query Attention (GQA) | Groups share K/V (e.g., 8 KV for 32 Q) | Medium | Gemma 4, Llama 3/4, Mistral |
3. Encoder, Decoder, and Encoder-Decoder
The original Transformer had both an encoder and a decoder. Modern LLMs have diverged into three main architecture families:
| Type | Attention Pattern | Best For | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional (sees full input) | Understanding: classification, NER, embeddings | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal (left-to-right only) | Generation: chat, code, reasoning | Gemma 4, GPT-4, Claude, Llama 4 |
| Encoder-Decoder | Bidirectional encoder + causal decoder | Seq-to-seq: translation, summarization | T5, BART, Flan-T5 |
The decoder-only architecture has won the scaling race. Nearly all frontier LLMs — GPT-4, Claude, Gemini, Gemma 4, Llama 4 — are decoder-only. The key insight: with enough scale, a decoder-only model can learn to do everything the other architectures can, while being simpler to train and scale.
4. Positional Encoding
Self-attention is permutation-invariant — it treats the input as a set, not a sequence. To inject information about token order, we need positional encoding.
Sinusoidal (Original Transformer)
The original paper used fixed sine and cosine functions of different frequencies:
RoPE (Rotary Position Embeddings)
RoPE (Su et al., 2021) is the dominant positional encoding in modern LLMs. Instead of adding position information to embeddings, RoPE rotates the query and key vectors by an angle proportional to their position. This elegantly encodes relative position directly into the attention scores.
Key benefits of RoPE:
- Naturally encodes relative positions (attention depends on distance between tokens, not absolute position)
- Can extrapolate to sequences longer than those seen during training (with techniques like NTK-aware scaling or YaRN)
- Computationally efficient — just a rotation applied to Q and K
5. Gemma 4 Architecture Deep Dive
Gemma 4 is a decoder-only Transformer with several modern design choices that optimize both quality and inference efficiency:
| Component | Gemma 4 Choice | Why |
|---|---|---|
| Architecture | Decoder-only | Best for generative tasks; scales efficiently |
| Attention | Grouped-Query Attention (GQA) | Reduces KV-cache memory by 4–8× |
| Positional encoding | RoPE | Relative positions; context length extrapolation |
| Normalization | RMSNorm (pre-norm) | Faster than LayerNorm; more stable training |
| Activation | GeGLU | Gated activation; better than ReLU or GELU |
| Vocabulary | 256K tokens (SentencePiece) | Efficient encoding of multilingual text and code |
| Multimodal | Native vision encoder (SigLIP-based) | Process images alongside text |
Gemma 4 Model Sizes
| Variant | Parameters | Layers | Hidden Dim | Q Heads | KV Heads | Best For |
|---|---|---|---|---|---|---|
| Gemma 4 1B | 1 billion | 18 | 1,536 | 8 | 4 | On-device, mobile, edge |
| Gemma 4 4B | 4 billion | 26 | 2,560 | 16 | 8 | Lightweight tasks, fast inference |
| Gemma 4 12B | 12 billion | 36 | 3,840 | 24 | 8 | General purpose — our default in this course |
| Gemma 4 27B | 27 billion | 46 | 4,608 | 32 | 16 | Complex reasoning, coding, analysis |
import os
from google import genai
client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])
# You can query different Gemma 4 sizes via the API:
models_to_try = [
"gemma-4-4b-it", # Fast, lightweight
"gemma-4-12b-it", # Balanced (our default)
"gemma-4-27b-it", # Most capable
]
prompt = "Explain the difference between RMSNorm and LayerNorm in one paragraph."
for model_name in models_to_try:
response = client.models.generate_content(
model=model_name,
contents=prompt,
)
print(f"=== {model_name} ===")
print(response.text)
print()
6. Mixture of Experts (MoE)
Some of the largest models use Mixture of Experts (MoE) architecture to scale parameter count without proportionally increasing compute. In an MoE model, each transformer layer has multiple "expert" feed-forward networks, but only a few are activated for each token.
A router network decides which experts to activate for each token:
Typically, only 2 out of 8–16 experts are active per token. This means a model with 400B total parameters might only use 17B parameters per forward pass, making it fast despite its massive size.
| Model | Total Params | Active Params | Experts | Active Per Token |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 |
| GPT-4 (rumored) | ~1.8T | ~280B | 16 | 2 |
| DeepSeek-V3 | 671B | 37B | 256 | 8 |
| Llama 4 Maverick | 400B | 17B | 128 | — |
| Gemma 4 (all sizes) | Dense | All | — | — |
7. KV-Cache and Efficient Inference
During autoregressive generation, the model produces one token at a time. Naively, this would require recomputing attention over all previous tokens at each step — an O(n²) cost per token. The KV-cache solves this by storing the Key and Value vectors from previous steps:
- On step 1, compute Q, K, V for the prompt. Store K and V.
- On step 2, compute Q for the new token only. Retrieve stored K, V. Compute attention. Append new K, V to the cache.
- Repeat — each new token only needs its own Q, plus the growing KV-cache.
The KV-cache trades memory for speed. For a Gemma 4 12B model with a 128K context window, the KV-cache can grow to several gigabytes. This is why GQA (fewer KV heads) is so important for long-context models.
8. Choosing the Right Model Size
There is no universally "best" model — the right choice depends on your constraints:
| Use Case | Recommended | Why |
|---|---|---|
| Learning & prototyping | Gemma 4 12B-it (API) | Free, fast, high quality — our course default |
| Production chatbot | Gemma 4 27B or Claude Sonnet | Best quality/cost ratio for conversational AI |
| Mobile / edge | Gemma 4 1B or 4B | Runs on-device with acceptable quality |
| Complex reasoning | Claude Opus, GPT-4o, Gemini Pro | Frontier models for hardest tasks |
| Cost-sensitive batch processing | Gemma 4 4B or Llama 4 Scout | Low cost per token at scale |
| Self-hosted / private data | Gemma 4 or Llama 4 via vLLM | Open-weights = full control, no data leaves your infra |