The libraryLarge Language Models & GenAI

Exploring LLM Architectures

25 min readreadingLLM Foundations

3 of 38Large Language Models & GenAI

Exploring LLM Architectures

Every modern LLM is built on the Transformer architecture introduced in the landmark 2017 paper "Attention Is All You Need". But not all Transformers are the same — the choices around attention type, normalization, positional encoding, and whether to use an encoder, decoder, or both create profoundly different models. This lesson takes you inside the architecture so you understand why models behave the way they do.

1. The Self-Attention Mechanism

Self-attention is the core innovation that makes Transformers powerful. Unlike RNNs, which process tokens one by one, self-attention lets every token attend to every other token in the sequence in parallel. This solves the long-range dependency problem that plagued earlier architectures.

For each token, the model computes three vectors from the input embedding:

Query (Q) — "What am I looking for?"
Key (K) — "What do I contain?"
Value (V) — "What information do I provide?"

These are computed by multiplying the input embedding $X$ by learned weight matrices:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

The attention scores are then computed using scaled dot-product attention:

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

The scaling factor $d_{k}$ (where $d_{k}$ is the dimension of the key vectors) prevents the dot products from growing too large, which would push the softmax into regions with vanishingly small gradients.

2. Multi-Head Attention

Rather than computing a single attention function, Transformers use multi-head attention — multiple attention "heads" operating in parallel, each with its own Q, K, V weight matrices. This allows the model to jointly attend to information from different representation subspaces:

MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}

where head_{i} = Attention (Q W_{i Q}, K W_{i K}, V W_{i V})

For example, in a model with $d_{model} = 768$ and $h = 12$ heads, each head operates on $d_{k} = 64$ dimensions. One head might learn to attend to syntactic relationships, another to semantic similarity, and yet another to positional proximity.

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)

Standard multi-head attention requires separate K and V projections for every head, which is expensive during inference (the KV-cache grows linearly with the number of heads). Modern models use optimized variants:

Variant	K/V Heads	Memory	Used By
Multi-Head Attention (MHA)	Same as Q heads (e.g., 32)	Highest	Original Transformer, GPT-2
Multi-Query Attention (MQA)	1 shared K/V for all Q heads	Lowest	PaLM, Falcon
Grouped-Query Attention (GQA)	Groups share K/V (e.g., 8 KV for 32 Q)	Medium	Gemma 4, Llama 3/4, Mistral

3. Encoder, Decoder, and Encoder-Decoder

The original Transformer had both an encoder and a decoder. Modern LLMs have diverged into three main architecture families:

Type	Attention Pattern	Best For	Examples
Encoder-only	Bidirectional (sees full input)	Understanding: classification, NER, embeddings	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (left-to-right only)	Generation: chat, code, reasoning	Gemma 4, GPT-4, Claude, Llama 4
Encoder-Decoder	Bidirectional encoder + causal decoder	Seq-to-seq: translation, summarization	T5, BART, Flan-T5

The decoder-only architecture has won the scaling race. Nearly all frontier LLMs — GPT-4, Claude, Gemini, Gemma 4, Llama 4 — are decoder-only. The key insight: with enough scale, a decoder-only model can learn to do everything the other architectures can, while being simpler to train and scale.

4. Positional Encoding

Self-attention is permutation-invariant — it treats the input as a set, not a sequence. To inject information about token order, we need positional encoding.

Sinusoidal (Original Transformer)

The original paper used fixed sine and cosine functions of different frequencies:

P E_{(p os, 2 i)} = sin (\frac{p os}{1000 0 ^{2 i / d}}), P E_{(p os, 2 i + 1)} = cos (\frac{p os}{1000 0 ^{2 i / d}})

RoPE (Rotary Position Embeddings)

RoPE (Su et al., 2021) is the dominant positional encoding in modern LLMs. Instead of adding position information to embeddings, RoPE rotates the query and key vectors by an angle proportional to their position. This elegantly encodes relative position directly into the attention scores.

Key benefits of RoPE:

Naturally encodes relative positions (attention depends on distance between tokens, not absolute position)
Can extrapolate to sequences longer than those seen during training (with techniques like NTK-aware scaling or YaRN)
Computationally efficient — just a rotation applied to Q and K

5. Gemma 4 Architecture Deep Dive

Gemma 4 is a decoder-only Transformer with several modern design choices that optimize both quality and inference efficiency:

Component	Gemma 4 Choice	Why
Architecture	Decoder-only	Best for generative tasks; scales efficiently
Attention	Grouped-Query Attention (GQA)	Reduces KV-cache memory by 4–8×
Positional encoding	RoPE	Relative positions; context length extrapolation
Normalization	RMSNorm (pre-norm)	Faster than LayerNorm; more stable training
Activation	GeGLU	Gated activation; better than ReLU or GELU
Vocabulary	256K tokens (SentencePiece)	Efficient encoding of multilingual text and code
Multimodal	Native vision encoder (SigLIP-based)	Process images alongside text

Gemma 4 Model Sizes

Variant	Parameters	Layers	Hidden Dim	Q Heads	KV Heads	Best For
Gemma 4 1B	1 billion	18	1,536	8	4	On-device, mobile, edge
Gemma 4 4B	4 billion	26	2,560	16	8	Lightweight tasks, fast inference
Gemma 4 12B	12 billion	36	3,840	24	8	General purpose — our default in this course
Gemma 4 27B	27 billion	46	4,608	32	16	Complex reasoning, coding, analysis

python

import os
from google import genai

client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])

# You can query different Gemma 4 sizes via the API:
models_to_try = [
    "gemma-4-4b-it",    # Fast, lightweight
    "gemma-4-12b-it",   # Balanced (our default)
    "gemma-4-27b-it",   # Most capable
]

prompt = "Explain the difference between RMSNorm and LayerNorm in one paragraph."

for model_name in models_to_try:
    response = client.models.generate_content(
        model=model_name,
        contents=prompt,
    )
    print(f"=== {model_name} ===")
    print(response.text)
    print()

6. Mixture of Experts (MoE)

Some of the largest models use Mixture of Experts (MoE) architecture to scale parameter count without proportionally increasing compute. In an MoE model, each transformer layer has multiple "expert" feed-forward networks, but only a few are activated for each token.

A router network decides which experts to activate for each token:

y = i = 1 \sum N g_{i} (x) \cdot E_{i} (x), where g (x) = TopK (softmax (W_{g} \cdot x))

Typically, only 2 out of 8–16 experts are active per token. This means a model with 400B total parameters might only use 17B parameters per forward pass, making it fast despite its massive size.

Model	Total Params	Active Params	Experts	Active Per Token
Mixtral 8x7B	46.7B	12.9B	8	2
GPT-4 (rumored)	~1.8T	~280B	16	2
DeepSeek-V3	671B	37B	256	8
Llama 4 Maverick	400B	17B	128	—
Gemma 4 (all sizes)	Dense	All	—	—

7. KV-Cache and Efficient Inference

During autoregressive generation, the model produces one token at a time. Naively, this would require recomputing attention over all previous tokens at each step — an O(n²) cost per token. The KV-cache solves this by storing the Key and Value vectors from previous steps:

On step 1, compute Q, K, V for the prompt. Store K and V.
On step 2, compute Q for the new token only. Retrieve stored K, V. Compute attention. Append new K, V to the cache.
Repeat — each new token only needs its own Q, plus the growing KV-cache.

The KV-cache trades memory for speed. For a Gemma 4 12B model with a 128K context window, the KV-cache can grow to several gigabytes. This is why GQA (fewer KV heads) is so important for long-context models.

8. Choosing the Right Model Size

There is no universally "best" model — the right choice depends on your constraints:

Use Case	Recommended	Why
Learning & prototyping	Gemma 4 12B-it (API)	Free, fast, high quality — our course default
Production chatbot	Gemma 4 27B or Claude Sonnet	Best quality/cost ratio for conversational AI
Mobile / edge	Gemma 4 1B or 4B	Runs on-device with acceptable quality
Complex reasoning	Claude Opus, GPT-4o, Gemini Pro	Frontier models for hardest tasks
Cost-sensitive batch processing	Gemma 4 4B or Llama 4 Scout	Low cost per token at scale
Self-hosted / private data	Gemma 4 or Llama 4 via vLLM	Open-weights = full control, no data leaves your infra

← Previous lessonHow LLMs Are Trained: Pretraining to RLHF

Up next · Setting Up Your LLM Development Environment