AIMaks

Exploring LLM Architectures

25 min readreadingLLM Foundations
3 of 38Large Language Models & GenAI

Exploring LLM Architectures

Every modern LLM is built on the Transformer architecture introduced in the landmark 2017 paper "Attention Is All You Need". But not all Transformers are the same — the choices around attention type, normalization, positional encoding, and whether to use an encoder, decoder, or both create profoundly different models. This lesson takes you inside the architecture so you understand why models behave the way they do.

1. The Self-Attention Mechanism

Self-attention is the core innovation that makes Transformers powerful. Unlike RNNs, which process tokens one by one, self-attention lets every token attend to every other token in the sequence in parallel. This solves the long-range dependency problem that plagued earlier architectures.

For each token, the model computes three vectors from the input embedding:

  • Query (Q) — "What am I looking for?"
  • Key (K) — "What do I contain?"
  • Value (V) — "What information do I provide?"

These are computed by multiplying the input embedding by learned weight matrices:

The attention scores are then computed using scaled dot-product attention:

The scaling factor (where is the dimension of the key vectors) prevents the dot products from growing too large, which would push the softmax into regions with vanishingly small gradients.

2. Multi-Head Attention

Rather than computing a single attention function, Transformers use multi-head attention — multiple attention "heads" operating in parallel, each with its own Q, K, V weight matrices. This allows the model to jointly attend to information from different representation subspaces:

For example, in a model with and heads, each head operates on dimensions. One head might learn to attend to syntactic relationships, another to semantic similarity, and yet another to positional proximity.

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)

Standard multi-head attention requires separate K and V projections for every head, which is expensive during inference (the KV-cache grows linearly with the number of heads). Modern models use optimized variants:

VariantK/V HeadsMemoryUsed By
Multi-Head Attention (MHA)Same as Q heads (e.g., 32)HighestOriginal Transformer, GPT-2
Multi-Query Attention (MQA)1 shared K/V for all Q headsLowestPaLM, Falcon
Grouped-Query Attention (GQA)Groups share K/V (e.g., 8 KV for 32 Q)MediumGemma 4, Llama 3/4, Mistral

3. Encoder, Decoder, and Encoder-Decoder

The original Transformer had both an encoder and a decoder. Modern LLMs have diverged into three main architecture families:

TypeAttention PatternBest ForExamples
Encoder-onlyBidirectional (sees full input)Understanding: classification, NER, embeddingsBERT, RoBERTa, DeBERTa
Decoder-onlyCausal (left-to-right only)Generation: chat, code, reasoningGemma 4, GPT-4, Claude, Llama 4
Encoder-DecoderBidirectional encoder + causal decoderSeq-to-seq: translation, summarizationT5, BART, Flan-T5

The decoder-only architecture has won the scaling race. Nearly all frontier LLMs — GPT-4, Claude, Gemini, Gemma 4, Llama 4 — are decoder-only. The key insight: with enough scale, a decoder-only model can learn to do everything the other architectures can, while being simpler to train and scale.

4. Positional Encoding

Self-attention is permutation-invariant — it treats the input as a set, not a sequence. To inject information about token order, we need positional encoding.

Sinusoidal (Original Transformer)

The original paper used fixed sine and cosine functions of different frequencies:

RoPE (Rotary Position Embeddings)

RoPE (Su et al., 2021) is the dominant positional encoding in modern LLMs. Instead of adding position information to embeddings, RoPE rotates the query and key vectors by an angle proportional to their position. This elegantly encodes relative position directly into the attention scores.

Key benefits of RoPE:

  • Naturally encodes relative positions (attention depends on distance between tokens, not absolute position)
  • Can extrapolate to sequences longer than those seen during training (with techniques like NTK-aware scaling or YaRN)
  • Computationally efficient — just a rotation applied to Q and K

5. Gemma 4 Architecture Deep Dive

Gemma 4 is a decoder-only Transformer with several modern design choices that optimize both quality and inference efficiency:

ComponentGemma 4 ChoiceWhy
ArchitectureDecoder-onlyBest for generative tasks; scales efficiently
AttentionGrouped-Query Attention (GQA)Reduces KV-cache memory by 4–8×
Positional encodingRoPERelative positions; context length extrapolation
NormalizationRMSNorm (pre-norm)Faster than LayerNorm; more stable training
ActivationGeGLUGated activation; better than ReLU or GELU
Vocabulary256K tokens (SentencePiece)Efficient encoding of multilingual text and code
MultimodalNative vision encoder (SigLIP-based)Process images alongside text

Gemma 4 Model Sizes

VariantParametersLayersHidden DimQ HeadsKV HeadsBest For
Gemma 4 1B1 billion181,53684On-device, mobile, edge
Gemma 4 4B4 billion262,560168Lightweight tasks, fast inference
Gemma 4 12B12 billion363,840248General purpose — our default in this course
Gemma 4 27B27 billion464,6083216Complex reasoning, coding, analysis
python
import os
from google import genai

client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])

# You can query different Gemma 4 sizes via the API:
models_to_try = [
    "gemma-4-4b-it",    # Fast, lightweight
    "gemma-4-12b-it",   # Balanced (our default)
    "gemma-4-27b-it",   # Most capable
]

prompt = "Explain the difference between RMSNorm and LayerNorm in one paragraph."

for model_name in models_to_try:
    response = client.models.generate_content(
        model=model_name,
        contents=prompt,
    )
    print(f"=== {model_name} ===")
    print(response.text)
    print()

6. Mixture of Experts (MoE)

Some of the largest models use Mixture of Experts (MoE) architecture to scale parameter count without proportionally increasing compute. In an MoE model, each transformer layer has multiple "expert" feed-forward networks, but only a few are activated for each token.

A router network decides which experts to activate for each token:

Typically, only 2 out of 8–16 experts are active per token. This means a model with 400B total parameters might only use 17B parameters per forward pass, making it fast despite its massive size.

ModelTotal ParamsActive ParamsExpertsActive Per Token
Mixtral 8x7B46.7B12.9B82
GPT-4 (rumored)~1.8T~280B162
DeepSeek-V3671B37B2568
Llama 4 Maverick400B17B128
Gemma 4 (all sizes)DenseAll

7. KV-Cache and Efficient Inference

During autoregressive generation, the model produces one token at a time. Naively, this would require recomputing attention over all previous tokens at each step — an O(n²) cost per token. The KV-cache solves this by storing the Key and Value vectors from previous steps:

  1. On step 1, compute Q, K, V for the prompt. Store K and V.
  2. On step 2, compute Q for the new token only. Retrieve stored K, V. Compute attention. Append new K, V to the cache.
  3. Repeat — each new token only needs its own Q, plus the growing KV-cache.

The KV-cache trades memory for speed. For a Gemma 4 12B model with a 128K context window, the KV-cache can grow to several gigabytes. This is why GQA (fewer KV heads) is so important for long-context models.

8. Choosing the Right Model Size

There is no universally "best" model — the right choice depends on your constraints:

Use CaseRecommendedWhy
Learning & prototypingGemma 4 12B-it (API)Free, fast, high quality — our course default
Production chatbotGemma 4 27B or Claude SonnetBest quality/cost ratio for conversational AI
Mobile / edgeGemma 4 1B or 4BRuns on-device with acceptable quality
Complex reasoningClaude Opus, GPT-4o, Gemini ProFrontier models for hardest tasks
Cost-sensitive batch processingGemma 4 4B or Llama 4 ScoutLow cost per token at scale
Self-hosted / private dataGemma 4 or Llama 4 via vLLMOpen-weights = full control, no data leaves your infra
Up next · Setting Up Your LLM Development Environment