Vectors,
Matrices, and Tensors
1 of 36Mathematics for Machine Learning
Vectors, Matrices, and Tensors
Every machine learning model — from logistic regression to GPT-5 — is, at its core, a sequence of operations on arrays of numbers. Those arrays are vectors when they have one axis, matrices when they have two, and tensors when they have three or more. If you can think fluently in these shapes, every ML paper becomes easier to read and every tensor-shape bug becomes easier to diagnose.
1. Vectors: Points and Directions
A vector is an ordered list of numbers. We write a column vector as :
Geometrically, this is both a point in 3-space and a direction from the origin to that point. Both interpretations are used constantly in ML. A word embedding is a point. A gradient is a direction.
Two key operations define a vector space:
- Addition — is element-wise. Geometrically, it's "tip-to-tail": walk along , then along .
- Scalar multiplication — stretches (or flips, if c < 0) the vector without rotating it.
The dot product is the single most important vector operation in ML. It measures how "aligned" two vectors are:
- Zero → orthogonal (uncorrelated directions)
- Positive → pointing in similar directions
- Negative → pointing in opposite directions
Cosine similarity — the backbone of semantic search and RAG retrieval — is just a normalized dot product: .
2. Norms: Measuring Length
A norm assigns a "length" to a vector. The two you'll meet most often:
3. Matrices: Transformations
A matrix is a rectangular array with rows and columns. The single most useful way to think about a matrix is as a function that maps to :
Given a column vector , the matrix transforms it into a new vector . This is what a dense layer in a neural network does — it's a matrix multiply followed by a nonlinearity.
Three special matrices you'll encounter every day:
| Matrix | What it does | Example ML use |
|---|---|---|
| Identity | Leaves the input unchanged: | Residual connections (skip path) |
| Diagonal | Scales each coordinate independently | Feature-wise normalization, LayerNorm's |
| Orthogonal | Rotates / reflects; preserves all lengths and angles | RoPE positional encoding; weight init |
4. Tensors: Higher-Order Arrays
A tensor is just a generalization: rank-0 is a scalar, rank-1 is a vector, rank-2 is a matrix, rank-3 has (depth, rows, cols), and so on. In ML we almost always need rank 3 or higher because we process data in batches.
import torch
# A single RGB image: (C=3, H=224, W=224) → rank 3
# A batch of images: (B=32, C=3, H=224, W=224) → rank 4
# A batch of token seqs: (B=8, T=1024, D=768) → rank 3
# A batch of videos: (B=4, T=16, C=3, H=224, W=224) → rank 5
x = torch.randn(8, 1024, 768) # batch × time × embedding
print(x.shape) # torch.Size([8, 1024, 768])
print(x.ndim) # 3
5. Broadcasting: How NumPy and PyTorch Stretch Tensors
Broadcasting is the rule for how tensors of different shapes combine. Two shapes are compatible when, aligned from the right, each pair of dimensions is either equal or one of them is 1.
import numpy as np
x = np.zeros((32, 1024, 768)) # (B, T, D)
bias = np.zeros(768) # (D,)
# bias broadcasts to (1, 1, 768), then expands to (32, 1024, 768)
y = x + bias # OK, shape (32, 1024, 768)
scale = np.zeros((32, 1)) # (B, 1)
# scale expands to (32, 1024), but we need a trailing 768 — FAILS
# Add an axis:
scale = scale[:, :, None] # (32, 1, 1) → broadcasts across T and D
y = x * scale # OK
6. Why This Matters for ML
Every line of a modern deep-learning framework is a tensor operation. The forward pass of a transformer is:
- Tokens → embedding lookup (matrix indexing)
- Add positional encoding (tensor addition)
- Linear projections to Q, K, V (three matrix multiplies)
- Attention scores (batched matmul)
- Softmax (tensor operation over one axis)
- Weighted sum of values (another matmul)
- Output projection (matmul again)
- Repeat × N layers
Nothing else is happening. If you are comfortable with the shapes flowing through those operations, you can read the PyTorch source for any transformer and know where you are.