The libraryMathematics for Machine Learning

Vectors,
Matrices, and Tensors

35 min readvideoLinear Algebra Foundations

1 of 36Mathematics for Machine Learning

Vectors, Matrices, and Tensors

Every machine learning model — from logistic regression to GPT-5 — is, at its core, a sequence of operations on arrays of numbers. Those arrays are vectors when they have one axis, matrices when they have two, and tensors when they have three or more. If you can think fluently in these shapes, every ML paper becomes easier to read and every tensor-shape bug becomes easier to diagnose.

1. Vectors: Points and Directions

A vector is an ordered list of numbers. We write a column vector as $v \in R^{n}$ :

v = 2 - 1 4 \in R^{3}

Geometrically, this is both a point in 3-space and a direction from the origin to that point. Both interpretations are used constantly in ML. A word embedding is a point. A gradient is a direction.

Two key operations define a vector space:

Addition — $u + v$ is element-wise. Geometrically, it's "tip-to-tail": walk along $u$ , then along $v$ .
Scalar multiplication — $c v$ stretches (or flips, if $c < 0$ ) the vector without rotating it.

The dot product $u \cdot v = \sum_{i} u_{i} v_{i}$ is the single most important vector operation in ML. It measures how "aligned" two vectors are:

Zero → orthogonal (uncorrelated directions)
Positive → pointing in similar directions
Negative → pointing in opposite directions

Cosine similarity — the backbone of semantic search and RAG retrieval — is just a normalized dot product: $cos θ = \frac{u \cdot v}{∥ u ∥∥ v ∥}$ .

2. Norms: Measuring Length

A norm assigns a "length" to a vector. The two you'll meet most often:

∥ v ∥_{2} = i \sum v_{i 2} (Euclidean, L2)

∥ v ∥_{1} = i \sum ∣ v_{i} ∣ (Manhattan, L1)

3. Matrices: Transformations

A matrix $A \in R^{m \times n}$ is a rectangular array with $m$ rows and $n$ columns. The single most useful way to think about a matrix is as a function that maps $R^{n}$ to $R^{m}$ :

y = A x

Given a column vector $x \in R^{n}$ , the matrix $A$ transforms it into a new vector $y \in R^{m}$ . This is what a dense layer in a neural network does — it's a matrix multiply followed by a nonlinearity.

Three special matrices you'll encounter every day:

Matrix	What it does	Example ML use
Identity $I$	Leaves the input unchanged: $I x = x$	Residual connections (skip path)
Diagonal	Scales each coordinate independently	Feature-wise normalization, LayerNorm's $γ$
Orthogonal	Rotates / reflects; preserves all lengths and angles	RoPE positional encoding; weight init

4. Tensors: Higher-Order Arrays

A tensor is just a generalization: rank-0 is a scalar, rank-1 is a vector, rank-2 is a matrix, rank-3 has (depth, rows, cols), and so on. In ML we almost always need rank 3 or higher because we process data in batches.

code

import torch

# A single RGB image:            (C=3, H=224, W=224)            → rank 3
# A batch of images:      (B=32, C=3, H=224, W=224)              → rank 4
# A batch of token seqs:  (B=8, T=1024, D=768)                   → rank 3
# A batch of videos:      (B=4, T=16, C=3, H=224, W=224)         → rank 5

x = torch.randn(8, 1024, 768)  # batch × time × embedding
print(x.shape)                 # torch.Size([8, 1024, 768])
print(x.ndim)                  # 3

5. Broadcasting: How NumPy and PyTorch Stretch Tensors

Broadcasting is the rule for how tensors of different shapes combine. Two shapes are compatible when, aligned from the right, each pair of dimensions is either equal or one of them is 1.

code

import numpy as np

x = np.zeros((32, 1024, 768))  # (B, T, D)
bias = np.zeros(768)            # (D,)

# bias broadcasts to (1, 1, 768), then expands to (32, 1024, 768)
y = x + bias                    # OK, shape (32, 1024, 768)

scale = np.zeros((32, 1))       # (B, 1)
# scale expands to (32, 1024), but we need a trailing 768 — FAILS
# Add an axis:
scale = scale[:, :, None]       # (32, 1, 1) → broadcasts across T and D
y = x * scale                   # OK

6. Why This Matters for ML

Every line of a modern deep-learning framework is a tensor operation. The forward pass of a transformer is:

Tokens → embedding lookup (matrix indexing)
Add positional encoding (tensor addition)
Linear projections to Q, K, V (three matrix multiplies)
Attention scores $Q K^{T} / d_{k}$ (batched matmul)
Softmax (tensor operation over one axis)
Weighted sum of values (another matmul)
Output projection (matmul again)
Repeat × N layers

Nothing else is happening. If you are comfortable with the shapes flowing through those operations, you can read the PyTorch source for any transformer and know where you are.

Up next · Matrix Operations and Properties