The libraryMathematics for Machine Learning

Matrix Operations and Properties

30 min readvideoLinear Algebra Foundations

2 of 36Mathematics for Machine Learning

Matrix Operations and Properties

A small handful of matrix operations — multiplication, transpose, inverse, trace, and rank — power everything from linear regression to the attention mechanism. This lesson is about developing fluency with those operations and, crucially, knowing when each one fails.

1. Matrix Multiplication

For $A \in R^{m \times n}$ and $B \in R^{n \times p}$ , the product $C = A B$ has shape $m \times p$ , with

C_{ij} = k = 1 \sum n A_{ik} B_{k j}

The inner dimensions must match. Three useful ways to see matrix multiplication — pick whichever fits the moment:

Row-by-column: the classic "take the $i$ -th row of $A$ , dot-product with the $j$ -th column of $B$ ". Good for hand calculation.
Column combinations: every column of $A B$ is a linear combination of columns of $A$ , with coefficients from that column of $B$ . Good for understanding what $B$ is "doing" to $A$ .
Composition of transformations: if $A$ and $B$ are each linear maps, $A B$ is the map you get by applying $B$ then $A$ . Good for thinking about neural network layers.

2. Transpose

The transpose $A^{T}$ flips a matrix across its main diagonal: rows become columns. Key identities you will use constantly:

(A^{T})^{T} = A (A B)^{T} = B^{T} A^{T} (A + B)^{T} = A^{T} + B^{T}

The middle identity — order reverses when you transpose a product — is the source of half the "why is there a transpose here" moments in backprop derivations.

A matrix is symmetric if $A = A^{T}$ . Symmetric matrices are everywhere in ML: covariance matrices, Gram matrices, Hessians of twice-differentiable losses. They have special structure (real eigenvalues, orthogonal eigenvectors) that we will exploit in Lesson 4.

3. Inverse

The inverse $A^{- 1}$ of a square matrix $A$ satisfies $A A^{- 1} = A^{- 1} A = I$ . It is the "undo" of the linear transformation $A$ .

In practice, never compute matrix inverses in ML code. If you need $A^{- 1} b$ , solve the linear system $A x = b$ instead. It's faster and numerically stabler.

code

import numpy as np

A = np.array([[3.0, 1.0], [1.0, 2.0]])
b = np.array([9.0, 8.0])

# BAD — builds the inverse, O(n³), numerically fragile
x_bad = np.linalg.inv(A) @ b

# GOOD — solves directly, same cost, better conditioning
x_good = np.linalg.solve(A, b)

print(x_good)  # [2. 3.]

4. Determinant

The determinant $det (A)$ of a square matrix is a single scalar that summarizes two things about the linear map $A$ :

Magnitude: $∣ det (A) ∣$ is the factor by which $A$ scales volume. If $det (A) = 0.5$ , any region of space is cut in half after applying $A$ .
Orientation: a negative determinant means $A$ also flips orientation (like a mirror reflection).

Useful properties:

det (A B) = det (A) det (B) det (A^{T}) = det (A) det (A^{- 1}) = \frac{1}{det ( A )}

Determinants appear in probability (changes of variable in densities), in normalizing flows, and in regularization for certain generative models. In day-to-day ML code they're rare, but conceptually important.

5. Rank

The rank of a matrix is the number of linearly independent columns (equivalently, rows). A full-rank matrix preserves dimensions; a rank-deficient one collapses them.

Rank shows up throughout ML:

Where	Why rank matters
Linear regression	If $X^{T} X$ is rank-deficient, the closed-form solution is undefined — features are perfectly collinear.
LoRA fine-tuning	The whole idea is that weight updates are low-rank — you can approximate $Δ W$ as $B A$ where $B, A$ are thin.
PCA, embeddings	Dimensionality reduction is finding a low-rank approximation to your data matrix.

6. Trace

The trace $tr (A)$ is the sum of the diagonal entries of a square matrix. It looks innocuous but has a magical property:

tr (A B) = tr (B A)

This cyclic property lets you rearrange products inside a trace freely. It's the tool that makes derivations of batch normalization, the score function, and many loss gradients clean rather than horrific.

7. Computational Cost Is Not Equal

← Previous lessonVectors, Matrices, and Tensors

Up next · Linear Transformations in Practice