Tensors,
Autograd, and GPU Computing

35 min readvideoPyTorch Foundations

2 of 42Deep Learning with PyTorch

Tensors, Autograd, and GPU Computing

Three primitives carry the entire course: tensors (multi-dimensional arrays), autograd (the reverse-mode differentiation engine), and devices (CPU / GPU / MPS placement). Every layer, every loss, every optimizer is a wrapper over these three. This lesson goes deep on each — what they actually do, where the sharp edges are, and the patterns you'll repeat in every PyTorch program you ever write.

1. Tensors: NumPy with Two Extra Powers

code

import torch

a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
b = torch.zeros(2, 2)
c = torch.randn(3, 4, 5)
d = torch.arange(0, 10, 2)        # [0, 2, 4, 6, 8]

print(a.shape, a.dtype, a.device, a.requires_grad)
# torch.Size([2, 2]) torch.float32 cpu False

A tensor's identity is shape × dtype × device × requires_grad. Two of those (device, requires_grad) don't exist in NumPy and are responsible for almost every PyTorch bug you'll hit early on:

device — where the data lives. CPU, CUDA GPU, Apple MPS. Operations between tensors on different devices throw.
requires_grad — whether autograd tracks operations on this tensor. Off by default; turned on by the optimizer for parameters and by you for inputs you want gradients of.

2. Dtype Matters

Dtype	Bits	When
`torch.float32` / `torch.float`	32	Default; use unless you have a reason
`torch.float64` / `torch.double`	64	Numerical analysis; rarely in deep learning
`torch.float16` / `torch.half`	16	Mixed-precision training (with care)
`torch.bfloat16`	16	Modern default for mixed precision; same exponent range as fp32
`torch.int64` / `torch.long`	64	Indices, class labels, embedding lookup keys
`torch.bool`	1	Masks

Two dtype rules to internalise:

Class labels for cross-entropy must be long, not float.
Mixed-precision training uses bf16 (preferred on modern GPUs) or fp16 — Lesson 34 covers it.

3. Indexing, Slicing, Broadcasting

code

x = torch.arange(24).reshape(2, 3, 4)         # (2, 3, 4)

x[0]              # → (3, 4) — first batch
x[:, 1]           # → (2, 4) — second row of every batch
x[..., -1]        # → (2, 3) — last element along the last dim
x[x > 10]         # → 1-D tensor of values > 10 (mask + select)

# Broadcasting follows NumPy rules
a = torch.randn(3, 1)       # (3, 1)
b = torch.randn(   4)       # (4,)
c = a + b                   # (3, 4)

If you can read NumPy you can read PyTorch indexing — minus one trap: tensor[mask] always returns a 1-D tensor. Use torch.where if you need to keep the shape.

4. The Two Reshape Operations

Op	Behaviour
`x.view(...)`	Returns a view; requires contiguous memory; throws otherwise
`x.reshape(...)`	Returns a view if possible; copies if not; always works
`x.permute(2, 0, 1)`	Reorders dimensions; produces a non-contiguous view
`x.transpose(0, 1)`	Swaps two dimensions
`x.contiguous()`	Forces a contiguous copy if not already

When in doubt: reshape. Reach for view only when you're sure the tensor is contiguous and want the failure mode to be loud.

5. Autograd: How It Actually Works

code

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad)        # tensor(7.) — dy/dx at x=2 is 2*2 + 3 = 7

Behind the scenes, every operation that touches a tensor with requires_grad=True records a node in a dynamic computation graph. y.backward() walks the graph in reverse, applying the chain rule, and accumulates the result on every leaf tensor's .grad attribute.

Three rules to internalise:

Only scalar outputs can .backward() directly. For non-scalar outputs you must pass a gradient: y.backward(torch.ones_like(y)).
Gradients accumulate. Calling .backward() twice without opt.zero_grad() sums them. This is a feature (gradient accumulation) but a foot-gun if you forget.
The graph is freed after backward by default. To call .backward() twice on the same graph, use retain_graph=True.

6. Detaching from the Graph

code

# Stop autograd from tracking a tensor
y = x.detach()                        # new tensor, no graph

# Inference / evaluation context — turns off autograd globally
with torch.no_grad():
    pred = model(x)

# Inference + freezes batchnorm / dropout
model.eval()
with torch.inference_mode():
    pred = model(x)

torch.no_grad() halves memory and roughly doubles speed during evaluation by skipping graph construction. torch.inference_mode() is even stricter — used whenever the tensor will never need a gradient. Always wrap your eval / inference code in one of them.

7. Devices: CPU, CUDA, MPS

code

def best_device():
    if torch.cuda.is_available():
        return torch.device("cuda")
    if torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

DEVICE = best_device()

x = torch.randn(1024, 1024, device=DEVICE)         # allocate on GPU
y = torch.randn(1024, 1024).to(DEVICE)             # move

Two rules about devices:

Operations between tensors on different devices throw. Always move both the model and the data.
Moving via .to(device) is a copy. After x_gpu = x.to("cuda"), modifying x_gpu doesn't change x.

8. CPU↔GPU Transfer is Where Performance Dies

9. Common Operations Cheat Sheet

code

x.sum(), x.mean(), x.std()              # reductions
x.sum(dim=1, keepdim=True)              # along a specific axis
x.argmax(dim=-1)                        # for classification
x.softmax(dim=-1)                       # for probabilities
torch.matmul(a, b)  /  a @ b            # matrix multiply
torch.einsum("bi,bj->bij", a, b)        # outer product per batch
torch.cat([a, b], dim=0)                # along an existing dim
torch.stack([a, b], dim=0)              # along a new dim
torch.where(mask, a, b)                 # vectorised if-else

Internalize einsum early — it expresses any multi-axis contraction without reshape gymnastics.

10. Putting It All Together

code

x = torch.randn(8, 3, requires_grad=True, device=DEVICE)
W = torch.randn(3, 5, requires_grad=True, device=DEVICE)
b = torch.zeros(5,    requires_grad=True, device=DEVICE)

logits = x @ W + b                                # (8, 5) on GPU
target = torch.randint(0, 5, (8,), device=DEVICE) # (8,) longs
loss = torch.nn.functional.cross_entropy(logits, target)
loss.backward()

print(W.grad.shape, W.grad.norm().item())

This is a toy linear classifier with one minibatch, computed end-to-end on the GPU, with gradients automatically flowing back to W and b. The same pattern, scaled up, is every neural network in this course.

11. The Mental Model

← Previous lessonIntroduction to PyTorch

Up next · Your First PyTorch Notebook