Tensors,
Autograd, and GPU Computing
2 of 42Deep Learning with PyTorch
Tensors, Autograd, and GPU Computing
Three primitives carry the entire course: tensors (multi-dimensional arrays), autograd (the reverse-mode differentiation engine), and devices (CPU / GPU / MPS placement). Every layer, every loss, every optimizer is a wrapper over these three. This lesson goes deep on each — what they actually do, where the sharp edges are, and the patterns you'll repeat in every PyTorch program you ever write.
1. Tensors: NumPy with Two Extra Powers
import torch
a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
b = torch.zeros(2, 2)
c = torch.randn(3, 4, 5)
d = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
print(a.shape, a.dtype, a.device, a.requires_grad)
# torch.Size([2, 2]) torch.float32 cpu False
A tensor's identity is shape × dtype × device × requires_grad. Two of those (device, requires_grad) don't exist in NumPy and are responsible for almost every PyTorch bug you'll hit early on:
- device — where the data lives. CPU, CUDA GPU, Apple MPS. Operations between tensors on different devices throw.
- requires_grad — whether autograd tracks operations on this tensor. Off by default; turned on by the optimizer for parameters and by you for inputs you want gradients of.
2. Dtype Matters
| Dtype | Bits | When |
|---|---|---|
torch.float32 / torch.float | 32 | Default; use unless you have a reason |
torch.float64 / torch.double | 64 | Numerical analysis; rarely in deep learning |
torch.float16 / torch.half | 16 | Mixed-precision training (with care) |
torch.bfloat16 | 16 | Modern default for mixed precision; same exponent range as fp32 |
torch.int64 / torch.long | 64 | Indices, class labels, embedding lookup keys |
torch.bool | 1 | Masks |
Two dtype rules to internalise:
- Class labels for cross-entropy must be long, not float.
- Mixed-precision training uses bf16 (preferred on modern GPUs) or fp16 — Lesson 34 covers it.
3. Indexing, Slicing, Broadcasting
x = torch.arange(24).reshape(2, 3, 4) # (2, 3, 4)
x[0] # → (3, 4) — first batch
x[:, 1] # → (2, 4) — second row of every batch
x[..., -1] # → (2, 3) — last element along the last dim
x[x > 10] # → 1-D tensor of values > 10 (mask + select)
# Broadcasting follows NumPy rules
a = torch.randn(3, 1) # (3, 1)
b = torch.randn( 4) # (4,)
c = a + b # (3, 4)
If you can read NumPy you can read PyTorch indexing — minus
one trap: tensor[mask] always returns a 1-D
tensor. Use torch.where if you need to keep the
shape.
4. The Two Reshape Operations
| Op | Behaviour |
|---|---|
x.view(...) | Returns a view; requires contiguous memory; throws otherwise |
x.reshape(...) | Returns a view if possible; copies if not; always works |
x.permute(2, 0, 1) | Reorders dimensions; produces a non-contiguous view |
x.transpose(0, 1) | Swaps two dimensions |
x.contiguous() | Forces a contiguous copy if not already |
When in doubt: reshape. Reach for
view only when you're sure the tensor is
contiguous and want the failure mode to be loud.
5. Autograd: How It Actually Works
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad) # tensor(7.) — dy/dx at x=2 is 2*2 + 3 = 7
Behind the scenes, every operation that touches a tensor with
requires_grad=True records a node in a dynamic
computation graph. y.backward() walks the graph
in reverse, applying the chain rule, and accumulates the
result on every leaf tensor's .grad attribute.
Three rules to internalise:
- Only scalar outputs can
.backward()directly. For non-scalar outputs you must pass a gradient:y.backward(torch.ones_like(y)). - Gradients accumulate. Calling
.backward()twice withoutopt.zero_grad()sums them. This is a feature (gradient accumulation) but a foot-gun if you forget. - The graph is freed after backward by default.
To call
.backward()twice on the same graph, useretain_graph=True.
6. Detaching from the Graph
# Stop autograd from tracking a tensor
y = x.detach() # new tensor, no graph
# Inference / evaluation context — turns off autograd globally
with torch.no_grad():
pred = model(x)
# Inference + freezes batchnorm / dropout
model.eval()
with torch.inference_mode():
pred = model(x)
torch.no_grad() halves memory and roughly doubles
speed during evaluation by skipping graph construction.
torch.inference_mode() is even stricter — used
whenever the tensor will never need a gradient. Always wrap
your eval / inference code in one of them.
7. Devices: CPU, CUDA, MPS
def best_device():
if torch.cuda.is_available():
return torch.device("cuda")
if torch.backends.mps.is_available():
return torch.device("mps")
return torch.device("cpu")
DEVICE = best_device()
x = torch.randn(1024, 1024, device=DEVICE) # allocate on GPU
y = torch.randn(1024, 1024).to(DEVICE) # move
Two rules about devices:
- Operations between tensors on different devices throw. Always move both the model and the data.
- Moving via
.to(device)is a copy. Afterx_gpu = x.to("cuda"), modifyingx_gpudoesn't changex.
8. CPU↔GPU Transfer is Where Performance Dies
9. Common Operations Cheat Sheet
x.sum(), x.mean(), x.std() # reductions
x.sum(dim=1, keepdim=True) # along a specific axis
x.argmax(dim=-1) # for classification
x.softmax(dim=-1) # for probabilities
torch.matmul(a, b) / a @ b # matrix multiply
torch.einsum("bi,bj->bij", a, b) # outer product per batch
torch.cat([a, b], dim=0) # along an existing dim
torch.stack([a, b], dim=0) # along a new dim
torch.where(mask, a, b) # vectorised if-else
Internalize einsum early — it expresses any
multi-axis contraction without reshape gymnastics.
10. Putting It All Together
x = torch.randn(8, 3, requires_grad=True, device=DEVICE)
W = torch.randn(3, 5, requires_grad=True, device=DEVICE)
b = torch.zeros(5, requires_grad=True, device=DEVICE)
logits = x @ W + b # (8, 5) on GPU
target = torch.randint(0, 5, (8,), device=DEVICE) # (8,) longs
loss = torch.nn.functional.cross_entropy(logits, target)
loss.backward()
print(W.grad.shape, W.grad.norm().item())
This is a toy linear classifier with one minibatch, computed
end-to-end on the GPU, with gradients automatically flowing
back to W and b. The same pattern,
scaled up, is every neural network in this course.