The libraryComputer Vision with Deep Learning

Introduction to Computer Vision

30 min readreadingImage Fundamentals and Preprocessing

1 of 30Computer Vision with Deep Learning

Up next · Image Representation and Color Spaces

Task	Output	Example
Classification	One label per image	"cat" / "dog"
Detection	Boxes + labels	"person at (x, y, w, h)"
Segmentation	Pixel-wise label map	"this pixel is road, this is sky"
Generation	A new image	"a photo of an astronaut on a horse"
Multi-modal	Text + image jointly	"describe this picture"

Task

Output

Example

Classification

One label per image

"cat" / "dog"

Detection

Boxes + labels

"person at (x, y, w, h)"

Segmentation

Pixel-wise label map

"this pixel is road, this is sky"

Generation

A new image

"a photo of an astronaut on a horse"

Multi-modal

Text + image jointly

"describe this picture"

import numpy as np import matplotlib.pyplot as plt # An 8x8 "image" of a plus sign, built from nothing but numbers. img = np.zeros((8, 8)) img[3:5, 1:7] = 0.6 # horizontal bar, medium gray img[1:7, 3:5] = 1.0 # vertical bar, white print("shape:", img.shape) print("the raw numbers the computer sees:") print(img) fig, axes = plt.subplots(1, 2, figsize=(9, 4)) axes[0].imshow(img, cmap="gray", vmin=0, vmax=1) axes[0].set_title("What you see") axes[1].imshow(img, cmap="gray", vmin=0, vmax=1) for i in range(8): for j in range(8): axes[1].text(j, i, f"{img[i, j]:.1f}", ha="center", va="center", color="tab:red", fontsize=7) axes[1].set_title("What the computer sees") for ax in axes: ax.axis("off") fig.tight_layout()

Year	Milestone	Why it mattered
2012	AlexNet	Learned features beat two decades of hand-engineering
2014	VGG, GANs	Deeper networks; networks that generate images
2015	ResNet, U-Net, Faster R-CNN	Skip connections unlock 100+ layers; segmentation and detection go deep
2016	YOLO	Real-time detection in a single forward pass
2017	Transformer (in NLP)	The architecture that would later cross over to vision

Layer	What it provides
Frameworks	PyTorch (the research and course default), TensorFlow / Keras
Vision libraries	torchvision, timm (1000+ pretrained backbones)
Image I/O	Pillow (PIL), OpenCV
Augmentation	torchvision.transforms.v2, Albumentations
Detection / segmentation	Ultralytics (YOLO lineage), RT-DETR, SAM / SAM 2, detectron2
Generation	🤗 diffusers (Stable Diffusion, SDXL-class models), ComfyUI
Multi-modal	CLIP / SigLIP, LLaVA-class VLMs, 🤗 transformers

Introduction to Computer Vision

1. What Computer Vision Actually Does

2. Why Vision Is Hard (For Machines)

3. The Pre-Deep-Learning Era (Briefly)

4. The 2012 Inflection Point

5. Why Deep Learning Wins for Vision

6. Act One — The Deep-Learning Decade (2012-2020)

7. Act Two — The Foundation-Model Era

8. The Modern Vision Stack

9. Where Computer Vision Is Used Today

10. The Cost: Data and Compute

11. What This Course Will Build

12. The Road Ahead