Image Representation and Color Spaces
2 of 30Computer Vision with Deep Learning
Image Representation and Color Spaces
Before you train any model, you need to understand what an image is to a computer: an array of integers organized as height × width × channels. This notebook walks through pixels, channels, color spaces, and the small library of operations (read, resize, convert, normalize) that every vision pipeline relies on.
pip install pillow==11.0.0 numpy==2.1.2 opencv-python==4.10.0.84 \
matplotlib==3.9.2 torch==2.5.1 torchvision==0.20.1torchvision.datasets.FakeData
works if you have nothing else handy.
1. An Image Is a Numpy Array
from PIL import Image
import numpy as np
img = Image.open("cat.jpg") # PIL Image, 'RGB' mode
arr = np.array(img)
print(arr.shape, arr.dtype) # (H, W, 3), uint8
print(arr.min(), arr.max()) # 0, 255
Three things to internalize:
- Shape is (H, W, C) in numpy — height first.
- dtype is usually uint8 — values 0-255.
- Channels last in numpy / PIL; PyTorch uses channels first (C, H, W).
Mixing up (H, W, C) vs (C, H, W) is the #1 cause of "this code ran yesterday and now it crashes" in vision pipelines.
2. RGB and Grayscale
# RGB: 3 channels, ordered red-green-blue (PIL convention)
print(arr[0, 0]) # [R, G, B] for the top-left pixel
# Grayscale: 1 channel
gray = img.convert("L") # PIL "luminance"
print(np.array(gray).shape) # (H, W)
Grayscale is computed as a weighted sum
(0.299·R + 0.587·G + 0.114·B) approximating human
perception of brightness. Many CV operations (edge detection,
classical features) traditionally happened in grayscale; deep
models almost always use the full 3 channels.
3. The Color Space Zoo
| Space | Channels | When you'd use it |
|---|---|---|
| RGB | R, G, B | Default; what every model expects |
| BGR | B, G, R | OpenCV's default — silently wrong color when displayed as RGB |
| HSV | Hue, Saturation, Value | Color-based filtering, augmentations like "shift hue" |
| Lab | Lightness, a*, b* | Perceptually uniform; image-similarity metrics |
| YCbCr | Luminance + chroma | JPEG compression, video |
import cv2
bgr = cv2.imread("cat.jpg") # OpenCV reads as BGR!
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
hsv = cv2.cvtColor(rgb, cv2.COLOR_RGB2HSV)
4. Bit Depth and Range
- uint8 — 0 to 255. Standard for JPEG, PNG.
- uint16 — 0 to 65 535. Medical imaging (DICOM), high-end cameras.
- float32 — typically scaled to [0, 1] or normalized to mean 0 / std 1. What models actually consume.
img_f = arr.astype(np.float32) / 255.0
print(img_f.min(), img_f.max()) # 0.0, 1.0
5. Channel-First vs Channel-Last
import torch
# numpy / PIL: (H, W, C)
print(arr.shape) # (224, 224, 3)
# PyTorch: (C, H, W)
ten = torch.from_numpy(arr).permute(2, 0, 1)
print(ten.shape) # (3, 224, 224)
# torchvision provides this conversion via ToTensor
from torchvision.transforms.v2 import ToTensor
tensor = ToTensor()(img) # also rescales to [0, 1] float
Why two conventions? PyTorch's choice (C, H, W) makes batched data shape (N, C, H, W), which is friendlier to convolutions because the spatial dimensions stay together. NumPy and PIL chose (H, W, C) because that's how human-readable image data is stored on disk. Train yourself to read shape tuples carefully.
6. Reading and Writing Images
from PIL import Image
# Read
img = Image.open("input.jpg") # lazy; data not loaded yet
img.load() # force load
# Write
img.save("output.png") # PNG (lossless)
img.save("output.jpg", quality=92) # JPEG with quality
PIL is the cleanest API for I/O; torchvision's
read_image returns a tensor directly; OpenCV is
fastest for video. Pick by need: PIL for clarity, torchvision for
PyTorch pipelines, OpenCV for high-throughput frame processing.
7. Visualizing Images
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 3, figsize=(12, 4))
ax[0].imshow(arr); ax[0].set_title("RGB")
ax[1].imshow(np.array(gray), cmap="gray"); ax[1].set_title("Gray")
ax[2].imshow(arr[..., 0], cmap="gray"); ax[2].set_title("R channel")
for a in ax: a.axis("off")
plt.show()
Single-channel data needs cmap="gray"; otherwise
matplotlib applies the default viridis colormap and your image
looks alien. Triple-check colormaps in any vision plot you commit.
8. Indexing and Slicing
# Top-left 100×100 patch
patch = arr[:100, :100]
# Center crop
H, W = arr.shape[:2]; s = 224
top = (H - s) // 2; left = (W - s) // 2
center = arr[top:top + s, left:left + s]
# Mask: where red is dominant
mask = (arr[..., 0] > 150) & (arr[..., 0] > arr[..., 1] + 30)
print(mask.shape, mask.dtype) # (H, W), bool
Numpy slicing is the universal language of vision preprocessing.
Crops, masks, color-thresholding, channel splits — all of them
reduce to a few lines of arr[...].