AIMaks

Image Representation and Color Spaces

40 min readnotebookImage Fundamentals and Preprocessing
2 of 30Computer Vision with Deep Learning

Image Representation and Color Spaces

Before you train any model, you need to understand what an image is to a computer: an array of integers organized as height × width × channels. This notebook walks through pixels, channels, color spaces, and the small library of operations (read, resize, convert, normalize) that every vision pipeline relies on.

code
pip install pillow==11.0.0 numpy==2.1.2 opencv-python==4.10.0.84 \
    matplotlib==3.9.2 torch==2.5.1 torchvision==0.20.1
Examples assume any standard JPEG. torchvision.datasets.FakeData works if you have nothing else handy.

1. An Image Is a Numpy Array

code
from PIL import Image
import numpy as np

img = Image.open("cat.jpg")             # PIL Image, 'RGB' mode
arr = np.array(img)
print(arr.shape, arr.dtype)             # (H, W, 3), uint8
print(arr.min(), arr.max())             # 0, 255

Three things to internalize:

  • Shape is (H, W, C) in numpy — height first.
  • dtype is usually uint8 — values 0-255.
  • Channels last in numpy / PIL; PyTorch uses channels first (C, H, W).

Mixing up (H, W, C) vs (C, H, W) is the #1 cause of "this code ran yesterday and now it crashes" in vision pipelines.

2. RGB and Grayscale

code
# RGB: 3 channels, ordered red-green-blue (PIL convention)
print(arr[0, 0])                        # [R, G, B] for the top-left pixel

# Grayscale: 1 channel
gray = img.convert("L")                 # PIL "luminance"
print(np.array(gray).shape)             # (H, W)

Grayscale is computed as a weighted sum (0.299·R + 0.587·G + 0.114·B) approximating human perception of brightness. Many CV operations (edge detection, classical features) traditionally happened in grayscale; deep models almost always use the full 3 channels.

3. The Color Space Zoo

SpaceChannelsWhen you'd use it
RGBR, G, BDefault; what every model expects
BGRB, G, ROpenCV's default — silently wrong color when displayed as RGB
HSVHue, Saturation, ValueColor-based filtering, augmentations like "shift hue"
LabLightness, a*, b*Perceptually uniform; image-similarity metrics
YCbCrLuminance + chromaJPEG compression, video
code
import cv2
bgr = cv2.imread("cat.jpg")             # OpenCV reads as BGR!
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
hsv = cv2.cvtColor(rgb, cv2.COLOR_RGB2HSV)

4. Bit Depth and Range

  • uint8 — 0 to 255. Standard for JPEG, PNG.
  • uint16 — 0 to 65 535. Medical imaging (DICOM), high-end cameras.
  • float32 — typically scaled to [0, 1] or normalized to mean 0 / std 1. What models actually consume.
code
img_f = arr.astype(np.float32) / 255.0
print(img_f.min(), img_f.max())         # 0.0, 1.0

5. Channel-First vs Channel-Last

code
import torch
# numpy / PIL: (H, W, C)
print(arr.shape)                        # (224, 224, 3)
# PyTorch:    (C, H, W)
ten = torch.from_numpy(arr).permute(2, 0, 1)
print(ten.shape)                        # (3, 224, 224)

# torchvision provides this conversion via ToTensor
from torchvision.transforms.v2 import ToTensor
tensor = ToTensor()(img)                # also rescales to [0, 1] float

Why two conventions? PyTorch's choice (C, H, W) makes batched data shape (N, C, H, W), which is friendlier to convolutions because the spatial dimensions stay together. NumPy and PIL chose (H, W, C) because that's how human-readable image data is stored on disk. Train yourself to read shape tuples carefully.

6. Reading and Writing Images

code
from PIL import Image

# Read
img = Image.open("input.jpg")           # lazy; data not loaded yet
img.load()                              # force load

# Write
img.save("output.png")                  # PNG (lossless)
img.save("output.jpg", quality=92)      # JPEG with quality

PIL is the cleanest API for I/O; torchvision's read_image returns a tensor directly; OpenCV is fastest for video. Pick by need: PIL for clarity, torchvision for PyTorch pipelines, OpenCV for high-throughput frame processing.

7. Visualizing Images

code
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3, figsize=(12, 4))
ax[0].imshow(arr); ax[0].set_title("RGB")
ax[1].imshow(np.array(gray), cmap="gray"); ax[1].set_title("Gray")
ax[2].imshow(arr[..., 0], cmap="gray"); ax[2].set_title("R channel")
for a in ax: a.axis("off")
plt.show()

Single-channel data needs cmap="gray"; otherwise matplotlib applies the default viridis colormap and your image looks alien. Triple-check colormaps in any vision plot you commit.

8. Indexing and Slicing

code
# Top-left 100×100 patch
patch = arr[:100, :100]

# Center crop
H, W = arr.shape[:2]; s = 224
top  = (H - s) // 2; left = (W - s) // 2
center = arr[top:top + s, left:left + s]

# Mask: where red is dominant
mask = (arr[..., 0] > 150) & (arr[..., 0] > arr[..., 1] + 30)
print(mask.shape, mask.dtype)           # (H, W), bool

Numpy slicing is the universal language of vision preprocessing. Crops, masks, color-thresholding, channel splits — all of them reduce to a few lines of arr[...].

9. Tensor Sanity Checks

10. Exercises

Up next · Image Preprocessing and Augmentation