Image Preprocessing and Augmentation
3 of 30Computer Vision with Deep Learning
Image Preprocessing and Augmentation
Models don't eat raw pixels. They eat resized, normalized, augmented tensors. Preprocessing puts every input on a consistent scale; augmentation generates variety so the model learns features that generalize. This notebook covers the full preprocessing pipeline and the augmentation patterns that matter most in practice.
pip install albumentations==1.4.21 torchvision==0.20.1 \
pillow==11.0.0 numpy==2.1.2torchvision.transforms.v2 for PyTorch-native
pipelines and albumentations for tasks where you also
need to transform bounding boxes or masks.
1. Why Preprocess
A trained model expects exact input statistics. Three concrete reasons preprocessing matters:
- Fixed input size — most architectures take a specific shape (224 × 224 for ImageNet models, 384 × 384 for many ViTs).
- Numerical range — neural nets train better on data in roughly [-1, 1] than [0, 255].
- Train/serve consistency — preprocessing at training and at inference must be identical, or you ship garbage.
2. The Standard Pipeline
from torchvision.transforms import v2 as T
train_tx = T.Compose([
T.RandomResizedCrop(224, scale=(0.7, 1.0)),
T.RandomHorizontalFlip(p=0.5),
T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.02),
T.ToTensor(), # PIL → (C, H, W) float32 in [0, 1]
T.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet stats
std =[0.229, 0.224, 0.225]),
])
eval_tx = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std =[0.229, 0.224, 0.225]),
])
Two pipelines: training augments aggressively; evaluation is deterministic. The split is non-negotiable — augmenting at eval time produces non-reproducible numbers.
3. Why Those Specific Mean and Std Values
4. Resize, Crop, and Pad
| Operation | What it does | When to use |
|---|---|---|
Resize(N) | Scale shorter side to N, preserve aspect | Keep all content, fixed size for next step |
CenterCrop(N) | Take central N × N square | Eval-time deterministic crop |
RandomResizedCrop(N) | Random box, resized to N × N | Training augmentation |
Pad(p) | Add border | Preserve content without distortion |
ResizedCrop / Letterbox | Resize + pad to a fixed shape | Detection (preserve aspect for boxes) |
5. Geometric Augmentations
The "classics" — useful for almost every task:
- Horizontal flip — free 2x effective data for most natural images. Skip for digits, text, asymmetric scenes (left/right matter for road signs).
- Random rotation — small angles (-15° to +15°); large rotations rarely help unless your data has them.
- Random crop / scale — handles position and size variation.
- Random affine — combined translate / rotate / scale; one-stop augment.
- Perspective — useful when test data has different camera angles (documents, signs).
6. Color Augmentations
T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1)
T.RandomGrayscale(p=0.1)
T.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0))
Color jitter handles lighting variation. Grayscale (low p) helps the model not rely on color when it shouldn't. Blur simulates out-of-focus shots. The right intensity is task-dependent — a medical-imaging classifier should not have its hue rotated by 0.5.
7. Modern Augmentation Tricks
| Augmentation | What it does | Helps when |
|---|---|---|
| RandAugment | Randomly applies N transforms with magnitude M | Default for most modern training |
| AugMix | Mixes augmented chains; strong robustness | You care about distribution shift |
| Cutout / Random Erasing | Masks a random rectangle | Fights overfitting on small data |
| MixUp | Linearly mixes two images and their labels | Strong regularization, low-data regime |
| CutMix | Patches one image into another, blends labels by area | Object-centric tasks, classification |
train_tx = T.Compose([
T.RandomResizedCrop(224),
T.RandomHorizontalFlip(),
T.RandAugment(num_ops=2, magnitude=9),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
T.RandomErasing(p=0.25),
])
8. Albumentations: Boxes, Masks, Keypoints
Torchvision augments images. Albumentations augments images and their associated bounding boxes, segmentation masks, and keypoints — together, consistently. This matters for detection (Section 3) and segmentation (Section 4), where the labels also have to move when the image moves.
import albumentations as A
train_tx = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(0.2, 0.2, 0.2, 0.02, p=0.8),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_ids"]))
out = train_tx(image=img, bboxes=boxes, class_ids=ids)
img_aug, boxes_aug = out["image"], out["bboxes"]