AIMaks

Image Preprocessing and Augmentation

45 min readnotebookImage Fundamentals and Preprocessing
3 of 30Computer Vision with Deep Learning

Image Preprocessing and Augmentation

Models don't eat raw pixels. They eat resized, normalized, augmented tensors. Preprocessing puts every input on a consistent scale; augmentation generates variety so the model learns features that generalize. This notebook covers the full preprocessing pipeline and the augmentation patterns that matter most in practice.

code
pip install albumentations==1.4.21 torchvision==0.20.1 \
    pillow==11.0.0 numpy==2.1.2
We use torchvision.transforms.v2 for PyTorch-native pipelines and albumentations for tasks where you also need to transform bounding boxes or masks.

1. Why Preprocess

A trained model expects exact input statistics. Three concrete reasons preprocessing matters:

  • Fixed input size — most architectures take a specific shape (224 × 224 for ImageNet models, 384 × 384 for many ViTs).
  • Numerical range — neural nets train better on data in roughly [-1, 1] than [0, 255].
  • Train/serve consistency — preprocessing at training and at inference must be identical, or you ship garbage.

2. The Standard Pipeline

code
from torchvision.transforms import v2 as T

train_tx = T.Compose([
    T.RandomResizedCrop(224, scale=(0.7, 1.0)),
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.02),
    T.ToTensor(),                       # PIL → (C, H, W) float32 in [0, 1]
    T.Normalize(mean=[0.485, 0.456, 0.406],   # ImageNet stats
                std =[0.229, 0.224, 0.225]),
])

eval_tx = T.Compose([
    T.Resize(256),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std =[0.229, 0.224, 0.225]),
])

Two pipelines: training augments aggressively; evaluation is deterministic. The split is non-negotiable — augmenting at eval time produces non-reproducible numbers.

3. Why Those Specific Mean and Std Values

4. Resize, Crop, and Pad

OperationWhat it doesWhen to use
Resize(N)Scale shorter side to N, preserve aspectKeep all content, fixed size for next step
CenterCrop(N)Take central N × N squareEval-time deterministic crop
RandomResizedCrop(N)Random box, resized to N × NTraining augmentation
Pad(p)Add borderPreserve content without distortion
ResizedCrop / LetterboxResize + pad to a fixed shapeDetection (preserve aspect for boxes)

5. Geometric Augmentations

The "classics" — useful for almost every task:

  • Horizontal flip — free 2x effective data for most natural images. Skip for digits, text, asymmetric scenes (left/right matter for road signs).
  • Random rotation — small angles (-15° to +15°); large rotations rarely help unless your data has them.
  • Random crop / scale — handles position and size variation.
  • Random affine — combined translate / rotate / scale; one-stop augment.
  • Perspective — useful when test data has different camera angles (documents, signs).

6. Color Augmentations

code
T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1)
T.RandomGrayscale(p=0.1)
T.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0))

Color jitter handles lighting variation. Grayscale (low p) helps the model not rely on color when it shouldn't. Blur simulates out-of-focus shots. The right intensity is task-dependent — a medical-imaging classifier should not have its hue rotated by 0.5.

7. Modern Augmentation Tricks

AugmentationWhat it doesHelps when
RandAugmentRandomly applies N transforms with magnitude MDefault for most modern training
AugMixMixes augmented chains; strong robustnessYou care about distribution shift
Cutout / Random ErasingMasks a random rectangleFights overfitting on small data
MixUpLinearly mixes two images and their labelsStrong regularization, low-data regime
CutMixPatches one image into another, blends labels by areaObject-centric tasks, classification
code
train_tx = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.RandAugment(num_ops=2, magnitude=9),
    T.ToTensor(),
    T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    T.RandomErasing(p=0.25),
])

8. Albumentations: Boxes, Masks, Keypoints

Torchvision augments images. Albumentations augments images and their associated bounding boxes, segmentation masks, and keypoints — together, consistently. This matters for detection (Section 3) and segmentation (Section 4), where the labels also have to move when the image moves.

code
import albumentations as A

train_tx = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(0.2, 0.2, 0.2, 0.02, p=0.8),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_ids"]))

out = train_tx(image=img, bboxes=boxes, class_ids=ids)
img_aug, boxes_aug = out["image"], out["bboxes"]

9. Anti-Patterns and Pitfalls

10. Exercises

Up next · Convolutional Neural Networks Explained