Convolutional Neural Networks for Image Classification

From convolutions to modern architectures

Christophe Avenel

NBIS

03-May-2026

CNNs for computer vision

CNN for image classification

CNN = Convolutional Neural Networks (or ConvNets)

Outline

  • Convolutions
  • Convolutions in Neural Networks
    • Motivations
    • Layers
  • Architectures
    • Classic CNN Architecture
    • AlexNet
    • VGG16
    • ResNet

Convolution

  • A mathematical operation that combines two functions to form a third function.
  • The feature map (or input data) and the kernel are combined to form a transformed feature map.
  • Often interpreted as a filter: the kernel filters the feature map for certain information (edges, etc.)

Convolving an image with an edge detector kernel.

The mathematical definition of the convolution of two functions f and x over a range t:

y(t) = f \otimes x = \int_{-\infty}^{\infty} f(k) \cdot x(t-k)\, \mathrm{d}k

Convolution as feature detector

Convolutional filters can be interpreted as feature detectors:

  • The input (feature map) is filtered for a certain feature (the kernel).
  • The output is large if the feature is detected in the image.

Image kernels demo

Convolution in a neural network

  • x is a 3 \times 3 chunk (yellow area) of the image (green array)
  • Each output neuron is parametrized with the 3 \times 3 weight matrix \mathbf{w} (small numbers)

The activation is obtained by sliding the 3 \times 3 window and computing:

z(x) = \mathrm{relu}(\mathbf{w}^T x + b)

Motivations

Standard Dense Layer for an image input:

import torch
import torch.nn as nn

# x: image batch of shape (N, 3, 480, 640)
x = torch.randn(1, 3, 480, 640)
y = nn.Flatten()(x)
# shape of y is: (N, 3 * 480 * 640)
z = nn.Linear(3 * 480 * 640, 1000)(y)

640 \times 480 \times 3 \times 1000 + 1000 = 922\,\text{M}

  • No spatial organization of the input
  • Dense layers are never used directly on large images
  • Most standard solution is to use convolution layers

Motivations

Local connectivity

  • A neuron depends only on a few local input neurons
  • Translation invariance

Comparison to Fully connected

  • Parameter sharing: reduce overfitting
  • Make use of spatial structure: strong prior for vision!

Animal Vision Analogy

  • Hubel & Wiesel, Receptive fields of single neurones in the cat’s striate cortex (1959)

Channels

Colored image = tensor of shape (height, width, channels)

Convolutions are usually computed for each channel, and summed:

(k \star im^{color}) = \sum\limits_{c=0}^2 k^c \star im^c

Multiple convolutions

Multiple convolutions

Multiple convolutions

Multiple convolutions

Multiple convolutions

  • Kernel size aka receptive field (usually 1, 3, 5, 7, 11)
  • Output dimension: length - kernel_size + 1

Strides

  • Strides: increment step size for the convolution operator
  • Reduces the size of the output map

Example with kernel size 3 \times 3 and a stride of 2 (image in blue)

Padding

  • Padding: artificially fill borders of image
  • Useful to keep spatial dimension constant across filters
  • Useful with strides and large receptive fields
  • Usually: fill with 0s

Shapes of convolution layers

Kernel or Filter shape: (F, F, C^i, C^o)

  • F \times F kernel size
  • C^i input channels
  • C^o output channels

Number of parameters: (F \times F \times C^i + 1) \times C^o

Shapes of convolution layers

Activations or Feature maps shape:

  • Input: \left(W^i, H^i, C^i\right)
  • Output: \left(W^o, H^o, C^o\right)

W^o = (W^i - F + 2P) / S + 1

Convolution demo

W^o = (W^i - F + 2P) / S + 1

Pooling

  • Spatial dimension reduction
  • Local invariance
  • No parameters: max or average of 2 \times 2 units

Pooling

  • Spatial dimension reduction
  • Local invariance
  • No parameters: max or average of 2 \times 2 units

Batch Normalization

Normalize activations within each mini-batch (per channel):

\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}, \quad y = \gamma \hat{x} + \beta

Standard placement in a conv block:

nn.Conv2d(in_c, out_c, kernel_size=3, padding=1),
nn.BatchNorm2d(out_c),
nn.ReLU(inplace=True),
  • Stabilizes and accelerates training of deep networks
  • Acts as a mild regularizer
  • Unlocked very deep models (ResNet, etc.)
  • Alternatives: LayerNorm (used in ViTs), GroupNorm (small batches)

Data augmentation

The cheapest regularizer you’ll ever deploy. Apply random transformations to training images on the fly:

from torchvision.transforms import v2

train_tf = v2.Compose([
    v2.RandomResizedCrop(224),
    v2.RandomHorizontalFlip(),
    v2.ColorJitter(0.2, 0.2, 0.2),
    v2.RandAugment(),               # auto-tuned augmentation policy
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
  • Increases effective dataset size, reduces overfitting
  • More advanced: MixUp / CutMix (mix images and labels)
  • Always validate/test on un-augmented images

Transfer learning

In 2026, you almost never train a CNN from scratch — start from ImageNet-pretrained weights:

from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

# Replace the classification head for your N classes
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Optionally freeze the backbone (linear-probe) ...
for p in model.parameters():
    p.requires_grad = False
for p in model.fc.parameters():
    p.requires_grad = True

# ... or fine-tune the whole network with a small learning rate.
  • Linear probe: freeze backbone, train only the head (fast, small data)
  • Fine-tune: train everything with a small learning rate (best results, more data)

In PyTorch: MLP

Fully Connected Network: Multilayer Perceptron

import torch.nn as nn

mlp = nn.Sequential(
    nn.Flatten(),                # (N, 1, 28, 28) -> (N, 784)
    nn.Linear(28 * 28, 256),
    nn.ReLU(),
    nn.Linear(256, 10),          # logits; use CrossEntropyLoss
)

In PyTorch: ConvNet

Convolutional Network

import torch.nn as nn

convnet = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=32, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Flatten(),
    nn.Linear(64 * 7 * 7, 256),
    nn.ReLU(),
    nn.Linear(256, 10),          # logits; use CrossEntropyLoss
)

2D spatial organization of features preserved until Flatten.

Feature visualization

Early layers detect edges and textures, deeper layers respond to parts and whole objects.

Grad-CAM — debugging predictions

Question: which pixels did the network look at to predict this class?

Grad-CAM: weight each feature map by the average gradient of the class score w.r.t. that map, then ReLU and overlay:

L^c_{\text{Grad-CAM}} = \mathrm{ReLU}\!\left( \sum_k \alpha_k^c \, A^k \right), \quad \alpha_k^c = \overline{\frac{\partial y^c}{\partial A^k}}

  • Class-specific saliency map, no architectural changes needed
  • Use it to spot shortcut learning (e.g. model attending to background, watermarks, scanner artefacts) — a key debugging tool for the lab

Grad-CAM — example

Architectures

Classic ConvNet Architecture

Input

Conv blocks

  • Convolution + activation (relu)
  • Convolution + activation (relu)
  • Maxpooling 2x2

Output

  • Fully connected layers
  • Softmax

AlexNet

Input: 227x227x3 image. First conv layer: kernel 11x11x3x96 stride 4

  • Kernel shape: (11,11,3,96)
  • Output shape: (55,55,96)
  • Number of parameters: 34,944
  • Equivalent MLP parameters: 43.7 × 10⁹

AlexNet

INPUT:     [227x227x3]
CONV1:     [55x55x96]    11x11, stride 4
MAX POOL1: [27x27x96]    3x3,   stride 2
CONV2-5:   [...x...x...] 5x5 / 3x3 stacks
MAX POOL3: [6x6x256]     3x3,   stride 2
FC6 / FC7: [4096]        4096 neurons
FC8:       [1000]        softmax logits

Total params: ~28M
  • First very large CNN trained on GPUs, 2012 ImageNet winner
  • Introduced ReLU, dropout, and aggressive data augmentation at scale

VGG16

VGG in PyTorch

import torch.nn as nn

def conv_block(in_c, out_c, n):
    layers = []
    for i in range(n):
        layers += [nn.Conv2d(in_c if i == 0 else out_c, out_c, kernel_size=3, padding=1),
                   nn.ReLU(inplace=True)]
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return layers

vgg16 = nn.Sequential(
    *conv_block(3,   64, 2),
    *conv_block(64,  128, 2),
    *conv_block(128, 256, 3),
    *conv_block(256, 512, 3),
    *conv_block(512, 512, 3),
    nn.Flatten(),
    nn.Linear(512 * 7 * 7, 4096), nn.ReLU(inplace=True), nn.Dropout(0.5),
    nn.Linear(4096, 4096),        nn.ReLU(inplace=True), nn.Dropout(0.5),
    nn.Linear(4096, 1000),        # logits; use CrossEntropyLoss
)

Or just load it pretrained — same recipe applies to ResNet, ConvNeXt, ViT, …

from torchvision.models import vgg16, VGG16_Weights
model = vgg16(weights=VGG16_Weights.IMAGENET1K_V1)

Memory and Parameters

           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M           1,728
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M          73,728
CONV3-256: [56x56x256]   = 800K         294,912
CONV3-512: [28x28x512]   = 400K       1,179,648
CONV3-512: [14x14x512]   = 100K       2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [4096]        = 4096     102,760,448
FC:        [1000]        = 1000       4,096,000

TOTAL activations: 24M  → ~93MB / image  (x2 for backward)
TOTAL parameters:  138M → ~552MB        (x2 for SGD, x4 for Adam)
  • Most parameters live in the first FC layer — modern designs avoid this
  • Activations dominate memory at training time, parameters at inference time

ResNet

Even deeper models:

34, 50, 101, 152 layers

ResNet — residual blocks

A block learns the residual w.r.t. identity

  • Good optimization properties

ResNet vs. VGG

ResNet50 compared to VGG:

  • Superior accuracy in all vision tasks
    • 5.25% top-5 error vs. 7.1%
  • Less parameters
    • 25M vs. 138M
  • Computational complexity
    • 3.8B Flops vs. 15.3B Flops
  • Fully Convolutional until the last layer

What came after ResNet (2017–today)

Year Model Key idea
2014 Inception factorized convolutions (1×1 + 3×3 + 5×5)
2017 MobileNet depthwise-separable convs for mobile / edge
2019 EfficientNet compound scaling of width/depth/resolution
2020 ViT image as a sequence of patches → Transformer
2022 ConvNeXt “modernized” CNN, matches ViTs at same compute

Vision Transformers (ViT) and foundation models (CLIP, DINOv2, SAM 2) are now state-of-the-art for many tasks — covered in Friday’s self-supervised lecture.

ImageNet-1k benchmarks

Model Year Params Top-1 Notes
ResNet-50 2015 25 M 76.1% the workhorse baseline
EfficientNet-B0 2019 5.3 M 77.7% great accuracy/parameter
ConvNeXt-T 2022 29 M 82.1% pure CNN, ViT-competitive
ViT-B/16 2020 86 M 81.1% needs large-scale pretrain
ViT-L/16 + DINOv2 2023 300 M 86.7% self-supervised pretraining
  • Architectures are converging at the top — CNN vs. Transformer matters less than data, scale, and pretraining
  • For most lab/research settings, a pretrained ResNet-50 or ConvNeXt-T is a strong default

Summary

  • Convolutions exploit local connectivity and parameter sharing to scale to images
  • Stacking conv + pooling layers builds a hierarchy of features
  • BatchNorm + residual connections were the key unlocks for very deep networks
  • Data augmentation is the cheapest regularizer; always use it
  • In 2026, start from pretrained weights — fine-tune or linear-probe rather than train from scratch
  • Use Grad-CAM to inspect what your model actually attends to

Coming up:

  • Next lecture — beyond classification: localisation, detection, segmentation
  • Friday — Vision Transformers, self-supervised pretraining, foundation models (CLIP, DINOv2, SAM 2)