Convolutional Neural Networks for Image Segmentation

Localisation, detection, and segmentation

Christophe Avenel

NBIS

03-May-2026

CNNs for computer vision

Beyond image classification

Classification answers “what is in this image?” — but real-world vision asks more:

  • Where is the object? (localisation)
  • How many objects, and which class is each one? (detection)
  • Which pixels belong to which object? (segmentation)

From classification to dense prediction

  • Same backbone (CNN/ViT) — but with task-specific heads
  • The pretrained classifier is the starting point in (almost) every modern approach

Outline

  • Localisation as regression
  • Detection algorithms (YOLO, RetinaNet, Faster R-CNN)
  • Fully convolutional networks → U-Net
  • Losses and transfer learning for segmentation
  • Instance segmentation (Mask R-CNN) and what came after

Localisation

  • Single object per image
  • Predict coordinates of a bounding box (x, y, w, h)
  • Evaluate via Intersection over Union (IoU)

Localisation as regression

Classification + Localisation

  • Use a pre-trained CNN on ImageNet (e.g. ResNet)
  • The “localisation head” is trained separately with regression
  • At test time, use both heads

C classes, 4 output dimensions (1 box)

Predict exactly N objects: predict (N \times 4) coordinates and (N \times K) class scores

Object detection

We don’t know in advance the number of objects in the image. Object detection relies on object proposal and object classification:

  • Object proposal: find regions of interest (RoIs) in the image
  • Object classification: classify the object in these regions

Two main families

  • Single-Stage: a grid in the image where each cell is a proposal (SSD, YOLO, RetinaNet)
  • Two-Stage: region proposal then classification (Faster-RCNN)

YOLO (You Only Look Once)

For each cell of the S \times S grid predict:

  • B boxes and confidence scores C (5 \times B values) + classes c
  • Final detections: C_j \cdot \mathrm{prob}(c) > \text{threshold}
  • One CNN, one forward pass — real-time detection
  • Globally processes the entire image at once

YOLO today

The original YOLO concept is now a mature production library maintained by Ultralytics.

from ultralytics import YOLO

model = YOLO("yolo11n.pt")          # COCO-pretrained, ~3 MB
results = model.predict("img.jpg")   # detection
results = model.predict("img.jpg", task="segment")  # instance segmentation
  • Current generation: YOLOv11 (2024) — anchor-free, decoupled heads, FPN
  • One package covers detection, instance segmentation, pose estimation, OBB, tracking
  • Real-time on CPU for the small variants; GPU for bigger ones

RetinaNet

Single stage detector with:

  • Multiple scales through a Feature Pyramid Network
  • More than 100K boxes proposed
  • Focal loss to manage imbalance between background and real objects

See this post for more information.

Box Proposals

Instead of using a predefined set of box proposals, find them on the image:

  • Selective Search — from pixels (not learnt)
  • Faster R-CNN — Region Proposal Network (RPN)

Crop-and-resize operator (RoI-Pooling):

  • Input: convolutional map + N regions of interest
  • Output: tensor of N \times 7 \times 7 \times \text{depth} boxes
  • Allows the gradient to propagate only on interesting regions, and efficient computation

Faster R-CNN

  • Replace Selective Search with RPN, train jointly
  • Region proposal is translation invariant, compared to YOLO

Segmentation

Output a class map for each pixel (here: dog vs background).

  • Instance segmentation: specify each object instance as well (two dogs have different instances)
  • This can be done through object detection + segmentation

Convolutionize

  • Slide the network with an input of (224, 224) over a larger image. Output of varying spatial size
  • Convolutionize: change Dense (4096, 1000) to 1 \times 1 Convolution, with 4096, 1000 input and output channels
  • Gives a coarse segmentation (no extra supervision)

Fully Convolutional Network

  • Predict / backpropagate for every output pixel
  • Aggregate maps from several convolutions at different scales for more robust results

Deconvolution

“Deconvolution”: transposed convolutions

U-Net

  • Symmetric encoder–decoder with skip connections that concatenate features from the contracting path to the expanding path
  • Trains well on small datasets with heavy augmentation
  • Fully convolutional → arbitrary input sizes at inference
  • The default architecture for biomedical and microscopy segmentation — directly relevant to the lab

Segmentation losses

Segmentation = per-pixel classification, but with a strong class-imbalance problem (background dominates).

  • Cross-entropy (per pixel) — the default, but biased toward the majority class
  • Dice loss — directly optimizes overlap with the ground-truth mask:

\mathcal{L}_{\text{Dice}} = 1 - \frac{2 \, |P \cap G|}{|P| + |G|}

  • Focal loss — down-weights easy pixels, focuses on hard ones (same idea as RetinaNet, applied per-pixel)

In practice for biomedical/microscopy: train with BCE + Dice (sum of the two) for stable convergence on imbalanced masks.

Transfer learning for segmentation

Same recipe as classification: pretrained encoder + task-specific decoder.

import segmentation_models_pytorch as smp

model = smp.Unet(
    encoder_name="resnet34",       # any timm/torchvision backbone
    encoder_weights="imagenet",    # ImageNet-pretrained
    in_channels=1,                  # grayscale microscopy
    classes=5,                      # number of mask channels
)
  • Decoder is randomly initialized; encoder starts from ImageNet weights
  • Works even when the input domain (microscopy, satellite, MRI) differs from ImageNet
  • Same library exposes U-Net, FPN, DeepLabV3+, MA-Net, etc. — swap one string

Mask R-CNN

Faster R-CNN architecture with a third, binary mask head.

Mask R-CNN — Results

  • Mask results are still coarse (low mask resolution)
  • Excellent instance generalization

What came after Mask R-CNN

Year Model Key idea
2020 DETR Transformer detection, end-to-end, no anchors / no NMS
2022 Mask2Former Unified semantic / instance / panoptic segmentation
2023 SAM Promptable segmentation — “click and get a mask”
2024 SAM 2 SAM extended to video with temporal consistency

Foundation models for segmentation (SAM 2, Grounding DINO, …) — pretrained on billions of masks, often zero-shot for new domains. Covered in Friday’s self-supervised lecture.

Summary

  • Localisation: regression heads on a CNN backbone for (x, y, w, h)
  • Detection: single-stage (YOLO, RetinaNet) vs. two-stage (Faster R-CNN); modern YOLO is a one-line pip install
  • Segmentation: encoder–decoder with skip connections — U-Net is the workhorse for biomedical imaging
  • Train it right: pretrained encoder + BCE + Dice loss + augmentation
  • Instance segmentation: Mask R-CNN, then DETR / Mask2Former
  • 2026 reality: foundation models (SAM 2, Grounding DINO) often give zero-shot masks — covered Friday