
NBIS
08-May-2026



The classification head typically has a fully connected layer followed by a softmax. The fully connected layer is what we’ll look at here.
\vec{y} = \operatorname{softmax}(W \vec{x} + \vec{\beta})
\Leftrightarrow
\vec{y} = \operatorname{softmax}( \begin{bmatrix} \vec{w}_1^\top \\ \vdots \\ \vec{w}_n^\top \end{bmatrix} \vec{x} + \vec{\beta})
A matrix multiplication can be though of as a list of separate dot products.
The core mechanism of neural networks is the dot product
\vec{a} \cdot \vec{b} = |\vec{a}||\vec{b}| \cos\theta
Different graphical representations of neural networks
Computational graph view of softmax







Each training step takes three L2-normalised embeddings — an anchor x_a, a positive x_p (same class), and a negative x_n (different class) — and asks the model to align the anchor with the positive more than with the negative:
\mathcal{L}_{\text{triplet}} = \max\!\Bigl(0,\; \underbrace{-\, x_a \cdot x_p}_{\text{attract positive}} + \underbrace{x_a \cdot x_n}_{\text{repel negative}} + \underbrace{m}_{\text{margin}}\Bigr)
Even when the formula is satisfied on average, the model can stagnate if the chosen negatives are too easy:
\mathcal{L}_{\text{triplet}} = \max\!\Bigl(0,\; s^- - s^+ + m\Bigr), \qquad s^+ = x_a \cdot x_p,\quad s^- = x_a \cdot x_n
| Situation | s^- - s^+ + m | Gradient |
|---|---|---|
| Easy negative: s^+ already \gg s^- + m | < 0 | Zero — no update |
| Semi-hard negative: s^- < s^+ but s^+ - s^- < m | (0, m) | Small corrective push |
| Hard negative: s^- > s^+ | > m | Large corrective update |
Most randomly sampled negatives are easy; the model therefore needs a hard-negative mining strategy to find informative triples. This adds overhead to the training since we need to introduce a clustering phase which recalculates the neighborhoods at regular intervals.

Instead of one positive and one negative per anchor, the SupCon loss uses all same-class examples in the batch (i \in I) as positives and all other examples as negatives simultaneously:
\mathcal{L}_{\text{SupCon}} = \sum_{i \in I} \frac{-1}{|P(i)|} \sum_{p \in P(i)} \log \underbrace{\frac{\exp\!\bigl(x_i \cdot x_p \;/\; \tau\bigr)}{\displaystyle\sum_{k \in A(i)} \exp\!\bigl(x_i \cdot x_k \;/\; \tau\bigr)}}_{\text{softmax probability of drawing positive } p}
| Symbol | Meaning |
|---|---|
| x_i | L2-normalised embedding of anchor i |
| P(i) | All same-class examples in the batch, excluding i itself |
| A(i) | All other examples in the batch, excluding i |
| \tau | Temperature: lower \tau sharpens the distribution, penalising near-miss negatives more harshly |
| \tfrac{1}{\lvert P(i)\rvert} | Average over positives — more positives per anchor gives a more stable gradient signal |
Khosla, Prannay, et al. “Supervised contrastive learning.” Advances in neural information processing systems 33 (2020): 18661-18673.



What if we can use the data itself to drive the learning of representations?
An autoencoder is a neural network trained to:
z = f_{\text{encoder}}(x), \hat{x} = f_{\text{decoder}}(z)\\ \mathcal{L} = \|x - \hat{x}\|^2




We found that while no single transformation (that we studied) suffices to define a prediction task that yields the best representations, two transformations stand out: random cropping and random color distortion. Although neither cropping nor color distortion leads to high performance on its own, composing these two transformations leads to state-of-the-art results.
In our experiments, we found that using such a nonlinear projection helps improve the representation quality, improving the performance of a linear classifier trained on the SimCLR-learned representation by more than 10%.
we observe that the performance of a supervised ResNet peaked between 90 and 300 training epochs (on ImageNet), but SimCLR can continue its improvement even after 800 epochs of training

For each image, SimCLR creates two augmented views (x_i, x_j), encodes and projects both to get normalised embeddings z_i, z_j. The NT-Xent (Normalised Temperature-scaled Cross-Entropy) loss asks the model to identify the matching view among all 2(N-1) other views in the batch:
\mathcal{L}^{\text{NT-Xent}}_{i,j} = -\log \frac{\exp\!\bigl(\operatorname{sim}(z_i, z_j)/\tau\bigr)}{\displaystyle\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]}\, \exp\!\bigl(\operatorname{sim}(z_i, z_k)/\tau\bigr)}
| Symbol | Meaning |
|---|---|
| z_i = g(f(x_i)) | Projected embedding: encoder f followed by projection head g |
| \operatorname{sim}(u,v) = u^\top v / (\|u\|\|v\|) | Cosine similarity |
| N | Batch size; 2N total views per batch |
| \tau | Temperature: lower \tau sharpens the distribution |
| \mathbf{1}_{[k \neq i]} | Excludes the anchor itself from the denominator |
Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” ICML, 2020.
Instead of training a neural network (student) on the “hard” labels of our dataset, add “soft” targets from a really good (teacher) model.

Grill, Jean-Bastien, et al. “Bootstrap your own latent - a new approach to self-supervised learning.” Advances in neural information processing systems 33 (2020): 21271-21284.
BYOL trains an online network (parameters \theta) to predict the representations of a target network (parameters \xi). There are no negative pairs; collapse is prevented by the asymmetry between the two networks.
\mathcal{L}_\text{BYOL} = \bigl\|\bar{q}_\theta(z^\text{online}) - \bar{z}^\text{target}\bigr\|_2^2 = 2 - 2\,\frac{\langle q_\theta(z^\text{online}),\, z^\text{target}\rangle}{\|q_\theta(z^\text{online})\|_2\;\|z^\text{target}\|_2}
| Symbol | Meaning |
|---|---|
| z^\text{online} = g_\theta(f_\theta(x)) | Online projection: encoder + projection head |
| z^\text{target} = g_\xi(f_\xi(x')) | Target projection for a different augmented view x' |
| q_\theta(\cdot) | Online predictor head (extra MLP absent from target network) |
| \bar{v} = v / \|v\|_2 | L2-normalised vector |
| \xi \leftarrow \lambda\,\xi + (1-\lambda)\,\theta | Target updated via EMA — no gradient flows through it |
Grill, Jean-Bastien, et al. “Bootstrap your own latent.” NeurIPS, 2020.
Caron, Mathilde, et al. “Emerging properties in self-supervised vision transformers.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
DINO uses a self-distillation objective: the student learns to match the (sharper) teacher’s output distribution over a set of learned prototype dimensions.
P_s(x) = \operatorname{softmax}\!\!\left(\frac{f_s(x)}{\tau_s}\right),\qquad P_t(x) = \operatorname{softmax}\!\!\left(\frac{f_t(x) - c}{\tau_t}\right)
\mathcal{L}_\text{DINO} = \sum_{\substack{v_1,\, v_2 \,\in\, \mathcal{V} \\ v_1 \neq v_2}} H\!\bigl(P_t(v_1),\, P_s(v_2)\bigr), \qquad H(p,q) = -\sum_i p_i \log q_i
| Symbol | Meaning |
|---|---|
| \tau_s > \tau_t | Student temperature higher — teacher output is sharper (more confident) |
| c | Centering vector: EMA of teacher outputs, subtracted to prevent one-prototype collapse |
| \mathcal{V} | Multi-crop views: 2 global + several local crops |
| \theta_t \leftarrow m\,\theta_t + (1-m)\,\theta_s | Teacher updated by EMA only — no backprop through teacher |
Caron, Mathilde, et al. “Emerging properties in self-supervised vision transformers.” ICCV, 2021.
Siméoni, Oriane, et al. “Dinov3.” arXiv preprint arXiv:2508.10104 (2025).


\mathcal{L}_\text{Gram} = \lVert \mathbf{X}_S \cdot \mathbf{X}_S^\top - \mathbf{X}_G \cdot \mathbf{X}_G^\top \rVert_\text{F}^2
Constrain all the pairs of inner products to be similar to an earlier version of the model
Multimodal Machine Learning tutorial - ACL2017 – https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf


With multimodal data we have “natural” associations in the separate modalities.


Afham, Mohamed, et al. “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
https://github.com/openai/CLIP
Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International Conference on Machine Learning. PMLR, 2021.
For a batch of N image–text pairs, CLIP computes L2-normalised embeddings from two separate encoders and applies a symmetric NT-Xent loss across modalities:
\mathcal{L}_\text{CLIP} = \frac{1}{2}\Bigl(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\Bigr)
\mathcal{L}_{I \to T} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(I_i \cdot T_i\,/\,\tau)}{\displaystyle\sum_{j=1}^N \exp(I_i \cdot T_j\,/\,\tau)}, \qquad \mathcal{L}_{T \to I} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(T_i \cdot I_i\,/\,\tau)}{\displaystyle\sum_{j=1}^N \exp(T_j \cdot I_i\,/\,\tau)}
| Symbol | Meaning |
|---|---|
| I_i = f_I(x_i) / \|f_I(x_i)\| | L2-normalised image embedding |
| T_i = f_T(c_i) / \|f_T(c_i)\| | L2-normalised text embedding for paired caption c_i |
| \tau | Learned temperature (not a fixed hyperparameter as in SimCLR) |
Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” ICML, 2021.
For a batch of N image–text pairs, CLIP computes L2-normalised embeddings from two separate encoders and applies a symmetric NT-Xent loss across modalities:
\mathcal{L}_\text{CLIP} = \frac{1}{2}\Bigl(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I}\Bigr)
\mathcal{L}_{I \to T} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(I_i \cdot T_i\,/\,\tau)}{\displaystyle\sum_{j=1}^N \exp(I_i \cdot T_j\,/\,\tau)}, \qquad \mathcal{L}_{T \to I} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(T_i \cdot I_i\,/\,\tau)}{\displaystyle\sum_{j=1}^N \exp(T_j \cdot I_i\,/\,\tau)}
Relation to SimCLR and SupCon:
Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” ICML, 2021.
Representation learning