Practical Advice

Generalization, hyperparameters, and practical deep learning

Erik Ylipää – AIDA Data Hub / Linköping University / NBIS

NBIS

08-May-2026

Generalization

Generalization

The goal of supervised machine learning is to minimize the (unknown) generalization error.

How to estimate generalization error

  • Set aside some test data for estimating the generalization error
  • No learning and no decisions are allowed based on this dataset for it to be meaningful

Training set

Use the training set to develop learning algorithms

Test set

At the very end, use the test set to estimate generalization error

How generalization fails

Underfitting

Just right

Overfitting

Model Capacity

  • We refer to a model’s ability to adapt to data as its capacity
  • Models with low capacity are limited in the datasets they can fit well
  • Models with high capacity can fit arbitrary datasets well

Underfitting

  • The adjustable function is inflexible, its capacity is too low
  • It misses the important variation in the data
  • It performs as poorly on all data, training and other
  • Both training and generalization error is high

Overfitting

  • The adjustable function is too flexible, its capacity is too high
  • It adjusts to the unimportant variations in the data, like noise
  • Outside of the training data it performs poorly
    • Training error is low, generalization error is high

Just right

  • The adjustable function is just flexible enough, it has enough capacity
  • It adjusts to the variations in the data which are general
  • It would give good predictions for new data
  • Training error and generalization error follow each other

To generalize, control capacity

  • A lot of practical machine learning is about controlling the capacity of the chosen model
  • Regularization are methods aimed at reducing generalization error, often by reducing the capacity of the model
  • Regularization is often controlled by hyper parameters

Hyperparameters

  • Parameters which are determined before learning are called hyperparameters
  • They often control model capacity or the speed of the learning (which is implicitly model capacity)
    • The number of free parameters directly controls capacity, so is considered a hyperparameter

Few free parameters — low capacity

Lots of free parameters — high capacity

Hyperparameters

  • Hyperparameters which control capacity (such as size of neural network layers) can not be learned while minimizing the training error
    • Higher capacity will lead to lower training error
  • But we are not allowed to search for hyperparameters using the test error
    • If we do, the test set loses its purpose

Hyper parameter optimization

  • We want to find the hyper parameters which leads to the best generalization error
    • We can’t use the training set for this!
    • We can’t use the test set for this!

Training set

Use the training set to develop learning algorithms

Test set

At the very end, use the test set to estimate generalization error

Development set

  • The solution is to further split the training set
    • We hold out parts of the training set and estimate hyper parameter influence on this
    • This data set is often referred to as the validation set, but recently the term development set is being used (dev set for short)

Training set

Test set

Training set

Use the training set to adjust the function to minimize training error

Development set

Use the development set to adjust hyper parameters to minimize validation error

Test set

At the very end, use the test set to estimate generalization error

Tuning hyper parameters

  • We can plot the dev-set error as a function of model capacity (e.g. by varying a capacity controlling hyper parameter)
  • We look to the minimum on these error curves for good hyper parameter settings

Small datasets — high estimator variance

  • If the amount of data is small, there’s a risk that the test or development sets becomes non-representative
    • Similar to having too small sample sizes in traditional statistical studies
  • They become poor estimates for the generalization error

Cross validation

A useful algorithm is to run multiple trainings with different test sets, this is called cross-validation.

Source: https://mlfromscratch.com/nested-cross-validation

Cross validation

The average error over all splits is a better estimate of the generalization error than using a single split.

Source: https://mlfromscratch.com/nested-cross-validation

Nested cross validation

  • For deep neural networks, it’s often a good idea to do nested cross validation for the cases you would need cross validation
  • First use cross validation to select multiple test sets, for each of the training sets, do further cross-validation with different development sets

Source: https://mlfromscratch.com/nested-cross-validation-python-code

Practical steps to build generalizing models

  • During development, use a validation set to estimate how hyper parameters impact generalization error
    • Make a habit of calling this a development set, your non-ML colleagues will be much less confused
  • After development, use the test set to estimate your final generalization error
    • You do this for your own sake, to understand what you can expect of your model
    • Do not select models based on test performance
  • If you have small amounts of data, perform cross-validation to get a better generalization estimate of you method (not model)

Do not select the model based on test performance

  • The cardinal sin of statistical learning is to use the test set for any model decision
  • Still, the ML research field often choose what models to develop further based on what models have published the best score on an open test set
    • As a community we tend to actually overfit to benchmark test sets. Not at the individual (article) level, but over time collectively
  • How would you solve this problem?

Overfitting in neural networks

Source: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks

Training neural networks in practice

  • Understanding learning curves is central to training networks in practice
  • They show you if or when your model underfits or overfits
  • Use early stopping to find the best time to train your model

Hyperparameters in deep learning

How do you decide hyper parameters?

Grad Student Descent — someone manually tries out a bunch.

Hyperparameters

  • Parameters which are determined before learning are called hyper parameters
  • Learning rate is one of the most important hyperparameters in deep learning
  • We need to find a good learning rate, we call this hyperparameter tuning

Hyperparameter tuning

  • Tuning hyperparameters is one of the central activities in practical deep learning
  • We tune them by looking at the error on a validation set
  • If the amount of data is small, use cross-validation

Basic Hyperparameter tuning

  1. Select a value for hyper parameters
    1. Train a network using these parameters
    2. Record validation error
  2. Repeat 1 for preset number of times
  3. Select hyper parameters corresponding to best validation error

Important Neural Network Hyperparameters

Optimization choices

  • Learning rate, often the most important hyperparameter
  • Learning rate schedule
  • Which optimization algorithm to choose
  • Metaparameters for optimization (e.g. momentum parameters)
  • Batch size
  • Weight initialization (there are heuristics for setting these, but they aren’t optimal)

Important Neural Network Hyperparameters

Architecture choices

  • Hidden layer types (fully connected, convolutional, etc.)
  • Number of hidden layers
  • Size of hidden layers
  • Activation functions (not as important)
  • Layer connectivity (skip-connections)
  • Normalization (layer norm, batch norm, etc.)

Important Neural Network Hyperparameters

Regularization choices

  • Weight decay strength
  • Dropout probability

Hyperparameter tuning example

Parallel coordinates chart of hyperparameter sweeps over learning rate, decay, momentum, and batch size, coloured by accuracy

Source: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters

Smarter hyper parameter tuning

  • We can try to fit another machine learning model to predict validation error given hyper parameter settings
    • Think of response surface methodology in Design of Experiments
  • Popular methods include Bayesian optimization using Parzen tree estimators

Optuna

Optuna

Hyperparameter tuning in practice

  • Select the hyperparameters to tune — try to be restrictive. If everyone in your field is using Adam as an optimizer, don’t search over different optimizers
  • For architectural parameters, tying them together throughout the whole architecture reduces the search space (e.g. the same layer size for multiple layers in a stack)
  • Define intervals of values to search over
    • Often, you want as small a range as possible, but what range is good is often task dependent — still needs human experience
  • Use random search as a baseline
    • It’s trivial to implement and use
  • If your framework has tools for this, use them
  • Optuna is a nice framework which isn’t tied to any specific ML framework (and not ML at all, you can use it to optimize design of any experiment)

Discrete and continuous variables

Discrete and Continuous variables

  • We often differentiate between discrete and continuous variables
  • Discrete takes their values from a finite set of values like {red, orange, yellow, green, blue, indigo, violet}
  • Continuous take their values from a range of real values, like (-\infty, \infty)

Discrete and Continuous variables as inputs

  • As inputs, we often just standardize continuous variables (divide by sample standard deviation, subtract by sample mean)
  • Discrete variables require extra treatment. These are encoded in the input as integers
  • A special Embedding layer is used to map the integers to vectors

Discrete and continuous variables as output

  • Discrete targets are easy, they often correspond to classification
    • Arbitrary discrete distributions are easy to model with the categorical distribution
  • Continuous variables are more difficult
    • Arbitrary continuous distributions are much more difficult to model
    • Linear outputs are traditionally used, but are very limited in what they can model

The conditional distribution

In supervised machine learning, we are interested in modelling P(Y|X): the probability distribution of our target variable Y conditioned on our specific input variables X.

Assumptions with linear outputs

  • Linear outputs implicitly models P(Y|X) as a Gaussian
  • In particular, it estimates the mean, E[Y|X]

Bishop, Christopher M. “Mixture density networks.” (1994).

Mixture Density Networks

Gaussian Mixture Models (GMM) Explained, https://youtu.be/wT2yLNUfyoM?si=aYHnmxvH2wVhDjX6

Petrov, Tatjana, and Denis Repin. “Automated deep abstractions for stochastic chemical reaction networks.” arXiv preprint arXiv:2002.01889 (2020).

Assumptions with linear outputs

  • For many problems P(Y|X) is far from Gaussian
  • The mean is often a poor prediction
  • Fitting an MDN can be tricky because the distribution has weird spikes and skews

Four different conditional distributions on autoregressive tasks

Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixel recurrent neural networks.” arXiv preprint arXiv:1601.06759 (2016).

Discretization hack

  • We can discretize the continuous target into discrete values
    • Turn the regression task into a classification task
  • This is just like creating a histogram of the data

Discretization trick

  • A common strategy is to decide on the number of bins, then select the bin edges as the quantiles of the empirical distribution
  • Deciding number of bins is difficult. Look at histograms of your data.

Thank you!

erik.ylipaa@scilifelab.se