Practical Advice

Generalization, hyperparameters, and practical deep learning

Erik Ylipää – AIDA Data Hub / Linköping University / NBIS

NBIS

08-May-2026

Generalization

The goal of supervised machine learning is to minimize the (unknown) generalization error.

How to estimate generalization error

Set aside some test data for estimating the generalization error
No learning and no decisions are allowed based on this dataset for it to be meaningful

Training set

Use the training set to develop learning algorithms

Test set

At the very end, use the test set to estimate generalization error

How generalization fails

Underfitting

Just right

Overfitting

Model Capacity

We refer to a model’s ability to adapt to data as its capacity
Models with low capacity are limited in the datasets they can fit well
Models with high capacity can fit arbitrary datasets well

Underfitting

The adjustable function is inflexible, its capacity is too low
It misses the important variation in the data
It performs as poorly on all data, training and other
Both training and generalization error is high

Overfitting

The adjustable function is too flexible, its capacity is too high
It adjusts to the unimportant variations in the data, like noise
Outside of the training data it performs poorly
- Training error is low, generalization error is high

Just right

The adjustable function is just flexible enough, it has enough capacity
It adjusts to the variations in the data which are general
It would give good predictions for new data
Training error and generalization error follow each other

To generalize, control capacity

A lot of practical machine learning is about controlling the capacity of the chosen model
Regularization are methods aimed at reducing generalization error, often by reducing the capacity of the model
Regularization is often controlled by hyper parameters

Hyperparameters

Parameters which are determined before learning are called hyperparameters
They often control model capacity or the speed of the learning (which is implicitly model capacity)
- The number of free parameters directly controls capacity, so is considered a hyperparameter

Few free parameters — low capacity

Lots of free parameters — high capacity

Hyperparameters

Hyperparameters which control capacity (such as size of neural network layers) can not be learned while minimizing the training error
- Higher capacity will lead to lower training error
But we are not allowed to search for hyperparameters using the test error
- If we do, the test set loses its purpose

Hyper parameter optimization

We want to find the hyper parameters which leads to the best generalization error
- We can’t use the training set for this!
- We can’t use the test set for this!

Training set

Use the training set to develop learning algorithms

Test set

At the very end, use the test set to estimate generalization error

Development set

The solution is to further split the training set
- We hold out parts of the training set and estimate hyper parameter influence on this
- This data set is often referred to as the validation set, but recently the term development set is being used (dev set for short)

Training set

Test set

Training set

Use the training set to adjust the function to minimize training error

Development set

Use the development set to adjust hyper parameters to minimize validation error

Test set

At the very end, use the test set to estimate generalization error

Tuning hyper parameters

We can plot the dev-set error as a function of model capacity (e.g. by varying a capacity controlling hyper parameter)
We look to the minimum on these error curves for good hyper parameter settings

Small datasets — high estimator variance

If the amount of data is small, there’s a risk that the test or development sets becomes non-representative
- Similar to having too small sample sizes in traditional statistical studies
They become poor estimates for the generalization error

Cross validation

A useful algorithm is to run multiple trainings with different test sets, this is called cross-validation.

Source: https://mlfromscratch.com/nested-cross-validation

Cross validation

The average error over all splits is a better estimate of the generalization error than using a single split.

Source: https://mlfromscratch.com/nested-cross-validation

Nested cross validation

For deep neural networks, it’s often a good idea to do nested cross validation for the cases you would need cross validation
First use cross validation to select multiple test sets, for each of the training sets, do further cross-validation with different development sets

Source: https://mlfromscratch.com/nested-cross-validation-python-code

Practical steps to build generalizing models

During development, use a validation set to estimate how hyper parameters impact generalization error
- Make a habit of calling this a development set, your non-ML colleagues will be much less confused
After development, use the test set to estimate your final generalization error
- You do this for your own sake, to understand what you can expect of your model
- Do not select models based on test performance
If you have small amounts of data, perform cross-validation to get a better generalization estimate of you method (not model)

Do not select the model based on test performance

The cardinal sin of statistical learning is to use the test set for any model decision
Still, the ML research field often choose what models to develop further based on what models have published the best score on an open test set
- As a community we tend to actually overfit to benchmark test sets. Not at the individual (article) level, but over time collectively
How would you solve this problem?

Overfitting in neural networks

Source: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks

Training neural networks in practice

Understanding learning curves is central to training networks in practice
They show you if or when your model underfits or overfits
Use early stopping to find the best time to train your model

Hyperparameters in deep learning

How do you decide hyper parameters?

Grad Student Descent — someone manually tries out a bunch.

Hyperparameters

Parameters which are determined before learning are called hyper parameters
Learning rate is one of the most important hyperparameters in deep learning
We need to find a good learning rate, we call this hyperparameter tuning

Hyperparameter tuning

Tuning hyperparameters is one of the central activities in practical deep learning
We tune them by looking at the error on a validation set
If the amount of data is small, use cross-validation

Basic Hyperparameter tuning

Select a value for hyper parameters
1. Train a network using these parameters
2. Record validation error
Repeat 1 for preset number of times
Select hyper parameters corresponding to best validation error

Important Neural Network Hyperparameters

Optimization choices

Learning rate, often the most important hyperparameter
Learning rate schedule
Which optimization algorithm to choose
Metaparameters for optimization (e.g. momentum parameters)
Batch size
Weight initialization (there are heuristics for setting these, but they aren’t optimal)

Important Neural Network Hyperparameters

Architecture choices

Hidden layer types (fully connected, convolutional, etc.)
Number of hidden layers
Size of hidden layers
Activation functions (not as important)
Layer connectivity (skip-connections)
Normalization (layer norm, batch norm, etc.)

Important Neural Network Hyperparameters

Regularization choices

Weight decay strength
Dropout probability

Hyperparameter tuning example

Parallel coordinates chart of hyperparameter sweeps over learning rate, decay, momentum, and batch size, coloured by accuracy

Source: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters

Start with random hyperparameter search

Grid search tests all combinations of values in nested loops
- A lot of time might be spent testing different values for unimportant hyper parameters without getting new information for the important ones

Bergstra, James, and Yoshua Bengio. “Random search for hyper-parameter optimization.” Journal of Machine Learning Research 13.Feb (2012): 281-305.

Smarter hyper parameter tuning

We can try to fit another machine learning model to predict validation error given hyper parameter settings
- Think of response surface methodology in Design of Experiments
Popular methods include Bayesian optimization using Parzen tree estimators

Optuna

Hyperparameter tuning in practice

Select the hyperparameters to tune — try to be restrictive. If everyone in your field is using Adam as an optimizer, don’t search over different optimizers
For architectural parameters, tying them together throughout the whole architecture reduces the search space (e.g. the same layer size for multiple layers in a stack)
Define intervals of values to search over
- Often, you want as small a range as possible, but what range is good is often task dependent — still needs human experience
Use random search as a baseline
- It’s trivial to implement and use
If your framework has tools for this, use them
Optuna is a nice framework which isn’t tied to any specific ML framework (and not ML at all, you can use it to optimize design of any experiment)

Discrete and continuous variables

Discrete and Continuous variables

We often differentiate between discrete and continuous variables
Discrete takes their values from a finite set of values like {red, orange, yellow, green, blue, indigo, violet}
Continuous take their values from a range of real values, like (-\infty, \infty)

Discrete and Continuous variables as inputs

As inputs, we often just standardize continuous variables (divide by sample standard deviation, subtract by sample mean)
Discrete variables require extra treatment. These are encoded in the input as integers
A special Embedding layer is used to map the integers to vectors

Discrete and continuous variables as output

Discrete targets are easy, they often correspond to classification
- Arbitrary discrete distributions are easy to model with the categorical distribution
Continuous variables are more difficult
- Arbitrary continuous distributions are much more difficult to model
- Linear outputs are traditionally used, but are very limited in what they can model

The conditional distribution

In supervised machine learning, we are interested in modelling P(Y|X): the probability distribution of our target variable Y conditioned on our specific input variables X.

Assumptions with linear outputs

Linear outputs implicitly models P(Y|X) as a Gaussian
In particular, it estimates the mean, E[Y|X]

Bishop, Christopher M. “Mixture density networks.” (1994).

Mixture Density Networks

Gaussian Mixture Models (GMM) Explained, https://youtu.be/wT2yLNUfyoM?si=aYHnmxvH2wVhDjX6

Petrov, Tatjana, and Denis Repin. “Automated deep abstractions for stochastic chemical reaction networks.” arXiv preprint arXiv:2002.01889 (2020).

Assumptions with linear outputs

For many problems P(Y|X) is far from Gaussian
The mean is often a poor prediction
Fitting an MDN can be tricky because the distribution has weird spikes and skews

Four different conditional distributions on autoregressive tasks

Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixel recurrent neural networks.” arXiv preprint arXiv:1601.06759 (2016).

Discretization hack

We can discretize the continuous target into discrete values
- Turn the regression task into a classification task
This is just like creating a histogram of the data

Discretization trick

A common strategy is to decide on the number of bins, then select the bin edges as the quantiles of the empirical distribution
Deciding number of bins is difficult. Look at histograms of your data.

Thank you!

erik.ylipaa@scilifelab.se