Linear regression (\(\approx\) single linear neuron)
closed form solution
ANN (arbitrary number of neurons in layers)
closed form does not work
iterative optimization algorithm (=Learning)
Supervised Learning
Aim
Find optimal values of \(w_{\cdot,j}\) and \(b_j\) over all neurons \(j\)
Data
\(x\) = input
\(y\) = labels, i.e., the known output corresponding to \(x\)
(Recall: \(\hat{y}\) is the estimated output)
Tools
Cross-validation
Data
Training set
Validation set
Test set
Loss function
(equiv. Cost/Error Function)
“How good is the ANNs estimate?”
Optimizers
“How can the ANN improve?”
Gradient descent
Back-propagation
Cross-validation (a reminder)
Split data into:
training set
for learning
use in gradient descent during learning
validation set
know when to stop learning, avoid overfitting
evaluate progress/convergence during learning
test set
quality control
evaluate final result after learning
Loss Function
Suppose we have
an ANN that, with input \(x\), produces an estimated output \(\hat{y}\)
training samples \(X=(x^{(1)},\ldots,x^{(K)})\) with labels (true output values) \(Y=(y^{(1)},\ldots,y^{(K)})\). Then the Quadratic Loss Function is defined as follows:
For each \(x\in X\), use the residual sum of squares, RSS, as an error measure
The full quadratic cost function is simply the Mean Squared Error (MSE) used in cross-validation \[\begin{eqnarray}
L(w,b) &=& \frac{1}{K} \sum_{k=1}^K L(w,b|x^{(k)})\\
\end{eqnarray}\]
Loss functions for regression
Quadratic loss function/Mean square error or variants thereof
Loss functions for classification
(Categorical) Cross-entropy or variants thereof
Gradient Descent
Optimization
Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.
Hill-climbing
Gradient Descent
Optimization
Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.
Hill-climbing
randomly choose direction and length to change \(v\)
stay if \(L(v|x)\) got lower, else go back.
We want to be smarter!
Gradient Descent
Optimization
Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.
Gradient descent
Gradient Descent
Optimization
Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.
Gradient descent
compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is
Gradient Descent
Optimization
Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.
Gradient descent
compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is
Take a reasonably long step (\(\eta\)) in that direction, \(v' = v-\eta\frac{dL(v|x)}{dv}\)
\(\eta\) is called the learning rate
Gradient Descent in higher dimensions
Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.
Consider a 2-dimensional case. We will treat each dimension separately
Find the partial derivatives for both dimensions \[\begin{pmatrix}
\frac{\partial L(v_1,v_2|x)}{\partial v_1}\\
\frac{\partial L(v_1,v_2|x)}{\partial v_2}
\end{pmatrix}\]
Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)
(A vector of partial derivatives is called a gradient)
Gradient Descent in higher dimensions
Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.
More realistic parameter space
Consider a 2-dimensional case. We will treat each dimension separately
Find the partial derivatives for both dimensions \[\begin{pmatrix}
\frac{\partial L(v_1,v_2|x)}{\partial v_1}\\
\frac{\partial L(v_1,v_2|x)}{\partial v_2}
\end{pmatrix}\]
Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)
(A vector of partial derivatives is called a gradient)
Gradient descent strategy
Algorithm
Initialize weights and biases randomly, e.g. \(\sim N(0, \sigma^2)\)
Loop for \(M\) epochs or until convergence:
In each epoch and for each weight \(w_{i,j}\) and each bias \(b_j\) :
Solution: Requires vector and matrix multiplication
Even more complex designs
Requires operations on multidimensional matrices
Tensors
Arrays (matrices) of arbitrary dimensions (ML def)
Tensor operations
multiplication, decomposition, …
produce new tensors
TensorFlow
The forward and backward passes are viewed as
“Tensors (e.g., layers) that flow through the network”
Additional twist is that tensors allow running all or chunks of test samples simultaneously
Summary Learning
Loss function
Measures how good the currrent output, \(\hat{y}\) are to the labels, \(y\).
(Quadratic) Loss function
\[\begin{array}{rcll}
L(w,b|x) &=& \frac{1}{2}\sum_i\left(y_i-\hat{y}_i\right)^2&\textsf{Residual sum of squares (RSS)}\\
L(w,b) &=& \frac{1}{K}\sum_{k=1}^K L(w,b|x^{(k)})&\textsf{Mean square error (MSE)}
\end{array}\]
Gradient descent
“Clever hill-climbing” in several dimensions
Change all variables \(v\in (w,b)\) by taking a reasonable step (the learning rate) in opposite direction to the gradient \[\begin{equation}
v' = v-\eta \frac{\partial L(w,b|x)}{\partial v}
\end{equation}\]
Back propagation
Decomposition of gradients (allows storing and re-using results)
Efficient implementation using tensors
Activation functions revisited
Perceptron – step activation
Pros
Clear classification (0/1)
Why did the perceptron “fail”?
1 layer \(\Rightarrow\) linear classification
Not meaningfully differentiable
a requirement for multilayer ANN
Activation functions revisited
Why not use the linear function
Pros:
continuous output
better output “resolution”
Cons:
Not really “meaningfully” differentiable
Multilayer linear ann collapses into a single linear model
However, used in the output layer for regression problems!
Activation functions revisited
Sigmoid activation function
Meaningfully differentiable
Intermediate between step and linear
True for most activation functions
Balance between pros and cons
Activation functions revisited
ReLu activation function
Meaningfully differentiable
(A different) intermediate between step and linear
True for most activation functions
Balance between pros and cons
Activation functions summary
Meaningfully differentiable is important
Often needs to balance pros and cons
Two main families (+ special cases)
Sigmoid (logistic) family
Examples
Sigmoid
Tanh
ReLu family
Examples
ReLu
Leaky ReLu
PreLu
Special uses (e.g., output layers)
Examples
SoftMax (classification)
Linear (regression)
(More about pros and cons of different activation functions in a later lecture)