ANN Building Blocks part 2

Learning

Estimating parameters (\(w\) and \(b\))

Linear regression (\(\approx\) single linear neuron)

closed form solution

ANN (arbitrary number of neurons in layers)

closed form does not work
iterative optimization algorithm (=Learning)

Neuron

Supervised Learning

Aim

Find optimal values of \(w_{\cdot,j}\) and \(b_j\) over all neurons \(j\)

Neuron

Data

\(x\) = input
\(y\) = labels, i.e., the known output corresponding to \(x\)
(Recall: \(\hat{y}\) is the estimated output)

Tools

Cross-validation
- Data
  - Training set
  - Validation set
  - Test set
Loss function
- (equiv. Cost/Error Function)
- “How good is the ANNs estimate?”
Optimizers
- “How can the ANN improve?”
- Gradient descent
  - Back-propagation

Cross-validation (a reminder)

Split data into:

training set
- for learning
- use in gradient descent during learning
validation set
- know when to stop learning, avoid overfitting
- evaluate progress/convergence during learning
test set
- quality control
- evaluate final result after learning

Loss Function

Suppose we have

an ANN that, with input \(x\), produces an estimated output \(\hat{y}\)
training samples \(X=(x^{(1)},\ldots,x^{(K)})\) with labels (true output values) \(Y=(y^{(1)},\ldots,y^{(K)})\).
Then the Quadratic Loss Function is defined as follows:

For each \(x\in X\), use the residual sum of squares, RSS, as an error measure

\[\begin{eqnarray*}L(w,b|x) &=& \sum_i\frac{1}{2} \left(y_i-\hat{y}_i\right)^2\end{eqnarray*}\]

The full quadratic cost function is simply the Mean Squared Error (MSE) used in cross-validation \[\begin{eqnarray} L(w,b) &=& \frac{1}{K} \sum_{k=1}^K L(w,b|x^{(k)})\\ \end{eqnarray}\]

RSS

Loss functions for regression

Quadratic loss function/Mean square error or variants thereof

Loss functions for classification

(Categorical) Cross-entropy or variants thereof

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing

HillClimbing1

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing

randomly choose direction and length to change \(v\)
stay if \(L(v|x)\) got lower, else go back.

We want to be smarter!

HillClimbing2

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent

GradientDescent1

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent

compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is

GradientDescent2

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent

compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is
Take a reasonably long step (\(\eta\)) in that direction, \(v' = v-\eta\frac{dL(v|x)}{dv}\)

\(\eta\) is called the learning rate

GradientDescent3

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

Consider a 2-dimensional case. We will treat each dimension separately

Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]
Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

valley

More realistic parameter space

Consider a 2-dimensional case. We will treat each dimension separately

Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]
Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient descent strategy

Algorithm

Initialize weights and biases randomly, e.g. \(\sim N(0, \sigma^2)\)
Loop for \(M\) epochs or until convergence:
- In each epoch and for each weight \(w_{i,j}\) and each bias \(b_j\) :
  1. Compute partial derivatives: \[\begin{eqnarray*} \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
  2. Update: \[\begin{eqnarray*} w_{i,j} &=& w_{i,j} - \eta \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ b_{j} &=& b_{j} - \eta \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
Return final weights and biases

For this to work, we need to be able to compute all \(\frac{\partial L(w,b|x)}{\partial v}\) efficiently

Solution: Back propagation

Back propagation – Forward pass

Neuron

\(\begin{array}{lllll} \qquad\qquad\; i_1 & \qquad\quad\Rightarrow z_1 & \Rightarrow a_1 & \quad \Rightarrow z_2 & \Rightarrow a_2 \Rightarrow \widehat{y}\\ \\ \end{array}\)

\(i_1 \quad = \quad x\)
\(z_1 \quad = \quad i_1 \times w_1 + b_1\)
\(a_1 \quad = \quad \sigma(z_1)\)
\(z_2 \quad = \quad a_1 \times w_2 + b_2\)
\(\hat{y} = a_2 \quad = \quad \sigma(z_2)\)

\(= \quad 0.05\)

\(= \quad 0.05 \times 0.1 - 0.1\) \(\quad= -0.095\)

\(= \quad \sigma(-0.095)\) \(\quad = 0.476\)

\(= \quad 0.476 \times 0.3 + 0.3\) \(\quad= 0.443\)

\(= \quad \sigma(0.443)\) \(\quad= 0.609\)

Back propagation – Backward pass

Neuron

\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609 \quad (=\hat{y})\)

\(y \quad = \quad 0.01\)

Partial derivative of loss function w.r.t.:

\(\begin{array}{ccccccccc} w_2:\qquad\qquad\; & \quad & & \quad &\frac{\partial z_2}{\partial w_2} & \times &\frac{\partial a_2}{\partial z_2} &\times& \frac{\partial L(w,b|x)}{\partial a_2} &=& \frac{\partial L(w,b|x)}{\partial w_2} \qquad\qquad\\ \\ \end{array}\)

\(\frac{\partial z_2}{\partial w_2} \quad = \quad \frac{\partial \left(a_1\times w_2 +b_2\right)}{\partial w_2}\) \(\qquad\qquad\qquad = \quad a_1\) \(\qquad\qquad\qquad\qquad = \quad 0.476\)

\(\frac{\partial a_2}{\partial z_2} \quad = \quad \frac{\partial \sigma(z_2)}{\partial z_2}\) \(\qquad\qquad\qquad\qquad = \quad a_2\left(1-a_2\right)\) \(\qquad\qquad\quad = \quad 0.609(1-0.609) \quad = \quad 0.238\)

\(\frac{\partial L(w,b|x)}{\partial a_2} \quad = \quad \frac{\partial \frac{1}{2}(y - \hat{y})^2}{\partial a_2} = \frac{\partial \frac{1}{2}(y - a_2)^2}{\partial a_2}\) \(\quad\; = \quad \left(a_2-y\right)\) \(\qquad\qquad\qquad = \quad 0.609 - 0.01\quad = \quad 0.599\)

\(\frac{\partial L(w,b|x)}{\partial w_2} \quad = \frac{\partial z_2}{\partial w_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(\qquad = \quad 0.476 \times 0.238 \times 0.599 \quad = \quad 0.0679\)

Back propagation – Backward pass

Neuron

\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609 \quad (=\hat{y})\)

\(y = \quad = \quad 0.01\)

Partial derivative of loss function w.r.t.:

\(w_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial w_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_2}\)

\(b_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial b_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_2}\)

\(w_1:\qquad\qquad\qquad\qquad \frac{\partial z_1}{\partial w_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_1}\)

\(b_1:\qquad\qquad\qquad\qquad\ \frac{\partial z_1}{\partial b_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_1}\)

Back propagation IRL

Multiple neurons per layer

Complication: Interactions between layers
Solution: Requires vector and matrix multiplication

Even more complex designs

Requires operations on multidimensional matrices

Tensors

Arrays (matrices) of arbitrary dimensions (ML def)
Tensor operations
- multiplication, decomposition, …
- produce new tensors

TensorFlow

The forward and backward passes are viewed as
“Tensors (e.g., layers) that flow through the network”
Additional twist is that tensors allow running all or chunks of test samples simultaneously

Neuron

Summary Learning

Loss function

Measures how good the currrent output, \(\hat{y}\) are to the labels, \(y\).

(Quadratic) Loss function

\[\begin{array}{rcll} L(w,b|x) &=& \frac{1}{2}\sum_i\left(y_i-\hat{y}_i\right)^2&\textsf{Residual sum of squares (RSS)}\\ L(w,b) &=& \frac{1}{K}\sum_{k=1}^K L(w,b|x^{(k)})&\textsf{Mean square error (MSE)} \end{array}\]

Gradient descent

“Clever hill-climbing” in several dimensions
Change all variables \(v\in (w,b)\) by taking a reasonable step (the learning rate) in opposite direction to the gradient \[\begin{equation} v' = v-\eta \frac{\partial L(w,b|x)}{\partial v} \end{equation}\]

Back propagation

Decomposition of gradients (allows storing and re-using results)
Efficient implementation using tensors

Activation functions revisited

Perceptron – step activation

Pros
- Clear classification (0/1)
Why did the perceptron “fail”?
- 1 layer \(\Rightarrow\) linear classification
- Not meaningfully differentiable
- a requirement for multilayer ANN

valley

Activation functions revisited

Why not use the linear function

Pros:
- continuous output
  - better output “resolution”
Cons:
- Not really “meaningfully” differentiable
- Multilayer linear ann collapses into a single linear model

However, used in the output layer for regression problems!

valley

Activation functions revisited

Sigmoid activation function

Meaningfully differentiable

Intermediate between step and linear

True for most activation functions
Balance between pros and cons

valley

Activation functions revisited

ReLu activation function

Meaningfully differentiable

(A different) intermediate between step and linear

True for most activation functions
Balance between pros and cons

valley

Activation functions summary

Meaningfully differentiable is important
Often needs to balance pros and cons
Two main families (+ special cases)

Sigmoid (logistic) family

Examples

Sigmoid
Tanh

ReLu family

Examples

ReLu
Leaky ReLu
PreLu

Special uses (e.g., output layers)

Examples

SoftMax (classification)
Linear (regression)

(More about pros and cons of different activation functions in a later lecture)