ANN Building Blocks part 2

Bengt Sennblad, NBIS

Learning

Estimating parameters (\(w\) and \(b\))

Linear regression (\(\approx\) single linear neuron)

  • closed form solution









ANN (arbitrary number of neurons in layers)

  • closed form does not work
  • iterative optimization algorithm (=Learning)

Neuron

Neuron

Supervised Learning

Aim

Find optimal values of \(w_{\cdot,j}\) and \(b_j\) over all neurons \(j\)

Neuron

Data

  • \(x\) = input
  • \(y\) = labels, i.e., the known output corresponding to \(x\)
  • (Recall: \(\hat{y}\) is the estimated output)

Tools

  • Cross-validation
    • Data
      • Training set
      • Validation set
      • Test set
  • Loss function
    • (equiv. Cost/Error Function)
    • “How good is the ANNs estimate?”
  • Optimizers
    • “How can the ANN improve?”
    • Gradient descent
      • Back-propagation

Cross-validation (a reminder)

Split data into:

  1. training set
    • for learning
    • use in gradient descent during learning
  2. validation set
    • know when to stop learning, avoid overfitting
    • evaluate progress/convergence during learning
  3. test set
    • quality control
    • evaluate final result after learning

Loss Function

Suppose we have

  1. an ANN that, with input \(x\), produces an estimated output \(\hat{y}\)
  2. training samples \(X=(x^{(1)},\ldots,x^{(K)})\) with labels (true output values) \(Y=(y^{(1)},\ldots,y^{(K)})\).
    Then the Quadratic Loss Function is defined as follows:
  1. For each \(x\in X\), use the residual sum of squares, RSS, as an error measure

\[\begin{eqnarray*}L(w,b|x) &=& \sum_i\frac{1}{2} \left(y_i-\hat{y}_i\right)^2\end{eqnarray*}\]

  1. The full quadratic cost function is simply the Mean Squared Error (MSE) used in cross-validation \[\begin{eqnarray} L(w,b) &=& \frac{1}{K} \sum_{k=1}^K L(w,b|x^{(k)})\\ \end{eqnarray}\]

RSS

Loss functions for regression

  • Quadratic loss function/Mean square error or variants thereof

Loss functions for classification

  • (Categorical) Cross-entropy or variants thereof

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing

HillClimbing1

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing
  1. randomly choose direction and length to change \(v\)
  2. stay if \(L(v|x)\) got lower, else go back.
We want to be smarter!

HillClimbing2

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent

GradientDescent1

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent
  1. compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is

GradientDescent2

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent
  1. compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is
  2. Take a reasonably long step (\(\eta\)) in that direction, \(v' = v-\eta\frac{dL(v|x)}{dv}\)

\(\eta\) is called the learning rate

GradientDescent3

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

valley

Consider a 2-dimensional case. We will treat each dimension separately

  1. Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]

  2. Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

valley

More realistic parameter space

Consider a 2-dimensional case. We will treat each dimension separately

  1. Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]

  2. Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient descent strategy

Algorithm
  1. Initialize weights and biases randomly, e.g. \(\sim N(0, \sigma^2)\)
  2. Loop for \(M\) epochs or until convergence:
    • In each epoch and for each weight \(w_{i,j}\) and each bias \(b_j\) :
      1. Compute partial derivatives: \[\begin{eqnarray*} \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
      2. Update: \[\begin{eqnarray*} w_{i,j} &=& w_{i,j} - \eta \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ b_{j} &=& b_{j} - \eta \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
  3. Return final weights and biases

For this to work, we need to be able to compute all \(\frac{\partial L(w,b|x)}{\partial v}\) efficiently


Solution: Back propagation

Back propagation – Forward pass

Neuron

\(\begin{array}{lllll} \qquad\qquad\; i_1 & \qquad\quad\Rightarrow z_1 & \Rightarrow a_1 & \quad \Rightarrow z_2 & \Rightarrow a_2 \Rightarrow \widehat{y}\\ \\ \end{array}\)

  • \(i_1 \quad = \quad x\)

  • \(z_1 \quad = \quad i_1 \times w_1 + b_1\)

  • \(a_1 \quad = \quad \sigma(z_1)\)

  • \(z_2 \quad = \quad a_1 \times w_2 + b_2\)

  • \(\hat{y} = a_2 \quad = \quad \sigma(z_2)\)

\(= \quad 0.05\)

\(= \quad 0.05 \times 0.1 - 0.1\) \(\quad= -0.095\)

\(= \quad \sigma(-0.095)\) \(\quad = 0.476\)

\(= \quad 0.476 \times 0.3 + 0.3\) \(\quad= 0.443\)

\(= \quad \sigma(0.443)\) \(\quad= 0.609\)

Back propagation – Backward pass

Neuron


\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609 \quad (=\hat{y})\)

\(y \quad = \quad 0.01\)

Partial derivative of loss function w.r.t.:

\(\begin{array}{ccccccccc} w_2:\qquad\qquad\; & \quad & & \quad &\frac{\partial z_2}{\partial w_2} & \times &\frac{\partial a_2}{\partial z_2} &\times& \frac{\partial L(w,b|x)}{\partial a_2} &=& \frac{\partial L(w,b|x)}{\partial w_2} \qquad\qquad\\ \\ \end{array}\)

\(\frac{\partial z_2}{\partial w_2} \quad = \quad \frac{\partial \left(a_1\times w_2 +b_2\right)}{\partial w_2}\) \(\qquad\qquad\qquad = \quad a_1\) \(\qquad\qquad\qquad\qquad = \quad 0.476\)

\(\frac{\partial a_2}{\partial z_2} \quad = \quad \frac{\partial \sigma(z_2)}{\partial z_2}\) \(\qquad\qquad\qquad\qquad = \quad a_2\left(1-a_2\right)\) \(\qquad\qquad\quad = \quad 0.609(1-0.609) \quad = \quad 0.238\)

\(\frac{\partial L(w,b|x)}{\partial a_2} \quad = \quad \frac{\partial \frac{1}{2}(y - \hat{y})^2}{\partial a_2} = \frac{\partial \frac{1}{2}(y - a_2)^2}{\partial a_2}\) \(\quad\; = \quad \left(a_2-y\right)\) \(\qquad\qquad\qquad = \quad 0.609 - 0.01\quad = \quad 0.599\)

\(\frac{\partial L(w,b|x)}{\partial w_2} \quad = \frac{\partial z_2}{\partial w_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(\qquad = \quad 0.476 \times 0.238 \times 0.599 \quad = \quad 0.0679\)

Back propagation – Backward pass

Neuron


\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609 \quad (=\hat{y})\)

\(y = \quad = \quad 0.01\)

Partial derivative of loss function w.r.t.:

\(w_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial w_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_2}\)

\(b_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial b_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_2}\)

\(w_1:\qquad\qquad\qquad\qquad \frac{\partial z_1}{\partial w_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_1}\)

\(b_1:\qquad\qquad\qquad\qquad\ \frac{\partial z_1}{\partial b_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_1}\)

Back propagation IRL

Multiple neurons per layer

  1. Complication: Interactions between layers
  2. Solution: Requires vector and matrix multiplication

Even more complex designs

  • Requires operations on multidimensional matrices

Tensors

  • Arrays (matrices) of arbitrary dimensions (ML def)
  • Tensor operations
    • multiplication, decomposition, …
    • produce new tensors

TensorFlow

  • The forward and backward passes are viewed as
    “Tensors (e.g., layers) that flow through the network”
  • Additional twist is that tensors allow running all or chunks of test samples simultaneously

Neuron

Neuron

Summary Learning

Loss function

Measures how good the currrent output, \(\hat{y}\) are to the labels, \(y\).

(Quadratic) Loss function
\[\begin{array}{rcll} L(w,b|x) &=& \frac{1}{2}\sum_i\left(y_i-\hat{y}_i\right)^2&\textsf{Residual sum of squares (RSS)}\\ L(w,b) &=& \frac{1}{K}\sum_{k=1}^K L(w,b|x^{(k)})&\textsf{Mean square error (MSE)} \end{array}\]

Gradient descent

  • “Clever hill-climbing” in several dimensions
  • Change all variables \(v\in (w,b)\) by taking a reasonable step (the learning rate) in opposite direction to the gradient \[\begin{equation} v' = v-\eta \frac{\partial L(w,b|x)}{\partial v} \end{equation}\]

Back propagation

  • Decomposition of gradients (allows storing and re-using results)
  • Efficient implementation using tensors

Activation functions revisited

Perceptron – step activation
  • Pros
    • Clear classification (0/1)
  • Why did the perceptron “fail”?
    • 1 layer \(\Rightarrow\) linear classification
    • Not meaningfully differentiable
    • a requirement for multilayer ANN

valley

Activation functions revisited

Why not use the linear function
  • Pros:
    • continuous output
      • better output “resolution”
  • Cons:
    • Not really “meaningfully” differentiable
    • Multilayer linear ann collapses into a single linear model

However, used in the output layer for regression problems!

valley

Activation functions revisited

Sigmoid activation function

  • Meaningfully differentiable
Intermediate between step and linear
  • True for most activation functions
  • Balance between pros and cons

valley

Activation functions revisited

ReLu activation function

  • Meaningfully differentiable
(A different) intermediate between step and linear
  • True for most activation functions
  • Balance between pros and cons

valley

Activation functions summary

  • Meaningfully differentiable is important
  • Often needs to balance pros and cons
  • Two main families (+ special cases)


Sigmoid (logistic) family

Examples
  • Sigmoid
  • Tanh

ReLu family

Examples
  • ReLu
  • Leaky ReLu
  • PreLu

Special uses (e.g., output layers)

Examples
  • SoftMax (classification)
  • Linear (regression)






(More about pros and cons of different activation functions in a later lecture)