12
[course site] Day 1 Lecture 5 Backpropagation in Deep Networks Elisa Sayrol

Backpropagation (D1L5 Deep Learning for Speech and Language)

Embed Size (px)

Citation preview

Page 1: Backpropagation (D1L5 Deep Learning for Speech and Language)

[course site]

Day 1 Lecture 5

Backpropagation in Deep Networks

Elisa Sayrol

Page 2: Backpropagation (D1L5 Deep Learning for Speech and Language)

From previous lectures

L Hidden LayersHidden pre-activation (k>0)

Hidden activation (k=1,…L)

Output activation (k=L+1)

Figure Credit: Hugo Laroche NN course

Page 3: Backpropagation (D1L5 Deep Learning for Speech and Language)

Backpropagation algorithm

The output of the Network gives class scores that depens on the input and the parameters

• Define a loss function that quantifies our unhappiness with the scores across the training data.

• Come up with a way of efficiently finding the parameters that minimize the loss function (optimization)

Page 4: Backpropagation (D1L5 Deep Learning for Speech and Language)

Probability Class given an input (softmax)

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

Figure Credit: Kevin McGuiness

Forward Pass

Page 5: Backpropagation (D1L5 Deep Learning for Speech and Language)

Probability Class given an input (softmax)

Loss function; e.g., negative log-likelihood (good for classification)

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

Regularization term (L2 Norm)aka as weight decay

Figure Credit: Kevin McGuiness

Forward Pass

Page 6: Backpropagation (D1L5 Deep Learning for Speech and Language)

Probability Class given an input (softmax)

Minimize the loss (plus some regularization term) w.r.t. Parameters

over the whole training set.

Loss function; e.g., negative log-likelihood (good for classification)

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

Regularization term (L2 Norm)aka as weight decay

Figure Credit: Kevin McGuiness

Forward Pass

Page 7: Backpropagation (D1L5 Deep Learning for Speech and Language)

Backpropagation algorithm• We need a way to fit the model to data: find parameters (W(k), b(k)) of the network that (locally)

minimize the loss function.

• We can use stochastic gradient descent. Or better yet, mini-batch stochastic gradient descent.

• To do this, we need to find the gradient of the loss function with respect to all the parameters of the model (W(k), b(k))

• These can be found using the chain rule of differentiation.

• The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the above layer and the output from the layer below.

• This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. This is known as the backpropagation algorithm.

Slide Credit: Kevin McGuiness

Page 8: Backpropagation (D1L5 Deep Learning for Speech and Language)

1. Find the error in the top layer:

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

L

Figure Credit: Kevin McGuiness

Backward Pass

Page 9: Backpropagation (D1L5 Deep Learning for Speech and Language)

1. Find the error in the top layer: 2. Compute weight updates

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

L

Figure Credit: Kevin McGuiness

Backward Pass

To simplify we don’t consider the biass

Page 10: Backpropagation (D1L5 Deep Learning for Speech and Language)

1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates

h2

h3

a3

a4 h

4Loss

Hidden Hidden Output

W2

W3

x a2

Input

W1

L

Figure Credit: Kevin McGuiness

Backward Pass

To simplify we don’t consider the biass

Page 11: Backpropagation (D1L5 Deep Learning for Speech and Language)

Optimization

Stochastic Gradient Descent

Stochastic Gradient Descent with momentum

Stochastic Gradient Descent with L2 regularization

Backpropagation: http://cs231n.github.io/optimization-2/

: learning rate

: weight decay

Recommended lectures:

Optimization: http://sebastianruder.com/optimizing-gradient-descent/

Sebastian Ruder Blog

Page 12: Backpropagation (D1L5 Deep Learning for Speech and Language)

In the backward pass you might be in the flat part of the sigmoid (or any other activation function like tanh) so derivative tends to zero and your training loss will not go down

“Vanishing Gradients”