Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH...

Machine Learning (CSE 446):Backpropagation

University of Washingtonnasmith@cs.washington.edu

November 8, 2017

1 / 32

Neuron-Inspired Classifiers

xn W × ∑

tanhŷ

weights

classifier output, “f”

correct output loss

v× ∑ !

“activation”

“hidden units”

2 / 32

Two-Layer Neural Network

f(x) = sign

vh · tanh (wh · x + bh)

)= sign (v · tanh (Wx + b))

I Two-layer networks allow decision boundaries that are nonlinear.

I It’s fairly easy to show that “XOR” can be simulated (recall conjunction featuresfrom the “practical issues” lecture on 10/18).

I Theoretical result: any continuous function on a bounded region in Rd can beapproximated arbitrarily well, with a finite number of hidden units.

I The number of hidden units affects how complicated your decision boundary canbe and how easily you will overfit.

3 / 32

Learning with a Two-Layer Network

Parameters: W ∈ RH×d, b ∈ RH , and v ∈ RH

I If we choose a differentiable loss, then the the whole function will be differentiablewith respect to all parameters.

I Because of the squashing function, which is not convex, the overall learningproblem is not convex.

I What does (stochastic) (sub)gradient descent do with non-convex functions? Itfinds a local minimum.

I To calculate gradients, we need to use the chain rule from calculus.

I Special name for (S)GD with chain rule invocations: backpropagation.

4 / 32

Backpropagation

For every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

5 / 32

Backpropagation

a =∂Ln

Base case:

Ln =∂Ln

∂Ln= 1

6 / 32

Backpropagation

a =∂Ln

Base case:

Ln =∂Ln

∂Ln= 1

From here on, we overload notation and let a and b refer to nodes and to their values.

7 / 32

BackpropagationFor every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

After working forwards through the computation graph to obtain the loss Ln, we workbackwards through the computation graph, using the chain rule to calculate a for everynode a, making use of the work already done for nodes that depend on a.

∂a=∑b:a→b

∂b· ∂b∂a

a =∑b:a→b

b · ∂b∂a

=∑b:a→b

1 if b = a+ c for some cc if b = a · c for some c

1− b2 if b = tanh(a)

8 / 32

Backpropagation with Vectors

Pointwise (“Hadamard”) product for vectors in Rn:

a� b =

a[1] · b[1]a[2] · b[2]

...a[n] · b[n]

a =∑

b:a→b

|b|∑i=1

b[i] · ∂b[i]

b:a→b

b if b = a + c for some c

b� c if b = a� c for some cb� (1− b� b) if b = tanh(a)

9 / 32

Backpropagation, Illustrated

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

Intermediate nodes are de-anonymized, to make notation easier.

10 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

∂Ln∂Ln

11 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

The form of g will be loss-function specific (e.g., −2(yn − g) for squared loss).

12 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

ḡḡ⋅1

13 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

ḡḡ⋅1ḡ⋅e

ḡ⋅v

Product.

14 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅v

Hyperbolic tangent.

15 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅vā

16 / 32

xn W d=Wxn a=b+d

e=tanh a

vf=v⊙e g=∑hf[h]

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅vāāxn⊤

Product.

17 / 32

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?

I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

18 / 32

Practical Notes

I Hyperparameters?

I Interpretability?

19 / 32

Practical Notes

I Hyperparameters?

I Interpretability?

20 / 32

Practical Notes

I Hyperparameters?I number of training iterations

I learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

21 / 32

Practical Notes

I Hyperparameters?I number of training iterationsI learning rate for SGD

I number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

22 / 32

Practical Notes

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)

I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

23 / 32

Practical Notes

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)

I amount of randomness in initializationI regularization

I Interpretability?

24 / 32

Practical Notes

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initialization

I regularization

I Interpretability?

25 / 32

Practical Notes

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

26 / 32

Practical Notes

I Interpretability?

27 / 32

Practical Notes

I Interpretability? /

28 / 32

Challenge of Deeper Networks

Backpropagation aims to assign “credit” (or “blame”) to each parameter.In a deep network, credit/blame is shared across all layers.So parameters at early layers tend to have very small gradients.One solution is to train a shallow network, then use it to initialize a deeper network,perhaps gradually increasing network depth. This is called layer-wise training.

29 / 32

Radial Basis Function Networks

xn w1 sqd ×

weights

classifier output, “f”

correct output loss

w2 sqd ×

“activation”

“hidden units”

In the diagram, sqd(x,w) = ‖x−w‖22.30 / 32

Radial Basis Function Networks

Generalizing to H hidden units:

f(x) = sign

v[h] · exp(−γh · ‖x−wh‖22

Each hidden unit is like a little “bump” in data space. wh is the position of the bump,and γh is inversely proportional to its width.

31 / 32

A Gentle Reading on Backpropagation

http://colah.github.io/posts/2015-08-Backprop/

32 / 32

Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH...

Documents

Di erentiable Programming for Hyperspectral Unmixing using ... · Di erentiable Programming for Hyperspectral Unmixing using a Physics-based Dispersion Model John Janiczek 1, Parth

Backpropagation - cs.cmu.edumgormley/courses/10601-s18/slides/lecture14... · 2.2.2 Backpropagation Thebackpropagationalgorithm (Rumelhartetal., 1986)isageneralmethodforcomputing

Backpropagation Networks

Backpropagation con momentum

Topological Properties of the Spaces of Ultradi erentiable

Chapter 6: Backpropagation Nets

Indi erentiable Deterministic Hashing to Elliptic and ...tibouchi/papers/welldistributed.pdf · Indi erentiable Deterministic Hashing to Elliptic and Hyperelliptic Curves ... Elliptic

Multilayer Perceptron Backpropagation Hagan

The backpropagation learning procedure

Backpropagation Network Structure

Backpropagation Networks. Introduction to Backpropagation - In 1969 a method for learning in multi-layer network, Backpropagation Backpropagation, was

Algoritma JST Backpropagation

Indi erentiable Hashing to Barreto{Naehrig Curves - …fouque/pub/latincrypt12.pdf · Indi erentiable Hashing to Barreto{Naehrig Curves ... and identity-based signcryption schemes

Backpropagation algorithm

Backpropagation - cs.cmu.edumgormley/courses/10601/slides/lecture13-backprop.pdf · 2.2.2 Backpropagation Thebackpropagationalgorithm (Rumelhartetal., 1986)isageneralmethodforcomputing

Backpropagation training

Lecture 4 Backpropagation - ttic.uchicago.eduttic.uchicago.edu/~shubhendu/Pages/Files/Lecture4_flat.pdf · Lecture 4 Backpropagation CMSC 35246. A General View of Backpropagation

Jst Metode Backpropagation

Diï¬€erentiable operad, Kuranishi correspondence, and Foundation

Wsteczna Propagacja Błędu (Backpropagation)