35
1 MLP & Backpropagation Issues •The following slides (up to Alternatives to Backprop) are thanks to J. Bullinaria •You are expected to know the highlighted (red boxed) parts in the above slides, as well as anything covered in class.

MLP & Backpropagation Issues

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MLP & Backpropagation Issues

1

MLP & Backpropagation Issues

• The following slides (up to Alternatives to Backprop) are thanks to J. Bullinaria

• You are expected to know the highlighted (red boxed) parts in the above slides, as well as anything covered

in class.

Page 2: MLP & Backpropagation Issues

2

Page 3: MLP & Backpropagation Issues

3

Page 4: MLP & Backpropagation Issues

4

Page 5: MLP & Backpropagation Issues

5

Page 6: MLP & Backpropagation Issues

6

Page 7: MLP & Backpropagation Issues

7

Preprocessing Input variables should be decorrelated and with roughly equal

variance But typically, a very simple linear transformation is applied to

the input to obtain zero-mean - unit variance input:

xi = ( xi - xi_mean )/σi

where σi = 1/(N-1) Σ ( xpi - xi_mean )2 patterns p

More complex preprocessing is also commonly done: E.g. Principal component analysis,...

Page 8: MLP & Backpropagation Issues

8

Page 9: MLP & Backpropagation Issues

9

Effects of Sequential versus Batch Mode: Summary

Batch:

Better estimation of the gradient

Sequential (online/stochastic)

Better if data is highly correlated Better in terms of local minima (stochastic search) Easier to implement

Page 10: MLP & Backpropagation Issues

10

Page 11: MLP & Backpropagation Issues

11

Page 12: MLP & Backpropagation Issues

12

Page 13: MLP & Backpropagation Issues

13

Page 14: MLP & Backpropagation Issues

14

Page 15: MLP & Backpropagation Issues

15

Page 16: MLP & Backpropagation Issues

16

Page 17: MLP & Backpropagation Issues

17

When to stop training No precise formula: 1) At local minima, the gradient magnitude is 0

–  Stop when the gradient is sufficiently small •  need to calculate the gradient over the whole set of patterns •  May need to measure the gradient in several directions, to avoid errors

caused by numerical instability

2) Local minima is a stationary point of the performance index (the error) –  Stop when the absolute change in weights is small

•  How to measure? Typically, rates: 0.01%

3) We are interested in generalization ability –  Stop when the generalization, measured as the performance on

validation set, starts to increase

Page 18: MLP & Backpropagation Issues

18

Local Minima of the Performance Criteria

- The performance surface is a very high-dimensional (W) space full of local minima.

- Your best bet using gradient descent is to locate one of the local minima.

–  Start the training from different random locations (we will later see how we can make use of several thus trained networks)

–  You may also use simulated annealing or genetic algorithms to improve the search in the weight space.

Page 19: MLP & Backpropagation Issues

19

Page 20: MLP & Backpropagation Issues

20

Page 21: MLP & Backpropagation Issues

21

Page 22: MLP & Backpropagation Issues

22

Page 23: MLP & Backpropagation Issues

Neural Networks - Berrin Yanıkoğlu 23

Alternatives to Gradient Descent

Page 24: MLP & Backpropagation Issues

24

SUMMARY

There are alternatives to standard backpropagation, intended to deal with speeding up its convergence.

These either choose a different search direction (p) or a different step size (α). In this course, we will cover updates to standard backpropagation as an

overview, namely momentum and variable rate learning, skipping the other alternatives (those that do not follow steepest descent, such as conjugate gradient method). –  Remember that you are never responsible of the HİDDEN slides (that do

not show in show mode but are visible when you step through the slides!)

Page 25: MLP & Backpropagation Issues

25

•  Variations of Backpropagation –  Momentum: Adds a momentum term to effectively increase the

step size when successive updates are in the same direction. –  Adaptive Learning Rate: Tries to increase the step size and if the

effect is bad (causes oscillations as evidenced by a decrease in performance)

•  Conjugate Gradient •  Levenberg-Marquardt:

–  These two methods choses the next search direction so as to speed up convergence. You should use them instead of the basic backpropagation.

•  Line search: ...

Page 26: MLP & Backpropagation Issues

26

Motivation for momentum (Bishop 7.5)

Page 27: MLP & Backpropagation Issues

27

Effect of momentum

Δwij (n) = η ∇E/dwij(n) + α Δwij (n-1)

n

Δwij(n) = η Σ αn-t ∇E/dwij(t) t=0

For Δwij(n) not to diverge, α must be < 1. If same sign in consecutive iterations => magnitude grows If opposite sign in consecutive iterations => magnitude shrinks

• Effectively adds inertia to the motion through the weight space and smoothes out the oscillations

• The smaller the η, the smoother the trajectory

Page 28: MLP & Backpropagation Issues

28

Effect of momentum

Δwij (n) = η ∇E/dwij(n) + α Δwij (n-1)

For Δwij(n) not to diverge, α must be < 1.

The smaller η, the smoother the trajectory •  If same sign in consecutive iterations => magnitude

grows •  If opposite sign in consecutive iterations => magnitude

shrinks

Page 29: MLP & Backpropagation Issues

29

Effect of momentum

Page 30: MLP & Backpropagation Issues

30

Effect of momentum (Bishop 7.7)

Page 31: MLP & Backpropagation Issues

31

Convergence Example of Backpropagation

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

Page 32: MLP & Backpropagation Issues

32

Learning Rate Too Large

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

Page 33: MLP & Backpropagation Issues

33

Momentum Backpropagation

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

γ 0.8=

Page 34: MLP & Backpropagation Issues

34

Variable Learning Rate (Know the general idea – not the algorithm)

•  If the squared error decreases after a weight update • the weight update is accepted • the learning rate is multiplied by some factor η>1. • If the momentum coefficient γ has been previously set to zero, it is reset to its original value.

•  If the squared error (over the entire training set) increases by more than some set percentage ζ after a weight update

• weight update is discarded • the learning rate is multiplied by some factor (0< ρ <1) • the momentum coefficient γ is set to zero.

•  If the squared error increases by less than ζ, then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

Page 35: MLP & Backpropagation Issues

35

Example

-5 0 5 10 15-5

0

5

10

15

100 102 1040

0.5

1

1.5

Iteration Number100 102 1040

20

40

60

Iteration Number

w11,1

w21,1

η 1.05=

ρ 0.7=

ζ 4%=