Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4

Neural Networks - Berrin Yanıkoğlu 1

Applications and ExamplesApplications and Examples

From Mitchell Chp. 4

2

ALVINN drives70mph on highways

3

Speech RecognitionSpeech Recognition

4

Hidden Node FunctionsHidden Node Functions

5

6

7

8

9

10

Head Pose RecognitionHead Pose Recognition


MLP & Backpropagation IssuesMLP & Backpropagation Issues

12

ConsiderationsConsiderations

• Network architecture• Typically feedforward, however you may also use local

receptive fields for hidden nodes; and in sequence learning, recursive nodes

• Number of input, hidden, output nodes• Number of hidden nodes is quite important, others

determined by problem setup

• Activation functions• Careful: regression requires linear activation on the output• For others, sigmoid or hyperbolic tangent is a good choice

• Learning rate• Typically software adjusts this

13

ConsiderationsConsiderations

• Preprocessing• Important (see next slides)

• Learning algorithm• Backpropagation with momentum or• Levenberg-Marquart suggested

• When to stop training• Important (see next slides)

14

PreprocessingPreprocessingInput variables should be decorrelated and with roughly equal variance

But typically, a very simple linear transformation is applied to the input to obtainzero-mean - unit variance input:

xi = ( xi - xi_mean )/i where i = 1/(N-1) ( xpi - xi_mean )2

patterns p

More complex preprocessing is also commonly done: E.g. Principal component analysis

15

When to stop trainingWhen to stop training

No precise formula:

1) At local minima, the gradient magnitude is 0– Stop when the gradient is sufficiently small

• need to calculate the gradient over the whole set of patterns• May need to measure the gradient in several directions, to avoid errors

caused by numerical instability

2) Local minima is a stationary point of the performance index (the error)– Stop when the absolute change in weights is small

• How to measure? Typically, rates: 0.01%

3) We are interested in generalization ability– Stop when the generalization, measured as the performance on

validation set, starts to increase

16

Effects of Sequential versus Batch Mode: SummaryEffects of Sequential versus Batch Mode: Summary

– Batch: – Better estimation of the gradient

– Sequential (online) – Better if data is highly correlated

– Better in terms of local minima (stochastic search)

– Easier to implement


Performance SurfacePerformance Surface

Motivation for some of the practical issues

18

Local Minima of the Performance CriteriaLocal Minima of the Performance Criteria

- The performance surface is a very high-dimensional (W) space full of local minima.

- Your best bet using gradient descent is to locate one of the local minima. – Start the training from different random locations (we will later see

how we can make use of several thus trained networks)– You may also use simulated annealing or genetic algorithms to

improve the search in the weight space.

19

Performance Surface ExamplePerformance Surface ExampleNetwork Architecture

w1 11

10=

w2 11

10=

b11

5–=

b21

5=w1 1

21=

w1 22

1=b

21–=

-2 -1 0 1 20

0.25

0.5

0.75

1

Nominal Function

Parameter Values

Layer numbers are shown as superscripts

20

Squared Error vs. Squared Error vs. ww111,1 1,1 and and bb11

11

w11,1

b11

-10

0

10

20

30 -30-20

-100

1020

0

0.5

1

1.5

2

2.5

b11w1

1,1-10 0 10 20 30-25

-15

-5

5

15

21

Squared Error vs. Squared Error vs. ww111,1 1,1 and and ww22

1,11,1

-5

0

5

10

15

-5

0

5

10

15

0

5

10

-5 0 5 10 15-5

0

5

10

15

w11,1w2

1,1

w11,1

w21,1

22

Squared Error vs. Squared Error vs. bb111 1 and and bb11

22

-10

-5

0

5

10

-10

-5

0

5

10

0

0.7

1.4

-10 -5 0 5 10-10

-5

0

5

10

b11

b21

b21b1

1


MLP & Backpropagation Summary MLP & Backpropagation Summary REST of the SLIDES are REST of the SLIDES are

ADVANCED MATERIAL ADVANCED MATERIAL (read only if you are interested, or (read only if you are interested, or

if there is something you do^’t understand…)if there is something you do^’t understand…)

These slides are thanks to John Bullinaria

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40


Alternatives to Gradient DescentAlternatives to Gradient Descent

ADVANCED MATERIAL

(read only if interested)

42

SUMMARYSUMMARY

There are alternatives to standard backpropagation, intended to deal with speeding up its convergence.

These either choose a different search direction (p) or a

different step size ().

In this course, we will cover updates to standard backpropagation as an overview, namely momentum and variable rate learning, skipping the other alternatives (those that do not follow steepest descent, such as conjugate gradient method). – Remember that you are never responsible of the HİDDEN slides (that do

not show in show mode but are visible when you step through the slides!)

43

• Variations of Backpropagation– Momentum: Adds a momentum term to effectively increase the

step size when successive updates are in the same direction.– Adaptive Learning Rate: Tries to increase the step size and if the

effect is bad (causes oscillations as evidenced by a decrease in performance)

• Newton’s Method:• Conjugate Gradient• Levenberg-Marquardt• Line search

44

Motivation for momentum (Bishop 7.5)Motivation for momentum (Bishop 7.5)

45

Effect of momentumEffect of momentum

wij (n) = E/dwij(n) + wij (n-1)

n

wij(n) = n-t E/dwij(t)

t=0

If same sign in consecutive iterations => magnitude growsIf opposite sign in consecutive iterations => magnitude shrinks

•For wij(n) not to diverge, must be < 1.

•Effectively adds inertia to the motion through the weight space and smoothes out the oscillations

•The smaller the , the smoother the trajectory

48

Convergence ExampleConvergence Example of Backpropagation of Backpropagation

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

49

Learning Rate Too LargeLearning Rate Too Large

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

50

Momentum BackpropagationMomentum Backpropagation

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

0.8=

51

Variable Learning RateVariable Learning Rate

• If the squared error decreases after a weight update•the weight update is accepted•the learning rate is multiplied by some factor >1. •If the momentum coefficient has been previously set to zero, it is reset to its original value.

• If the squared error (over the entire training set) increases by more than some set percentage after a weight update

•weight update is discarded•the learning rate is multiplied by some factor (1>>0)•the momentum coefficient is set to zero.

• If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

52

ExampleExample

-5 0 5 10 15-5

0

5

10

15

100 102 1040

0.5

1

1.5

Iteration Number100 102 1040

20

40

60

Iteration Number

w11,1

w21,1

1.05=

0.7=

4%=

Documents

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4