Widrow HoffLearning LMS

7/23/2019 Widrow HoffLearning LMS

1/22

1430/10/28

IUT-Ahmadzadeh1

Ch 10: Widrow-Hoff Learning(LMS Algorithm)

In this chapter we apply the principles of performance

learning to a single-layer linear neural network.

Widrow-Hoff learning is an approximate steepest

1

descent algorithm, in which the performance index is

mean square error.

Bernard Widrow began working in NN in

the late 1950s, at about the same time thatFrank Rosenblatt developed theperceptron learning rule.

n row an o n ro uceADALINE (ADAptive LInear NEuron)network.

Its learning rule is called LMS (Least MeanSquare) algorithm.

2

ADALINE is similar to the perceptron,except that its transfer function is linear,instead of hard limiting.


2/22

1430/10/28

IUT-Ahmadzadeh2

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive

switching circuits, in 1960 IRE WESCON Convention

Record, Part 4, New York: IRE, pp. 96104.

Widrow, B., and Lehr, M. A., 1990, 30 years of

adaptive neural networks: Perceptron, madaline, and

backpropagation,Proc. IEEE, 78:14151441.

Widrow, B., and Stearns, S. D., 1985,Adaptive Signal

3

Processing, Englewood Cliffs, NJ: Prentice-Hall.

Both have the same limitations; They can

only solve linearly separable problems.

The LMS algorithm minimizes mean

,

the decision boundaries as far from the

training patterns as possible.

The LMS algorithm found many more

practical uses than the perceptron (like

4

most long distance phone lines useADALINE network for echo cancellation).


3/22

1430/10/28

IUT-Ahmadzadeh3

ADALINE Network

a purel in Wp b+ Wp b+= =

5

a i purelin ni purelin wT

i p b i+ wT

i p bi+= = =

wi

wi 1

wi 2

wi R

=

iw is made up of the elements of the ith row ofW:

Two-Input ADALINE

6

a pure lin n purelin wT

1 p b+ wT

1 p b+= = =

a wT

1 p b+ w1 1 p1 w1 2 p2 b+ += =

The ADALINE like perceptron has a decision boundary, which is

determined by the input vectors for which the net input n is zero.


4/22

1430/10/28

IUT-Ahmadzadeh4

Mean Square Error

p1t1{ , } p2 t2{ , } pQtQ{ , } Training Set:

The LMS algorithm is an example of supervised training.

pq tqInput: Target:

x w1

b= z p

1= a w

T

1 p b+= a xTz=

Notation:

7

Fx E e2 = E t a 2 E t xTz 2 = =

Mean Square Error:

The expectation is taken over all sets of input/target pairs.

Error Analysis

Fx E e2 = E t a 2 E t xTz 2 = =

F x E t 2 tx Tz x Tzz x+ =

Fx E t2 2x TE tz xTEzzT x+=

This can be written in the following convenient form:

8

F x c 2xTh xTR x+=

c E t2 = h E tz = R E zz

T =

where


5/22

1430/10/28

IUT-Ahmadzadeh5

The vector h gives the cross-correlation

between the input vector and its associatedtarget.

R is the in ut correlation matrix.

The diagonal elements of this matrix are

equal to the mean square values of the

elements of the input vectors.

9

Fx c dTx 12---x

TAx+ +=

d 2h= A 2 R=

quadratic function:

Stationary Point

A 2R=Hessian Matrix:

The correlation matrix R must be at least positive

semidefinite. Really it can be shown that all correlation

matrices are either positive definite or positive

semidefinite. If there are any zero eigenvalues, the

performance index will either have a weak minimum or

else no stationary point (depending on d= -2h),

10

Fx c dTx 12---x

TAx+ +

d Ax+ 2h 2Rx+= = =

2h 2R x+ 0=

(see Ch8).

Stationary point:


6/22

1430/10/28

IUT-Ahmadzadeh6

x R 1 h=

If R (the correlation matrix) is positivedefinite:

If we could calculate the statistical quantities h and R,

we could find the minimum point directly from above

equation.

But it is not desirable or convenient to calculate h and

11

R. So

Approximate Steepest Descent

F x t k a k 2 e2 k = =

Approximate mean square error (one sample):

Fx e2 k =2

Approximate (stochastic) gradient:

Expectation of the squared error has been replaced

by the squared error at iteration k.

12

e k jw 1 j

---------------- 2 e k w 1 j

-------------= = j 1 2 R =

e2

k R 1+e

2k

b---------------- 2e k

e k b

-------------= =


7/22

1430/10/28

IUT-Ahmadzadeh7

Approximate Gradient Calculation

e k w1 j

------------- t k a k

w1 j----------------------------------

w1 j

t k wT

1 pk b+ = =

e k w 1 j

-------------w1 j

t k w1 i p i k i 1=

R

b+

=

Where pi(k) is the ith elements of the input vector at kth iteration.

13

e k w1 j------------- pj k = e k b------------- 1=

Fx e2 k 2e k zk = =

Now we can see the beauty of approximating themean square error by the single error at iteration k as in:

Fx t k a k 2

e2

k= =

This approximation to )(xF can now be used

in the Steepest descent algorithm.

14

gor mxk 1+ xk F x x xk=

=


8/22

1430/10/28

IUT-Ahmadzadeh8

If we substitute )(for)( xx FF

k 1+ k e=

w1 k 1+ w1 k 2e k pk +=

b k 1+ b k 2e k +=

15

These last two equations make up the LMS algorithm.Also called Delta Rule or the Widrow-Hofflearning

algorithm.

Multiple-Neuron Case

wi k 1+ wi k 2 ei k p k +=

b i k 1+ b i k 2e i k +=

Matrix Form:

16

Wk 1+ Wk 2ek pT k +=

b k 1+ bk 2e k +=


9/22

1430/10/28

IUT-Ahmadzadeh9

Analysis of Convergence

Note that xk is a function only of z(k-1), z(k-2), , z(0). If

independent, then xk is independent of z(k).

We will show that for stationary input processes meeting

this condition, so the expected value of the weight vector

will converge to:*

17

x

This is the minimum mean square error {E[ek2]}

solution, as we saw before.

xk 1+ xk 2e k zk +=

=

Recall the LMS Algorithm:

Exk 1+ Exk 2 E t k z k E xkTzk z k +=

Substitute the error with )()( kkt T

kzx

TT

18

Exk 1+ Exk 2 E tkzk E zk zT

k xk +=

kk


10/22

1430/10/28

IUT-Ahmadzadeh10

Exk 1+ Exk 2 h RExk +=

=

Since xk is independent of z(k)

For stability, the eigenvalues of this

matrix must fall inside the unit circle.

Conditions for Stability

19

eig I

2R

1 2 i

1=

(where i is an eigenvalue of R)

Therefore the stability condition simplifies to

i 0Since , 1 2i 1 .

1 2 i

1

1 i for all i

20

0 1 m ax

Note: we have the same condition as the SD algorithm. In

SD we use the Hessian MatrixA, here we use the input

correlation matrix R (Recall thatA=2R).


11/22

1430/10/28

IUT-Ahmadzadeh11

Steady State Response

E xk 1+ I 2R E xk 2h+=

E xss I 2R E xss 2h+=

1 = =

, .

The solution to this equation is

21

ss

This is also the strong minimum of the performance index.

Thus the LMS solution, obtained by applying one input at a time, is

the same as the minimum mean square solution of hRx1*

Examplep1

1

1

1

t1 1= =

p2

1

1

1

t2 1= =

Banana Apple

R EppT

12---p1p1

T 1

2---p2p2

T+==

1---

11---

1 1 0 0

= =

If inputs are generated randomly with equal probability, the

input correlation matrix is:

22

2---

1

2---

1

0 1 1

1 1.0 2 0.0 3 2.0=== 1

max------------ 1

2.0------- 0.5==

We take =0.2 (Note: Practically it is difficult to calculate R and

. We choose them by trial and error).


12/22

1430/10/28

IUT-Ahmadzadeh12

Iteration One

a 0 W 0 p 0 W 0 p1 0 0 01

1

1

0====Banana

W(0) is

selected

arbitrarily.

e 0 t 0 a 0 t1 a 0 1 0 1====

W 1 W 0 2e 0 pT 0 +=

23

W 1 0 0 0 2 0.2 1

1

1

1

0.4 0.4 0.4=+=

Iteration TwoApple

a 1 W 1 p 1 W 1 p2 0.4 0.4 0.4

1

1

1

0.4====

e 1 t1 a 1 t2 a 1 1 0.4 1.4====

W 2 0.4 0.4 0.4 2 0.2 1.4 1

1

T

0.96 0.16 0.16=+=

24

1


13/22

1430/10/28

IUT-Ahmadzadeh13

Iteration Three

a 2 W 2 p 2 W 2 p1 0.96 0.16 0.16

1

1 0.64====

1

e 2 t2 a 2 t1 a 2 1 0.64 0.36====

T

25

3 2 2e 2 p 2 + 1.10400.0160 0.0160= =

W 1 0 0=

Some general comments on the

learning process:

Computationally, the learning process

goes roug a ra n ng examp es an

epoch) number of times, until a stopping

criterion is reached.

The convergence process can be

monitored with the lot of the mean-

squared error function F(W(k)).

26


14/22

1430/10/28

IUT-Ahmadzadeh14

The popular stopping cri teria are:

the mean-squared error is sufficiently

sma : <

The rate of change of the mean-squared

error is sufficiently small:

27

Adaptive Filtering

ADALINE is one of the most widely used NNs in practical

applications. One of the major application areas has been

Adaptive Filtering.

Tapped Delay Line Adaptive Filter

28


15/22

1430/10/28

IUT-Ahmadzadeh15

a k purelinWp b+ w1iy k i 1+ i 1=

R

b+= =

u w

recognize this network as a finite impulse response

(FIR) filter.

29

Example: Noise Cancellation

30


16/22

1430/10/28

IUT-Ahmadzadeh16

Noise Cancellation Adaptive Filter

Two-input filter can attenuate and phase-shift the

noise in the desired way.

31

Correlation MatrixTo Analyze this system we need to find the inputcorrelation matrix R and the input/target cross-

h E tz =

zk v k

v k 1 = t k s k m k +=

.

][ TEzzR

32

R E v k E v k v k 1

E v k 1 v k E v2

k 1 =

h E s k m k + v k

E s k m k + v k 1 =


17/22


18/22

1430/10/28

IUT-Ahmadzadeh18

Stationary Point

E k m k + v k E s k v k E m k v k +=

0st

E m k v k 1

3--- 1.2

2k3

---------34

------ sin

1.2 2k3

---------sin

k 1=

3

0.51= =

independent and zero mean.

35

E s k m k + v k 1 E s k v k 1 E m k v k 1 +=

0

Now we find the 2nd element of h:

E m k v k 1 13--- 1.2 2k

3--------- 3

4------ sin 1.2

2k 1 3

-----------------------sin k 1=

3

0.70= =

h 0.51

0.70=

x R 1 h 0.72 0.36

0.36 0.72

10.51

0.70

0.30

0.82= = =

h E s k m k + v k

E s k m k + v k 1 =

36

Now, what kind of error will we have at the

minimum solution?


19/22

1430/10/28

IUT-Ahmadzadeh19

Performance Index

Fx c 2 xTh xTRx+=

We have ust foundx* R andh. To find c we have

c E t2

k E s k m k + 2 ==

c E s2

k 2E s k m k E m2 k + +=

37

E s2

k 10.4------- s

2sd

0.2

0.2

1

3 0.4 ---------------s

3

0.2

0.2

0.0133= = =

independent and zero mean.

E m2 k 13--- 1.2 23------ 34------ sin

2

k 1=

3

0.72= =

Fx 0.7333 2 0.72 0.72+ 0.0133= =

. . .

The minimum mean square error is the same as the

mean square value of the EEG signal. This is what

38

we expected, since the error of this adaptive noise

canceller is in fact the reconstructed EEG Signal.


20/22

1430/10/28

IUT-Ahmadzadeh20

LMS Response for =0.1

W1,2

39

W1,1

LMS trajectory looks like noisy version of steepest

descent.

Note that the contours in this figure reflect the fact that

the eigenvalues and the eigenvectors of the Hessian

matrix A=2R are

7071.0

7071.0z,75.0,

7071.0

7071.0z,16.2 2211

If the learning rate is decreased, the LMS trajectory is

smoother, but the learning proceed more slowly.

40

Note that max is 2/2.16=0.926 for stability.


21/22

1430/10/28

IUT-Ahmadzadeh21

41

Note that error does not go to zero, because the LMS

algorithm is approximate steepest descent; it uses an estimate

of the gradient, not the true gradient. nnd10eeg

Echo Cancellation

42


22/22

1430/10/28

HW

Ch 4: E 2, 4, 6, 7

Ch 5: 5, 7, 9

Ch 6: 4, 5, 8, 10

Ch 7: 1, 5, 6, 7

Ch 8: 2, 4, 5

43

Ch 9: 2, 5, 6

Ch 10: 3, 6, 7

Documents

Widrow HoffLearning LMS