Widrow HoffLearning LMS

Embed Size (px)

Citation preview

  • 7/23/2019 Widrow HoffLearning LMS

    1/22

    1430/10/28

    IUT-Ahmadzadeh1

    Ch 10: Widrow-Hoff Learning(LMS Algorithm)

    In this chapter we apply the principles of performance

    learning to a single-layer linear neural network.

    Widrow-Hoff learning is an approximate steepest

    1

    descent algorithm, in which the performance index is

    mean square error.

    Bernard Widrow began working in NN in

    the late 1950s, at about the same time thatFrank Rosenblatt developed theperceptron learning rule.

    n row an o n ro uceADALINE (ADAptive LInear NEuron)network.

    Its learning rule is called LMS (Least MeanSquare) algorithm.

    2

    ADALINE is similar to the perceptron,except that its transfer function is linear,instead of hard limiting.

  • 7/23/2019 Widrow HoffLearning LMS

    2/22

    1430/10/28

    IUT-Ahmadzadeh2

    Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive

    switching circuits, in 1960 IRE WESCON Convention

    Record, Part 4, New York: IRE, pp. 96104.

    Widrow, B., and Lehr, M. A., 1990, 30 years of

    adaptive neural networks: Perceptron, madaline, and

    backpropagation,Proc. IEEE, 78:14151441.

    Widrow, B., and Stearns, S. D., 1985,Adaptive Signal

    3

    Processing, Englewood Cliffs, NJ: Prentice-Hall.

    Both have the same limitations; They can

    only solve linearly separable problems.

    The LMS algorithm minimizes mean

    ,

    the decision boundaries as far from the

    training patterns as possible.

    The LMS algorithm found many more

    practical uses than the perceptron (like

    4

    most long distance phone lines useADALINE network for echo cancellation).

  • 7/23/2019 Widrow HoffLearning LMS

    3/22

    1430/10/28

    IUT-Ahmadzadeh3

    ADALINE Network

    a purel in Wp b+ Wp b+= =

    5

    a i purelin ni purelin wT

    i p b i+ wT

    i p bi+= = =

    wi

    wi 1

    wi 2

    wi R

    =

    iw is made up of the elements of the ith row ofW:

    Two-Input ADALINE

    6

    a pure lin n purelin wT

    1 p b+ wT

    1 p b+= = =

    a wT

    1 p b+ w1 1 p1 w1 2 p2 b+ += =

    The ADALINE like perceptron has a decision boundary, which is

    determined by the input vectors for which the net input n is zero.

  • 7/23/2019 Widrow HoffLearning LMS

    4/22

    1430/10/28

    IUT-Ahmadzadeh4

    Mean Square Error

    p1t1{ , } p2 t2{ , } pQtQ{ , } Training Set:

    The LMS algorithm is an example of supervised training.

    pq tqInput: Target:

    x w1

    b= z p

    1= a w

    T

    1 p b+= a xTz=

    Notation:

    7

    Fx E e2 = E t a 2 E t xTz 2 = =

    Mean Square Error:

    The expectation is taken over all sets of input/target pairs.

    Error Analysis

    Fx E e2 = E t a 2 E t xTz 2 = =

    F x E t 2 tx Tz x Tzz x+ =

    Fx E t2 2x TE tz xTEzzT x+=

    This can be written in the following convenient form:

    8

    F x c 2xTh xTR x+=

    c E t2 = h E tz = R E zz

    T =

    where

  • 7/23/2019 Widrow HoffLearning LMS

    5/22

    1430/10/28

    IUT-Ahmadzadeh5

    The vector h gives the cross-correlation

    between the input vector and its associatedtarget.

    R is the in ut correlation matrix.

    The diagonal elements of this matrix are

    equal to the mean square values of the

    elements of the input vectors.

    9

    Fx c dTx 12---x

    TAx+ +=

    d 2h= A 2 R=

    quadratic function:

    Stationary Point

    A 2R=Hessian Matrix:

    The correlation matrix R must be at least positive

    semidefinite. Really it can be shown that all correlation

    matrices are either positive definite or positive

    semidefinite. If there are any zero eigenvalues, the

    performance index will either have a weak minimum or

    else no stationary point (depending on d= -2h),

    10

    Fx c dTx 12---x

    TAx+ +

    d Ax+ 2h 2Rx+= = =

    2h 2R x+ 0=

    (see Ch8).

    Stationary point:

  • 7/23/2019 Widrow HoffLearning LMS

    6/22

    1430/10/28

    IUT-Ahmadzadeh6

    x R 1 h=

    If R (the correlation matrix) is positivedefinite:

    If we could calculate the statistical quantities h and R,

    we could find the minimum point directly from above

    equation.

    But it is not desirable or convenient to calculate h and

    11

    R. So

    Approximate Steepest Descent

    F x t k a k 2 e2 k = =

    Approximate mean square error (one sample):

    Fx e2 k =2

    Approximate (stochastic) gradient:

    Expectation of the squared error has been replaced

    by the squared error at iteration k.

    12

    e k jw 1 j

    ---------------- 2 e k w 1 j

    -------------= = j 1 2 R =

    e2

    k R 1+e

    2k

    b---------------- 2e k

    e k b

    -------------= =

  • 7/23/2019 Widrow HoffLearning LMS

    7/22

    1430/10/28

    IUT-Ahmadzadeh7

    Approximate Gradient Calculation

    e k w1 j

    ------------- t k a k

    w1 j----------------------------------

    w1 j

    t k wT

    1 pk b+ = =

    e k w 1 j

    -------------w1 j

    t k w1 i p i k i 1=

    R

    b+

    =

    Where pi(k) is the ith elements of the input vector at kth iteration.

    13

    e k w1 j------------- pj k = e k b------------- 1=

    Fx e2 k 2e k zk = =

    Now we can see the beauty of approximating themean square error by the single error at iteration k as in:

    Fx t k a k 2

    e2

    k= =

    This approximation to )(xF can now be used

    in the Steepest descent algorithm.

    14

    gor mxk 1+ xk F x x xk=

    =

  • 7/23/2019 Widrow HoffLearning LMS

    8/22

    1430/10/28

    IUT-Ahmadzadeh8

    If we substitute )(for)( xx FF

    k 1+ k e=

    w1 k 1+ w1 k 2e k pk +=

    b k 1+ b k 2e k +=

    15

    These last two equations make up the LMS algorithm.Also called Delta Rule or the Widrow-Hofflearning

    algorithm.

    Multiple-Neuron Case

    wi k 1+ wi k 2 ei k p k +=

    b i k 1+ b i k 2e i k +=

    Matrix Form:

    16

    Wk 1+ Wk 2ek pT k +=

    b k 1+ bk 2e k +=

  • 7/23/2019 Widrow HoffLearning LMS

    9/22

    1430/10/28

    IUT-Ahmadzadeh9

    Analysis of Convergence

    Note that xk is a function only of z(k-1), z(k-2), , z(0). If

    independent, then xk is independent of z(k).

    We will show that for stationary input processes meeting

    this condition, so the expected value of the weight vector

    will converge to:*

    17

    x

    This is the minimum mean square error {E[ek2]}

    solution, as we saw before.

    xk 1+ xk 2e k zk +=

    =

    Recall the LMS Algorithm:

    Exk 1+ Exk 2 E t k z k E xkTzk z k +=

    Substitute the error with )()( kkt T

    kzx

    TT

    18

    Exk 1+ Exk 2 E tkzk E zk zT

    k xk +=

    kk

  • 7/23/2019 Widrow HoffLearning LMS

    10/22

    1430/10/28

    IUT-Ahmadzadeh10

    Exk 1+ Exk 2 h RExk +=

    =

    Since xk is independent of z(k)

    For stability, the eigenvalues of this

    matrix must fall inside the unit circle.

    Conditions for Stability

    19

    eig I

    2R

    1 2 i

    1=

    (where i is an eigenvalue of R)

    Therefore the stability condition simplifies to

    i 0Since , 1 2i 1 .

    1 2 i

    1

    1 i for all i

    20

    0 1 m ax

    Note: we have the same condition as the SD algorithm. In

    SD we use the Hessian MatrixA, here we use the input

    correlation matrix R (Recall thatA=2R).

  • 7/23/2019 Widrow HoffLearning LMS

    11/22

    1430/10/28

    IUT-Ahmadzadeh11

    Steady State Response

    E xk 1+ I 2R E xk 2h+=

    E xss I 2R E xss 2h+=

    1 = =

    , .

    The solution to this equation is

    21

    ss

    This is also the strong minimum of the performance index.

    Thus the LMS solution, obtained by applying one input at a time, is

    the same as the minimum mean square solution of hRx1*

    Examplep1

    1

    1

    1

    t1 1= =

    p2

    1

    1

    1

    t2 1= =

    Banana Apple

    R EppT

    12---p1p1

    T 1

    2---p2p2

    T+==

    1---

    11---

    1 1 0 0

    = =

    If inputs are generated randomly with equal probability, the

    input correlation matrix is:

    22

    2---

    1

    2---

    1

    0 1 1

    1 1.0 2 0.0 3 2.0=== 1

    max------------ 1

    2.0------- 0.5==

    We take =0.2 (Note: Practically it is difficult to calculate R and

    . We choose them by trial and error).

  • 7/23/2019 Widrow HoffLearning LMS

    12/22

    1430/10/28

    IUT-Ahmadzadeh12

    Iteration One

    a 0 W 0 p 0 W 0 p1 0 0 01

    1

    1

    0====Banana

    W(0) is

    selected

    arbitrarily.

    e 0 t 0 a 0 t1 a 0 1 0 1====

    W 1 W 0 2e 0 pT 0 +=

    23

    W 1 0 0 0 2 0.2 1

    1

    1

    1

    0.4 0.4 0.4=+=

    Iteration TwoApple

    a 1 W 1 p 1 W 1 p2 0.4 0.4 0.4

    1

    1

    1

    0.4====

    e 1 t1 a 1 t2 a 1 1 0.4 1.4====

    W 2 0.4 0.4 0.4 2 0.2 1.4 1

    1

    T

    0.96 0.16 0.16=+=

    24

    1

  • 7/23/2019 Widrow HoffLearning LMS

    13/22

    1430/10/28

    IUT-Ahmadzadeh13

    Iteration Three

    a 2 W 2 p 2 W 2 p1 0.96 0.16 0.16

    1

    1 0.64====

    1

    e 2 t2 a 2 t1 a 2 1 0.64 0.36====

    T

    25

    3 2 2e 2 p 2 + 1.10400.0160 0.0160= =

    W 1 0 0=

    Some general comments on the

    learning process:

    Computationally, the learning process

    goes roug a ra n ng examp es an

    epoch) number of times, until a stopping

    criterion is reached.

    The convergence process can be

    monitored with the lot of the mean-

    squared error function F(W(k)).

    26

  • 7/23/2019 Widrow HoffLearning LMS

    14/22

    1430/10/28

    IUT-Ahmadzadeh14

    The popular stopping cri teria are:

    the mean-squared error is sufficiently

    sma : <

    The rate of change of the mean-squared

    error is sufficiently small:

    27

    Adaptive Filtering

    ADALINE is one of the most widely used NNs in practical

    applications. One of the major application areas has been

    Adaptive Filtering.

    Tapped Delay Line Adaptive Filter

    28

  • 7/23/2019 Widrow HoffLearning LMS

    15/22

    1430/10/28

    IUT-Ahmadzadeh15

    a k purelinWp b+ w1iy k i 1+ i 1=

    R

    b+= =

    u w

    recognize this network as a finite impulse response

    (FIR) filter.

    29

    Example: Noise Cancellation

    30

  • 7/23/2019 Widrow HoffLearning LMS

    16/22

    1430/10/28

    IUT-Ahmadzadeh16

    Noise Cancellation Adaptive Filter

    Two-input filter can attenuate and phase-shift the

    noise in the desired way.

    31

    Correlation MatrixTo Analyze this system we need to find the inputcorrelation matrix R and the input/target cross-

    h E tz =

    zk v k

    v k 1 = t k s k m k +=

    .

    ][ TEzzR

    32

    R E v k E v k v k 1

    E v k 1 v k E v2

    k 1 =

    h E s k m k + v k

    E s k m k + v k 1 =

  • 7/23/2019 Widrow HoffLearning LMS

    17/22

  • 7/23/2019 Widrow HoffLearning LMS

    18/22

    1430/10/28

    IUT-Ahmadzadeh18

    Stationary Point

    E k m k + v k E s k v k E m k v k +=

    0st

    E m k v k 1

    3--- 1.2

    2k3

    ---------34

    ------ sin

    1.2 2k3

    ---------sin

    k 1=

    3

    0.51= =

    independent and zero mean.

    35

    E s k m k + v k 1 E s k v k 1 E m k v k 1 +=

    0

    Now we find the 2nd element of h:

    E m k v k 1 13--- 1.2 2k

    3--------- 3

    4------ sin 1.2

    2k 1 3

    -----------------------sin k 1=

    3

    0.70= =

    h 0.51

    0.70=

    x R 1 h 0.72 0.36

    0.36 0.72

    10.51

    0.70

    0.30

    0.82= = =

    h E s k m k + v k

    E s k m k + v k 1 =

    36

    Now, what kind of error will we have at the

    minimum solution?

  • 7/23/2019 Widrow HoffLearning LMS

    19/22

    1430/10/28

    IUT-Ahmadzadeh19

    Performance Index

    Fx c 2 xTh xTRx+=

    We have ust foundx* R andh. To find c we have

    c E t2

    k E s k m k + 2 ==

    c E s2

    k 2E s k m k E m2 k + +=

    37

    E s2

    k 10.4------- s

    2sd

    0.2

    0.2

    1

    3 0.4 ---------------s

    3

    0.2

    0.2

    0.0133= = =

    independent and zero mean.

    E m2 k 13--- 1.2 23------ 34------ sin

    2

    k 1=

    3

    0.72= =

    Fx 0.7333 2 0.72 0.72+ 0.0133= =

    . . .

    The minimum mean square error is the same as the

    mean square value of the EEG signal. This is what

    38

    we expected, since the error of this adaptive noise

    canceller is in fact the reconstructed EEG Signal.

  • 7/23/2019 Widrow HoffLearning LMS

    20/22

    1430/10/28

    IUT-Ahmadzadeh20

    LMS Response for =0.1

    W1,2

    39

    W1,1

    LMS trajectory looks like noisy version of steepest

    descent.

    Note that the contours in this figure reflect the fact that

    the eigenvalues and the eigenvectors of the Hessian

    matrix A=2R are

    7071.0

    7071.0z,75.0,

    7071.0

    7071.0z,16.2 2211

    If the learning rate is decreased, the LMS trajectory is

    smoother, but the learning proceed more slowly.

    40

    Note that max is 2/2.16=0.926 for stability.

  • 7/23/2019 Widrow HoffLearning LMS

    21/22

    1430/10/28

    IUT-Ahmadzadeh21

    41

    Note that error does not go to zero, because the LMS

    algorithm is approximate steepest descent; it uses an estimate

    of the gradient, not the true gradient. nnd10eeg

    Echo Cancellation

    42

  • 7/23/2019 Widrow HoffLearning LMS

    22/22

    1430/10/28

    HW

    Ch 4: E 2, 4, 6, 7

    Ch 5: 5, 7, 9

    Ch 6: 4, 5, 8, 10

    Ch 7: 1, 5, 6, 7

    Ch 8: 2, 4, 5

    43

    Ch 9: 2, 5, 6

    Ch 10: 3, 6, 7