Pdf8 Steepest Descent

8/19/2019 Pdf8 Steepest Descent

1/60


2/60

2

• { x(n)} are the WSS input samples

•

{d (n)} is the WSS desired output• )}(ˆ{ nd is the estimate of the desired

signal given by)()()(ˆ nnnd H xw=

whereT M n xn xn xn

)]1(),...,1(),([)( +−−=x

andT

M nwnwnwn )](),...,(),([)( 110 −=w is thefilter weight vector at time n.


3/60

3

Then

)()()(

)(ˆ)()(

nnnd

nd nd ne H xw−=

−=

Thus the MSE of time n is

)()()()(

}|)({|)(

2

2

nnnn

ne E n J

H H H d Rwwwppw +−−=

=σ

where2d σ - variance of desired signal

p – Cross-correlation between x(n) and d (n)

R –correlation matrix of x(n)

When w(n) is set to the (optimal) Wiener

solution, then

pRww1

0)( −

==nand

02

min)( wp H

d J n J −== σ


4/60

4

Hence, in order to iteratively find w0, we usethe method of steepest descent. To illustrate

this concept, let M = 2, in the 2-D spacedw(n), the MSE forms a Bowl-shapedfunction.

A contour of the MSE is given as

Thus, if we are at a specific point in theBowl, we can imagine dropping a marble. Itwould reach the minimum. By goingthrough the path of steepest descent.

w0w

J w

J w

w1

w2

w0

w2

w1

)( J ∇−

2w J

∂∂−

1w J

∂∂−


5/60

5

Hence the direction in which we change thefilter direction is )(n J ∇− , or

)]([21)()1( n J nn −∇+=+ µ ww

or, since )(22))(( nn J Rwp +−=∇)]([)()1( nnn Rwpww −+=+

for n=0, 1, 2, … and where µ is called the

stepsize and

)generalin( )0( 0w =

Stability: Since the SD method usesfeedback, the system can go unstable

• bounds on the step size guaranteeingstability can be determined with respect

ot the eigenvalues of R (widrow, 1970)


6/60

6

Define the error vector for the tap weights as

0)()( wwc −= nn

Then using 0Rwp = in the update,

)()(

)]([)(

)]([)()1(

0

nn

nn

nnn

Rcw

RwRww

Rwpww

µ

µ

−=

−+=−+=+

and

)(][)()()1(

)()()1( 00

nnnn

or

nnn

cRIRccc

Rcwwww

µ µ

µ

−=−=+

−−=−+


7/60

7

Using the Unitary Similarity Transform H QQR =

we have

)(][)1( nn H cQQIc µ −=+

Premultiplying by Q H gives

)(][

)(][)1(

n

nn

H

H H H H

cQI

cQQQQcQ

µ

µ

−=

−=+

Define the transformed coefficients as

))((

)()(

0wwQ

cQv

−==

n

nn H

H


8/60

8

Then

)(][)1( nn vIv −=+with initial condition

00 ))0(()0( wQwwQv H H −=−=

if 0w =)0(The k th term in v(n+1) (mode) is given by

M k nvnv k k k ,...,2,1)()1()1( =−=+ λ

or using the recursion

)0()1()( k n

k k vnv µλ −=Thus for all 0)(lim =∞→ nvk n we must have

1|1|


9/60

9

The k th mode has geometric decay

)0()1()( k n

k k vnv µλ −=we can characterize the rate of decay by

finding the time it takes to decay to e-1 of the

initiative. Thus

11

)1ln(1

)0()0()1()( 1


10/60

10

Recall that

∑

∑

=

=

−+=

+=

+=

−−+=−−+=

M

k k

n

k k

M

k k k

H

H H

H

v J

nv J

nn J

nn J

nn J n J

1

22

min

1

2min

min

00min

00min

|)0(|)1(

|)(|

)()(

))(())((

))(())(()(

µλ λ

λ

v

wwQQww

wwRww

Thus min)(lim J n J n =∞→


11/60

11

Example: Consider a two-tap predictor

Consider the effects of the following cases

• Varying the eigenvalue spread

min

max)(λ

χ =R and keeping µ fixed.

• Varying µ and keeping the eigenvalue

spread )(R fixed


12/60

12


13/60

13


14/60

14


15/60

15


16/60

16


17/60

17


18/60

18


19/60

19

Example: Consider the system identification

problem

For M =2 suppose

==5.0

8.0

18.0

8.01PR x

w(n) system

{ x(n)}

)(ˆ nd d (n)+_

e(n)


20/60

20

From eigen analysis we have

λ 1 = 1.8, λ 2 = 0.2 and 8.12

< µ also

−== 11

2

11

1

2

121 qq

and

−= 1111

2

1Q

Also,

−== −

389.011.11

0 pRw

Thus])([)( 0wwQv −= nn H

Noting that=−−−=−= 06.1

51.0

389.0

11.1

11

11

2

1)0( 0wQv

H


21/60

21

and51.0))8.1(1()(1

nnv µ −=

06.1))2.0(1()(2nnv µ −=


22/60

22

The Least Mean Square (LMS) Algorithm

The error performance surface used by theSD method is not always known a priori.We can use estimated values. The estimatesare RVs and thus this leads to a stochasticapproach.

We will use the following instantaneousestimates

)()()(ˆ nnn H xxR =)()()(ˆ nd nn ∗=xP


23/60

23

Recall the SD update

[ ]))((2

1)()1( n J nn ∇−+=+ µ ww

where the gradient of the error surface atw(n) was shown to be

)(22))(( nn J Rwp +−=∇Using the instantaneous estimates,

)()(2)](ˆ)()[(2

)]()()()[(2

)()()(2)()(2))((ˆ

nen

nd nd n

nnnd n

nnnnd nn J H

H

∗

∗∗

∗

∗

−=−−=

−−=+−=∇

x

x

wxx

wxxx

Complex conjugate of estimate error


24/60

24

Putting this in the update

)()()()1( nennn ∗+=+ xww µ Thus LMS algorithm belongs to the family

of stochastic gradient algorithms.

The update is extremely simple while theinstantaneous estimates may have large

variance, the LMS algorithm is recursive

and effectively averages these estimates.

The simplicity and good performance of the

LMS algorithm make it the benchmark

against which other optimization algorithms

are judged.


25/60

25

The LMS algorithm can be analyzed by

invoking the independence theory, which

states

1) The vectors x(1), x(2), …, x(n) are

statistically independent.

2)

x(n) is independent of d (1), d (2), …,d (n-1)

3) d (n) is statistically dependent on x(n),

but is independent of d (1), d (2), …, d (n-1)

4) x(n) and d (n) are mutually Gaussian.The independence theorem is justified in

some cases, e.g. beamforming where we

receive independent vector observations. In

other cases it is not well justified, but allows

the analysis to proceeds.


26/60

26

Using the independence theory we can show

that w(n) converges to the optimal solution

in the mean

0)}({lim ww =∞→ n E nIn certain cases, to show this, evaluate the

update

)()()()1( nennn ∗+=+ xww µ

)()()()1( 00 nennn ∗+−=−+ xwwww µ

])()()[(

)()]()([)()()()()(

)()()(

])()[()(

)()()())()()()(()()1(

0

0

00

wxx

cxxIwxxcxx

xc

wwwxx

xc

wxxcc

nnd n

nnnnnnnn

nd nn

nnn

nd nn

nnnd nnn

H

H

H H

H

H

−+−= −−

+=+−−

+= −+=+

∗

∗

∗

∗

µ

µ µ µ

µ

µ

µ µ


27/60

27

Note that since w(n) is based on past inputs

desired responses, w(n) (and c(n)) is

independent of x(n)

Thus

)}({)()}1({

)}()({)}({)()}1({

)()()()]()([)1(

why?zero,isThis

0

0

n E n E

nen E n E n E

nennnnn H

cRIc

xcRIc

xcxxIc

µ

µ µ

µ µ

−=+

+−=+⇓

+−=+

∗

∗

Using arguments similar to the SD case we

have

max

20if 0)}({lim

λ µ


28/60

28

Noting that2

max )0(][ x N Nr trace σ λ ==≤ Ra more conservative bound is

2

20

x N σ µ


29/60

29

An equivalent condition is to show that

constant}|)({|lim)(lim 2 ==∞→∞→

ne E n J nn

write e(n) as

)()()(

)()()()(

)()()()(ˆ)()(

0

0

nnne

nnnnd

nnnd nd nd ne

H

H H

H

xc

xcxw

xw

−=

−−=−=−=

Thus

)(

)}()()()({

))}()()())(()()({(

}|)({|)(

min

)(

min

00

2

n J J

nnnn E J

nnnennne E

ne E n J

ex

n J

H H

H H

ex

+=

+=−−=

=∗

cxxc

cxxc


30/60

30

Since J ex(n) is a scalar

)}]()()()({[

)]}()()()([{

)]}()()()([{

)}()()()({)(

nnnn E trace

nnnntrace E

nnnntrace E

nnnn E n J

H H

H H

H H

H H ex

ccxx

ccxx

cxxc

cxxc

====

Invoking the independence theorem

)]([

)}]()({)}()({[)(

ntrace

nn E nn E tracen J H H exRK

ccxx

==

where

)}()({)( nn E n H

ccK =


31/60

31

Thus

)]([

)()(

min

min

ntrace J

n J J n J exRK+=

+=

Recall H H or QQRRQQ ==

Let

)()( nn H SQKQ∆=

where S(n) need not be diagonal. Then H

nn QQSK )()( = and

)]([

)]([

])([

])([

)]([)(

ntrace

ntrace

ntrace

ntrace

ntracen J

H

H

H H

ex

SQQ

QSQ

QQSQQ

RK

=

=

=

==


32/60

32

Since Ω is diagonal

∑===

M

iiiex nsntracen J

1

)()]([)( λ

where s1(n), s2(n), …, s M (n) are the diagonal

elements of S(n).

The recursion expression can be modified to

yield a recursion on S(n), which is

ISIS min2))(()()1( J nn µ µ µ +−−=+

which for the diagonal elements is

M i J nsns iiii ,..,2,1)()1()1( min22 =+−=+ λ µ µλ

Suppose J ex(n) converges, then

)()1( nsns ii =+ and from the above

M i J

J J

ns

i

ii

i

i

i

i

,...,2,12

2)1(1)(

min

22

min2

2

min2

=−=−=−−=

µλ µ

λ µ µλ λ µ

µλ λ µ


33/60

33

Utilizing

∑=== M

iiiex nsntracen J 1 )()]([)( λ

we see

∑=∞→ −

= M

i i

iex

n J n J

1min 2

)(lim µλ

µλ

The LMS misadjustment is defined

∑=

∞→

−== M

i i

iex

n

J

n J M

1min 2

)(lim

µλ µλ

A misadjustment at 10% or less is generally

considered acceptable.


34/60

34

Example: one tap predictor of order one AR

process. Let

)()1()( nvnaxn x +−−=and use a one tap predictor.

The weight update is

)]1()()()[1()(

)()1()()1(

−−−+=−+=+

n xnwn xn xnw

nen xnwnw

µ

Note aw −=0 consider two cases and set05.0= a

2 xσ

-0.99 0.936270.99 0.995


35/60

35


36/60

36


37/60

37


38/60

38

Consider the expected trajectory of w(n).

Recall

)()1()()]1()1(1[

)]1()()()[1()(

)()1()()1(

n xn xnwn xn x

n xnwn xn xnw

nen xnwnw

−+−−−=−−−+=

−+=+

µ µ µ

Since)()1()( nvnaxn x +−−=

)()1()1()1(

)()]1()1(1[

)]()1()[1(

)()]1()1(1[)1(

nvn xn xnax

nwn xn x

nvnaxn x

nwn xn xnw

−+−−−−−−=

+−−−+−−−=+

µ µ µ

µ

Taking the expectation and invoking the

dependence theorem

anw E nw E x x22 )}({)1()}1({ µσ µσ −−=+


39/60

39


40/60

40

We can also derive a theoretical expression

for J (n).

Note that the initial value of J (n) is2)0( x J σ =

and the final value is

1

1min2min 2)( µλ λ

σ −+=+=∞ J J J J vexif µ small

+=

+=∞

21

2)(

22

222 x

v x

vv J µσ

σ µσ

σ σ

Also, the time constant is

)

2

1()1)](

2

1([)(

21

)1ln(21

)1ln(21

2222222

221

xvn

x xv x

x x

n J σ µ

σ µσ σ µ

σ σ

µσ µσ µλ τ

++−+−=

≈−−=−−=


41/60

41


42/60

42

Example: Adaptive equalization

Goal: Pass known signal through unknown

channel to invert effects of channel and

noise on signal.


43/60

43

The signal is a Bernouli sequence

−

+=1/2yprobabilitwith1

1/2yprobabilitwith1n x

The channel has a raised cosine response

=−+=otherwise 0

3,2,1n))]2(2

cos(1[21

nwhn

π

Note that w controls the eigenvalue spread

)(R .

Also the additive noise is ∼ N (0, 0.001)

Note that hn is symmetric about n=2 and

thus introduces a delay of 2. We will use an

M =11 tap filter, which will be symmetric

about n=5 and introduce a delay of 5.

Thus an overall delay of δ =5+2=7 is addedto the system.


44/60

44

Channel response and Filter response

Consider three w values

Note step size is bound by w=3.5 case

14.0)3022.1(11

2)0(

2 ==≤ Nr

µ

Choose µ =0.075 in all cases.


45/60

45


46/60

46


47/60

47


48/60

48

Example: Directionality of the LMS

algorithm

• The speed of convergence of the LMSalgorithm is faster in certain directions in

the weight space.

•

If the convergence is in the appropriatedirection, the convergence can be

accelerated by increased eigenvalue

spread.

Consider the deterministic signal

)cos()cos()( 2211 n An An x +=with

++++

= 22

212

221

21

2

2

21

2

1

2

2

2

1)cos()cos(

)cos()cos(21

A A A A

A A A A

ω ω ω ω

R


49/60

49

which gives

))cos(1(21

))cos(1(21

))cos(1(

2

1))cos(1(

2

1

2221

212

2221

211

ω ω λ

ω ω λ

−+−=

+++=

A A

A A

and−==1

1

1

121 qq

Consider two cases:

9.12)( with)23.0cos(5.0)6.0cos()(

9.2)( with)1.0cos(5.0)2.1cos()(

=+==+=

R

R

χ nnn xnnn x

b

a

In each case let

==⇒=⇒=1

11011011 qwqRwqp λ λ and

−==⇒=⇒=1

12022022 qwqRwqp λ λ

Look at 200 iterations of the algorithm.

Look at minimum eigenfilter, first

Then maximum eigenfilter,

−==1

120 qw

== 1110 qw


50/60

50


51/60


52/60

52

Normalized LMS Algorithm

In the standard LMS algorithm the

correction is proportional to )()( nen ∗x µ

)()()()1( nennn ∗+=+ xww µ If x(n) is large, the update suffers from

gradient noise amplification. The

normalized LMS algorithm seeks to avoid

gradient noise amplification

• The step size is made time varying, µ (n),and optimized to minimize error.


53/60

53

Thus let

)]()[()()]()[(2

1)()1(

nnnnnnn

Rwpwww

−+=−∇+=+

µ µ

Choose µ (n), such that the updated w(n+1)

produces the minimum MSE,

}|)1({|)1( 2+=+ ne E n J where

)1()1()1()1( ++−+=+ nnnd ne H xwThus we choose µ (n) such that it minimizes

J (n+1).

The optimal step size, µ 0(n), will be a

function of R and ∇(n). As before, we useinstantaneous estimates of these values.


54/60

54

To determine µ 0(n), expand J(n+1)

)1()1(

)1()1(

))}1()1()1((

))1()1()1({(

)}1()1({)1(

2

+++

+−+−=++−+

++−+=++=+

∗

∗

nn

nn

nnnd

nnnd E

nene E n J

H

H H d

H

H

Rww

wppw

wx

xw

σ

Now use the fact that )()(21

)()1( nnnn ∇−=+ µ ww

)()()(41

)()()(21

)()()(21

)()(

2

2

)()(21

)()()(21

)(

)()(21

)(

)()(

2

1)()1(

nnnnnn

nnnnn

H

H

H

d

H H

H H

nnnnnn

nnn

nnnn J

∇∇+∇−

∇−=

∇−∇−+

∇−−

∇−−=+

RRw

RwRww

wRw

wp

pw

µ µ

µ

µ µ

µ

µ σ


55/60

55

)()()(41

)()()(21

)()()(21

)()(

)()(21)(

)()(21

)()1(

2

2

nnnnnn

nnnnn

nnn

nnnn J

H H

H H

H

H

d

∇∇+∇−

∇−+

∇−−

∇−−=+

RRw

RwRww

wp

pw

µ µ

µ

µ

µ σ

Differentiating with respect to µ (n),

)()()(

2

1)()(

2

1

)()(21

)(21

)(21

)()1(

nnnnn

nnnnn

n J

H H

H H H

∇∇+∇−

∇−∇+∇=∂+∂

RRw

Rwpp

µ

µ

Setting equal to 0

pRw

pRwR

)()()(

)()()()()()(0nnn

nnnnnn H H

H H H

∇−∇+∇−∇=∇∇ µ

)()(])()[()(])([

)()(])()[()(])([)(0

nnnnnn

nnnnnnn

H

H H

H

H H H

∇∇

−∇+∇−=

∇∇−∇+∇−=

RpRwpRw

RpRwpRw µ


56/60

56

)()()()(

)()(

)()(21

)()(21

)(0

nnnn

nn

nnnnn

H

H

H

H H

∇∇

∇∇=∇∇

∇∇+∇∇=

R

R µ

Using instantaneous estimates

)()(2

))]()(ˆ)(([2

)]()()()()([2)(ˆ)()(

ˆ

nen

nd nd n

nd nnnnnnn

H

H

∗

∗∗

∗

−=−=

−=∇ =

x

x

xwxxxxR

Thus

2

22

2

0

||)(||1

)()(1

))()((|)(|)()(|)(|

)()(2)()()()(2)()()()(4

)(

nnn

nnnennne

nennnnennennen

n

H

H

H

H H

H

xxx

xxxx

xxxxxx

==

=

=∗

∗

µ


57/60

57

Thus the NLMS update is

)()(||)(||

~)()1(

)(

2 nenn

nn

n

∗+=+ xx

ww

µ

To avoid problems when 0||)(||2 ≈nx we add

an offset

)()(||)(||

~)()1( 2 nenna

nn ∗++=+

xx

ww

where a > 0.Consider now the convergence of the NLMS

algorithm.

)()(||)(||

~)()1( 2 nenn

nn ∗+=+ xx

ww

substituting )()()()( nnnd ne H xw−=

22

2

||)(||)()(~)(]

||)(||)()(~[

)]()()()[(||)(||

~)()1(

nnd n

nn

nn

nnnd nnnn

H

H

xx

wx

xx

wxxxww

∗

∗

+−=

−+=+

µ µ


58/60

58

Compare NLMS and LMS:

NLMS:

22 ||)(||)()(~)(]

||)(||)()(~[)1(

nnd n

nn

nnn

H

xx

wx

xxw

∗

+−=+ µ µ

LMS:

)()()()]()([)1( nd nnnnn H ∗+−=+ xwxxw µ µ Comparing we see the following

corresponding terms

LMS NLMS

µ ~

)()( nn H xx2||)(||

)()(n

nn H

xxx

)()( nd n ∗x2||)(||)()(

nnd n

xx ∗


59/60

59

Since in the LMS case

][

2

)}]()({[

2

0 Rxx tracenn E trace H =


60/60

1}||)({||)}()({

||)(||)()(

22 == n E nn E

nnn

E trace H H

xxx

xxx

Thus the NLMS update

)(||)(||

)(~)()1( 2 nenn

nn ∗+=+xx

ww µ

will converge if 2~0

Documents

Pdf8 Steepest Descent