Adaptive Filtering Method of Steepest Descent method, which is … · 2000-02-20 · Adaptive Filtering Method of Steepest Descent Steepest descent is an old, deterministic ... of

1

Adaptive Filtering Method of Steepest

Descent

Steepest descent is an old, deterministic

method, which is the basis for stochastic

gradient based methods.

This is a feed back approach to finding the

minimum of the error performance surface

- error surface must be know

- adaptive approach converses to the

optimal solution, pRw 10

−= without

inverting matrix

2

• x(n) are the WSS input samples• d(n) is the WSS desired output

• )(ˆ nd is the estimate of the desiredsignal given by

)()()(ˆ nnnd H xw=

where TMnxnxnxn )]1(),...,1(),([)( +−−=x

and T

M nwnwnwn )](),...,(),([)( 110 −=w is the

filter weight vector at time n.

3

Then

)()()(

)(ˆ)()(

nnnd

ndndneH xw−=

−=

Thus the MSE of time n is

)()()()(

|)(|)(2

2

nnnn

neEnJHHH

d Rwwwppw +−−==

σwhere

2dσ - variance of desired signal

p – Cross-correlation between x(n) and d(n)

R –correlation matrix of x(n)

When w(n) is set to the (optimal) Wiener

solution, then

pRww 10)( −==n

and

02

min)( wpHdJnJ −== σ

4

Hence, in order to iteratively find w0, we usethe method of steepest descent. To illustratethis concept, let M = 2, in the 2-D spacedw(n), the MSE forms a Bowl-shapedfunction.

A contour of the MSE is given as

Thus, if we are at a specific point in theBowl, we can imagine dropping a marble. Itwould reach the minimum. By goingthrough the path of steepest descent.

w0

w

J(w)

J(w)

w1

w2

w0

w2

w1

)(J∇−

2w

J

∂∂−

1w

J

∂∂−

5

Hence the direction in which we change thefilter direction is )(nJ∇− , or

)]([2

1)()1( nJnn −∇+=+ µww

or, since )(22))(( nnJ Rwp +−=∇

)]([)()1( nnn Rwpww −+=+ µ

for n=0, 1, 2, … and where µ is called the

stepsize and

) generalin ( )0( 0w =

Stability: Since the SD method uses

feedback, the system can go unstable

• bounds on the step size guaranteeing

stability can be determined with respect

ot the eigenvalues of R (widrow, 1970)

6

Define the error vector for the tap weights as

0)()( wwc −= nn

Then using 0Rwp = in the update,

)()(

)]([)(

)]([)()1(

0

nn

nn

nnn

Rcw

RwRww

Rwpww

µµµ

−=−+=

−+=+

and

)(][

)()()1(

)()()1( 00

n

nnn

or

nnn

cRI

Rccc

Rcwwww

µµ

µ

−=−=+

−−=−+

7

Using the Unitary Similarity Transform

HQQR =

we have

)(][)1( nn H cQQIc µ−=+

Premultiplying by QH gives

)(][

)(][)1(

n

nn

H

HHHH

cQI

cQQQQcQ

µ

µ

−=

−=+

Define the transformed coefficients as

))((

)()(

0wwQ

cQv

−==

n

nnH

H

8

Then

)(][)1( nn vIv µ−=+

with initial condition

00 ))0(()0( wQwwQv HH −=−=

if 0w =)0(

The kth term in v(n+1) (mode) is given by

Mknvnv kkk ,...,2,1)()1()1( =−=+ µλ

or using the recursion

)0()1()( kn

kk vnv µλ−=

Thus for all 0)(lim =∞→

nvkn we must have

1|1| <− kµλ for all k, or max

20

λµ <<

9

The kth mode has geometric decay

)0()1()( kn

kk vnv µλ−=

we can characterize the rate of decay by

finding the time it takes to decay to e-1 of the

initiative. Thus

11

)1ln(

1

)0()0()1()( 1

<<≈−−=

⇓

=−= −

µµλµλ

τ

µλτ τ

for

vevv

kkk

kkkkkk

The overall rate of decay is

)1ln(

1

)1ln(

1

minmax µλτ

µλ −−≤≤

−−

10

Recall that

∑

∑

=

=

−+=

+=

+=

−−+=

−−+=

M

kk

nkk

M

kkk

H

HH

H

vJ

nvJ

nnJ

nnJ

nnJnJ

1

22min

1

2min

min

00min

00min

|)0(|)1(

|)(|

)()(

))(())((

))(())(()(

µλλ

λ

Yv

wwQQww

wwRww

Thus min)(lim JnJn

=∞→

11

Example: Consider a two-tap predictor

Consider the effects of the following cases

• Varying the eigenvalue spread

min

max)(λλχ =R

and keeping µ fixed.

• Varying µ and keeping the eigenvalue

spread )(Rχ fixed

12

13

14

15

16

17

18

19

Example: Consider the system identification

problem

For M=2 suppose

=

=

5.0

8.0

18.0

8.01PRx

w(n) system

x(n)

)(ˆ nd d(n)+_

e(n)

20

From eigen analysis we have

λ1 = 1.8, λ2 = 0.2 and 8.1

2<µ

also

−

=

=

1

1

2

1

1

1

2

121 qq

and

−

=11

11

2

1Q

Also,

−

== −

389.0

11.110 pRw

Thus])([)( 0wwQv −= nn H

Noting that

=

−

−

−=−=06.1

51.0

389.0

11.1

11

11

2

1)0( 0wQv H

21

and51.0))8.1(1()(1

nnv µ−=

06.1))2.0(1()(2nnv µ−=

22

The Least Mean Square (LMS) Algorithm

The error performance surface used by theSD method is not always known a priori.We can use estimated values. The estimatesare RVs and thus this leads to a stochasticapproach.

We will use the following instantaneousestimates

)()()(ˆ nnn HxxR =)()()(ˆ ndnn ∗= xP

23

Recall the SD update

[ ]))((2

1)()1( nJnn ∇−+=+ µww

where the gradient of the error surface atw(n) was shown to be

)(22))(( nnJ Rwp +−=∇

Using the instantaneous estimates,

)()(2

)](ˆ)()[(2

)]()()()[(2

)()()(2)()(2))((ˆ

nen

ndndn

nnndn

nnnndnnJH

H

∗

∗∗

∗

∗

−=−−=

−−=+−=∇

x

x

wxx

wxxx

Complex conjugate ofestimate error

24

Putting this in the update

)()()()1( nennn ∗+=+ xww µ

Thus LMS algorithm belongs to the family

of stochastic gradient algorithms.

The update is extremely simple while the

instantaneous estimates may have large

variance, the LMS algorithm is recursive

and effectively averages these estimates.

The simplicity and good performance of the

LMS algorithm make it the benchmark

against which other optimization algorithms

are judged.

25

The LMS algorithm can be analyzed by

invoking the independence theory, which

states

1) The vectors x(1), x(2), …, x(n) are

statistically independent.

2) x(n) is independent of d(1), d(2), …,

d(n-1)

3) d(n) is statistically dependent on x(n),

but is independent of d(1), d(2), …, d(n-1)

4) x(n) and d(n) are mutually Gaussian.

The independence theorem is justified in

some cases, e.g. beamforming where we

receive independent vector observations. In

other cases it is not well justified, but allows

the analysis to proceeds.

26

Using the independence theory we can show

that w(n) converges to the optimal solution

in the mean

0)(lim ww =∞→

nEn

In certain cases, to show this, evaluate the

update

)()()()1( nennn ∗+=+ xww µ

)()()()1( 00 nennn ∗+−=−+ xwwww µ

])()()[(

)()]()([

)()()()()(

)()()(

])()[()(

)()()(

))()()()(()()1(

0

0

00

wxx

cxxI

wxxcxx

xc

wwwxx

xc

wxxcc

nndn

nnn

nnnnn

ndnn

nnn

ndnn

nnndnnn

H

H

HH

H

H

−+

−=

−−

+=

+−−+=

−+=+

∗

∗

∗

∗

µµ

µµµ

µµµ

27

Note that since w(n) is based on past inputs

desired responses, w(n) (and c(n)) is

independent of x(n)

Thus

)()()1(

)()()()()1(

)()()()]()([)1(

why?zero, is This

0

0

nEnE

nenEnEnE

nennnnn H

cRIc

xcRIc

xcxxIc

µ

µµ

µµ

−=+

+−=+⇓

+−=+

∗

∗

4434421

Using arguments similar to the SD case we

have

max

20 if0)(lim

λµ <<=

∞→nE

nc

or equivalently

max0

20 if)(lim

λµ <<=

∞→ww nE

n

28

Noting that

2max )0(][ xNNrtrace σλ ==≤ R

a more conservative bound is

2

20

xNσµ <<

Also, convergence in the mean

0)(lim ww =∞→

nEn

is a weak condition that says nothing about

the variance, which may even grow.

A stronger condition is convergence in the

mean square, which says

constant|)(|lim 2 =∞→

nEn

c

29

An equivalent condition is to show that

constant|)(|lim)(lim 2 ==∞→∞→

neEnJnn

write e(n) as

)()()(

)()()()(

)()()()(ˆ)()(

0

0

nnne

nnnnd

nnndndndne

H

HH

H

xc

xcxw

xw

−=

−−=−=−=

Thus

)(

)()()()(

))()()())(()()((

|)(|)(

min

)(

min

00

2

nJJ

nnnnEJ

nnnennneE

neEnJ

ex

nJ

HH

HH

ex

+=

+=

−−==

∗

4444 34444 21cxxc

cxxc

30

Since Jex(n) is a scalar

)]()()()([

)]()()()([

)]()()()([

)()()()()(

nnnnEtrace

nnnntraceE

nnnntraceE

nnnnEnJ

HH

HH

HH

HHex

ccxx

ccxx

cxxc

cxxc

===

=

Invoking the independence theorem

)]([

)]()()()([)(

ntrace

nnEnnEtracenJ HHex

RK

ccxx

==

where

)()()( nnEn HccK =

31

Thus

)]([

)()(

min

min

ntraceJ

nJJnJ ex

RK+=+=

Recall

HH or QQRRQQ ==

Let

)()( nnH SQKQ∆=

where S(n) need not be diagonal. Then

Hnn QQSK )()( = and

)]([

)]([

])([

])([

)]([)(

ntrace

ntrace

ntrace

ntrace

ntracenJ

H

H

HH

ex

6

SQQ

QSQ

QQSQQ

RK

=

=

=

=

=

32

Since Ω is diagonal

∑=

==M

iiiex nsntracenJ

1

)()]([)( λ6

where s1(n), s2(n), …, sM(n) are the diagonal

elements of S(n).

The recursion expression can be modified to

yield a recursion on S(n), which is

ISIS min2))(()()1( Jnn µµµ +−−=+

which for the diagonal elements is

MiJnsns iiii ,..,2,1)()1()1( min22 =+−=+ λµµλ

Suppose Jex(n) converges, then

)()1( nsns ii =+ and from the above

MiJ

JJns

i

ii

i

i

ii

,...,2,12

2)1(1)(

min

22min

2

2min

2

=−

=

−=

−−=

µλµ

λµµλλµ

µλλµ

33

Utilizing

∑=

==M

iiiex nsntracenJ

1

)()]([)( λ6

we see

∑=∞→ −

=M

i i

iex

nJnJ

1min 2

)(limµλ

µλ

The LMS misadjustment is defined

∑=

∞→

−==

M

i i

iex

n

J

nJM

1min 2

)(lim

µλµλ

A misadjustment at 10% or less is generally

considered acceptable.

34

Example: one tap predictor of order one AR

process. Let

)()1()( nvnaxnx +−−=and use a one tap predictor.

The weight update is

)]1()()()[1()(

)()1()()1(

−−−+=−+=+

nxnwnxnxnw

nenxnwnw

µµ

Note aw −=0 consider two cases and set

05.0=µ a 2xσ

-0.99 0.936270.99 0.995

35

36

37

38

Consider the expected trajectory of w(n).

Recall

)()1()()]1()1(1[

)]1()()()[1()(

)()1()()1(

nxnxnwnxnx

nxnwnxnxnw

nenxnwnw

−+−−−=−−−+=

−+=+

µµµµ

Since )()1()( nvnaxnx +−−=

)()1()1()1(

)()]1()1(1[

)]()1()[1(

)()]1()1(1[)1(

nvnxnxnax

nwnxnx

nvnaxnx

nwnxnxnw

−+−−−−−−=

+−−−+−−−=+

µµµ

µµ

Taking the expectation and invoking the

dependence theorem

anwEnwE xx22 )()1()1( µσµσ −−=+

39

40

We can also derive a theoretical expression

for J(n).

Note that the initial value of J(n) is

2)0( xJ σ=

and the final value is

1

1min

2min 2

)(µλ

µλσ−

+=+=∞ JJJJ vex

if µ small

+=

+=∞

21

2)(

22

222 x

vx

vvJµσσµσσσ

Also, the time constant is

)2

1()1)](2

1([)(

2

1

)1ln(2

1

)1ln(2

1

2222222

221

xvn

xxvx

xx

nJ σµσµσσµσσ

µσµσµλτ

++−+−=

≈−

−=−

−=

41

42

Example: Adaptive equalization

Goal: Pass known signal through unknown

channel to invert effects of channel and

noise on signal.

43

The signal is a Bernouli sequence

−+

=1/2y probabilitwith 1

1/2y probabilitwith 1nx

The channel has a raised cosine response

=−+

=otherwise 0

3,2,1n))]2(2

cos(1[2

1n

whn

π

Note that w controls the eigenvalue spread

)(Rχ .

Also the additive noise is ∼N(0, 0.001)

Note that hn is symmetric about n=2 and

thus introduces a delay of 2. We will use an

M=11 tap filter, which will be symmetric

about n=5 and introduce a delay of 5.

Thus an overall delay of δ=5+2=7 is added

to the system.

44

Channel response and Filter response

Consider three w values

Note step size is bound by w=3.5 case

14.0)3022.1(11

2

)0(

2 ==≤Nr

µ

Choose µ=0.075 in all cases.

45

46

47

48

Example: Directionality of the LMS

algorithm

• The speed of convergence of the LMS

algorithm is faster in certain directions in

the weight space.

• If the convergence is in the appropriate

direction, the convergence can be

accelerated by increased eigenvalue

spread.

Consider the deterministic signal

)cos()cos()( 2211 nAnAnx ωω +=

with

++++

=22

212

221

21

2221

21

22

21

)cos()cos(

)cos()cos(

21

AAAA

AAAA

ωωωω

R

49

which gives

))cos(1(2

1))cos(1(

2

1

))cos(1(2

1))cos(1(

2

1

2221

212

2221

211

ωωλ

ωωλ

−+−=

+++=

AA

AA

and

−=

=

1

1

1

121 qq

Consider two cases:

9.12)( with )23.0cos(5.0)6.0cos()(

9.2)( with )1.0cos(5.0)2.1cos()(

=+==+=

R

R

χχ

nnnx

nnnx

b

a

In each case let

==⇒=⇒=

1

11011011 qwqRwqp λλ and

−==⇒=⇒=

1

12022022 qwqRwqp λλ

Look at 200 iterations of the algorithm.

Look at minimum eigenfilter, first

Then maximum eigenfilter,

−==

1

120 qw

==

1

110 qw

50

51

52

Normalized LMS Algorithm

In the standard LMS algorithm the

correction is proportional to )()( nen ∗xµ

)()()()1( nennn ∗+=+ xww µ

If x(n) is large, the update suffers from

gradient noise amplification. The

normalized LMS algorithm seeks to avoid

gradient noise amplification

• The step size is made time varying, µ(n),

and optimized to minimize error.

53

Thus let

)]()[()(

)]()[(2

1)()1(

nnn

nnnn

Rwpw

ww

−+=

−∇+=+

µ

µ

Choose µ(n), such that the updated w(n+1)

produces the minimum MSE,

|)1(|)1( 2+=+ neEnJ

where

)1()1()1()1( ++−+=+ nnndne H xw

Thus we choose µ(n) such that it minimizes

J(n+1).

The optimal step size, µ0(n), will be a

function of R and ∇(n). As before, we use

instantaneous estimates of these values.

54

To determine µ0(n), expand J(n+1)

)1()1(

)1()1(

))1()1()1((

))1()1()1((

)1()1()1(

2

+++

+−+−=++−+

++−+=++=+

∗

∗

nn

nn

nnnd

nnndE

neneEnJ

H

HHd

H

H

Rww

wppw

wx

xw

σ

Now use the fact that )()(21

)()1( nnnn ∇−=+ µww

44444444 344444444 21

)()()(4

1)()()(

2

1

)()()(2

1)()(

2

2

)()(2

1)()()(

2

1)(

)()(2

1)(

)()(2

1)()1(

nnnnnn

nnnnn

H

H

H

d

HH

HH

nnnnnn

nnn

nnnnJ

∇∇+∇−

∇−=

∇−

∇−+

∇−−

∇−−=+

RRw

RwRww

wRw

wp

pw

µµ

µ

µµ

µ

µσ

55

)()()(4

1)()()(

2

1

)()()(2

1)()(

)()(2

1)(

)()(2

1)()1(

2

2

nnnnnn

nnnnn

nnn

nnnnJ

HH

HH

H

H

d

∇∇+∇−

∇−+

∇−−

∇−−=+

RRw

RwRww

wp

pw

µµ

µ

µ

µσ

Differentiating with respect to µ(n),

)()()(2

1)()(

2

1

)()(2

1)(

2

1)(

2

1

)(

)1(

nnnnn

nnnnn

nJ

HH

HHH

∇∇+∇−

∇−∇+∇=∂

+∂

RRw

Rwpp

µ

µ

Setting equal to 0

pRw

pRwR

)()()(

)()()()()()(0

nnn

nnnnnnHH

HHH

∇−∇+

∇−∇=∇∇µ

)()(

])()[()(])([

)()(

])()[()(])([)(0

nn

nnnn

nn

nnnnn

H

HH

H

HHH

∇∇−∇+∇−=

∇∇−∇+∇−=

RpRwpRw

RpRwpRwµ

56

)()(

)()(

)()(

)()(21

)()(21

)(0

nn

nn

nn

nnnnn

H

H

H

HH

∇∇∇∇=

∇∇

∇∇+∇∇=

R

Rµ

Using instantaneous estimates

)()(2

))]()(ˆ)(([2

)]()()()()([2)(ˆ

)()(ˆ

nen

ndndn

ndnnnnn

nnH

H

∗

∗∗

∗

−=−=

−=∇

=

x

x

xwxx

xxR

Thus

2

22

2

0

||)(||

1

)()(

1

))()((|)(|

)()(|)(|

)()(2)()()()(2

)()()()(4)(

nnn

nnne

nnne

nennnnen

nennenn

H

H

H

HH

H

xxx

xxxx

xxxxxx

==

=

= ∗

∗

µ

57

Thus the NLMS update is

)()(||)(||

~)()1(

)(

2nen

nnn

n

∗+=+ xx

ww43421

µ

µ

To avoid problems when 0||)(|| 2 ≈nx we add

an offset

)()(||)(||

~)()1(

2nen

nann ∗

++=+ x

xww

µ

where a > 0.Consider now the convergence of the NLMS

algorithm.

)()(||)(||

~)()1(

2nen

nnn ∗+=+ x

xww

µ

substituting )()()()( nnndne H xw−=

22

2

||)(||

)()(~)(]||)(||

)()(~[

)]()()()[(||)(||

~)()1(

n

ndnn

n

nn

nnndnn

nn

H

H

xx

wx

xx

wxxx

ww

∗

∗

+−=

−+=+

µµ

µ

58

Compare NLMS and LMS:

NLMS:

22 ||)(||

)()(~)(]||)(||

)()(~[)1(n

ndnn

n

nnn

H

xx

wx

xxw

∗

+−=+ µµ

LMS:

)()()()]()([)1( ndnnnnn H ∗+−=+ xwxxw µµComparing we see the following

corresponding terms

LMS NLMS

µ µ~

)()( nn Hxx2||)(||

)()(

n

nn H

xxx

)()( ndn ∗x2||)(||

)()(

n

ndn

xx ∗

59

Since in the LMS case

][

2

)]()([

20

Rxx tracennEtrace H=<< µ

guarantees stability by analogy,

the NLMS condition is

]||)(||

)()([

2~0

2

<<

n

nnEtrace

H

xxx

µ

make the following approximation

||)(||

)()(

||)(||

)()(22 nE

nnE

n

nnE

HH

xxx

xxx ≈

Then

||)(||

)]()([

||)(||

)]()([

||)(||

)]()([

||)(||

)()(

22

22

nE

nntraceE

nE

nntraceE

nE

nnEtrace

n

nnEtrace

HH

HH

xxx

xxx

xxx

xxx

==

=

60

1||)(||

)()(

||)(||

)()(22

==

nE

nnE

n

nnEtrace

HH

xxx

xxx

Thus the NLMS update

)(||)(||

)(~)()1(2

nen

nnn ∗+=+

xx

ww µ

will converge if 2~0 << µ .

• The NLMS has a simpler convergence

criteria than the LMS

• The NLMS generally converges faster

than the LMS algorithm

Documents

Adaptive Filtering Method of Steepest Descent method, which is … · 2000-02-20 · Adaptive Filtering Method of Steepest Descent Steepest descent is an old, deterministic ... of