11. The Least-Squares Criterion

1

11. The Least-Squares Criterion

At this point we have been dealing with linear least-mean-square estimation and forming “cost” functions accordingly. As a result, we have been describing the results as the minimum mean-square error The purpose of Chapters 11-15 is to study the recursive least-squares algorithm in greater detail. Rather than motivate it as a stochastic gradient approximation to a steepest descent method, as was done in Sec. 5.9, the discussion in these chapters will bring forth deeper insights into the nature of the RLS algorithm. In particular, it will be seen in Chapter 12 that RLS is an optimal (as opposed to approximate) solution to a well-defined optimization problem. In addition, the discussion will reveal that RLS is very rich in structure, so much so that many equivalent variants exist. While all these variants are mathematically equivalent, they vary among themselves in computational complexity, performance under finite-precision conditions, and even in modularity and ease of implementation. Most important sections (from preamble) 11.1-11.4. See Chapters 29 to 32 of the on-line text book.

2

11.1 Least-Squares Problem Assume we have available N realizations of the random variables d and u, say,

1,2,1,0 Ndddd , 1210 ,,, Nuuuu ,

where the { d ( i ) } are scalars and the {ui} are 1 x M . Given the { d ( i ) , ui}, and assuming ergodicity, we can approximate the mean-square-error cost by its sample average as

1

0

22 1 N

ii wuid

NwudE

In this way, the optimization problem

2min wudEw

can be replaced by the related problem

1

0

2min

N

ii

wwuid

Vector Formulation Forming the references into an N x 1 vector, let

1

1

0

Nd

d

d

y

the regressors become an N x M matrix (assuming that each ui was 1 x M)

1

1

0

Nu

u

u

H

and the cost function is rewritten based on vectors, matrices and the norm square operator as

2min wHy

w

This is defined as the standard least-squares problem.

3

Classically, the least-squares problems is approached in different ways depending upon whether it is over-determined (N≥M) or underdetermined (N<M).

2min wHy

w

1. Over-determined least-squares (N≥M): In this case, the data matrix H has at least as many rows as columns, so that the number of measurements (i.e., the number of entries in y) is at least equal to the number of unknowns (i.e., the number of entries in w). This situation corresponds to an over-determined least-squares problem and, as we shall see, the cost function will either have a unique solution or an infinite (rank dependent) number of solutions. 2. Under-determined least-squares (N < M): In this case, the data matrix H has fewer rows than columns, so that the number of measurements is less than the number of unknowns. This situation corresponds to an under-determined least-squares problem for which the cost function will have an infinite number of solutions. All solutions of the least-squares problems are characterized as solutions to the linear system of equations

wHHyH HH ˆ which are known as the normal equations. In our presentation, we shall use both geometric and algebraic derivations to establish these facts. We start with the geometric argument and later show how to arrive at the same conclusions by means of algebraic arguments.

Geometric Arguments The vector wH lies in the column span or range space of the data matrix H.

HwH If the range space is defined as a linear plane, for any vector y originating from a point in the range space and not totally defined by the range space has a projection on the range space and makes the distance

wHy ˆ a minimum. For a two dimensional plane, this is the vector connected from y’s origin to the point that is perpendicular to the end point as seen as.

4

As the error vector is orthogonal, it must hold that 0ˆ wHypH H

where pH is a vector in the range space of H. This becomes

0ˆ wHyHp H For p an arbitrary, non-zero vector, we must have

0ˆ wHyH H Thus, it can be concluded that the solutions must satisfy the normal equation

wHHyH HH ˆ As a definition, the projection of y onto the range space is also defined as

wHy ˆˆ and the residual vector is defined as

wHyyyy ˆˆ~ This gives rise to the orthogonality condition

0~ yH H or Hy ~ and additional orthogonality element

0~ˆ yy H or yy ~ˆ Minimum cost based on vectors, let the minimum be defined as

2wHy

wHywHy H ˆˆ

wHyHwwHyy HHH ˆˆˆ

0ˆ~ˆˆ wHyyyywHyy HHH

wHyy H ˆ Note that this equation is also equivalent to

yy H ~ But continuing

wHyyy HH ˆ

but for the minimum wHHyH HH ˆ , therefore

wwHHyyHHH ˆˆ

wHHwyy HHH ˆˆ but this is just

yyyy HH ˆˆ The minimum is the vector norm of y minus the vector norm of the projection of y.

yyyy H ~ˆ22

5

Differentiation Arguments The cost function to be minimized is

2wHywJ

wHywHywJ H

wHHwyHwwHyyywJ HHHHHH Differentiating with respect to w to define the minimum/maximum

0ˆ HHwHywJ HHHw

Therefore the weight estimate become HyHHw HHH ˆ

To determine if this is indeed a minimum (or a maximum), differentiate again and insure that the result is non-negative definite.

HHwJwJ Hwww

2*

which will be zero or positive definite. Vector equivalent of cost function

w

y

HHH

HIwywJ H

NHH

Performing a UDUH factorization of the central matrix,

MH

N

H

HN

M

N

HN

ID

I

HH

HDI

I

DI

HHH

HI 0

0

0

0

where HHHD H or HHH HDHH

or

1 HHHD H or HHH HHHD

1

For the matrix structure

w

y

ID

I

HH

HDI

I

DIwywJ

MH

N

H

HN

M

NHH 0

0

0

0

wyD

y

HH

HDIwDyywJ HH

HNHHH

0

0

wyDHHwDyyHDIywJ HHHHHN

H

yDwHHyDwyHDIywJ HHHHHN

H

The two terms imply one term in y and another based on y and w. As both terms can be shown to be non-negative definite, the w that minimizes the cost must minimize the second term (it has not effect on the first). With the cost function

yHDIywJ HN

H

6

We seek to minimize the second term based on 0ˆ yDwH H

from which we have the solution yDHwH H ˆ

Substituting for D

yHHHHwH HH 1

ˆ

yHHHHwH HH 1

ˆ

Premultiplying (to eliminate the inverse)

yHyHHHHHwHH HHHHH 1

ˆ

which is the normal equation as desired, As an alternate thought, we have

0ˆ yDwHH HH For this to occur, the element in parenthesis must reside in the null space of HHH and H.

7

Theorem 11.1.2 Properties of the least-squares problems 1. The solution w is unique if, and only if, the data matrix H has full column rank (i.e., all its columns are linearly independent, which necessarily requires N ≥ M ) . In this case, w is given by

yHHHw HH 1

ˆ This situation occurs only for over-determined least-squares problems. 2. When HH H is singular, then infinitely many solutions w exist and any two solutions differ by a vector in the nullspace of H, i.e., if 1w and 2w are any two solutions, then 0ˆˆ 21 wwHor

HNww 21 ˆˆ This situation can occur for both over- and under-determined least-squares problem. 3. When many solutions w

exist, regardless of which one we pick, the resulting projection vector wHy ˆˆ is the same and the resulting minimum cost is also the same and given by

yyyy H ~ˆ 22

4. When many solutions w

exist, the one that has the smallest Euclidean norm, namely, the one that solves

2

ˆˆmin w

w subject to wHHyH HH ˆ

is given by the pseudo-inverse

yHHHw HH 1

ˆ where the term in brackets is referred to as the pseudo-inverse of H. (see App. 11.c)

Projection Matrices Concentrating on over-determined solutions with full rank H (N≥M). The least squares problem solution was defined as

yHHHw HH 1

ˆ and the projection vector of y on H becomes

yHHHHwH HH 1

ˆ Define the projection matrix as

yPyHHHHwH HHH

1ˆ

or

HHH HHHHP

1

Some interesting properties of this matrix are H

HH PP and HH PP 2

8

Defining other terms: The residue becomes

yPywHyyyy H ˆˆ~

yPIyPyy HH ~ The projection matrix onto the orthogonal component is then

HH PIP The minimum cost

yyyy HH ˆˆ

yPPyyy HH

HHH

yPyyy HHH

and yPyyPIy H

HH

H

9

11.2 Weighted Least-Squares As seen in Chap. 9, there is a weighting factor that may also be involved in determining the cost function. The new minimization becomes:

wHyWwHy H

wmin

where W is a Hermitian positive-definite weight matrix. This may also be written as

2min

WwwHy

as was defined in Chap. 9 for the weighted norm of x. This form has an equivalent to a transformed version of the standard least squares. Let

HVVW 21

21

the decomposition of the weight matrix into two Hermitian symmetric unitary matricies and a diagonal matrix. Therefore,

IVVVV HH By substitution

wHyVVwHy HH

w 2

12

1min

and combining the weight elements with the existing terms

wHVyVwHVyV HH

HHH

w

21

21

21

21

min

If we transform the input, y, and H matrix, they become

yVa H 21

HVA H 21

and have the least mean-square problem.

wAawAa H

wmin

This problem is solved as

0ˆ wAaAH which when using the previous definitions becomes

0ˆ21

21

21

wHVyVHV HH

HH

0ˆ21

21

wHyVVH HH

0ˆ wHyWH H Comparing with the orthogonality condition (29.6) in the unweighted case, we see that the only difference is the presence of the weighting matrix W.

10

This conclusion suggests that we can extend to the weighted least-squares setting the same geometric properties of the standard least-squares setting if we simply employ the concept of weighted inner products. Specifically, for any two column vectors {c, d } , we can define their weighted inner product as

dWcdc H

W,

and then say that c and d are orthogonal whenever their weighted inner product is zero. Using this definition, we can interpret

0ˆ wHyWH H to mean that the residual vector, wHy ˆ ,is orthogonal to the column span of H in a weighted sense, i.e.,

HqanyforwHyWqwHyq H

W 0ˆˆ,

We further conclude that the normal equations are now replaced by

yWHwHWH HH ˆ

11

Theorem 11.2.1 Properties of the Weighted least-squares problems The solution w is unique if, and only if, it satisfies the normal equation

yWHwHWH HH ˆ The following properties hold: 1. These normal equations are always consistent, i.e., a solution always exists. 2. The solution is unique if, and only if, the data matrix H has full column rank (i.e., all its columns are linearly independent, which necessarily requires N ≥ M ) . In this case, w is given by

yWHHWHw HH 1

ˆ Note: This situation occurs only for over-determined least-squares problems. 3. When HWH H is singular, which is equivalent to HH H being singular, then infinitely many solutions w exist and any two solutions differ by a vector in the nullspace of H, i.e., if 1w

and 2w are any two solutions, then 0ˆˆ 21 wwH or

HNww 21 ˆˆ This situation can occur for both over- and under-determined least-squares problem. 4. When many solutions w

exist, regardless of which one we pick, the resulting projection vector wHy ˆˆ is the same and the resulting minimum cost is also the same and given by

yWyyy H

WW~ˆ

22

where wHy ˆˆ 5. When many solutions w

exist, the one that has the smallest Euclidean norm, namely, the one that solves

2

ˆˆmin w

w subject to wHWHyWH HH ˆ

is given by the pseudo-inverse equation

aAw †ˆ where † refers to the pseudo-inverse of A given the following

yVa H 21

HVA H 21

12

Projection Matrix definition Concentrating on over-determined solutions which have full rank, H (N≥M). The least squares problem solution was defined as

yWHHWHw HH 1

ˆ and the projection vector of y on H becomes

yWHHWHHwH HH 1

ˆ Define the projection matrix as

yPyWHHWHHwH HHH

1ˆ

or

WHHWHHP HHH

1

The properties of this projection matrix are

WPPW HHH

HH PP 2

HHH

H PWPWP The residue becomes, as before,

yPywHyyyy H ˆˆ~

yPIyPyy HH ~ The projection matrix onto the orthogonal component is then

HH PIP The minimum cost

yWyyWy HH ˆˆ

yPWPyyWy HH

HHH

yPWyyWy HHH

and yPWyyPIWy H

HH

H

13

11.3 Regularized Least-Squares An expanded cost function is often used in control system theory where additional factors beyond the weighted error are of concern and to be minimized. For our purposes, we can directly add a weight miss-adjustment term, forming regularized least-squares as

wHywHywwww HH

wmin

where Π is a Hermitian positive-definite weight matrix. It is expected that Π is typically a multiple of the identity matrix and “w-bar” may be the “optimal weight” or could even be zero (using the weight magnitudes as part of the cost function). Beyond expanding the definition of the cost, this form also offers a way to incorporate a-priori information about the solution into the cost function. For example, Π can be used to express the certainty that “w-bar” is an excellent starting point for estimating w. This solution also can relieve problems associated with rank deficiency in the matrix H. Developing solutions … the cost function can be differentiated! Differentiating with respect to w to define the minimum/maximum

0 HHwHywwwJ HHHHw

If we again differentiate the cost function HHwJ H

w 2

As mentioned, if H is rank deficient, Π can be used to mitigate the deficiency and, in fact, reduce the distance between the maximum and minimum eigenvalues of the composite matrix! Solving for the optimal weight estimate we have

HHHH wHyHHw or when Π is diagonal (and taking the Hermitian transpose of the terms)

wyHwHH HH ˆ

wHHyHHHw HHH 11

ˆ The second new term is based on the a-priori information. Note that if “w-bar” is zero the term becomes zero.

14

The text uses an alternate solution involving changes of variables and solving an augmented matrix structure. First, the change of variable, let

wwz and wHyb

Then the minimization becomes

zHbzHbzz HH

wmin

Performing the eigen-decomposition of Π HUU

the minimization can be written in an augmented matrix form as

z

H

Ub

zH

Ub

HH

H

w

21

21 00

min

Now the solution takes the form of the least-squares problem

wHywHy H

wmin

when we substitute for the previous y, H, and w with the augmented terms above. From before, we had

yHwHH HH ˆ Therefore, the augmented solution must be

bH

UzH

U

H

UH

HHH

H 0ˆ

21

21

21

bHUz

H

UHU HH

H 0ˆ 2

121

21

bHzHHUU HHH

ˆ2

12

1

and reversing the substitution wHyHwwHH HH ˆ

which is the form of the normal equation. If it is continued

wHHwHHyHwHH HHHH ˆ

wHHyHHHw HHH 11

ˆ which is identical to the derivative solution above. The existence of the inverse matrix is now dependent upon the combination being positive-definite and not the individual components. Therefore, this is offered as an alternate solution when H is rank deficient (think about solving the exam 1 constrained estimation equalizer!) As a note, this solution exists and is always unique!

15

We computed some additional terms for the minimum cost and orthogonality conditions. It is useful to repeat these computations based on the augmented matrix construct. From before we had

2min wHy

w

yHHHw HH 1

ˆ Minimum Cost

yyyy H ~ˆ 22

Orthogonality condition 0~ˆ yHwHyH HH

With the augmented system the minimum cost becomes

z

H

Ubb

HH

ˆ00 2

1

zHbb H ˆ

wwHwHywHy H ˆ

wHywHy H ˆ

ywHy H ~

As a note, this form of the equation does not include or require Π … An alternate way to proceed would be from

wHyHwwHH HH ˆ where now

wwHwHywHy H ˆ

becomes

wHyHHHHwHywHy HHH 1

wHyHHHHIwHy HHH 1

using the matrix inversion formula

1111111 ADBADCBAADCBA

we get

wHyHHIwHy HH 11

With the augmented system the orthogonality condition becomes

0ˆ0 2

12

1

zH

UbH

U HH

H

0ˆ

ˆ21

21

zHb

zUHUH

H

16

0ˆˆ21

21

zHbHzUU HH

wwzHbH H ˆˆ

Focusing on b wwHwHyzHbb ˆˆ

~

yyywHyb ~ˆˆ~

So finally, we have the orthogonality condition

wwyH H ˆ~

This term is zero when Π is zero, but otherwise become an “alternate orthogonality condition” for regularized least squares.

11.4 Weighted Regularized Least-Squares This is a homework problem, but follows using a combination of the techniques used for the weighted least-squares and regularized least-squares. For our purposes, we can directly add a weight miss-adjustment term as in regularized lest squares and a weighting to the error as in weighted least squares

wHyWwHywwww HH

wmin

where W and Π are a Hermitian positive-definite weight matrix. It is expected that Π is typically a multiple of the identity matrix and W may be diagonal (focus on the individual errors).. Use both substitutions previously performed. Permutation is required (solves weighting) followed by substitution and an augmented matrix solution. Presenting the results: Theorem 11.4.1: Weighted Regularized least-squares The solution is always unique and is given by

wHyWHHWHww HH 1

ˆ

the resulting minimum cost is given by

wHyHHWwHy HH 111

where wHyyyy ˆˆ~ The orthogonality condition becomes

wwyWH H ˆ~

17

A summary of the four least-squares variants is provided in Table 11.1 on p. 672.

Their orthogonality conditions are given in Table 11.2.

18

The minimum costs are given in Table 11.3.

Appendix 11.A

Equivalence Results in Linear Estimation Rather than repeating textbook information … gp to p. 724 and 725. Using the regularization matrix and the weighting matrix, the deterministic problem is identical in structure to the stochastic problem! In addition, the m.m.s.e. is related to the minimum cost. More fun with matrix theory … see the QR decomposition material in Appendix 11.b

19

Matlab Simulations

Project 11.1 (Amplitude tone detection) A linear processes model

vxHy where x, and v are independent random processes. v will be Gaussian noise and x will be a sinusoid of know frequency and a random amplitude, uniformly distributed between -1 and 1. The linear mmse estimate (Theorem 2.6.1) provides

Hxxxy HRR

Then

yHRHRHRx Hxxvv

Hxx

1ˆ

or equivalently

yRHHRHRx vvH

vvH

xx 1111ˆ

And the cost function

111 HRHRKJ vvH

xxopt

If we used a weighted regularized least-squares cost function defined as

xHyRxHyxRx vvH

xxH

w 11min

The solution becomes

yRHHRHRx vvH

vvH

xx 1111ˆ

and the minimum cost is given by

yHRHRy Hxxvv

H 1

For IR xxx 2 and IR vvv 2

yIHHIHIxv

H

v

H

x

2

1

22

111ˆ

yHHHIx HH

x

v

1

2

2

ˆ

yHHHISNR

x HH

11

ˆ

Note that 1/SNR is popular choice for the regularization value multiple times the identity matrix.

20

Part A) Plot y and estimate y for three SNR values, 10, 20, and 30 dB. Let H=known sinusoid. “a” is the amplitude of the sinusoid.

yHHHISNR

x HH

11

ˆ

Part B) Use a range of regularization parameters instead of the SNR. For the input, use a 10 dB SNR. (Note: alpha = 0.1 corresponds to +10 dB).

Project 11.2 (OFDM Receiver) Welcome to advanced communications system signal considerations. Orthogonal-frequency-division multiplexing (OFDM) symbol transmission:

Complex data symbols (QAM-based constellation values) are placed in frequency bins. The data entered is inverse discrete Fourier transformed (generating a real time sequence of fixed length). The time sequence is “circular” in that performing a circular shift on the sequence would only result in additional linear phase in the symbols (if directly discrete Fourier transformed after the circular shift). The last part of the sequence is pre-pended to the front of the sequence (note that any sequence equal to the original length would just provide a linear phase if a DFT were performed). The signal is transmitted. The received signal is truncated to be the exact length of the original DFT, preferably cutting off the cyclic prefix that was prepended. Perform a DFT on the sequence. The “data symbols” are in the DFT bins with some corruption. From the on-line textbook:

21

22

23

24

25

26

Documents

11. The Least-Squares Criterion