Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
11. The Least-Squares Criterion
At this point we have been dealing with linear least-mean-square estimation and forming “cost” functions accordingly. As a result, we have been describing the results as the minimum mean-square error The purpose of Chapters 11-15 is to study the recursive least-squares algorithm in greater detail. Rather than motivate it as a stochastic gradient approximation to a steepest descent method, as was done in Sec. 5.9, the discussion in these chapters will bring forth deeper insights into the nature of the RLS algorithm. In particular, it will be seen in Chapter 12 that RLS is an optimal (as opposed to approximate) solution to a well-defined optimization problem. In addition, the discussion will reveal that RLS is very rich in structure, so much so that many equivalent variants exist. While all these variants are mathematically equivalent, they vary among themselves in computational complexity, performance under finite-precision conditions, and even in modularity and ease of implementation. Most important sections (from preamble) 11.1-11.4. See Chapters 29 to 32 of the on-line text book.
2
11.1 Least-Squares Problem Assume we have available N realizations of the random variables d and u, say,
1,2,1,0 Ndddd , 1210 ,,, Nuuuu ,
where the { d ( i ) } are scalars and the {ui} are 1 x M . Given the { d ( i ) , ui}, and assuming ergodicity, we can approximate the mean-square-error cost by its sample average as
1
0
22 1 N
ii wuid
NwudE
In this way, the optimization problem
2min wudEw
can be replaced by the related problem
1
0
2min
N
ii
wwuid
Vector Formulation Forming the references into an N x 1 vector, let
1
1
0
Nd
d
d
y
the regressors become an N x M matrix (assuming that each ui was 1 x M)
1
1
0
Nu
u
u
H
and the cost function is rewritten based on vectors, matrices and the norm square operator as
2min wHy
w
This is defined as the standard least-squares problem.
3
Classically, the least-squares problems is approached in different ways depending upon whether it is over-determined (N≥M) or underdetermined (N<M).
2min wHy
w
1. Over-determined least-squares (N≥M): In this case, the data matrix H has at least as many rows as columns, so that the number of measurements (i.e., the number of entries in y) is at least equal to the number of unknowns (i.e., the number of entries in w). This situation corresponds to an over-determined least-squares problem and, as we shall see, the cost function will either have a unique solution or an infinite (rank dependent) number of solutions. 2. Under-determined least-squares (N < M): In this case, the data matrix H has fewer rows than columns, so that the number of measurements is less than the number of unknowns. This situation corresponds to an under-determined least-squares problem for which the cost function will have an infinite number of solutions. All solutions of the least-squares problems are characterized as solutions to the linear system of equations
wHHyH HH ˆ which are known as the normal equations. In our presentation, we shall use both geometric and algebraic derivations to establish these facts. We start with the geometric argument and later show how to arrive at the same conclusions by means of algebraic arguments.
Geometric Arguments The vector wH lies in the column span or range space of the data matrix H.
HwH If the range space is defined as a linear plane, for any vector y originating from a point in the range space and not totally defined by the range space has a projection on the range space and makes the distance
wHy ˆ a minimum. For a two dimensional plane, this is the vector connected from y’s origin to the point that is perpendicular to the end point as seen as.
4
As the error vector is orthogonal, it must hold that 0ˆ wHypH H
where pH is a vector in the range space of H. This becomes
0ˆ wHyHp H For p an arbitrary, non-zero vector, we must have
0ˆ wHyH H Thus, it can be concluded that the solutions must satisfy the normal equation
wHHyH HH ˆ As a definition, the projection of y onto the range space is also defined as
wHy ˆˆ and the residual vector is defined as
wHyyyy ˆˆ~ This gives rise to the orthogonality condition
0~ yH H or Hy ~ and additional orthogonality element
0~ˆ yy H or yy ~ˆ Minimum cost based on vectors, let the minimum be defined as
2wHy
wHywHy H ˆˆ
wHyHwwHyy HHH ˆˆˆ
0ˆ~ˆˆ wHyyyywHyy HHH
wHyy H ˆ Note that this equation is also equivalent to
yy H ~ But continuing
wHyyy HH ˆ
but for the minimum wHHyH HH ˆ , therefore
wwHHyyHHH ˆˆ
wHHwyy HHH ˆˆ but this is just
yyyy HH ˆˆ The minimum is the vector norm of y minus the vector norm of the projection of y.
yyyy H ~ˆ22
5
Differentiation Arguments The cost function to be minimized is
2wHywJ
wHywHywJ H
wHHwyHwwHyyywJ HHHHHH Differentiating with respect to w to define the minimum/maximum
0ˆ HHwHywJ HHHw
Therefore the weight estimate become HyHHw HHH ˆ
To determine if this is indeed a minimum (or a maximum), differentiate again and insure that the result is non-negative definite.
HHwJwJ Hwww
2*
which will be zero or positive definite. Vector equivalent of cost function
w
y
HHH
HIwywJ H
NHH
Performing a UDUH factorization of the central matrix,
MH
N
H
HN
M
N
HN
ID
I
HH
HDI
I
DI
HHH
HI 0
0
0
0
where HHHD H or HHH HDHH
or
1 HHHD H or HHH HHHD
1
For the matrix structure
w
y
ID
I
HH
HDI
I
DIwywJ
MH
N
H
HN
M
NHH 0
0
0
0
wyD
y
HH
HDIwDyywJ HH
HNHHH
0
0
wyDHHwDyyHDIywJ HHHHHN
H
yDwHHyDwyHDIywJ HHHHHN
H
The two terms imply one term in y and another based on y and w. As both terms can be shown to be non-negative definite, the w that minimizes the cost must minimize the second term (it has not effect on the first). With the cost function
yHDIywJ HN
H
6
We seek to minimize the second term based on 0ˆ yDwH H
from which we have the solution yDHwH H ˆ
Substituting for D
yHHHHwH HH 1
ˆ
yHHHHwH HH 1
ˆ
Premultiplying (to eliminate the inverse)
yHyHHHHHwHH HHHHH 1
ˆ
which is the normal equation as desired, As an alternate thought, we have
0ˆ yDwHH HH For this to occur, the element in parenthesis must reside in the null space of HHH and H.
7
Theorem 11.1.2 Properties of the least-squares problems 1. The solution w is unique if, and only if, the data matrix H has full column rank (i.e., all its columns are linearly independent, which necessarily requires N ≥ M ) . In this case, w is given by
yHHHw HH 1
ˆ This situation occurs only for over-determined least-squares problems. 2. When HH H is singular, then infinitely many solutions w exist and any two solutions differ by a vector in the nullspace of H, i.e., if 1w and 2w are any two solutions, then 0ˆˆ 21 wwHor
HNww 21 ˆˆ This situation can occur for both over- and under-determined least-squares problem. 3. When many solutions w
exist, regardless of which one we pick, the resulting projection vector wHy ˆˆ is the same and the resulting minimum cost is also the same and given by
yyyy H ~ˆ 22
4. When many solutions w
exist, the one that has the smallest Euclidean norm, namely, the one that solves
2
ˆˆmin w
w subject to wHHyH HH ˆ
is given by the pseudo-inverse
yHHHw HH 1
ˆ where the term in brackets is referred to as the pseudo-inverse of H. (see App. 11.c)
Projection Matrices Concentrating on over-determined solutions with full rank H (N≥M). The least squares problem solution was defined as
yHHHw HH 1
ˆ and the projection vector of y on H becomes
yHHHHwH HH 1
ˆ Define the projection matrix as
yPyHHHHwH HHH
1ˆ
or
HHH HHHHP
1
Some interesting properties of this matrix are H
HH PP and HH PP 2
8
Defining other terms: The residue becomes
yPywHyyyy H ˆˆ~
yPIyPyy HH ~ The projection matrix onto the orthogonal component is then
HH PIP The minimum cost
yyyy HH ˆˆ
yPPyyy HH
HHH
yPyyy HHH
and yPyyPIy H
HH
H
9
11.2 Weighted Least-Squares As seen in Chap. 9, there is a weighting factor that may also be involved in determining the cost function. The new minimization becomes:
wHyWwHy H
wmin
where W is a Hermitian positive-definite weight matrix. This may also be written as
2min
WwwHy
as was defined in Chap. 9 for the weighted norm of x. This form has an equivalent to a transformed version of the standard least squares. Let
HVVW 21
21
the decomposition of the weight matrix into two Hermitian symmetric unitary matricies and a diagonal matrix. Therefore,
IVVVV HH By substitution
wHyVVwHy HH
w 2
12
1min
and combining the weight elements with the existing terms
wHVyVwHVyV HH
HHH
w
21
21
21
21
min
If we transform the input, y, and H matrix, they become
yVa H 21
HVA H 21
and have the least mean-square problem.
wAawAa H
wmin
This problem is solved as
0ˆ wAaAH which when using the previous definitions becomes
0ˆ21
21
21
wHVyVHV HH
HH
0ˆ21
21
wHyVVH HH
0ˆ wHyWH H Comparing with the orthogonality condition (29.6) in the unweighted case, we see that the only difference is the presence of the weighting matrix W.
10
This conclusion suggests that we can extend to the weighted least-squares setting the same geometric properties of the standard least-squares setting if we simply employ the concept of weighted inner products. Specifically, for any two column vectors {c, d } , we can define their weighted inner product as
dWcdc H
W,
and then say that c and d are orthogonal whenever their weighted inner product is zero. Using this definition, we can interpret
0ˆ wHyWH H to mean that the residual vector, wHy ˆ ,is orthogonal to the column span of H in a weighted sense, i.e.,
HqanyforwHyWqwHyq H
W 0ˆˆ,
We further conclude that the normal equations are now replaced by
yWHwHWH HH ˆ
11
Theorem 11.2.1 Properties of the Weighted least-squares problems The solution w is unique if, and only if, it satisfies the normal equation
yWHwHWH HH ˆ The following properties hold: 1. These normal equations are always consistent, i.e., a solution always exists. 2. The solution is unique if, and only if, the data matrix H has full column rank (i.e., all its columns are linearly independent, which necessarily requires N ≥ M ) . In this case, w is given by
yWHHWHw HH 1
ˆ Note: This situation occurs only for over-determined least-squares problems. 3. When HWH H is singular, which is equivalent to HH H being singular, then infinitely many solutions w exist and any two solutions differ by a vector in the nullspace of H, i.e., if 1w
and 2w are any two solutions, then 0ˆˆ 21 wwH or
HNww 21 ˆˆ This situation can occur for both over- and under-determined least-squares problem. 4. When many solutions w
exist, regardless of which one we pick, the resulting projection vector wHy ˆˆ is the same and the resulting minimum cost is also the same and given by
yWyyy H
WW~ˆ
22
where wHy ˆˆ 5. When many solutions w
exist, the one that has the smallest Euclidean norm, namely, the one that solves
2
ˆˆmin w
w subject to wHWHyWH HH ˆ
is given by the pseudo-inverse equation
aAw †ˆ where † refers to the pseudo-inverse of A given the following
yVa H 21
HVA H 21
12
Projection Matrix definition Concentrating on over-determined solutions which have full rank, H (N≥M). The least squares problem solution was defined as
yWHHWHw HH 1
ˆ and the projection vector of y on H becomes
yWHHWHHwH HH 1
ˆ Define the projection matrix as
yPyWHHWHHwH HHH
1ˆ
or
WHHWHHP HHH
1
The properties of this projection matrix are
WPPW HHH
HH PP 2
HHH
H PWPWP The residue becomes, as before,
yPywHyyyy H ˆˆ~
yPIyPyy HH ~ The projection matrix onto the orthogonal component is then
HH PIP The minimum cost
yWyyWy HH ˆˆ
yPWPyyWy HH
HHH
yPWyyWy HHH
and yPWyyPIWy H
HH
H
13
11.3 Regularized Least-Squares An expanded cost function is often used in control system theory where additional factors beyond the weighted error are of concern and to be minimized. For our purposes, we can directly add a weight miss-adjustment term, forming regularized least-squares as
wHywHywwww HH
wmin
where Π is a Hermitian positive-definite weight matrix. It is expected that Π is typically a multiple of the identity matrix and “w-bar” may be the “optimal weight” or could even be zero (using the weight magnitudes as part of the cost function). Beyond expanding the definition of the cost, this form also offers a way to incorporate a-priori information about the solution into the cost function. For example, Π can be used to express the certainty that “w-bar” is an excellent starting point for estimating w. This solution also can relieve problems associated with rank deficiency in the matrix H. Developing solutions … the cost function can be differentiated! Differentiating with respect to w to define the minimum/maximum
0 HHwHywwwJ HHHHw
If we again differentiate the cost function HHwJ H
w 2
As mentioned, if H is rank deficient, Π can be used to mitigate the deficiency and, in fact, reduce the distance between the maximum and minimum eigenvalues of the composite matrix! Solving for the optimal weight estimate we have
HHHH wHyHHw or when Π is diagonal (and taking the Hermitian transpose of the terms)
wyHwHH HH ˆ
wHHyHHHw HHH 11
ˆ The second new term is based on the a-priori information. Note that if “w-bar” is zero the term becomes zero.
14
The text uses an alternate solution involving changes of variables and solving an augmented matrix structure. First, the change of variable, let
wwz and wHyb
Then the minimization becomes
zHbzHbzz HH
wmin
Performing the eigen-decomposition of Π HUU
the minimization can be written in an augmented matrix form as
z
H
Ub
zH
Ub
HH
H
w
21
21 00
min
Now the solution takes the form of the least-squares problem
wHywHy H
wmin
when we substitute for the previous y, H, and w with the augmented terms above. From before, we had
yHwHH HH ˆ Therefore, the augmented solution must be
bH
UzH
U
H
UH
HHH
H 0ˆ
21
21
21
bHUz
H
UHU HH
H 0ˆ 2
121
21
bHzHHUU HHH
ˆ2
12
1
and reversing the substitution wHyHwwHH HH ˆ
which is the form of the normal equation. If it is continued
wHHwHHyHwHH HHHH ˆ
wHHyHHHw HHH 11
ˆ which is identical to the derivative solution above. The existence of the inverse matrix is now dependent upon the combination being positive-definite and not the individual components. Therefore, this is offered as an alternate solution when H is rank deficient (think about solving the exam 1 constrained estimation equalizer!) As a note, this solution exists and is always unique!
15
We computed some additional terms for the minimum cost and orthogonality conditions. It is useful to repeat these computations based on the augmented matrix construct. From before we had
2min wHy
w
yHHHw HH 1
ˆ Minimum Cost
yyyy H ~ˆ 22
Orthogonality condition 0~ˆ yHwHyH HH
With the augmented system the minimum cost becomes
z
H
Ubb
HH
ˆ00 2
1
zHbb H ˆ
wwHwHywHy H ˆ
wHywHy H ˆ
ywHy H ~
As a note, this form of the equation does not include or require Π … An alternate way to proceed would be from
wHyHwwHH HH ˆ where now
wwHwHywHy H ˆ
becomes
wHyHHHHwHywHy HHH 1
wHyHHHHIwHy HHH 1
using the matrix inversion formula
1111111 ADBADCBAADCBA
we get
wHyHHIwHy HH 11
With the augmented system the orthogonality condition becomes
0ˆ0 2
12
1
zH
UbH
U HH
H
0ˆ
ˆ21
21
zHb
zUHUH
H
16
0ˆˆ21
21
zHbHzUU HH
wwzHbH H ˆˆ
Focusing on b wwHwHyzHbb ˆˆ
~
yyywHyb ~ˆˆ~
So finally, we have the orthogonality condition
wwyH H ˆ~
This term is zero when Π is zero, but otherwise become an “alternate orthogonality condition” for regularized least squares.
11.4 Weighted Regularized Least-Squares This is a homework problem, but follows using a combination of the techniques used for the weighted least-squares and regularized least-squares. For our purposes, we can directly add a weight miss-adjustment term as in regularized lest squares and a weighting to the error as in weighted least squares
wHyWwHywwww HH
wmin
where W and Π are a Hermitian positive-definite weight matrix. It is expected that Π is typically a multiple of the identity matrix and W may be diagonal (focus on the individual errors).. Use both substitutions previously performed. Permutation is required (solves weighting) followed by substitution and an augmented matrix solution. Presenting the results: Theorem 11.4.1: Weighted Regularized least-squares The solution is always unique and is given by
wHyWHHWHww HH 1
ˆ
the resulting minimum cost is given by
wHyHHWwHy HH 111
where wHyyyy ˆˆ~ The orthogonality condition becomes
wwyWH H ˆ~
17
A summary of the four least-squares variants is provided in Table 11.1 on p. 672.
Their orthogonality conditions are given in Table 11.2.
18
The minimum costs are given in Table 11.3.
Appendix 11.A
Equivalence Results in Linear Estimation Rather than repeating textbook information … gp to p. 724 and 725. Using the regularization matrix and the weighting matrix, the deterministic problem is identical in structure to the stochastic problem! In addition, the m.m.s.e. is related to the minimum cost. More fun with matrix theory … see the QR decomposition material in Appendix 11.b
19
Matlab Simulations
Project 11.1 (Amplitude tone detection) A linear processes model
vxHy where x, and v are independent random processes. v will be Gaussian noise and x will be a sinusoid of know frequency and a random amplitude, uniformly distributed between -1 and 1. The linear mmse estimate (Theorem 2.6.1) provides
Hxxxy HRR
Then
yHRHRHRx Hxxvv
Hxx
1ˆ
or equivalently
yRHHRHRx vvH
vvH
xx 1111ˆ
And the cost function
111 HRHRKJ vvH
xxopt
If we used a weighted regularized least-squares cost function defined as
xHyRxHyxRx vvH
xxH
w 11min
The solution becomes
yRHHRHRx vvH
vvH
xx 1111ˆ
and the minimum cost is given by
yHRHRy Hxxvv
H 1
For IR xxx 2 and IR vvv 2
yIHHIHIxv
H
v
H
x
2
1
22
111ˆ
yHHHIx HH
x
v
1
2
2
ˆ
yHHHISNR
x HH
11
ˆ
Note that 1/SNR is popular choice for the regularization value multiple times the identity matrix.
20
Part A) Plot y and estimate y for three SNR values, 10, 20, and 30 dB. Let H=known sinusoid. “a” is the amplitude of the sinusoid.
yHHHISNR
x HH
11
ˆ
Part B) Use a range of regularization parameters instead of the SNR. For the input, use a 10 dB SNR. (Note: alpha = 0.1 corresponds to +10 dB).
Project 11.2 (OFDM Receiver) Welcome to advanced communications system signal considerations. Orthogonal-frequency-division multiplexing (OFDM) symbol transmission:
Complex data symbols (QAM-based constellation values) are placed in frequency bins. The data entered is inverse discrete Fourier transformed (generating a real time sequence of fixed length). The time sequence is “circular” in that performing a circular shift on the sequence would only result in additional linear phase in the symbols (if directly discrete Fourier transformed after the circular shift). The last part of the sequence is pre-pended to the front of the sequence (note that any sequence equal to the original length would just provide a linear phase if a DFT were performed). The signal is transmitted. The received signal is truncated to be the exact length of the original DFT, preferably cutting off the cyclic prefix that was prepended. Perform a DFT on the sequence. The “data symbols” are in the DFT bins with some corruption. From the on-line textbook:
21
22
23
24
25
26