LECTURE NOTES IN STOCHASTIC MODELS AND LEAST SQUARES ...george1/lecture notes-stochastics-MSc.pdf · Model name derives from the fact that nt may be written in regression form with

LECTURE NOTES IN STOCHASTIC MODELS AND LEAST SQUARES

ESTIMATION

MSc in Information Engineering

Dr George Halikias

EEIE, School of Engineering and Mathematical Sciences, City University

16 December 2006

1. Stochastic Models

1.1 Stochastic Processes (Revision)

A stochastic process is a sequence of random variables

{xt} = {. . . x−1, x0, x1, x2, . . .}

with joint probability distributions defined for all t1, t2, . . . , tn, all n. This description requires

too much information to be of practical interest. We are typically use the following two

statistical characteristics of {xt}:

1. The mean:

mt = E(xt) =

∫ ∞

−∞ξfxt(ξ)dξ

where fxt(·) is the probablility density function of xt, and

2. The covariance function:

R(t, s) = E{(xt −mt)(xs −ms)} = E(xtxs)−mtms

=

∫ ∞

−∞

∫ ∞

−∞xyfxtxs(x, y)dxdy

where fxtxs(·, ·) is the joint density of xt, xs.

A stochastic process {xt} is wide-sense stationary iff: (i) mt = m (constant for all t), and

(ii) R(t, s) = R(t− s, 0) := R(t− s) for all t and s. A wide-sense stationary process is ergodic

iff:

limt→∞

{1

2t + 1

t∑

k=−t

xk

}= m

and

limt→∞

{1

2t + 1

t∑

k=−t

xkxk−n

}= R(n) + m2

i.e. if the “ensemble average” (freeze time index and take average over many different

random experiments) coincides with the “time average” (average over a single realisation of

the stochastic process).

A stochastic process of particular interest is the “white noise” sequence: {xt} is white iff (i)

mt = 0 and (ii) R(i) is constant for i = 0, R(i) = 0 for i 6= 0.

Given a stationary stochastic sequence {ηt} and its covariance function {R(k)}, its spectral

density function is defined as the (two-sided) Z-transform of {R(k)}, i.e.

Φηη(z) =∞∑

k=−∞R(k)z−k

2 of 30

The frequency content of {ηt} may be obtained by evaluating Φηη(z) for z = ejω, |ω| ≤ π. Here

ω is the normalised angular frequency measured in rads/sample and hence π corresponds to

one half the sampling frequency, i.e.

Ψ(ω) = Φηη(ejω) =

∞∑

k=−∞R(k)e−jkω |ω| ≤ π

What is the spectral density function of a white noise sequence?

Ψ(ω) = R(0) = constant (independent of ω)

and hence a white noise sequence has “equal frequency content” at all frequencies.

1.2 Stochastic signals and models

We usually assume stationary, zero mean noise sequences {ηt}, t ∈ Z - note that non-zero means

can always be accommodated as a dc offset in the “deterministic” part of the model. This class

of models is too large for practical purposes. A simple widely-used class of models is generated

by the output of an LTI discrete-time system driven by white noise:

{et} {nt}LTI discrete system L

Stochastic model

Here {et}, t ∈ Z is a white noise sequence, i.e. uncorrelated with E(et) = 0 and variance

Var(et) = σ2. System L can have a difference equation or state-space description. The most

widely used model is:

• A(z−1)ηt = B(z−1)et t ∈ Z (ARMA model)

where A(z−1) = 1+a1z−1+. . .+anz−n and B(z−1) = b0+b1z

−1+. . .+bmz−m, and its spe-

cial cases:

• ηt = B(z−1)et t ∈ Z (Moving average -MA- model)

Model name derives from the fact that nt may be written as a (weighted) moving av-

erage of present and past samples of the input white noise sequence, i.e. ηt = b0et +

b1et−1 + . . . + bmet−m.

• A(z−1)ηt = et t ∈ Z (AutoRegressive -AR- model)

3 of 30

Model name derives from the fact that nt may be written in regression form with its

past values (i.e. “with itself”), i.e.

ηt =(

ηt−1 ηt−2 . . . ηt−n

)

−a1

−a2

...

−an

Note we use z−1 both as a Z-domain variable and as a time-domain (unit delay) operator. We

have made the assumption that {ηt} must be stationary. We want to investigate the restrictions

that this assumption imposes on the coefficients od A(z−1) and B(z−1) which define the ARMA

model.

Example: Consider the first-order AR model ηt − aηt−1 = et, t ∈ Z where {et} is a white

noise sequence with variance Var(et) = σ2. Then

(1− az−1)ηt = et ⇒ ηt = (1− az−1)−1et

Thus

ηt = (1 + az−1 + a2z−2 + . . .)et = et + aet−1 + a2et−2 + . . .

So,

Var(ηt) = E(η2t ) = E[(et + aet−1 + a2et−2 + . . .)2]

= E[e2t + a2e2

t−1 + a4e2t−2 + . . . + Cross terms]

= σ2(1 + a2 + a4 + . . .) =σ2

1− a2for |a| < 1

Thus, if |a| < 1 ηt has finite variance; if |a| ≥ 1 variance of ηt “blows up”. Hence in this case a

necessary condition for ηt to be stationary is that |a| < 1 or that z → G(z−1) = (1 − az−1)−1

has all its poles inside the unit circle ∂D = {z : |z| = 1}.Next consider the MA process

ηt = B(z−1)et = b0et + b1et−1 + . . . + bmet−m

(t ∈ Z). Does stationarity of ηt impose any restriction on the bi’s in this case? Since

Var(ηt) = σ2(b20 + b2

1 + . . . + b2m)

we always have finite variance for any collection of bi’s.

The results of this example can be generalised as follows:

Definition: The transfer function z → G(z−1) is: (i) Stable iff z → G(z−1) has all its poles

inside ∂D (open unit disk), (ii) Minimum-phase iff z → G(z−1) has all its zeros inside ∂D.

Theorem: If {et}, t ∈ Z is a white noise sequence with constant variance then ηt =

A−1(z−1)B(z−1)et defines a stationary process iff A−1(z−1)B(z−1) is stable. If A−1(z−1)B(z−1)

4 of 30

is stable and minimum phase, then {et} can be recovered from {ηt} as et = B−1(z−1)A(z−1)ηt,

i.e. system is “invertible”.

Note: The ARMA model introduced in this section formally defines a stochastic difference

equation. It is important to realize that here the underlying time-index set of this equation

is Z (all integers, both positive and negative). The above theorem establishes that when

we drive the LTI system by white noise, the output is stationary if the system is stable.

We can think of this in two ways: Perform multiple simulation experiments with random

realizations of {et} (starting in the infinite past). Then the statistical averages of the samples

of the resulting “output” sequences ηt will be the same, namely zero (i.e. the RV’s making up

the output stochastic sequence ηt will all have zero mean) and the statistical averages of all

covariance terms will satisfy the “time-invariance” property. Alternatively we can think of ηt as

a stochastic process formally defined by convolving the unit-pulse response of the LTI system

and et (involving an infinite sum) and studying the mean and covariance properties of the RV’s

making up ηt “in the limit”, i.e. as more and more terms are included in the sum (in the limit,

these are zero and time-invariant, respectively).

1.3 Spectral properties of ARMA models

Given a process {ηt} we define the covariance function as

R(k) = E{ηtηt−k} k = 0,±1,±2 . . .

and its spectral density function

Φηη(z) =∞∑

k=−∞R(k)z−k

Note that


∞∑

k=−∞R(k)z−jωk

is the Fourier transform of the covariance sequence {. . . , R(−1), R(0), R(1), . . .}. Note also that

due to stationarity,

R(−k) = E{ηtηt+k} = E{ηt−kηt} = R(k)

and this implies that:

Φηη(z) = R(0) +∞∑

k=1

R(k)(z−k + zk)

so that:

Ψ(ω) = Φηη(ejω) = R(0) +

∞∑

k=1

R(k)(ejωk + e−jωk) = R(0) + 2∞∑

k=1

R(k) cos(ωk)

which is a real function of ω (in fact non-negative).

Given an ARMA model driven by white noise what is the spectral density of the output?

5 of 30

Theorem: Suppose that G(z−1) = A−1(z−1)B(z−1) is a stable transfer function. Then ηt

defined by A(z−1)ηt = B(z−1)et, where {et}, t ∈ Z is a white noise sequence of unit variance,

has spectral density:

Φηη(z) =B(z−1)

A(z−1)

B(z)

A(z)= G(z−1)G(z)

Proof: Let G(z−1) have a series expansion

G(z−1) = g0 + g1z−1 + g2z

−2 + . . . =∞∑i=0

giz−i

where the {gi} are the Markov parameters of G(z−1) (or its unit pulse response, assumed

causal). The series can be summed from i = −∞ to i = +∞ by defining gi = 0 for i < 0. Now:

R(k) = E{ηtηt−k}= E{(g0et + . . . + gket−k + gk+1et−k−1 + . . .)(g0et−k + g1et−k−1 + . . .)}

and hence

R(k) = g0gk + g1gk+1 + . . . =∞∑i=0

gigi+k

Hence Φηη(z) may be written as:

Φηη(z) =∞∑

k=−∞R(k)z−k =

∞∑

k=−∞

( ∞∑i=−∞

gigi+k

)z−k =

∞∑i=−∞

gi

∞∑

k=−∞gi+kz

−k

by interchanging the order of summation. Fixing i and defining j = i + k gives

Φηη(z) =∞∑

i=−∞gi

∞∑j=−∞

gjz−(j−i) =

∞∑i=−∞

giz−i

∞∑j=−∞

gjz−j = G(z−1)G(z)

which completes the proof.

Note: ARMA models give rise to rational spectral densities. In fact the converse is also true:

Any rational spectral density function may be generated by an ARMA model driven by white

noise. (This is established by showing that any function of the form∑∞

k=−∞ R(k)z−k with

R(−k) = R(k) can be spectrally factored in the form G(z−1)G(z) where G(z−1) is rational).

This result justifies the use of ARMA models: Most noise processes are adequately described by

their spectral densities and these can be approximated (to any degree of accuracy) by rational

functions.

Note that many ARMA models of the form A(z−1)ηt = B(z−1)et, t ∈ Z correspond to a single

spectral density

Φηη(z) =B(z−1)

A(z−1)

B(z)

A(z)

If we impose the restriction, however, that

z → G(z−1) =B(z−1)

A(z−1)

6 of 30

is stable and minimum phase, the polynomials corresponding to Φηη(z) are uniquely determined

(up to a sign).

Example: Consider the following ARMA model:

ηt + 0.75ηt−1 = et + 2et−1 t ∈ Z (1)

or

ηt =1 + 2z−1

1 + 0.75z−1et :=

B(z−1)

A(z−1)et

Then:

Φηη(z) =1 + 2z−1

1 + 0.75z−1

1 + 2z

1 + 0.75z=

4(0.5 + z−1)zz−1(0.5 + z)

(1 + 0.75z−1)(1 + 0.75z)

which can be factored as:

Φηη(z) =2(1 + 0.5z)

1 + 0.75z

2(1 + 0.5z−1)

1 + 0.75z−1:=

B(z)

A(z)

B(z−1)

A(z−1)

and hence the process:

A(z−1)ηt = B(z−1)et, i.e. ηt + 0.75ηt−1 = 2et + et−1

will have identical spectral density with (1) and is both stable and minimum-phase.

A similar analysis can always be used to bring a zero lying outside ∂D inside the unit disc

without affecting the spectral density function (however, if a zero lies on ∂D it cannot be

moved). Since Φηη(z) may be factored as G(z−1)G(z), then if z0 (p0) is a zero (pole) of Φηη(z),

then z−10 (p−1

0 ) is also a zero (pole) of Φηη(z). Thus Φηη(z) has a pole/zero pattern symmetric

with respect to ∂D, as shown below:

stable and minimumphase spectral factor

Re(z)

Im(z)

-1

i

-j

10

Spectral factorisation

7 of 30

Clearly, many pole/zero patterns are possible for a spectral factor, but only one will correspond

to a stable/minimum-phase system.

1.4 Calculation of covariance functions for ARMA models

Given an ARMA model A(z−1)ηt = B(z−1)et where {et} is white with unit covariance, we

sometimes require to obtain {R(k)} in closed-form. There are several approaches for this.

Example: Consider the first order autoregression ηt − aηt−1 = et with |a| < 1

• Multiply by ηt−1 and take expectations:

E{ηtηt−1 − aη2t−1} = E{ηt−1et} = 0 ⇒ R(1)− aR(0) = 0 (2)

• Next multiply through by ηt and take expectations:

E{η2t − aηtηt−1} = E{ηtet} (3)

• Multiply through by et and take expectations:

E{ηtet − aηtet} = E{e2t} = 1 ⇒ E(ηtet} = 1 (4)

• From (2), (3) and (4)

R(0)− a2R(1) = 1 ⇒ R(0) =1

1− a2and R(1) =

a

1− a2(5)

• Finally multiply through by ηt−k (k = 1, 2, . . .) and take expectations:

E{ηtηt−k − aηt−1ηt−k} = E{etηt−k} = 0 ⇒ R(k) = aR(k − 1) (6)

Combining (5) and (6) gives:

R(k) =ak

1− a2k = 0, 1, 2, . . . (7)

which gives the covariance function in closed form.

This procedure of multiplying by past samples and taking expectations actually works in

general. The result is summarised in the following theorem (Yule-Walker equations):

Theorem: Let {R(0), R(1), R(2), . . .} be the covariance function of the stable ARMA model

A(z−1)ηt = B(z−1)et, where A(z−1) = 1 + a1z−1 + . . . + anz−n and B(z−1) = b0 + b1z

−1 + . . . +

bmz−m and suppose that {g0, g1, g2, . . .} is the unit-pulse response of A(z−1)−1B(z−1). Then:

m∑i=0

aiR(l − i) =n∑

i=max(0,l)

bigi−l l ≤ n

= 0 l > n

8 of 30

Proof: The ARMA model may be written in algebraic form as:

m∑i=0

aiηt−i =n∑

i=0

biet−i

Multiplying by ηt−l (l = 0, 1, 2, . . .) and taking expectations gives:

m∑i=0

aiE(ηt−lηt−i) =n∑

i=0

biE(ηt−let−i)

or equivalentlym∑

i=0

aiR(l − i) =n∑

i=0

biE(ηt−let−i) (8)

Now, {g0, g1, g2, . . .} are the Markov parameters of the model and hence

ηt =∞∑

j=0

gjet−j ⇒ ηt−l =∞∑

j=0

gjet−l−j

and hence

E{ηt−let−i} = E

{ ∞∑j=0

gjet−l−jet−i

}

The only non-zero terms in the expectation occur when t − l − j = t − i, i.e. when j = i − l.

Hence:

E{ηt−let−i} = gi−l i− l ≥ 0

= 0 i− l < 0

Substituting into (8) gives:

m∑i=0

aiR(l − i) =n∑

i=max(0,l)

bigi−l l ≤ n

= 0 l > n

Note: The first m + 1 equations for l = 0, 1, 2, . . . ,m involve 2m + 1 unknowns

R(−m), . . . , R(0), . . . R(m) which may be solved on noting the relations R(−i) = R(i). Using

these values as initial data the higher order R(i)’s can be determined recursively using the

remaining Yule-Walker equations.

A second method for determining the covariance function is via contour integration. Recall

that given ηt satisfying A(z−1)ηt = B(z−1)et the spectral density is given by

Φnn(z) =B(z−1)

A(z−1)

B(z)

A(z)

9 of 30

Also, Ψ(ω) = Φηη(ejω) is the inverse Fourier transform of {R(0), R(±1), R(±2), . . .}. Hence the

R(i)’s can be obtained as the Fourier coefficients of Ψ(ω), i.e.

R(k) =1

2π

∫ π

−π

Φηη(ejω)ejkωdω (k = 0,±1,±2, . . .)

=1

2πj

∫ π

−π

Φηη(ejω)(ejω)k−1d(ejω)

=1

2πj

∮

∂D

Φηη(z)zk−1dz

This is a contour integral in the complex plane around the unit circle ∂D which may be

calculated using Cauchy’s residue theorem. Thus, assuming that z → Φηη(z)zk−1 has only

simple poles inside ∂D (write them z1, z2, . . . , zp}, then

R(k) =

p∑i=1

Res(Φηη(z)zk−1, zi) =

p∑i=1

{limz→zi

(z − zi)Φηη(z)zk−1

}

Example: Consider the process ηt − aηt−1 = et, t ∈ Z, |a| < 1. The spectral density function

is:

Φηη(z) =1

1− az−1

1

1− az=

z

(z − a)(1− az)

Thus

R(k) =1

2πj

∮

∂D

Φηη(z)zk−1dz =1

2πj

∮

∂D

zk

(z − a)(1− az)dz

Thus

R(k) =∑

i

Residues inside ∂D =ak

1− a2k = 0, 1, 2, . . .

(as before) since the only pole inside ∂D is z = a for all k = 0, 1, 2, . . ..

1.5 Stochastic state-space models

The standard state-space stochastic model we will be using is of the form:

xk+1 = Axk + Bek (9)

yk = Cxk + Dek (10)

As usual, xk denotes the state vector (at time-index k), yk is the system output vector and

ek the system input vector; we assume that the ek’s are uncorrelated, zero-mean, and that

Cov(ek) = E(ekeTk ) = I for all k. We can give two different interpretations of this model:

(i) The time set is Z+ (non-negative zeros only); in this case we need to specify the (joint)

statistics of the initial state vector x0 and e0 to properly define the processes xk and yk; (ii)

The time set is Z (all integers); in this case xk and yk are the outputs of the ARMA models

xk = (zI − A)−1Bek and yk = [C(zI − A)−1B + D]ek in the sense discussed earlier.

Theorem: Suppose that in (9) the time index is Z+, the initial state x0 has zero mean and

covariance P0, and that x0 is uncorrelated with the ek for all k. Then P (k) := Cov(xk) satisfies

P (k + 1) = AP (k)AT + BBT , P (0) = P0 (11)

10 of 30

and

Cov(yk, yk−j) = CP (k)CT + DDT , j = 0

= CAjP (k − j)CT + CAj−1BDT , j = 1, 2, . . . , k

If A is stable then P (k) → P as k → ∞ where P is the unique solution of the (discrete)

Lyapunov matrix equation:

P = APAT + BBT

Furthermore if P0 = P then P (k) = P for all k ≥ 0.

Next suppose that the time index is Z and that A is stable. In this case {xk} and {yk} are

wide-sense stationary processes, E(xk) = 0, Cov(xk) = P and

Cov(yk, yk−j) = CPCT + DDT , j = 0

= CAjPCT + CAj−1BDT , j > 0

Here P is once more the (unique) solution of the Lyapunov equation.

Proof: First note that E(xk+1) = AE(xk) and since E(x0) = 0, xk (and thus also yk) is

zero-mean for every k ≥ 0. Next note that

E{xk+1xTk+1} = E{(Axk + Bek)(Axk + Bek)

T}= E{(Axk + Bek)(x

Tk AT + eT

k BT )}= AE{xkx

Tk }AT + BE{eke

Tk }BT

since xk and ek are uncorrelated, and hence

P (k + 1) = AP (k)AT + BBT

for all k ≥ 0, with P (0) = P0. Next write:

xk = Axk−1 + Bek−1 = A(Axk−2 + Bek−2) + Bek−1 = A2xk−2 + ABek−2 + Bek−1

and hence in general:

xk = Ajxk−j + Aj−1Bek−j + . . . + Bek−1

for 1 ≤ j ≤ k. Since xk−j is uncorrelated with {ek−j, . . . , ek−1} we have that:

E{xkxTk−j} = E{Ajxk−jx

Tk−j + Aj−1Bek−jx

Tk−j + . . . + Bek−1x

Tk−j} = AjP (k − j) (12)

Also,

E{xkeTk−j} = E{Ajxk−je

Tk−j + Aj−1Bek−je

Tk−j + . . . + Bek−1e

Tk−j} = Aj−1B (13)

The covariance of the output signal is:

Cov(yk, yk) = E{ykyTk } = E{(Cxk + Dek)(x

Tk CT + eT

k DT )} = CP (k)CT + DDT

11 of 30

using again the fact that xk and ek are uncorrelated. Also, for 1 ≤ j ≤ k,

Cov(yk, yk−j) = E{ykyTk−j} = E{(Cxk + Dek)(x

Tk−jC

T + eTk−jD

T )}= CE{xkx

Tk−j}CT + CE{xke

Tk−j}DT + DE{ekx

Tk−j}CT + DE{eke

Tk−j}DT

The last two terms are equal to zero, and hence:

Cov(yk, yk−j) = CAjP (k − j)CT + CAj−1BDT

using (12) and (13). Next consider the discrete Lyapunov matrix equation P = APAT + BBT .

It is well known that if A is stable, the solution to this equation is unique and is given by:

P =∞∑

k=0

AkBBT (Ak)T

Now, by applying repeatedly the recursion (11) we get:

Pk = AkP0(Ak)T +

k−1∑i=0

AiBBT (Ai)T

Taking limits k → ∞ the first term converges to zero (A is stable) and hence P (k) → P

(regardless of the initial condition P0). It is also clear that if P0 = P then Pk = P for all k ≥ 0.

Finally consider the case when the time-set is Z+. Since A is stable xk and yk can be expressed

as the outputs of ARMA models and hence they are wide-sense stationary processes. The

state-space model equations clearly show that E(xk) = 0 and E(yk) = 0. Let E(xkxTk ) = P .

Then from stationarity,

P = E(xk+1xTk+1) = E{(Axk + Bek)(x

Tk AT + eT

k BT )} = APAT + BBT

using the facts that xk and ek are uncorrelated. Thus P satisfies the Luapunov equation;

uniqueness now implies that P = P . Using a similar argument as before we can show that

E(xkxTk−j) = AjP ; using the output equations this establishes the expressions for Cov(yk, yk−j).

1.6 The overall model (Deterministic and Stochastic parts)

We can now combine a deterministic model (described by difference equations) with an ARMA

noise model to give a complete stochastic dynamic model:

Here {ut} is a deterministic discrete-time input, {et} is a zero-mean white noise input of variance

E(e2t ) = σ2, ηt is the (coloured) noise signal and y(t) is the overall output signal. G(z−1) and

H(z−1) are assumed to be discrete LTI systems of the form:

G(z−1) =b0 + b1z

−1 + . . . + blz−l

1 + a1z−1 + . . . + arz−rand H(z−1) =

d0 + d1z−1 + . . . + dpz

−l

1 + c1z−1 + . . . + cqz−q

i.e. yt = G(z−1)ut + H(z−1)et. This will be the standard setting of the identification problem

(next chapter): In general we are given a finite record of input-output data (ut, yt) (the output

12 of 30

H(z-1)

G(z-1){ut}

{nt}

{yt}

{et}

+

+Σ

Stochastic model

data being corrupted by a noise component {ηt}) and we are required to estimate the parameters

of the two processes, i.e. the coefficients of G(z−1) and H(z−1) (and possibly their degrees).

Noise intensity σ2 is typically unknown.

Note: Inclusion of pure delays z−m in G(z−1) may be necessary, but pure delays in H(z−1)

can always be removed, e.g. if

H(z−1) =z−2

1 + 0.5z−1; ηt = H(z−1)et

redefine et = z−2et = et−2; then the noise process is described identically as

H(z−1) =1

1 + 0.5z−1; ηt = H(z−1)et

since {et} and {et} have identical statistical properties.

Note: We can express model equations as

A(z−1)yt = B(z−1)ut + C(z−1)et

where A(z−1) is the lowest common multiple of polynomials A(z−1) and C(z−1). The explicit

dependence of this model on the deterministic sequence {ut} gives it the name ARMAX

(AutoRegressive Moving Average with eXogenous input).

1.7 Stochastic process modelling

Suppose we have a long record {ηt} of noise data and we wish to find its stochastic description

(i.e. model is as the output of an ARMA model driven by white noise of unit variance). We

can follow the following steps:

1. Remove any significant mean/trend components. For example, we can remove the mean

of ηt and and redefine ηt = ηt − ηt where ηt = 1N

∑Nt=1 ηt is the estimated mean. ηt can

be modelled as a deterministic step disturbance.

13 of 30

2. Obtain estimates of the covariance function from the data (again assume ergodicity) as:

R(k) =1

N − k

N∑

t=k+1

ηtηt−k

for k = 0, 1, . . . , m, m ≈ N4.

3. Use discrete Fourier transform (FFT) to obtain estimate of spectral density:


m∑

k=−m

R(k)e−jωk

in the range 0 ≤ ω ≤ π. (Note that π corresponds to the Nyquist angular frequency; this

is half the sampling angular frequency π/T rads/sample where T is the sampling period

of the signal).

4. Design a stable and minimum-phase filter

G(z−1) =B(z−1)

A(z−1)

so that its frequency response |G(ejω)|2 “fits” Ψ(ω). Note:

Ψ(ω) = Φηη(ejω) = G(e−jω)G(ejω) = G(ejω)G(ejω) = |G(ejω)|2

5. The estimated model is shown in the figure below.

{et} {nt}

E(et)

B(z-1)/A(z-1) Σ

+

+

Estimated model of stochastic signal

14 of 30

2. Systems Identification

2.1 The general identification problem

The general identification problem is to determine the parameters of a mathematical model

from the response of a physical system:

Physical System

Model M(θ )

uy

ym

e=y-ymΣ+

_

Systems Identification problem

The physical system is excited by an input u and produces an output y. The model is

parametrised by a vector-parameter θ which is allowed to vary over a suitable set Θ. When the

same input u is applied to the model, the simulated response is ym. This is compared with the

measured output y to produce an error e = y − ym. The problem is to choose θ ∈ Θ so that a

suitably chosen measure of the error “size” is minimised over all θ ∈ Θ, so that ym is “as close

as possible” to y.

2.2 Linear least-squares estimation

Postulate a model of the form:

y = Xθ + e

where:

y =

y1

y2

...

yn

, e =

e1

e2

...

en

, θ =

θ1

θ2

...

θq

and X ∈ Rn×q

Here y and X are known, θ is the vector of parameters to be estimated (unknown) and e is a

noise/disturbance vector (unknown).

Example: Consider a linear system which consists of a simple unknown gain θ and whose

output is corrupted by a noise sequence et (see figure). In this case model equations may be

written as:

y1

y2

...

yn

=

u1

u2

...

un

θ +

e1

e2

...

en

15 of 30

unknown gain θ{ut}

{et}

{yt}Σ+

+

Estimation of unknown gain

or as y = Xθ + e.

2.3 Least squares estimate

In least-squares estimation we choose θ to minimise the sum of squares of the “residuals”

(errors), i.e. we minimise

J(θ) =n∑

i=1

(yi − xTi θ)2 = (y −Xθ)T (y −Xθ)

Here xTi is the i-th row of matrix X (“regression vectors”). To find the minimising θ, first

expand J(θ):

J(θ) = (yT − θT XT )(y −Xθ) = yT y − 2θT XT y + θT XT Xθ

Setting the derivative to zero:

∂J(θ)

∂θ= −2XT y + 2XT Xθ = 0 ⇒ XT Xθ = XT y (Normal equations)

Note that XT X is a square (q × q) matrix and therefore the normal equations can be solved

uniquely if XT X is non-singular, or equivalently if X is of full column rank. In this case,

therefore, the normal equations have the unique solution:

θ = (XT X)−1XT y

2.4 Geometric interpretation of least squares

Definition: For a real n× q matrix X the range of X is the set

Range(X) = {z ∈ Rn : z = Xθ, θ ∈ Rq}

which is a subspace of Rn. The dimension of Range(X) is called the Rank of X. If

X = [x1 x2 xq] then Range(X) is the linear span of the columns of X, i.e. Range(X) =

{∑qi=1 θixi : θi ∈ R}.

The normal equations for the least-squares solution can be written as:

XT Xθ −XT y = 0 ⇒ XT (Xθ − y) = 0 ⇒ xTi (Xθ − y) = 0 for all i = 1, 2, . . . q

This means that (y −Xθ)⊥xi for all i, i.e. (y −Xθ)⊥Range(X). Thus Xθ is the closest point

in Range(X) to y, i.e. Xθ is the orthogonal projection of y onto Range(X).

16 of 30

1/s x(t)x(t).

B

Au*(t)

-R-1BTP(t)

+

+

Least squares estimation

In general, with every X ∈ Rn×q (of full column rank) we can define two projection operators:

• P1 := X(XT X)−1XT (Projection onto Range(X)).

• P2 := In − P1 = In −X(XT X)−1XT (Projection ⊥ Range(X)).

[Check this: for every y ∈ Range(X), i.e. y = Xθ for some θ, P1y = X(XT X)−1XT Xθ =

Xθ = y, P2y = (In − P1)y = y − P1y = 0]. Thus, if θ is the least-squares estimate,

Xθ = X(XT X)−1XT y = P1y (projection of y onto Range(X)) and e = y−Xθ−(In−P1)y = P2y

(optimal residuals orthogonal to Range(X)).

2.5 Statistical properties least squares

Now assume that there is a “true” parameter θ, treat e as a vector of‘random variables and

examine the statistical properties of θ. Assume that:

1. E(e) = 0.

2. E(eeT ) = σ2I

i.e. the ei’s are zero mean, uncorrelated with common variance σ2.

Note that since y = Xθ + e, the yi’s are random variables, and hence any linear combination

of these, bT y say, is also a random variable. Thus θ = (XT X)−1XT y is a vector of random

variables and as such it will have a probability distribution. (Think of this as follows: Perform

a sequence of random experiments, each time with a different realisation of the random vector

e (of course drawn from the same distribution!) and record in a histogram the percentage

(successes over trials) that the estimate falls an interval of the histogram. In the limit (as the

number of experiments increases and the histogram partitioning becomes finer and finer) we

will have obtained the statistical distribution of θ).

When is θ a “good” estimate? It is natural to say that θ is “good” if the following two conditions

are satisfied:

17 of 30

• The statistical mean of θ coincides with the true parameter θ, i.e. if the estimates’ average

over noise realisations e (in the limit) is equal to θ.

• The variance of θ is small, i.e. the statistical spread of the estimates (for different noise

realisations) around the mean is small.

It is shown next that under the assumptions that: (i) X is a constant (deterministic) matrix,

and (ii) The ei’s are zero-mean, uncorrelated with common variance σ2, then:

• θ is an unbiased estimate of θ, i.e. E(θ) = θ.

• θ has the smallest possible variance among all linear unbiased estimates of θ.

We start with the unbiasedness property:

Theorem: θ is unbiased.

Proof:

E(θ) = E{(XT X)−1XT (Xθ + e)} = E{θ + (XT X)−1XT e}Since θ and X are deterministic,

E(θ) = θ + (XT X)−1XT E(e) = θ

since e is zero-mean.

Next we calculate the covariance of θ:

Theorem: Cov(θ) = σ2(XT X)−1.

Proof:

θ − θ = (XT X)−1XT (Xθ + e)− θ = θ + (XT X)−1XT e− θ = (XT X)−1XT e

Thus:

Cov(θ) = E{(θ − θ)(θ − θ)T} (since E(θ) = θ)

= E{(XT X)−1XT eeT X(XT X)−1} (first line of proof!)

= (XT X)−1XT E(eeT )X(XT X)−1 (since X is deterministic)

= σ2(XT X)−1XT X(XT X)−1 (since E(eeT ) = σ2In)

= σ2(XT X)−1

as required.

Theorem: θ is BLUE (‘Best Linear Unbiased Estimate’), i.e. given any estimate β of the form

β = By (linear) such that E(By) = θ for all θ, then Cov(θ) ≤ Cov(β).

Proof: Let β = By be any linear unbiased estimate, then:

E{By} = E{B(Xθ + e)} = BXθ = θ for all θ

18 of 30

and so BX = I. Also

By = B(Xθ + e) = BXθ + Be = θ + Be

so that

By − θ = Be = deviation of By from its mean

Hence,

Cov(β) = Cov(By) = E{(By − θ)(By − θ)T} = E{BeeT BT} = BE{eeT}BT = σ2BBT

Now let B = (XT X)−1XT + B. Since,

BX = I ⇒ (XT X)−1XT X + BX = I ⇒ BX = 0

Hence

Cov(β) = Cov(By) = σ2BBT = σ2[(XT X)−1XT + B][X(XT X)−1 + BT ]

= σ2(XT X)−1 + σ2BBT

= Cov(θ) + σ2BBT

≥ Cov(θ)

2.6 Estimation of noise variance

Consider the model

y = Xθ + e

in which e is a n− vector of zero mean, uncorrelated random variables with common variance

σ2 and θ is a q-vector. We still assume that X is deterministic and that its columns are linearly

independent. The parameter vector θ is constant (but unknown). Assume also that the noise

variance σ2 is unknown. A natural estimate of the variance σ2 is

σ2 =1

n

n∑i=1

e2i =

1

n(y −Xθ)T (y −Xθ)

Since θ is unknown, however, we may try θ instead. It turns out that the estimate:

σ2 =1

n(y −Xθ)T (y −Xθ)

(which is the “maximum-likelihood estimate of σ2) is biased. An unbiased estimate of σ2 is

given in the next theorem. Before presenting this theorem and its proof some background

material on the trace of a matrix is included.

The trace of a matrix: For a square, (n × n) say, matrix S, its trace is defined as the sum

of its diagonal elements, i.e.

trace(S) =n∑

i=1

Sii

19 of 30

Trace(·) is a linear operator: trace(A+B) = trace(A)+trace(B), and trace(aA) = atrace(A) for

every scalar a. Another useful property is that trace(AB) = trace(BA) whenever the products

AB and BA are both defined. Check this: If A ∈ Rm×n and B ∈ Rn×m (so that both products

are defined), then:

(AB)ii =∑

k

AikBki

and hence

trace(AB) =∑

i

(AB)ii =∑

i

∑j

AikBki =∑

k

∑i

BkiAik =∑

k

(BA)kk = trace(BA)

Another useful property of trace (not used here) is that

trace(S) =n∑

i=1

λi(S)

where λi(S) are the eigenvalues of S.

Theorem:

σ2 =1

n− q(y −Xθ)T (y −Xθ)

is an unbiased estimate of σ2.

Proof: First note that:

(y −Xθ)T (y −Xθ) = (y −X(XT X)−1XT y)T (y −X(XT X)−1XT y)

= yT (I −X(XT X)−1XT )(I −X(XT X)−1XT )y

= yT P 22 y (P2 is the projection onto Range(X)⊥)

= yT (I −X(XT X)−1XT )y (P 22 = P2 - one projection is enough! or expand)

= trace{(I −X(XT X)−1XT )yyT} (trace(AB) = trace(BA))

= trace{(I −X(XT X)−1XT )(Xθ + e)(θT XT + eT )}

Hence

σ2{n− trace(Iq)} = E{trace(P2(Xθ + e)(θT XT + eT ))}= E{trace(P2(XθθT XT + σ2In))} (E(·) and trace(·) commute)

= σ2{trace(In)− trace(X(XT X)−1XT )}= σ2{n− trace((XT X)−1XT X)}= σ2{n− trace(Iq)}= σ2{n− q}

Thus

σ2 =1

n− qE{(y −Xθ)T (y −Xθ)} = E(σ2)

and hence σ2 is an unbiased estimate of σ2.

20 of 30

2.7 Least squares estimation for dynamic systems

Consider the ARMA model

A(z−1)yk = B(z−1)uk + ek

in which

A(z−1) = 1 + a1z−1 + a2z

−2 + . . . + anz−n

and

B(z−1) = b0 + b1z−1 + b2z

−2 + . . . + bmz−m

We require to estimate the parameters:

θ = [a1 a2 . . . an | b0 . . . bm]T

from input-output measurements (ui, yi), i = 1, 2, . . . , N . The model equations may be written

as:

yk = [−yk−1 − yk−2 . . . − yk−n | uk . . . uk−m]θ + ek

for k = 1, 2, . . . , N , or, in matrix form:

y1

y2

...

yN

=

−y0 −y−1 . . . −y1−n u1 . . . u1−m

−y1 −y0 . . . −y2−n u2 . . . u2−m

......

......

...

−yN−1 −yN−2 . . . −yN−n uN . . . uN−m

θ +

e1

e2

...

eN

This is now in the form y = Xθ + e and the parameter vector θ may be estimated by least

squares.

Note 1: The ‘regression matrix’ X is partitioned as X = [X1 | X2], in which X1 and X2 have

a special (“Toeplitz”) structure, i.e. entries along main diagonals are identical.

Note 2: The upper triangular parts of T1 and T2 contain data yi and ui, i ≤ 0 which represent

initial conditions. If these data are unavailable we use 0’s to fill these positions.

Note 3: In this case X is a random matrix (since y is random and correlated with e), so

standard LS theory does not apply. In particular:

• For any finite n, θ is biased.

• θ is asymptotically unbiased (i.e. limN→∞ E(θN) = θ for all θ if the el’s are uncorrelated

and the uk’s are “persistently exciting” (see next).

• θ is biased even asymptotically if the ek’s are not white.

Asymptotic convergence in the case of white/correlated noise is examined in the following

examples:

Example 1: (white noise)

21 of 30

Consider a second-order autoregressive model for simplicity:

yk =(

yk−1 yk−2

) (−a1

−a2

)+ ek k = 1, 2, . . . , n

In this case,

y1

y2

...

yn

=

y0 y−1

y1 y0

......

yn−1 yn−2

(−a1

−a2

)+

e1

e2

...

en

and hence the least-squares estimate is given by:

θ = (XT X)−1XT y = (XT X)−1XT (Xθ + e) = (θ + XT X)−1XT e

Now,

XT X =

(y0 y1 . . . yn−1

y−1 y0 . . . yn−2

)

y0 y−1

y1 y0

......

yn−1 yn−2

=

( ∑nk=1 y2

k−1

∑nk=1 yk−1yk−2∑n

k=1 yk−1yk−2

∑nk=1 y2

k−1

)

or1

nXT X =

(1n

∑nk=1 y2

k−11n

∑nk=1 yk−1yk−2

1n

∑nk=1 yk−1yk−2

1n

∑nk=1 y2

k−1

)

Under ergodicity assumptions, the “sample covariances” converge to the “true covariances” and

hence:

limn→∞

{1

nXT X

}=

(Ryy(0) Ryy(1)

Ryy(1) Ryy(0)

)

Also,

1

nXT e =

1

n

(y0 y1 . . . yn−1

y−1 y0 . . . yn−2

)

e1

e2

...

en

=

(1n

∑nk=1 yk−1ek

1n

∑nk=1 yk−2ek

)

so that:

limn→∞

{1

nXT e

}=

(E(yk−1ek)

E(yk−2ek)

)= 0

since clearly the pairs (yk−1, ek) and (yk−2, ek) are uncorrelated. Hence:

θn − θ =

{1

n(XT X)

}−1 {1

nXT e

}−→ 0

as n →∞, so that limn→∞ θn = θ and the LS estimates are asymptotically unbiased.

Example 2: (Correlated disturbances)

22 of 30

Consider now the model

yk = ayk−1 + wk, wk = ek + cek−1

in which {e[} is white (E(ek) = 0, Var(ek) = σ2). In this case,

y1

y2

...

yn

=

y0

y1

...

yn−1

a +

w1

w2

...

wn

which is in the form y = Xa + w. Hence,

XT X =n∑

k=1

y2k−1, XT y =

n∑

k=1

ykyk−1

and hence least-squares estimate is

an =

∑nk=1 ykyk−1∑n

k=1 y2k−1

and hence

limn→∞

(an) = limn→∞

1n

∑nk=1 ykyk−1

1n

∑nk=1 y2

k−1

=Ryy(1)

Ryy(0)

under appropriate ergodicity assumptions. Now

yk = ayk−1 + ek + ck−1

Multiplying by yk−1 and taking expectations:

E{ykyk−1} = aE{y2k−1}+ E{ekyk−1}+ cE{ek−1yk−1} ⇒ Ryy(1) = aRyy(0) + cE(ekyk) (14)

Multiplying by ek and taking expectations:

E{ykek} = aE{yk−1ek}+ E{e2k}+ cE{ek−1ek} ⇒ E(ykek) = σ2 (15)

Substituting (15) in (14):

Ryy(1) = aRyy(0) + cσ2 ⇒ limn→∞

an = a +cσ2

Ryy(0)

and hence there is an asymptotic bias:

asymptotic bias =cσ2

Ryy(0)

Note: Ryy(0) may be calculated explicitly as:

Ryy(0) =σ2(c2 + 2ac + 1)

1− a2

23 of 30

(exercise!)

2.8 Persistence of excitation and choice of input

Consider the least-squares algorithm applied to the Moving-Average model:

yk = b0uk + b1uk−1 + . . . + bquk−q + ek, k = 1, 2, . . . , n

This may be written is full as:

y1

y2

...

yn

=

u1 u0 . . . u1−q

u2 u1 . . . u2−q

......

...

un un−1 . . . un−q

b0

b1

...

bq

+

e1

e2

...

en

or y = Xθ + e. It is clear that certain conditions must be imposed on the input sequence {uk}so that the parameters b0, b1, . . . , bq can be successfully identified - for example if uk = 0 for all

k identification of the bi’s is impossible. In general, for a unique least-squares estimate, XT X

must be non-singular. In this case,

XT X =

u1 u2 . . . un

u0 u1 . . . un−1

......

...

u1−q u2−q . . . un−q

u1 u0 . . . u1−q

u2 u1 . . . u2−q

......

...

un un−1 . . . un−q

and hence

1

n(XT X) =

1n

∑nk=1 u2

k1n

∑nk=1 ukuk−1 . . . 1

n

∑nk=1 ukuk−q

1n

∑nk=1 ukuk−1

1n

∑nk=1 u2

k−1 . . . 1n

∑nk=1 uk−1uk−q

......

...1n

∑nk=1 uk−quk

1n

∑nk=1 uk−quk−1 . . . 1

n

∑nk=1 u2

k−q

In the general case when {uk} is a (zero-mean )stochastic process, under ergodicity assumptions:

1

n(XT X) −→

Ruu(0) Ruu(1) . . . Ruu(q)

Ruu(1) Ruu(0) . . . Ruu(q − 1)...

......

Ruu(q) Ruu(q − 1) . . . Ruu(0)

:= Rq+1

where Ruu(i) is the covariance function of {uk}. Hence, for asymptotic identifiability of a

(q + 1)-th order model we require that Rq+1 > 0 (positive definite). Note that if Rq > 0 then

Rq−i > 0 for all i = 1, 2, . . . , q.

Definition: A signal {uk} is called persistently exciting of order q if the matrix Rq is positive

definite.

Note: This condition allows us to decide whether a particular input signal is adequate to identify

the coefficients of a particular model.

24 of 30

Theorem: A signal {uk} is persistently exciting of order q + 1 iff:

limn→∞

1

n

n∑

k=1

{A(z−1)uk}2 > 0

for ever polynomial A(z−1) 6= 0 of degree q or less (i.e. for A(z−1) of the form: A(z−1) =

a0 + a1z−1 + . . . + aqz

−q).

Proof: For A(z−1) of this form, define:

aT = [a0 a1 . . . aq] and UTk = [uk uk−1 . . . uk−q]

Then1

n

n∑

k=1

{A(z−1)uk}2 =1

n

n∑

k=1

aT UkUTk a

or,

1

n

n∑

k=1

{A(z−1)uk}2 = aT

1n

∑nk=1 u2

k1n

∑nk=1 ukuk−1 . . . 1

n

∑nk=1 ukuk−q

1n

∑nk=1 ukuk−1

1n

∑nk=1 u2

k−1 . . . 1n

∑nk=1 uk−1uk−q

......

...1n

∑nk=1 uk−quk

1n

∑nk=1 uk−quk−1 . . . 1

n

∑nk=1 u2

k−q

a

or,1

n

n∑

k=1

{A(z−1)uk}2 −→ aT Rq+1a

Hence limn→∞{A(z−1)uk}2 > 0 for every A(z−1) 6= 0 of degree less that or equal to q iff

aT Rq+1a > 0 for all a 6= 0, i.e. iff Rq+1 > 0.

Example:

• Take uk = 1 (unit step). Since (1 − z−1)uk = uk − uk−1 = 0, there exists a non-trivial

polynomial in z−1 of degree 1 such that A(z−1)uk = 0; hence {uk} can be persistently

exciting of degree 1 at most.

• Take uk = ejωk (complex sinusoid) and define A(z−1) = 1− 2z−1 cos ω + z−2. Now,

A(z−1)uk = (1− z−1(ejω + e−jω) + z−2)ejωk

= ejωk − ejωejω(k−1) − e−jωejω(k−1) + ejω(k−2)

= ejωk(1− 1− e−2jω + e−2jω) = 0

Thus there exists a non-trivial polynomial of degree 2 such that A(z−1)uk = 0, and hence

ejωk can be persistently exciting of order at most 2.

• Take uk = ek (white noise of variance σ2). Then Rq = σ2Iq > 0 for all q. Hence ek is

persistently exciting of any order.

25 of 30

2.9 Generalised least-squares algorithm

We have seen that the direct application of least-squares gives biased estimates for correlated

disturbances. Consider the model:

A(z−1)yk = B(z−1)uk + wk

where

D(z−1)wk = ek, {ek} white

(Note: Here noise is modelled as AR process, not MA!). Suppose that D(z−1) is known. In

this case,

A(z−1)D(z−1)yk = B(z−1)D(z−1)uk + ek

Let y?k and u?

k be the output and input sequences filtered by D(z−1), i.e. define y?k = D(z−1yk

and u?k = D(z−1uk. Then,

A(z−1)y?k = B(z−1)u?

k + ek

in which the noise is now white. Thus the least-squares algorithm can be applied to give

asymptotically unbiased estimates of the coefficients of A(z−1) and B(z−1).

The generalised least-squares algorithm is an iterative application of this idea when D(z−1) is

unknown. It involves two steps:

1. Least squares estimation applied to data filtered through the current estimate of D(z−1),

and

2. An iterative procedure for estimating D(z−1): Using the current estimates of A(z−1) and

B(z−1) we calculate wk := A(z−1)yk−B(z−1)uk; next treat {wk} as an output series and

estimate D(z−1) from the model D(z−1)wk = ek using least-squares.

A conceptual is summarised next:

Algorithm - generalised least-squares:

1. Choose D0 = 1; Set j ← 1.

2. Choose:

Aj(z−1) = 1 + a(j)1 z−1 + . . . + a(j)

n z−n

Bj(z−1) = b(j)0 + b

(j)1 z−1 + . . . + b(j)

m z−m

Dj(z−1) = 1 + d(j)1 z−1 + . . . + d(j)

p z−p

according to the following procedure:

3. Filter data through Dj(z−1), i.e. calculate series y?k = Dj(z−1)yk and u?

k = Dj(z−1)uk.

26 of 30

4. Choose Aj(z−1), Bj(z−1) using least-squares to fit model:

A(z−1)y?k = B(z−1)u?

k + ek

and form the resulting optimal polynomials Aj(z−1) and Bj(z−1).

5. Calculate the optimal residuals: wk = Aj(z−1)yk − Bj(z−1)uk.

6. Using {wk} as the output sequence, fit using least-squares the model D(z−1)wk = ek and

form the resulting optimal polynomial Dj(z−1)

7. Calculate the residuals ek = Dj(z−1)wk. Are these white (apply whiteness test - see next)

or have the estimates converged? If yes, stop; else set j ← j + 1 and go to (3).

A possible stopping condition of the generalised least-squares algorithm is a “whiteness test” on

the residual sequence {ek}. One such test is based on the statistical properties of the “sample

covariances” of ek.

Statistical properties of “white” sequences

Suppose that {ek}nk=1 is a white sequence with variance Var(ek) = σ2. Introducing the “sample

covariances” φee(k),

φee(k) =1

n− kn

n−kn∑i=1

eiei+k 0 ≤ k ≤ kn

for some kn (fixed). Typically kn ≈ 0.15n. Notice that:

E{φee(0)} =1

n− kn

n−kn∑i=1

E(e2i ) =

1

n− kn

σ2(n− kn) = σ2

and

E{φee(k)} =1

n− kn

n−kn∑i=1

E(eiei+k) = 0 for k 6= 0

Also for k 6= 0,

E{φ2ee(k)} =

1

(n− kn)2E

(n−kn∑i=1

eiei+k

)2 =

1

(n− kn)2E

{n−kn∑i=1

e2i e

2i+k + cross-terms

}

=1

(n− kn)2σ4(n− kn) =

σ4

n− kn

and E{φee(k)φee(l)} = 0 for k 6= l. We are interested in the (approximate) statistical properties

of the random variable,

ξk =φee(k)

φee(0)k = 1, 2, . . .

If n is “large” we can ignore the statistical fluctuations in the denominator and take φee(0) ≈E{φee(0)} = σ2 (constant). Hence,

E(ξk) ≈ E

(φee(k)

σ2

)= 0

27 of 30

and

Var(ξk) = Var

{φee(k)

φee(0)

}= E

{φ2

ee(k)

φ2ee(0)

}≈ E(φ2

ee(k)

σ4=

1

n− kn

k = 1, 2, . . .

Now, since φee(k) are averages of a large number of identically distributed random variables,

φee(k)/φee(0) is approximately normal (Central limit theorem). Hence, the probability that ξk

is within two standard deviations away from the mean is approximately 95 per cent, or

Prob

{− 2√

n− kn

≤ φee(k)

φee(0)≤ 2√

n− kn

}= 0.95

This leads to the following statistical “whiteness test” of a sequence {ek}: Plot the values

of φee(k)/φee(0) for k = 1, 2, . . . , kn. If 95 per cent of the statistics do not lie inside strip

[− 2√n−kn

2√n−kn

], then REJECT the hypothesis that {ek} is white (with 95 per cent confidence).

2.10 Recursive least-squares

We consider a modification of the least-squares algorithm which updates the old estimates as

new data come in. Consider the standard model:

y1

y2

...

yn

=

xT1

xT2...

xTn

θ +

e1

e2

...

en

which we now write as Yn = Xnθ + En to keep track of dimensions. The least-squares estimate

of θ at time t = n is given by:

θn = (XTn Xn)−1XT

n Yn

Define Pn := (XTn Xn)−1; then

θn = Pn

(x1 . . . xn

)

y1

...

yn

= Pn

n∑i=1

xiyi = Pn

(n−1∑i=1

xiyi + xnyn

)

Now

θn−1 = Pn−1

n−1∑i=1

xiyi ⇒n−1∑i=1

xiyi = P−1n−1θn−1

Hence

θn = Pn

{P−1

n−1θn−1 + xnyn

}

But,

P−1n = XnXT

n = XTn−1Xn−1 + xnx

Tn = P−1

n−1 + xnxTn

Substituting we get

θn = Pn{(P−1n − xnxT

n )θn−1 + xnyn}or

θn = θn−1 − PnxnxTn θn−1 + Pnxnyn

28 of 30

and thus

θn = θn−1 + Pnxn(yn − xTn θn−1)

To find a recursive update of Pn we have to use the “matrix inversion lemma”:

Lemma: For any three matrices A, B and C of compatible dimensions such that A and A+BC

are invertible,

(A + BC)−1 = A−1 − A−1B(I + CA−1B)−1CA−1

Proof: Check identity by multiplying the RHS by A + BC:

(RHS)(A + BC) = I + A−1BC − A−1B(I + CA−1B)−1C − A−1B(I + CA−1B)−1CA−1BC

= I + A−1B(I + CA−1B)−1{(I + CA−1B)− I − CA−1B

}C = I

Applying this to:

Pn = (XnXTn )T = (XT

n−1Xn−1 + xnxTn )−1

gives

Pn = Pn−1 − Pn−1xn(I + xTnPn−1xn)−1xT

nPn−1

or

Pn = Pn−1 − Pn−1xnxTnPn−1

1 + xTnPn−1xn

on noting that xTnPn−1xn is a scalar. The updating formulae of recursive least squares can be

summarised as:

θn = θn−1 + Pnxn(yn − xTn θn−1)

Pn = Pn−1 − Pn−1xnxTnPn−1

1 + xTnPn−1xn

Note 1: Computationally demanding matrix inversion is now completely avoided. Also, there

is no need to store the old data; at time t = n we receive new information from the data (xn, yn)

and we update the old estimates (Pn−1, θn−1) to produce the new ones (Pn, θn).

Note 2: The updating formula for θn is intuitively appealing: The “predicted” value of yn

given information up to time t = n− 1 is

yn|n−1 = xTn θn−1

When new information arrives at time t = n in the form of a new measurement (yn), this is

compared with the predicted value to generate the “prediction-error”:

en|n−1 = yn − yn|n−1 = yn − xTn θn−1

The new estimate θn is then a correction of the previous estimate θn by an amount proportional

to the prediction error.

29 of 30

Note 3: Modified recursive least-squares algorithm: A slight disadvantage of the LS

recursion is that (in its present form) it cannot be started at time t = 1, since the matrix Pn

is only defined when XTn Xn is non-singular, or equivalently when Xn has full column rank. A

necessary condition for this is that n ≥ q, i.e. that we have at least as many measurements as

the number of parameters to be estimated. A possible solution is to wait for at least n0 ≥ q

measurements until XTn0

Xn0 is invertible, calculate Pn0 and θn0 by matrix inversion (i.e. as

in batch LS), and proceed from then on recursively. An alternative approach is to start the

recursions at time t = 1 by choosing reasonable initial conditions θ0 and P0 (e.g. in the absence

of any information we can take θ0 = 0 and P0 = 1εIq, where ε is a small positive number). It

can be shown that this initialization corresponds to the minimisation of:

Jn(θ) =n∑

i=1

(yi − xTi θ)2 +

1

2(θ − θ0)

T P−10 (θ − θ0)

This performance index differs from the “true” LS cost by a fixed amount 12(θ− θ0)

T P−10 (θ− θ0).

As n increases, the deviation from the true cost decreases in relative terms. For small n, Jn(θ)

can be made approximately equal to the LS cost by choosing P0 sufficiently large, e.g. P0 = 1εIq

for a small positive ε, as suggested previously. Note that from a previous result the covariance

of the estimated vector is σ2(XTn Xn)−1 = σ2Pn (assuming that En is white with variance σ2).

Thus, in statistical terms a “large” P0 reflects a high initial uncertainty around the initial

estimates θ0.

30 of 30

Documents

LECTURE NOTES IN STOCHASTIC MODELS AND LEAST SQUARES ...george1/lecture notes-stochastics-MSc.pdf · Model name derives from the fact that nt may be written in regression form with