LECTURE NOTES IN STOCHASTIC MODELS AND LEAST SQUARES
ESTIMATION
MSc in Information Engineering
Dr George Halikias
EEIE, School of Engineering and Mathematical Sciences, City University
16 December 2006
1. Stochastic Models
1.1 Stochastic Processes (Revision)
A stochastic process is a sequence of random variables
{xt} = {. . . x−1, x0, x1, x2, . . .}
with joint probability distributions defined for all t1, t2, . . . , tn, all n. This description requires
too much information to be of practical interest. We are typically use the following two
statistical characteristics of {xt}:
1. The mean:
mt = E(xt) =
∫ ∞
−∞ξfxt(ξ)dξ
where fxt(·) is the probablility density function of xt, and
2. The covariance function:
R(t, s) = E{(xt −mt)(xs −ms)} = E(xtxs)−mtms
=
∫ ∞
−∞
∫ ∞
−∞xyfxtxs(x, y)dxdy
where fxtxs(·, ·) is the joint density of xt, xs.
A stochastic process {xt} is wide-sense stationary iff: (i) mt = m (constant for all t), and
(ii) R(t, s) = R(t− s, 0) := R(t− s) for all t and s. A wide-sense stationary process is ergodic
iff:
limt→∞
{1
2t + 1
t∑
k=−t
xk
}= m
and
limt→∞
{1
2t + 1
t∑
k=−t
xkxk−n
}= R(n) + m2
i.e. if the “ensemble average” (freeze time index and take average over many different
random experiments) coincides with the “time average” (average over a single realisation of
the stochastic process).
A stochastic process of particular interest is the “white noise” sequence: {xt} is white iff (i)
mt = 0 and (ii) R(i) is constant for i = 0, R(i) = 0 for i 6= 0.
Given a stationary stochastic sequence {ηt} and its covariance function {R(k)}, its spectral
density function is defined as the (two-sided) Z-transform of {R(k)}, i.e.
Φηη(z) =∞∑
k=−∞R(k)z−k
2 of 30
The frequency content of {ηt} may be obtained by evaluating Φηη(z) for z = ejω, |ω| ≤ π. Here
ω is the normalised angular frequency measured in rads/sample and hence π corresponds to
one half the sampling frequency, i.e.
Ψ(ω) = Φηη(ejω) =
∞∑
k=−∞R(k)e−jkω |ω| ≤ π
What is the spectral density function of a white noise sequence?
Ψ(ω) = R(0) = constant (independent of ω)
and hence a white noise sequence has “equal frequency content” at all frequencies.
1.2 Stochastic signals and models
We usually assume stationary, zero mean noise sequences {ηt}, t ∈ Z - note that non-zero means
can always be accommodated as a dc offset in the “deterministic” part of the model. This class
of models is too large for practical purposes. A simple widely-used class of models is generated
by the output of an LTI discrete-time system driven by white noise:
{et} {nt}LTI discrete system L
Stochastic model
Here {et}, t ∈ Z is a white noise sequence, i.e. uncorrelated with E(et) = 0 and variance
Var(et) = σ2. System L can have a difference equation or state-space description. The most
widely used model is:
• A(z−1)ηt = B(z−1)et t ∈ Z (ARMA model)
where A(z−1) = 1+a1z−1+. . .+anz−n and B(z−1) = b0+b1z
−1+. . .+bmz−m, and its spe-
cial cases:
• ηt = B(z−1)et t ∈ Z (Moving average -MA- model)
Model name derives from the fact that nt may be written as a (weighted) moving av-
erage of present and past samples of the input white noise sequence, i.e. ηt = b0et +
b1et−1 + . . . + bmet−m.
• A(z−1)ηt = et t ∈ Z (AutoRegressive -AR- model)
3 of 30
Model name derives from the fact that nt may be written in regression form with its
past values (i.e. “with itself”), i.e.
ηt =(
ηt−1 ηt−2 . . . ηt−n
)
−a1
−a2
...
−an
Note we use z−1 both as a Z-domain variable and as a time-domain (unit delay) operator. We
have made the assumption that {ηt} must be stationary. We want to investigate the restrictions
that this assumption imposes on the coefficients od A(z−1) and B(z−1) which define the ARMA
model.
Example: Consider the first-order AR model ηt − aηt−1 = et, t ∈ Z where {et} is a white
noise sequence with variance Var(et) = σ2. Then
(1− az−1)ηt = et ⇒ ηt = (1− az−1)−1et
Thus
ηt = (1 + az−1 + a2z−2 + . . .)et = et + aet−1 + a2et−2 + . . .
So,
Var(ηt) = E(η2t ) = E[(et + aet−1 + a2et−2 + . . .)2]
= E[e2t + a2e2
t−1 + a4e2t−2 + . . . + Cross terms]
= σ2(1 + a2 + a4 + . . .) =σ2
1− a2for |a| < 1
Thus, if |a| < 1 ηt has finite variance; if |a| ≥ 1 variance of ηt “blows up”. Hence in this case a
necessary condition for ηt to be stationary is that |a| < 1 or that z → G(z−1) = (1 − az−1)−1
has all its poles inside the unit circle ∂D = {z : |z| = 1}.Next consider the MA process
ηt = B(z−1)et = b0et + b1et−1 + . . . + bmet−m
(t ∈ Z). Does stationarity of ηt impose any restriction on the bi’s in this case? Since
Var(ηt) = σ2(b20 + b2
1 + . . . + b2m)
we always have finite variance for any collection of bi’s.
The results of this example can be generalised as follows:
Definition: The transfer function z → G(z−1) is: (i) Stable iff z → G(z−1) has all its poles
inside ∂D (open unit disk), (ii) Minimum-phase iff z → G(z−1) has all its zeros inside ∂D.
Theorem: If {et}, t ∈ Z is a white noise sequence with constant variance then ηt =
A−1(z−1)B(z−1)et defines a stationary process iff A−1(z−1)B(z−1) is stable. If A−1(z−1)B(z−1)
4 of 30
is stable and minimum phase, then {et} can be recovered from {ηt} as et = B−1(z−1)A(z−1)ηt,
i.e. system is “invertible”.
Note: The ARMA model introduced in this section formally defines a stochastic difference
equation. It is important to realize that here the underlying time-index set of this equation
is Z (all integers, both positive and negative). The above theorem establishes that when
we drive the LTI system by white noise, the output is stationary if the system is stable.
We can think of this in two ways: Perform multiple simulation experiments with random
realizations of {et} (starting in the infinite past). Then the statistical averages of the samples
of the resulting “output” sequences ηt will be the same, namely zero (i.e. the RV’s making up
the output stochastic sequence ηt will all have zero mean) and the statistical averages of all
covariance terms will satisfy the “time-invariance” property. Alternatively we can think of ηt as
a stochastic process formally defined by convolving the unit-pulse response of the LTI system
and et (involving an infinite sum) and studying the mean and covariance properties of the RV’s
making up ηt “in the limit”, i.e. as more and more terms are included in the sum (in the limit,
these are zero and time-invariant, respectively).
1.3 Spectral properties of ARMA models
Given a process {ηt} we define the covariance function as
R(k) = E{ηtηt−k} k = 0,±1,±2 . . .
and its spectral density function
Φηη(z) =∞∑
k=−∞R(k)z−k
Note that
Ψ(ω) = Φηη(ejω) =
∞∑
k=−∞R(k)z−jωk
is the Fourier transform of the covariance sequence {. . . , R(−1), R(0), R(1), . . .}. Note also that
due to stationarity,
R(−k) = E{ηtηt+k} = E{ηt−kηt} = R(k)
and this implies that:
Φηη(z) = R(0) +∞∑
k=1
R(k)(z−k + zk)
so that:
Ψ(ω) = Φηη(ejω) = R(0) +
∞∑
k=1
R(k)(ejωk + e−jωk) = R(0) + 2∞∑
k=1
R(k) cos(ωk)
which is a real function of ω (in fact non-negative).
Given an ARMA model driven by white noise what is the spectral density of the output?
5 of 30
Theorem: Suppose that G(z−1) = A−1(z−1)B(z−1) is a stable transfer function. Then ηt
defined by A(z−1)ηt = B(z−1)et, where {et}, t ∈ Z is a white noise sequence of unit variance,
has spectral density:
Φηη(z) =B(z−1)
A(z−1)
B(z)
A(z)= G(z−1)G(z)
Proof: Let G(z−1) have a series expansion
G(z−1) = g0 + g1z−1 + g2z
−2 + . . . =∞∑i=0
giz−i
where the {gi} are the Markov parameters of G(z−1) (or its unit pulse response, assumed
causal). The series can be summed from i = −∞ to i = +∞ by defining gi = 0 for i < 0. Now:
R(k) = E{ηtηt−k}= E{(g0et + . . . + gket−k + gk+1et−k−1 + . . .)(g0et−k + g1et−k−1 + . . .)}
and hence
R(k) = g0gk + g1gk+1 + . . . =∞∑i=0
gigi+k
Hence Φηη(z) may be written as:
Φηη(z) =∞∑
k=−∞R(k)z−k =
∞∑
k=−∞
( ∞∑i=−∞
gigi+k
)z−k =
∞∑i=−∞
gi
∞∑
k=−∞gi+kz
−k
by interchanging the order of summation. Fixing i and defining j = i + k gives
Φηη(z) =∞∑
i=−∞gi
∞∑j=−∞
gjz−(j−i) =
∞∑i=−∞
giz−i
∞∑j=−∞
gjz−j = G(z−1)G(z)
which completes the proof.
Note: ARMA models give rise to rational spectral densities. In fact the converse is also true:
Any rational spectral density function may be generated by an ARMA model driven by white
noise. (This is established by showing that any function of the form∑∞
k=−∞ R(k)z−k with
R(−k) = R(k) can be spectrally factored in the form G(z−1)G(z) where G(z−1) is rational).
This result justifies the use of ARMA models: Most noise processes are adequately described by
their spectral densities and these can be approximated (to any degree of accuracy) by rational
functions.
Note that many ARMA models of the form A(z−1)ηt = B(z−1)et, t ∈ Z correspond to a single
spectral density
Φηη(z) =B(z−1)
A(z−1)
B(z)
A(z)
If we impose the restriction, however, that
z → G(z−1) =B(z−1)
A(z−1)
6 of 30
is stable and minimum phase, the polynomials corresponding to Φηη(z) are uniquely determined
(up to a sign).
Example: Consider the following ARMA model:
ηt + 0.75ηt−1 = et + 2et−1 t ∈ Z (1)
or
ηt =1 + 2z−1
1 + 0.75z−1et :=
B(z−1)
A(z−1)et
Then:
Φηη(z) =1 + 2z−1
1 + 0.75z−1
1 + 2z
1 + 0.75z=
4(0.5 + z−1)zz−1(0.5 + z)
(1 + 0.75z−1)(1 + 0.75z)
which can be factored as:
Φηη(z) =2(1 + 0.5z)
1 + 0.75z
2(1 + 0.5z−1)
1 + 0.75z−1:=
B(z)
A(z)
B(z−1)
A(z−1)
and hence the process:
A(z−1)ηt = B(z−1)et, i.e. ηt + 0.75ηt−1 = 2et + et−1
will have identical spectral density with (1) and is both stable and minimum-phase.
A similar analysis can always be used to bring a zero lying outside ∂D inside the unit disc
without affecting the spectral density function (however, if a zero lies on ∂D it cannot be
moved). Since Φηη(z) may be factored as G(z−1)G(z), then if z0 (p0) is a zero (pole) of Φηη(z),
then z−10 (p−1
0 ) is also a zero (pole) of Φηη(z). Thus Φηη(z) has a pole/zero pattern symmetric
with respect to ∂D, as shown below:
stable and minimumphase spectral factor
Re(z)
Im(z)
-1
i
-j
10
Spectral factorisation
7 of 30
Clearly, many pole/zero patterns are possible for a spectral factor, but only one will correspond
to a stable/minimum-phase system.
1.4 Calculation of covariance functions for ARMA models
Given an ARMA model A(z−1)ηt = B(z−1)et where {et} is white with unit covariance, we
sometimes require to obtain {R(k)} in closed-form. There are several approaches for this.
Example: Consider the first order autoregression ηt − aηt−1 = et with |a| < 1
• Multiply by ηt−1 and take expectations:
E{ηtηt−1 − aη2t−1} = E{ηt−1et} = 0 ⇒ R(1)− aR(0) = 0 (2)
• Next multiply through by ηt and take expectations:
E{η2t − aηtηt−1} = E{ηtet} (3)
• Multiply through by et and take expectations:
E{ηtet − aηtet} = E{e2t} = 1 ⇒ E(ηtet} = 1 (4)
• From (2), (3) and (4)
R(0)− a2R(1) = 1 ⇒ R(0) =1
1− a2and R(1) =
a
1− a2(5)
• Finally multiply through by ηt−k (k = 1, 2, . . .) and take expectations:
E{ηtηt−k − aηt−1ηt−k} = E{etηt−k} = 0 ⇒ R(k) = aR(k − 1) (6)
Combining (5) and (6) gives:
R(k) =ak
1− a2k = 0, 1, 2, . . . (7)
which gives the covariance function in closed form.
This procedure of multiplying by past samples and taking expectations actually works in
general. The result is summarised in the following theorem (Yule-Walker equations):
Theorem: Let {R(0), R(1), R(2), . . .} be the covariance function of the stable ARMA model
A(z−1)ηt = B(z−1)et, where A(z−1) = 1 + a1z−1 + . . . + anz−n and B(z−1) = b0 + b1z
−1 + . . . +
bmz−m and suppose that {g0, g1, g2, . . .} is the unit-pulse response of A(z−1)−1B(z−1). Then:
m∑i=0
aiR(l − i) =n∑
i=max(0,l)
bigi−l l ≤ n
= 0 l > n
8 of 30
Proof: The ARMA model may be written in algebraic form as:
m∑i=0
aiηt−i =n∑
i=0
biet−i
Multiplying by ηt−l (l = 0, 1, 2, . . .) and taking expectations gives:
m∑i=0
aiE(ηt−lηt−i) =n∑
i=0
biE(ηt−let−i)
or equivalentlym∑
i=0
aiR(l − i) =n∑
i=0
biE(ηt−let−i) (8)
Now, {g0, g1, g2, . . .} are the Markov parameters of the model and hence
ηt =∞∑
j=0
gjet−j ⇒ ηt−l =∞∑
j=0
gjet−l−j
and hence
E{ηt−let−i} = E
{ ∞∑j=0
gjet−l−jet−i
}
The only non-zero terms in the expectation occur when t − l − j = t − i, i.e. when j = i − l.
Hence:
E{ηt−let−i} = gi−l i− l ≥ 0
= 0 i− l < 0
Substituting into (8) gives:
m∑i=0
aiR(l − i) =n∑
i=max(0,l)
bigi−l l ≤ n
= 0 l > n
Note: The first m + 1 equations for l = 0, 1, 2, . . . ,m involve 2m + 1 unknowns
R(−m), . . . , R(0), . . . R(m) which may be solved on noting the relations R(−i) = R(i). Using
these values as initial data the higher order R(i)’s can be determined recursively using the
remaining Yule-Walker equations.
A second method for determining the covariance function is via contour integration. Recall
that given ηt satisfying A(z−1)ηt = B(z−1)et the spectral density is given by
Φnn(z) =B(z−1)
A(z−1)
B(z)
A(z)
9 of 30
Also, Ψ(ω) = Φηη(ejω) is the inverse Fourier transform of {R(0), R(±1), R(±2), . . .}. Hence the
R(i)’s can be obtained as the Fourier coefficients of Ψ(ω), i.e.
R(k) =1
2π
∫ π
−π
Φηη(ejω)ejkωdω (k = 0,±1,±2, . . .)
=1
2πj
∫ π
−π
Φηη(ejω)(ejω)k−1d(ejω)
=1
2πj
∮
∂D
Φηη(z)zk−1dz
This is a contour integral in the complex plane around the unit circle ∂D which may be
calculated using Cauchy’s residue theorem. Thus, assuming that z → Φηη(z)zk−1 has only
simple poles inside ∂D (write them z1, z2, . . . , zp}, then
R(k) =
p∑i=1
Res(Φηη(z)zk−1, zi) =
p∑i=1
{limz→zi
(z − zi)Φηη(z)zk−1
}
Example: Consider the process ηt − aηt−1 = et, t ∈ Z, |a| < 1. The spectral density function
is:
Φηη(z) =1
1− az−1
1
1− az=
z
(z − a)(1− az)
Thus
R(k) =1
2πj
∮
∂D
Φηη(z)zk−1dz =1
2πj
∮
∂D
zk
(z − a)(1− az)dz
Thus
R(k) =∑
i
Residues inside ∂D =ak
1− a2k = 0, 1, 2, . . .
(as before) since the only pole inside ∂D is z = a for all k = 0, 1, 2, . . ..
1.5 Stochastic state-space models
The standard state-space stochastic model we will be using is of the form:
xk+1 = Axk + Bek (9)
yk = Cxk + Dek (10)
As usual, xk denotes the state vector (at time-index k), yk is the system output vector and
ek the system input vector; we assume that the ek’s are uncorrelated, zero-mean, and that
Cov(ek) = E(ekeTk ) = I for all k. We can give two different interpretations of this model:
(i) The time set is Z+ (non-negative zeros only); in this case we need to specify the (joint)
statistics of the initial state vector x0 and e0 to properly define the processes xk and yk; (ii)
The time set is Z (all integers); in this case xk and yk are the outputs of the ARMA models
xk = (zI − A)−1Bek and yk = [C(zI − A)−1B + D]ek in the sense discussed earlier.
Theorem: Suppose that in (9) the time index is Z+, the initial state x0 has zero mean and
covariance P0, and that x0 is uncorrelated with the ek for all k. Then P (k) := Cov(xk) satisfies
P (k + 1) = AP (k)AT + BBT , P (0) = P0 (11)
10 of 30
and
Cov(yk, yk−j) = CP (k)CT + DDT , j = 0
= CAjP (k − j)CT + CAj−1BDT , j = 1, 2, . . . , k
If A is stable then P (k) → P as k → ∞ where P is the unique solution of the (discrete)
Lyapunov matrix equation:
P = APAT + BBT
Furthermore if P0 = P then P (k) = P for all k ≥ 0.
Next suppose that the time index is Z and that A is stable. In this case {xk} and {yk} are
wide-sense stationary processes, E(xk) = 0, Cov(xk) = P and
Cov(yk, yk−j) = CPCT + DDT , j = 0
= CAjPCT + CAj−1BDT , j > 0
Here P is once more the (unique) solution of the Lyapunov equation.
Proof: First note that E(xk+1) = AE(xk) and since E(x0) = 0, xk (and thus also yk) is
zero-mean for every k ≥ 0. Next note that
E{xk+1xTk+1} = E{(Axk + Bek)(Axk + Bek)
T}= E{(Axk + Bek)(x
Tk AT + eT
k BT )}= AE{xkx
Tk }AT + BE{eke
Tk }BT
since xk and ek are uncorrelated, and hence
P (k + 1) = AP (k)AT + BBT
for all k ≥ 0, with P (0) = P0. Next write:
xk = Axk−1 + Bek−1 = A(Axk−2 + Bek−2) + Bek−1 = A2xk−2 + ABek−2 + Bek−1
and hence in general:
xk = Ajxk−j + Aj−1Bek−j + . . . + Bek−1
for 1 ≤ j ≤ k. Since xk−j is uncorrelated with {ek−j, . . . , ek−1} we have that:
E{xkxTk−j} = E{Ajxk−jx
Tk−j + Aj−1Bek−jx
Tk−j + . . . + Bek−1x
Tk−j} = AjP (k − j) (12)
Also,
E{xkeTk−j} = E{Ajxk−je
Tk−j + Aj−1Bek−je
Tk−j + . . . + Bek−1e
Tk−j} = Aj−1B (13)
The covariance of the output signal is:
Cov(yk, yk) = E{ykyTk } = E{(Cxk + Dek)(x
Tk CT + eT
k DT )} = CP (k)CT + DDT
11 of 30
using again the fact that xk and ek are uncorrelated. Also, for 1 ≤ j ≤ k,
Cov(yk, yk−j) = E{ykyTk−j} = E{(Cxk + Dek)(x
Tk−jC
T + eTk−jD
T )}= CE{xkx
Tk−j}CT + CE{xke
Tk−j}DT + DE{ekx
Tk−j}CT + DE{eke
Tk−j}DT
The last two terms are equal to zero, and hence:
Cov(yk, yk−j) = CAjP (k − j)CT + CAj−1BDT
using (12) and (13). Next consider the discrete Lyapunov matrix equation P = APAT + BBT .
It is well known that if A is stable, the solution to this equation is unique and is given by:
P =∞∑
k=0
AkBBT (Ak)T
Now, by applying repeatedly the recursion (11) we get:
Pk = AkP0(Ak)T +
k−1∑i=0
AiBBT (Ai)T
Taking limits k → ∞ the first term converges to zero (A is stable) and hence P (k) → P
(regardless of the initial condition P0). It is also clear that if P0 = P then Pk = P for all k ≥ 0.
Finally consider the case when the time-set is Z+. Since A is stable xk and yk can be expressed
as the outputs of ARMA models and hence they are wide-sense stationary processes. The
state-space model equations clearly show that E(xk) = 0 and E(yk) = 0. Let E(xkxTk ) = P .
Then from stationarity,
P = E(xk+1xTk+1) = E{(Axk + Bek)(x
Tk AT + eT
k BT )} = APAT + BBT
using the facts that xk and ek are uncorrelated. Thus P satisfies the Luapunov equation;
uniqueness now implies that P = P . Using a similar argument as before we can show that
E(xkxTk−j) = AjP ; using the output equations this establishes the expressions for Cov(yk, yk−j).
1.6 The overall model (Deterministic and Stochastic parts)
We can now combine a deterministic model (described by difference equations) with an ARMA
noise model to give a complete stochastic dynamic model:
Here {ut} is a deterministic discrete-time input, {et} is a zero-mean white noise input of variance
E(e2t ) = σ2, ηt is the (coloured) noise signal and y(t) is the overall output signal. G(z−1) and
H(z−1) are assumed to be discrete LTI systems of the form:
G(z−1) =b0 + b1z
−1 + . . . + blz−l
1 + a1z−1 + . . . + arz−rand H(z−1) =
d0 + d1z−1 + . . . + dpz
−l
1 + c1z−1 + . . . + cqz−q
i.e. yt = G(z−1)ut + H(z−1)et. This will be the standard setting of the identification problem
(next chapter): In general we are given a finite record of input-output data (ut, yt) (the output
12 of 30
H(z-1)
G(z-1){ut}
{nt}
{yt}
{et}
+
+Σ
Stochastic model
data being corrupted by a noise component {ηt}) and we are required to estimate the parameters
of the two processes, i.e. the coefficients of G(z−1) and H(z−1) (and possibly their degrees).
Noise intensity σ2 is typically unknown.
Note: Inclusion of pure delays z−m in G(z−1) may be necessary, but pure delays in H(z−1)
can always be removed, e.g. if
H(z−1) =z−2
1 + 0.5z−1; ηt = H(z−1)et
redefine et = z−2et = et−2; then the noise process is described identically as
H(z−1) =1
1 + 0.5z−1; ηt = H(z−1)et
since {et} and {et} have identical statistical properties.
Note: We can express model equations as
A(z−1)yt = B(z−1)ut + C(z−1)et
where A(z−1) is the lowest common multiple of polynomials A(z−1) and C(z−1). The explicit
dependence of this model on the deterministic sequence {ut} gives it the name ARMAX
(AutoRegressive Moving Average with eXogenous input).
1.7 Stochastic process modelling
Suppose we have a long record {ηt} of noise data and we wish to find its stochastic description
(i.e. model is as the output of an ARMA model driven by white noise of unit variance). We
can follow the following steps:
1. Remove any significant mean/trend components. For example, we can remove the mean
of ηt and and redefine ηt = ηt − ηt where ηt = 1N
∑Nt=1 ηt is the estimated mean. ηt can
be modelled as a deterministic step disturbance.
13 of 30
2. Obtain estimates of the covariance function from the data (again assume ergodicity) as:
R(k) =1
N − k
N∑
t=k+1
ηtηt−k
for k = 0, 1, . . . , m, m ≈ N4.
3. Use discrete Fourier transform (FFT) to obtain estimate of spectral density:
Ψ(ω) = Φηη(ejω) =
m∑
k=−m
R(k)e−jωk
in the range 0 ≤ ω ≤ π. (Note that π corresponds to the Nyquist angular frequency; this
is half the sampling angular frequency π/T rads/sample where T is the sampling period
of the signal).
4. Design a stable and minimum-phase filter
G(z−1) =B(z−1)
A(z−1)
so that its frequency response |G(ejω)|2 “fits” Ψ(ω). Note:
Ψ(ω) = Φηη(ejω) = G(e−jω)G(ejω) = G(ejω)G(ejω) = |G(ejω)|2
5. The estimated model is shown in the figure below.
{et} {nt}
E(et)
B(z-1)/A(z-1) Σ
+
+
Estimated model of stochastic signal
14 of 30
2. Systems Identification
2.1 The general identification problem
The general identification problem is to determine the parameters of a mathematical model
from the response of a physical system:
Physical System
Model M(θ )
uy
ym
e=y-ymΣ+
_
Systems Identification problem
The physical system is excited by an input u and produces an output y. The model is
parametrised by a vector-parameter θ which is allowed to vary over a suitable set Θ. When the
same input u is applied to the model, the simulated response is ym. This is compared with the
measured output y to produce an error e = y − ym. The problem is to choose θ ∈ Θ so that a
suitably chosen measure of the error “size” is minimised over all θ ∈ Θ, so that ym is “as close
as possible” to y.
2.2 Linear least-squares estimation
Postulate a model of the form:
y = Xθ + e
where:
y =
y1
y2
...
yn
, e =
e1
e2
...
en
, θ =
θ1
θ2
...
θq
and X ∈ Rn×q
Here y and X are known, θ is the vector of parameters to be estimated (unknown) and e is a
noise/disturbance vector (unknown).
Example: Consider a linear system which consists of a simple unknown gain θ and whose
output is corrupted by a noise sequence et (see figure). In this case model equations may be
written as:
y1
y2
...
yn
=
u1
u2
...
un
θ +
e1
e2
...
en
15 of 30
unknown gain θ{ut}
{et}
{yt}Σ+
+
Estimation of unknown gain
or as y = Xθ + e.
2.3 Least squares estimate
In least-squares estimation we choose θ to minimise the sum of squares of the “residuals”
(errors), i.e. we minimise
J(θ) =n∑
i=1
(yi − xTi θ)2 = (y −Xθ)T (y −Xθ)
Here xTi is the i-th row of matrix X (“regression vectors”). To find the minimising θ, first
expand J(θ):
J(θ) = (yT − θT XT )(y −Xθ) = yT y − 2θT XT y + θT XT Xθ
Setting the derivative to zero:
∂J(θ)
∂θ= −2XT y + 2XT Xθ = 0 ⇒ XT Xθ = XT y (Normal equations)
Note that XT X is a square (q × q) matrix and therefore the normal equations can be solved
uniquely if XT X is non-singular, or equivalently if X is of full column rank. In this case,
therefore, the normal equations have the unique solution:
θ = (XT X)−1XT y
2.4 Geometric interpretation of least squares
Definition: For a real n× q matrix X the range of X is the set
Range(X) = {z ∈ Rn : z = Xθ, θ ∈ Rq}
which is a subspace of Rn. The dimension of Range(X) is called the Rank of X. If
X = [x1 x2 xq] then Range(X) is the linear span of the columns of X, i.e. Range(X) =
{∑qi=1 θixi : θi ∈ R}.
The normal equations for the least-squares solution can be written as:
XT Xθ −XT y = 0 ⇒ XT (Xθ − y) = 0 ⇒ xTi (Xθ − y) = 0 for all i = 1, 2, . . . q
This means that (y −Xθ)⊥xi for all i, i.e. (y −Xθ)⊥Range(X). Thus Xθ is the closest point
in Range(X) to y, i.e. Xθ is the orthogonal projection of y onto Range(X).
16 of 30
1/s x(t)x(t).
B
Au*(t)
-R-1BTP(t)
+
+
Least squares estimation
In general, with every X ∈ Rn×q (of full column rank) we can define two projection operators:
• P1 := X(XT X)−1XT (Projection onto Range(X)).
• P2 := In − P1 = In −X(XT X)−1XT (Projection ⊥ Range(X)).
[Check this: for every y ∈ Range(X), i.e. y = Xθ for some θ, P1y = X(XT X)−1XT Xθ =
Xθ = y, P2y = (In − P1)y = y − P1y = 0]. Thus, if θ is the least-squares estimate,
Xθ = X(XT X)−1XT y = P1y (projection of y onto Range(X)) and e = y−Xθ−(In−P1)y = P2y
(optimal residuals orthogonal to Range(X)).
2.5 Statistical properties least squares
Now assume that there is a “true” parameter θ, treat e as a vector of‘random variables and
examine the statistical properties of θ. Assume that:
1. E(e) = 0.
2. E(eeT ) = σ2I
i.e. the ei’s are zero mean, uncorrelated with common variance σ2.
Note that since y = Xθ + e, the yi’s are random variables, and hence any linear combination
of these, bT y say, is also a random variable. Thus θ = (XT X)−1XT y is a vector of random
variables and as such it will have a probability distribution. (Think of this as follows: Perform
a sequence of random experiments, each time with a different realisation of the random vector
e (of course drawn from the same distribution!) and record in a histogram the percentage
(successes over trials) that the estimate falls an interval of the histogram. In the limit (as the
number of experiments increases and the histogram partitioning becomes finer and finer) we
will have obtained the statistical distribution of θ).
When is θ a “good” estimate? It is natural to say that θ is “good” if the following two conditions
are satisfied:
17 of 30
• The statistical mean of θ coincides with the true parameter θ, i.e. if the estimates’ average
over noise realisations e (in the limit) is equal to θ.
• The variance of θ is small, i.e. the statistical spread of the estimates (for different noise
realisations) around the mean is small.
It is shown next that under the assumptions that: (i) X is a constant (deterministic) matrix,
and (ii) The ei’s are zero-mean, uncorrelated with common variance σ2, then:
• θ is an unbiased estimate of θ, i.e. E(θ) = θ.
• θ has the smallest possible variance among all linear unbiased estimates of θ.
We start with the unbiasedness property:
Theorem: θ is unbiased.
Proof:
E(θ) = E{(XT X)−1XT (Xθ + e)} = E{θ + (XT X)−1XT e}Since θ and X are deterministic,
E(θ) = θ + (XT X)−1XT E(e) = θ
since e is zero-mean.
Next we calculate the covariance of θ:
Theorem: Cov(θ) = σ2(XT X)−1.
Proof:
θ − θ = (XT X)−1XT (Xθ + e)− θ = θ + (XT X)−1XT e− θ = (XT X)−1XT e
Thus:
Cov(θ) = E{(θ − θ)(θ − θ)T} (since E(θ) = θ)
= E{(XT X)−1XT eeT X(XT X)−1} (first line of proof!)
= (XT X)−1XT E(eeT )X(XT X)−1 (since X is deterministic)
= σ2(XT X)−1XT X(XT X)−1 (since E(eeT ) = σ2In)
= σ2(XT X)−1
as required.
Theorem: θ is BLUE (‘Best Linear Unbiased Estimate’), i.e. given any estimate β of the form
β = By (linear) such that E(By) = θ for all θ, then Cov(θ) ≤ Cov(β).
Proof: Let β = By be any linear unbiased estimate, then:
E{By} = E{B(Xθ + e)} = BXθ = θ for all θ
18 of 30
and so BX = I. Also
By = B(Xθ + e) = BXθ + Be = θ + Be
so that
By − θ = Be = deviation of By from its mean
Hence,
Cov(β) = Cov(By) = E{(By − θ)(By − θ)T} = E{BeeT BT} = BE{eeT}BT = σ2BBT
Now let B = (XT X)−1XT + B. Since,
BX = I ⇒ (XT X)−1XT X + BX = I ⇒ BX = 0
Hence
Cov(β) = Cov(By) = σ2BBT = σ2[(XT X)−1XT + B][X(XT X)−1 + BT ]
= σ2(XT X)−1 + σ2BBT
= Cov(θ) + σ2BBT
≥ Cov(θ)
2.6 Estimation of noise variance
Consider the model
y = Xθ + e
in which e is a n− vector of zero mean, uncorrelated random variables with common variance
σ2 and θ is a q-vector. We still assume that X is deterministic and that its columns are linearly
independent. The parameter vector θ is constant (but unknown). Assume also that the noise
variance σ2 is unknown. A natural estimate of the variance σ2 is
σ2 =1
n
n∑i=1
e2i =
1
n(y −Xθ)T (y −Xθ)
Since θ is unknown, however, we may try θ instead. It turns out that the estimate:
σ2 =1
n(y −Xθ)T (y −Xθ)
(which is the “maximum-likelihood estimate of σ2) is biased. An unbiased estimate of σ2 is
given in the next theorem. Before presenting this theorem and its proof some background
material on the trace of a matrix is included.
The trace of a matrix: For a square, (n × n) say, matrix S, its trace is defined as the sum
of its diagonal elements, i.e.
trace(S) =n∑
i=1
Sii
19 of 30
Trace(·) is a linear operator: trace(A+B) = trace(A)+trace(B), and trace(aA) = atrace(A) for
every scalar a. Another useful property is that trace(AB) = trace(BA) whenever the products
AB and BA are both defined. Check this: If A ∈ Rm×n and B ∈ Rn×m (so that both products
are defined), then:
(AB)ii =∑
k
AikBki
and hence
trace(AB) =∑
i
(AB)ii =∑
i
∑j
AikBki =∑
k
∑i
BkiAik =∑
k
(BA)kk = trace(BA)
Another useful property of trace (not used here) is that
trace(S) =n∑
i=1
λi(S)
where λi(S) are the eigenvalues of S.
Theorem:
σ2 =1
n− q(y −Xθ)T (y −Xθ)
is an unbiased estimate of σ2.
Proof: First note that:
(y −Xθ)T (y −Xθ) = (y −X(XT X)−1XT y)T (y −X(XT X)−1XT y)
= yT (I −X(XT X)−1XT )(I −X(XT X)−1XT )y
= yT P 22 y (P2 is the projection onto Range(X)⊥)
= yT (I −X(XT X)−1XT )y (P 22 = P2 - one projection is enough! or expand)
= trace{(I −X(XT X)−1XT )yyT} (trace(AB) = trace(BA))
= trace{(I −X(XT X)−1XT )(Xθ + e)(θT XT + eT )}
Hence
σ2{n− trace(Iq)} = E{trace(P2(Xθ + e)(θT XT + eT ))}= E{trace(P2(XθθT XT + σ2In))} (E(·) and trace(·) commute)
= σ2{trace(In)− trace(X(XT X)−1XT )}= σ2{n− trace((XT X)−1XT X)}= σ2{n− trace(Iq)}= σ2{n− q}
Thus
σ2 =1
n− qE{(y −Xθ)T (y −Xθ)} = E(σ2)
and hence σ2 is an unbiased estimate of σ2.
20 of 30
2.7 Least squares estimation for dynamic systems
Consider the ARMA model
A(z−1)yk = B(z−1)uk + ek
in which
A(z−1) = 1 + a1z−1 + a2z
−2 + . . . + anz−n
and
B(z−1) = b0 + b1z−1 + b2z
−2 + . . . + bmz−m
We require to estimate the parameters:
θ = [a1 a2 . . . an | b0 . . . bm]T
from input-output measurements (ui, yi), i = 1, 2, . . . , N . The model equations may be written
as:
yk = [−yk−1 − yk−2 . . . − yk−n | uk . . . uk−m]θ + ek
for k = 1, 2, . . . , N , or, in matrix form:
y1
y2
...
yN
=
−y0 −y−1 . . . −y1−n u1 . . . u1−m
−y1 −y0 . . . −y2−n u2 . . . u2−m
......
......
...
−yN−1 −yN−2 . . . −yN−n uN . . . uN−m
θ +
e1
e2
...
eN
This is now in the form y = Xθ + e and the parameter vector θ may be estimated by least
squares.
Note 1: The ‘regression matrix’ X is partitioned as X = [X1 | X2], in which X1 and X2 have
a special (“Toeplitz”) structure, i.e. entries along main diagonals are identical.
Note 2: The upper triangular parts of T1 and T2 contain data yi and ui, i ≤ 0 which represent
initial conditions. If these data are unavailable we use 0’s to fill these positions.
Note 3: In this case X is a random matrix (since y is random and correlated with e), so
standard LS theory does not apply. In particular:
• For any finite n, θ is biased.
• θ is asymptotically unbiased (i.e. limN→∞ E(θN) = θ for all θ if the el’s are uncorrelated
and the uk’s are “persistently exciting” (see next).
• θ is biased even asymptotically if the ek’s are not white.
Asymptotic convergence in the case of white/correlated noise is examined in the following
examples:
Example 1: (white noise)
21 of 30
Consider a second-order autoregressive model for simplicity:
yk =(
yk−1 yk−2
) (−a1
−a2
)+ ek k = 1, 2, . . . , n
In this case,
y1
y2
...
yn
=
y0 y−1
y1 y0
......
yn−1 yn−2
(−a1
−a2
)+
e1
e2
...
en
and hence the least-squares estimate is given by:
θ = (XT X)−1XT y = (XT X)−1XT (Xθ + e) = (θ + XT X)−1XT e
Now,
XT X =
(y0 y1 . . . yn−1
y−1 y0 . . . yn−2
)
y0 y−1
y1 y0
......
yn−1 yn−2
=
( ∑nk=1 y2
k−1
∑nk=1 yk−1yk−2∑n
k=1 yk−1yk−2
∑nk=1 y2
k−1
)
or1
nXT X =
(1n
∑nk=1 y2
k−11n
∑nk=1 yk−1yk−2
1n
∑nk=1 yk−1yk−2
1n
∑nk=1 y2
k−1
)
Under ergodicity assumptions, the “sample covariances” converge to the “true covariances” and
hence:
limn→∞
{1
nXT X
}=
(Ryy(0) Ryy(1)
Ryy(1) Ryy(0)
)
Also,
1
nXT e =
1
n
(y0 y1 . . . yn−1
y−1 y0 . . . yn−2
)
e1
e2
...
en
=
(1n
∑nk=1 yk−1ek
1n
∑nk=1 yk−2ek
)
so that:
limn→∞
{1
nXT e
}=
(E(yk−1ek)
E(yk−2ek)
)= 0
since clearly the pairs (yk−1, ek) and (yk−2, ek) are uncorrelated. Hence:
θn − θ =
{1
n(XT X)
}−1 {1
nXT e
}−→ 0
as n →∞, so that limn→∞ θn = θ and the LS estimates are asymptotically unbiased.
Example 2: (Correlated disturbances)
22 of 30
Consider now the model
yk = ayk−1 + wk, wk = ek + cek−1
in which {e[} is white (E(ek) = 0, Var(ek) = σ2). In this case,
y1
y2
...
yn
=
y0
y1
...
yn−1
a +
w1
w2
...
wn
which is in the form y = Xa + w. Hence,
XT X =n∑
k=1
y2k−1, XT y =
n∑
k=1
ykyk−1
and hence least-squares estimate is
an =
∑nk=1 ykyk−1∑n
k=1 y2k−1
and hence
limn→∞
(an) = limn→∞
1n
∑nk=1 ykyk−1
1n
∑nk=1 y2
k−1
=Ryy(1)
Ryy(0)
under appropriate ergodicity assumptions. Now
yk = ayk−1 + ek + ck−1
Multiplying by yk−1 and taking expectations:
E{ykyk−1} = aE{y2k−1}+ E{ekyk−1}+ cE{ek−1yk−1} ⇒ Ryy(1) = aRyy(0) + cE(ekyk) (14)
Multiplying by ek and taking expectations:
E{ykek} = aE{yk−1ek}+ E{e2k}+ cE{ek−1ek} ⇒ E(ykek) = σ2 (15)
Substituting (15) in (14):
Ryy(1) = aRyy(0) + cσ2 ⇒ limn→∞
an = a +cσ2
Ryy(0)
and hence there is an asymptotic bias:
asymptotic bias =cσ2
Ryy(0)
Note: Ryy(0) may be calculated explicitly as:
Ryy(0) =σ2(c2 + 2ac + 1)
1− a2
23 of 30
(exercise!)
2.8 Persistence of excitation and choice of input
Consider the least-squares algorithm applied to the Moving-Average model:
yk = b0uk + b1uk−1 + . . . + bquk−q + ek, k = 1, 2, . . . , n
This may be written is full as:
y1
y2
...
yn
=
u1 u0 . . . u1−q
u2 u1 . . . u2−q
......
...
un un−1 . . . un−q
b0
b1
...
bq
+
e1
e2
...
en
or y = Xθ + e. It is clear that certain conditions must be imposed on the input sequence {uk}so that the parameters b0, b1, . . . , bq can be successfully identified - for example if uk = 0 for all
k identification of the bi’s is impossible. In general, for a unique least-squares estimate, XT X
must be non-singular. In this case,
XT X =
u1 u2 . . . un
u0 u1 . . . un−1
......
...
u1−q u2−q . . . un−q
u1 u0 . . . u1−q
u2 u1 . . . u2−q
......
...
un un−1 . . . un−q
and hence
1
n(XT X) =
1n
∑nk=1 u2
k1n
∑nk=1 ukuk−1 . . . 1
n
∑nk=1 ukuk−q
1n
∑nk=1 ukuk−1
1n
∑nk=1 u2
k−1 . . . 1n
∑nk=1 uk−1uk−q
......
...1n
∑nk=1 uk−quk
1n
∑nk=1 uk−quk−1 . . . 1
n
∑nk=1 u2
k−q
In the general case when {uk} is a (zero-mean )stochastic process, under ergodicity assumptions:
1
n(XT X) −→
Ruu(0) Ruu(1) . . . Ruu(q)
Ruu(1) Ruu(0) . . . Ruu(q − 1)...
......
Ruu(q) Ruu(q − 1) . . . Ruu(0)
:= Rq+1
where Ruu(i) is the covariance function of {uk}. Hence, for asymptotic identifiability of a
(q + 1)-th order model we require that Rq+1 > 0 (positive definite). Note that if Rq > 0 then
Rq−i > 0 for all i = 1, 2, . . . , q.
Definition: A signal {uk} is called persistently exciting of order q if the matrix Rq is positive
definite.
Note: This condition allows us to decide whether a particular input signal is adequate to identify
the coefficients of a particular model.
24 of 30
Theorem: A signal {uk} is persistently exciting of order q + 1 iff:
limn→∞
1
n
n∑
k=1
{A(z−1)uk}2 > 0
for ever polynomial A(z−1) 6= 0 of degree q or less (i.e. for A(z−1) of the form: A(z−1) =
a0 + a1z−1 + . . . + aqz
−q).
Proof: For A(z−1) of this form, define:
aT = [a0 a1 . . . aq] and UTk = [uk uk−1 . . . uk−q]
Then1
n
n∑
k=1
{A(z−1)uk}2 =1
n
n∑
k=1
aT UkUTk a
or,
1
n
n∑
k=1
{A(z−1)uk}2 = aT
1n
∑nk=1 u2
k1n
∑nk=1 ukuk−1 . . . 1
n
∑nk=1 ukuk−q
1n
∑nk=1 ukuk−1
1n
∑nk=1 u2
k−1 . . . 1n
∑nk=1 uk−1uk−q
......
...1n
∑nk=1 uk−quk
1n
∑nk=1 uk−quk−1 . . . 1
n
∑nk=1 u2
k−q
a
or,1
n
n∑
k=1
{A(z−1)uk}2 −→ aT Rq+1a
Hence limn→∞{A(z−1)uk}2 > 0 for every A(z−1) 6= 0 of degree less that or equal to q iff
aT Rq+1a > 0 for all a 6= 0, i.e. iff Rq+1 > 0.
Example:
• Take uk = 1 (unit step). Since (1 − z−1)uk = uk − uk−1 = 0, there exists a non-trivial
polynomial in z−1 of degree 1 such that A(z−1)uk = 0; hence {uk} can be persistently
exciting of degree 1 at most.
• Take uk = ejωk (complex sinusoid) and define A(z−1) = 1− 2z−1 cos ω + z−2. Now,
A(z−1)uk = (1− z−1(ejω + e−jω) + z−2)ejωk
= ejωk − ejωejω(k−1) − e−jωejω(k−1) + ejω(k−2)
= ejωk(1− 1− e−2jω + e−2jω) = 0
Thus there exists a non-trivial polynomial of degree 2 such that A(z−1)uk = 0, and hence
ejωk can be persistently exciting of order at most 2.
• Take uk = ek (white noise of variance σ2). Then Rq = σ2Iq > 0 for all q. Hence ek is
persistently exciting of any order.
25 of 30
2.9 Generalised least-squares algorithm
We have seen that the direct application of least-squares gives biased estimates for correlated
disturbances. Consider the model:
A(z−1)yk = B(z−1)uk + wk
where
D(z−1)wk = ek, {ek} white
(Note: Here noise is modelled as AR process, not MA!). Suppose that D(z−1) is known. In
this case,
A(z−1)D(z−1)yk = B(z−1)D(z−1)uk + ek
Let y?k and u?
k be the output and input sequences filtered by D(z−1), i.e. define y?k = D(z−1yk
and u?k = D(z−1uk. Then,
A(z−1)y?k = B(z−1)u?
k + ek
in which the noise is now white. Thus the least-squares algorithm can be applied to give
asymptotically unbiased estimates of the coefficients of A(z−1) and B(z−1).
The generalised least-squares algorithm is an iterative application of this idea when D(z−1) is
unknown. It involves two steps:
1. Least squares estimation applied to data filtered through the current estimate of D(z−1),
and
2. An iterative procedure for estimating D(z−1): Using the current estimates of A(z−1) and
B(z−1) we calculate wk := A(z−1)yk−B(z−1)uk; next treat {wk} as an output series and
estimate D(z−1) from the model D(z−1)wk = ek using least-squares.
A conceptual is summarised next:
Algorithm - generalised least-squares:
1. Choose D0 = 1; Set j ← 1.
2. Choose:
Aj(z−1) = 1 + a(j)1 z−1 + . . . + a(j)
n z−n
Bj(z−1) = b(j)0 + b
(j)1 z−1 + . . . + b(j)
m z−m
Dj(z−1) = 1 + d(j)1 z−1 + . . . + d(j)
p z−p
according to the following procedure:
3. Filter data through Dj(z−1), i.e. calculate series y?k = Dj(z−1)yk and u?
k = Dj(z−1)uk.
26 of 30
4. Choose Aj(z−1), Bj(z−1) using least-squares to fit model:
A(z−1)y?k = B(z−1)u?
k + ek
and form the resulting optimal polynomials Aj(z−1) and Bj(z−1).
5. Calculate the optimal residuals: wk = Aj(z−1)yk − Bj(z−1)uk.
6. Using {wk} as the output sequence, fit using least-squares the model D(z−1)wk = ek and
form the resulting optimal polynomial Dj(z−1)
7. Calculate the residuals ek = Dj(z−1)wk. Are these white (apply whiteness test - see next)
or have the estimates converged? If yes, stop; else set j ← j + 1 and go to (3).
A possible stopping condition of the generalised least-squares algorithm is a “whiteness test” on
the residual sequence {ek}. One such test is based on the statistical properties of the “sample
covariances” of ek.
Statistical properties of “white” sequences
Suppose that {ek}nk=1 is a white sequence with variance Var(ek) = σ2. Introducing the “sample
covariances” φee(k),
φee(k) =1
n− kn
n−kn∑i=1
eiei+k 0 ≤ k ≤ kn
for some kn (fixed). Typically kn ≈ 0.15n. Notice that:
E{φee(0)} =1
n− kn
n−kn∑i=1
E(e2i ) =
1
n− kn
σ2(n− kn) = σ2
and
E{φee(k)} =1
n− kn
n−kn∑i=1
E(eiei+k) = 0 for k 6= 0
Also for k 6= 0,
E{φ2ee(k)} =
1
(n− kn)2E
(n−kn∑i=1
eiei+k
)2 =
1
(n− kn)2E
{n−kn∑i=1
e2i e
2i+k + cross-terms
}
=1
(n− kn)2σ4(n− kn) =
σ4
n− kn
and E{φee(k)φee(l)} = 0 for k 6= l. We are interested in the (approximate) statistical properties
of the random variable,
ξk =φee(k)
φee(0)k = 1, 2, . . .
If n is “large” we can ignore the statistical fluctuations in the denominator and take φee(0) ≈E{φee(0)} = σ2 (constant). Hence,
E(ξk) ≈ E
(φee(k)
σ2
)= 0
27 of 30
and
Var(ξk) = Var
{φee(k)
φee(0)
}= E
{φ2
ee(k)
φ2ee(0)
}≈ E(φ2
ee(k)
σ4=
1
n− kn
k = 1, 2, . . .
Now, since φee(k) are averages of a large number of identically distributed random variables,
φee(k)/φee(0) is approximately normal (Central limit theorem). Hence, the probability that ξk
is within two standard deviations away from the mean is approximately 95 per cent, or
Prob
{− 2√
n− kn
≤ φee(k)
φee(0)≤ 2√
n− kn
}= 0.95
This leads to the following statistical “whiteness test” of a sequence {ek}: Plot the values
of φee(k)/φee(0) for k = 1, 2, . . . , kn. If 95 per cent of the statistics do not lie inside strip
[− 2√n−kn
2√n−kn
], then REJECT the hypothesis that {ek} is white (with 95 per cent confidence).
2.10 Recursive least-squares
We consider a modification of the least-squares algorithm which updates the old estimates as
new data come in. Consider the standard model:
y1
y2
...
yn
=
xT1
xT2...
xTn
θ +
e1
e2
...
en
which we now write as Yn = Xnθ + En to keep track of dimensions. The least-squares estimate
of θ at time t = n is given by:
θn = (XTn Xn)−1XT
n Yn
Define Pn := (XTn Xn)−1; then
θn = Pn
(x1 . . . xn
)
y1
...
yn
= Pn
n∑i=1
xiyi = Pn
(n−1∑i=1
xiyi + xnyn
)
Now
θn−1 = Pn−1
n−1∑i=1
xiyi ⇒n−1∑i=1
xiyi = P−1n−1θn−1
Hence
θn = Pn
{P−1
n−1θn−1 + xnyn
}
But,
P−1n = XnXT
n = XTn−1Xn−1 + xnx
Tn = P−1
n−1 + xnxTn
Substituting we get
θn = Pn{(P−1n − xnxT
n )θn−1 + xnyn}or
θn = θn−1 − PnxnxTn θn−1 + Pnxnyn
28 of 30
and thus
θn = θn−1 + Pnxn(yn − xTn θn−1)
To find a recursive update of Pn we have to use the “matrix inversion lemma”:
Lemma: For any three matrices A, B and C of compatible dimensions such that A and A+BC
are invertible,
(A + BC)−1 = A−1 − A−1B(I + CA−1B)−1CA−1
Proof: Check identity by multiplying the RHS by A + BC:
(RHS)(A + BC) = I + A−1BC − A−1B(I + CA−1B)−1C − A−1B(I + CA−1B)−1CA−1BC
= I + A−1B(I + CA−1B)−1{(I + CA−1B)− I − CA−1B
}C = I
Applying this to:
Pn = (XnXTn )T = (XT
n−1Xn−1 + xnxTn )−1
gives
Pn = Pn−1 − Pn−1xn(I + xTnPn−1xn)−1xT
nPn−1
or
Pn = Pn−1 − Pn−1xnxTnPn−1
1 + xTnPn−1xn
on noting that xTnPn−1xn is a scalar. The updating formulae of recursive least squares can be
summarised as:
θn = θn−1 + Pnxn(yn − xTn θn−1)
Pn = Pn−1 − Pn−1xnxTnPn−1
1 + xTnPn−1xn
Note 1: Computationally demanding matrix inversion is now completely avoided. Also, there
is no need to store the old data; at time t = n we receive new information from the data (xn, yn)
and we update the old estimates (Pn−1, θn−1) to produce the new ones (Pn, θn).
Note 2: The updating formula for θn is intuitively appealing: The “predicted” value of yn
given information up to time t = n− 1 is
yn|n−1 = xTn θn−1
When new information arrives at time t = n in the form of a new measurement (yn), this is
compared with the predicted value to generate the “prediction-error”:
en|n−1 = yn − yn|n−1 = yn − xTn θn−1
The new estimate θn is then a correction of the previous estimate θn by an amount proportional
to the prediction error.
29 of 30
Note 3: Modified recursive least-squares algorithm: A slight disadvantage of the LS
recursion is that (in its present form) it cannot be started at time t = 1, since the matrix Pn
is only defined when XTn Xn is non-singular, or equivalently when Xn has full column rank. A
necessary condition for this is that n ≥ q, i.e. that we have at least as many measurements as
the number of parameters to be estimated. A possible solution is to wait for at least n0 ≥ q
measurements until XTn0
Xn0 is invertible, calculate Pn0 and θn0 by matrix inversion (i.e. as
in batch LS), and proceed from then on recursively. An alternative approach is to start the
recursions at time t = 1 by choosing reasonable initial conditions θ0 and P0 (e.g. in the absence
of any information we can take θ0 = 0 and P0 = 1εIq, where ε is a small positive number). It
can be shown that this initialization corresponds to the minimisation of:
Jn(θ) =n∑
i=1
(yi − xTi θ)2 +
1
2(θ − θ0)
T P−10 (θ − θ0)
This performance index differs from the “true” LS cost by a fixed amount 12(θ− θ0)
T P−10 (θ− θ0).
As n increases, the deviation from the true cost decreases in relative terms. For small n, Jn(θ)
can be made approximately equal to the LS cost by choosing P0 sufficiently large, e.g. P0 = 1εIq
for a small positive ε, as suggested previously. Note that from a previous result the covariance
of the estimated vector is σ2(XTn Xn)−1 = σ2Pn (assuming that En is white with variance σ2).
Thus, in statistical terms a “large” P0 reflects a high initial uncertainty around the initial
estimates θ0.
30 of 30