An Introduction to State Space Models - ZHAW Blogs · PDF fileAn Introduction to State Space Models Marc Wildi May 6, ... (as interfaced by dlmodeler) ... • Classical linear regression

An Introduction to State Space Models

Marc Wildi

May 6, 2013

Contents

1 Introduction 21.1 Model-Equations: State and Observation Equations . . . . . . . . . . . . . . . . . 31.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Stochastic Level and Trend Models 42.1 Constant Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Changing Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Changing Level and Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Exercices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 ARMA-Processes in State Space Form 103.1 AR-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 MA-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 ARMA-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 The Kalman-Filter 144.1 The Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Initializing the Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 MA(1)-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.2 AR(1)-Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.3 The Diffuse Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Optimization 185.1 Example: ARMA-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Least-Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 In-Sample vs. Out-of-Sample Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 195.5 Exercises Part I: Constant Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.6 Exercises Part II: Changing Level and Optimization Criteria . . . . . . . . . . . . 265.7 Exercises Part III: the Q/R Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.8 Exercises Part IV: Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Regression Analysis 336.1 Classical Fixed Coefficients Regression . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Adaptive Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 A More General KF-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.4.1 Exercises: Replicating the Classical (Fixed Coefficient) Regression Model . 376.4.2 Exercises: Adaptive Regression . . . . . . . . . . . . . . . . . . . . . . . . . 40

1

7 Forecasting, Smoothing and Interpolation 447.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.3 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.4.1 Exercises Forecasting Part I: Varying Level and Slope . . . . . . . . . . . . 457.4.2 Exercises Interpolation: Varying Level and Slope . . . . . . . . . . . . . . . 487.4.3 Exercises Forecasting Part II: Changing Autocorrelation (AR(1)) . . . . . . 49

8 Time Series Decomposition 52

9 A Completely Worked-Out Real-World Project from the Health-Care Sector 549.1 The State Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559.2 Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.3 A More General Adaptive Optimization Criterion . . . . . . . . . . . . . . . . . . . 589.4 Robustification of the KF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.5 Constraining the Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 599.6 Summary: the Generalized Kalman Filter (R-Code) . . . . . . . . . . . . . . . . . 599.7 Run this Code: Estimate Components . . . . . . . . . . . . . . . . . . . . . . . . . 639.8 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669.9 Plots of the Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689.10 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

10 R-Packages 7410.1 dlm-Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7510.2 Package KFAS (as interfaced by dlmodeler) . . . . . . . . . . . . . . . . . . . . . . 7810.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

11 Appendix 8311.1 The Kalman-Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

1 Introduction

The State Space Model Approach (SSMA) offers a very general and powerful framework to operatewith time series data.

• Classical linear regression is embedded as a special case.

• The ARIMA-model class can be replicated.

• Multivariate VARIMA-models can be implemented.

• Models with time-varying parameters (level/slope/autocorrelation) can be operationalized.

• Heteroscedasticity can be accounted for.

The approach is very flexible and it allows to extract a whole bunch of interesting data-features in‘nice’ graphs. The main estimation algorithm, the Kalman Filter, is set-up in a recursive form: itis fast and neat. The approach allows to include a priori knowledge through a suitable Bayesianformulation of the initial state vector. SSMA allow to decompose a time series into relevant com-ponents - for example trend, cycle, seasonal - and to analyze each of these in real-time (filter), toinfer best historical estimates (smoothing), and to forecast all components as well as the originalseries. Missing data can be interpolated easily and straightforwardly.

Though very elegant, the approach has its own language and syntax which must be learnedin order to fully develope its potential. Unfortunately, no unifying notation has imposed over time.

2

Unsurprisingly, therefore, all known R-packages use their own specific denominations/definitions/specifications.This will invariably lead to confusions!

1.1 Model-Equations: State and Observation Equations

A SSM is invariably specified by two sets of equations:

• An observation equation

yt = H′tξt + wt (1)

where yt is a data-vector, ξt is the generally unknown state-vector, Ht links the state vectorto the observations, and where wt is a noise process.

• A state equation

ξt+1 = Ft+1ξt + vt+1 (2)

which describes the dynamics of the state vector in Markovian form, via a state-transitionmatrix Ft+1. As for the observation equation, the state equation allows for a random nuisancevt.

Goal: one seeks to infer ξt, given observations yt.

Note the generality of this approach where all system matrices are allowed to change over time.This allows to tackle all kinds of non-stationarities. In general there are much more unknownentities than observation equations: therefore, some rules - model assumptions - will be needed totackle the problem.

1.2 Model Assumptions

Let the following SSM be given

yt = H′tξt + wt

ξt+1 = Ft+1ξt + vt+1

where yt is n-dimensional, ξt is r-dimensional, Ft+1 and H′t are r ∗ r and n ∗ r-dimensional. Thenoise vectors vt and wt are r− and n-dimensional with variance covariance matrices

E[vtv′t+k] =

{Qt k = 00 otherwise

E[wtw′t+k] =

{Rt k = 00 otherwise

where Qt and Rt are r ∗ r and n ∗ n matrices which can depend on time t (heteroscedasticity).We require

E[vtw′t+k] = 0

for all t, k. Frequently it is assumed that the noise terms vt,wt are Gaussian. Under theseassumptions, the classical estimation algorithm for ξt, the so-called Kalman Filter (KF), will

provide the best possible estimate ξt given data y1, ...,yt. Otherwise, the KF provides the bestpossible linear estimate (linear in the observations y1, ...yt) which, in practice, is often a sufficientrequirement.

3

2 Stochastic Level and Trend Models

We here consider simple illustrations of the above general SSM applied to the estimation of thepossibly time-varying level and/or slope of a random variable yt. Let

yt = µt + wt

We would like to estimate the level µt from y1, ..., yt.

2.1 Constant Level

Assume that we know that the level µt = µ0 is a constant, see fig.1. How could we possiblyestimate the constant µ0 as a function of the data? One of the simplest estimates is the identity

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Data y_t with constant level

0 20 40 60 80 100

12

34

5

True LevelArithmetic mean

Figure 1: Data yt: constant level

µt = yt. A straightforward alternative would be the arithmetic mean

µt =1

t

t∑k=1

yk

see the red line in the figure. Note that the estimate µt is randomly changing over time althoughµ0 is a fixed constant (it is equal to π in our example): as the sample size increases (t→ 100) theestimate seems to converge towards the true µ0 = π. Assigning equal weight to all observationsby the arithmetic mean is meaningful here, because the level was assumed constant: thus each ykis equally important in contributing information. This assumption fails when the variance of thenoise term changes or when the level µt changes dynamically.

4

2.2 Changing Level

For simplicity assume thatµt = µt−1 + vt

The previous setting (constant level) can be replicated by setting Q = 0 where Q is the varianceof vt. The following fig. 2 plots a realization of the data yt (black) and the true level (blue) whereR = Q = 1 i.e. both error terms vt, wt are standardized Gaussian noise. The red lines correspondto the arithmetic mean (upper graph) and to an optimal estimate of the level (lower graph). In

●●

●●●●●

●●●

●

●●●●●●●

●●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●

●

●●

●●●●●

●

●

●●●

●●●●●●●●

●●●●●●●●●

●●●

●

Data y_t with changing level: arithmetic mean

0 20 40 60 80 100

−15

−5

0 True LevelArithmetic mean

●●

●●●●●

●●●

●

●●●●●●●

●●●●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●

●

●●

●●●●●

●

●

●●●

●●●●●●●●

●●●●●●●●●

●●●

●

Data y_t with changing level: optimal estimate

0 20 40 60 80 100

−15

−5

0 True LevelOptimal Smoothed Estimate

Figure 2: Data yt: changing level

contrast to the previous example with a constant level, the arithmetic mean now performs muchworse than the Identity: it is much too smooth and heavily delayed. We now briefly computethe mean-square performances of the three estimates (the Identity, the mean, and the optimalestimate) in both cases (constant/changing level), see table 1. The first column (constant level)

Constant level Changing levelId 0.90 0.94

Arithmetic mean 0.08 30.72Optimal Estimate 0.08 0.52

Table 1: Mean square performances of estimates (Id, arithmetic mean, optimal estimate) for datawith constant and changing level

distinguishes the arithmetic mean as the best estimate. For changing levels in the second columnthe idendity clearly outperforms the mean, as expected from an examination of fig.2 (upper graph).The optimal estimate (lower graph) improves by an additional reduction of approximately 50% ofthe error variance relative to the Identity.

5

2.2.1 Exercises

1. Generate realizations for yt with R = 1 and Q = 0, 0.1, 0.5, 1, 5, 100:

> set.seed(10)

> len<-100

> Q<-1 # 0, 0.1,0.5,1,5,100

> R<-1

> mu<-cumsum(sqrt(Q)*rnorm(len))

> y<-mu+sqrt(R)*rnorm(len)

Plot the data yt as well as the level µt and describe the effect of the parameter Q on theobserved dynamics.

2. Compute mean-square performances of the Identity as well as of the arithmetic mean in eachcase.

> a_mean<-cumsum(y)/(1:len)

> # Identity

> mean((y-mu)^2)

[1] 0.93949

> # Arithmetic mean

> mean((a_mean-mu)^2)

[1] 30.71903

3. What is the data generating process (DGP) of yt when Q = 0 or Q = 100? Try to identifya correct model.

> arorder<-0 #1

> maorder<-0 #1

> y.arima<-arima(y,order=c(maorder,0,arorder))

> # Diagnostic

> tsdiag(y.arima)

4. Try to explain what happens if R and Q are changed simultaneously but under the restrictionof a constant ratio R/Q = 1 (keep the set.seed fixed)? What happens if the ratio R/Q issmall? What happens if the ratio is large?

We now try to express the above simple level model in terms of a SSM. For this purpose wehave to distinguish the data yt from the state-vector ξt: the former is observed, the latter is ourunobserved target (in this case: the level).

ξt = µt

yt = yt,vt = vt,wt = wt

H′t = 1

Ft = 1

then

yt = H′tξt + wt


is the same as

yt = µt + wt

µt = µt−1 + vt

6

2.3 Changing Level and Slope

Let

yt = µt + wt

µt = d+ µt−1 + vt

where d 6= 0. Fig.3 shows a realization with Q = R = 1 and d = 0.5. In economic applications the

●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●

●

Changing level and fixed slope d=0.5

0 20 40 60 80 100

010

30

True LevelArithmetic mean

●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●

●

Changing level and fixed slope d=0.5

0 20 40 60 80 100

010

30

True LevelOptimal Smoothed Estimate

Figure 3: Data yt: changing level with constant slope d = 0.5. Arithmetic mean (top) vs. optimalestimate (bottom)

assumption of a constant growth or slope dt = d0 is very unrealistic. Therefore, we could make itvariable too:

yt = µt + wt

µt = dt−1 + µt−1 + v1t

dt = adt−1 + v2t

where a is an autoregressive parameter. A realization of the process with a = 1, R = 1, Q11 =1, Q22 = 0.1 (the variances of v1t and v2t) is plotted in fig.4. One can see true and estimated levels(blue and red lines, left scale) as well as true and estimated slopes (orange and green lines, rightscale). Modifying the variances in the state-equation to Q11 = 0.00001, Q22 = 0.000001 leads tothe realization in fig.5. As can be seen, the pure noise component wt now dominates; also, theslope estimate (orange line) is now almost constant (the red line has a fairly constant downwardtrend) which is OK because the data appears to be fitted quite well this way (red and blue linesare very close). Mean-square performances of the Identity, the mean and of the optimal estimateare reported in table 2.

7

Q=1,0.1

−30

0−

200

−10

00

Leve

l

True LevelOptimal Level Estimate

−6

−5

−4

−3

−2

−1

0S

lope

True SlopeOptimal Slope Estimate

1 10 20 30 40 50 60 70 80 90

Figure 4: Data yt: changing level and changing slope

Q=(1,0.1) Q=(0.00001,0.000001)Id 0.94 0.94

Arithmetic mean 15131.83 0.19Optimal Estimate 0.50 0.01

Table 2: Mean square performances of estimates (Id, arithmetic mean, optimal estimate) for datawith varying level and slope: first column corresponds to Q11 = 1, Q22 = 0.1 and second columnto Q11 = 0.00001, Q22 = 0.000001

As for the previous example we now convert the above simple model in a SSM. For this purposewe have to distinguish the data yt from the state-vector ξt. In this example ξ′t = (µt, dt) is two-dimensional since we want to estimate the level µt as well as the slope dt.

ξt =

(µt

dt

)yt = yt,vt =

(v1tv2t

),wt = wt

H′t = (1, 0)

Ft =

(1 10 a

)then

yt = H′tξt + wt


8

Q=1e−05,1e−06

−3

−2

−1

01

Leve

l

True LevelOptimal Level Estimate

−0.

020

−0.

015

−0.

010

−0.

005

0.00

0S

lope

True SlopeOptimal Slope Estimate

1 10 20 30 40 50 60 70 80 90

Figure 5: Data yt: changing level and changing slope

is the same as

yt = µt + wt

µt = dt−1 + µt−1 + v1t

dt = adt−1 + v2t

2.3.1 Exercices

1. How would you replicate the first example with a constant level in the above model? Howwould you replicate the second model with a changing level but a fixed zero-slope dt = d0 =0? How would you replicate the third model with a changing level but a fixed non-zero slopedt = d0 6= 0?

2. Simulate data yt based on the above model:

> set.seed(10)

> len<-100

> Q<-c(1,0.1)

> R<-1

> d<-0.5

> d<-cumsum(sqrt(Q[2])*rnorm(len))

> mu<-cumsum(d+sqrt(Q[1])*rnorm(len))


Try to understand this piece of code; then simulate series yt for various values of R and

Q =

(Q11 0

0 Q22

).

9

3. What happens if R is much larger than Q, say R = 1, Q =

(10−6 0

0 10−7

)? Try to

identify the DGP of yt in this case:

> arorder<-0 #1

> maorder<-0 #1

> y.arima<-arima(y,order=c(maorder,0,arorder))

> # Diagnostic

> tsdiag(y.arima)

4. Compute the mean-square error of the Identity and of the arithmetic mean for the two casesin table 2.

> a_mean<-cumsum(y)/(1:len)

> # Identity

> mean((y-mu)^2)

[1] 0.9378635

> # Arithmetic mean

> mean((a_mean-mu)^2)

[1] 15131.83

5. How would you forecast the series in figs.4 and 5? How would you forecast the series in fig.1?And how would you forecast the series in fig.2?

3 ARMA-Processes in State Space Form

3.1 AR-Process

The AR(p)-model

xt = axt−1 + ...+ apxt−p + εt (3)

with var(εt) = σ2 can be put into SS-form according to

yt = xt

wt = εt

H′t = (xt−1, xt−2, ..., xt−p)

ξ′t = (a1, a2, ..., ap)

Ft = Id

vt = 0

R = σ2

Q = 0

Please verify that

yt = H′tξt + wt


is the same as 3.

10

Alternatively, the above AR-process can be expressed as

yt = xt

H′t = (1, 0, 0, ..., 0)

vt =

εt0...0

Ft =

a1 1 0 ... 0a2 0 1 ... 0. . . ... .

ap−1 0 0 ... 1ap 0 0 ... 0

Q =

σ2 0 ... 00 0 ... 0:::0 0 ... 0

R = 0

Note that we did not specify ξt so far: we now proceed so and briefly check pertinence of thisrepresentation. We ‘unfold’ the state-equation ξt+1 = Ft+1ξt + vt+1 by starting in the bottomequation and then working up sequentially:

ξt+1,p = apξt,1

ξt+1,p−1 = ap−1ξt,1 + ξt,p

ξt+1,p−2 = ap−2ξt,1 + ξt,p−1

... = ...

ξt+1,2 = a2ξt,1 + ξt,3

ξt+1,1 = a1ξt,1 + ξt,2 + εt+1

We now insert the first equation into the second:

ξt+1,p−1 = ap−1ξt,1 + ξt,p = ap−1ξt,1 + apξt−1,1

This expression can be plugged into the third equation to obtain

ξt+1,p−2 = ap−2ξt,1 + ξt,p−1 = ap−2ξt,1 + ap−1ξt−1,1 + apξt−2,1

Continuing this way up to the first equation we obtain

ξt+1,1 = a1ξt,1 + ξt,2 + εt+1 = a1ξt,1 + a2ξt−1,1 + ...+ apξt−p,1 + εt+1

Thus we see that the first component ξt,1 of ξt is an AR(p)-process with the noise-term εt: itcorresponds exactly to the AR(p)-equation for the process yt. By extracting the first componentfrom the state-vector, the system-matrix H′t := (1, 0, ..., 0) links the data yt with the AR(p)-process. Note that all other components ξt,2, ξt,3, ..., ξt,p of ξt are less straightforward to interpret.In a sense, these are just (artificial) ‘by-products’ necessary in the iterative build-up of the process.

The main difference between both SS-representations is that ξt collects either the unknownparameters (first case) or the process (second case). In the latter case, the AR-parameters areembedded into Ft. Thus one has to rely on numerical optimization in order to infer estimates, seesection 5. Note also that the first SS-representation allows to extend the classical fixed-parameterAR-model to an adaptive-parameter setting by setting Q > 0 which is extremely simple and

11

therefore very elegant. To summarize, my personal favorite for AR-processes is the first SS-representation.

To conclude, we note that AR-processes with a non-vanishing mean can be centered (demeaned)prior to being cast into SS-form. This is the pragmatic way! Alternatively, suppose that

xt = c+ axt−1 + ...+ apxt−p + εt

with var(εt) = σ2 Then we can define

yt = xt

wt = εt

H′t = (1, xt−1, xt−2, ..., xt−p)

ξ′t = (c, a1, a2, ..., ap)

Ft = Id

vt = 0

R = σ2

Q = 0

The price to be paid is minor since we just have to inflate the dimension of the state-vector byone, in order to account for the additional intercept c.

3.2 MA-Process

Let nowxt = εt + b1εt−1 + ...+ bqεt−q

be an MA(q) process. A SS-representation for this process is given by

yt = xt

wt = 0

H ′ = (1, 0, ..., 0)

vt =

1b1...bq

εt

Q = σ2

1 b1 ... bqb1 b21 ... b1bqb2 b2b1 ... b2bq:::bq bqb1 ... b2q

R = 0

Ft =

0 1 0 ... 00 0 1 ... 0. . . ... .0 0 0 ... 10 0 0 ... 0

12

We now briefly check pertinence of this representation. As for the previous AR-example we unfoldthe state-equation from the last to the first one:

ξt+1,q+1 = bqεt+1

ξt+1,q = ξt,q+1 + bq−1εt+1

ξt+1,q−1 = ξt,q + bq−2εt+1

... = ...

ξt+1,1 = ξt,2 + εt+1

Plugging the first into the second equation we obtain:

ξt+1,q = ξt,q+1 + bq−1εt+1 = bqεt + bq−1εt+1

This can be inserted into the third equation to obtain

ξt+1,q−1 = ξt,q + bq−2εt+1 = bqεt−1 + bq−1εt + bq−2εt+1

Unfolding the system of equations recursively up the first equation we obtain

ξt+1,1 = ξt,2 + εt+1 = εt+1 + b1εt + b2εt−1 + ...+ bqεt+1−q

which is recognized as being the MA(q)-process yt.

3.3 ARMA-Process

A brief inspection of the second AR-representation in section 3.1 and the MA-representation inthe previous section 3.2 suggests that the ARMA-process

xt = a1xt−1 + ...+ apxt−p + εt + b1εt−1 + ...+ bqεt−q

can be represented as

yt = xt

wt = 0

H ′ = (1, 0, ..., 0)

vt =

1b1...br

εt

Q = σ2

1 b1 ... brb1 b21 ... b1brb2 b2b1 ... b2br:::br brb1 ... b2r

R = 0

Ft =

a1 1 0 ... 0a2 0 1 ... 0. . . ... .

ar−1 0 0 ... 1ar 0 0 ... 0

where r = max(p, q+1). If p < r then ap+1, ..., ar = 0; if q < r, then bq+1, ..., br = 0. We leave it asan exercise to verify pertinence of this representation (hint: unfold the state-equation in reversedorder, recursively from the last to the first one).

13

3.4 Exercises

1. Simulate the following ARMA(2,3)-process

xt = 1.2xt−1 − 0.6xt−2 + εt − 1.2εt−1 + 0.6εt−2 − 0.4εt−3

by relying on the above SSM-representation. Estimate model-coefficients by plugging thedata into the arima-command and perform a diagnostic check in order to verify pertinenceof the SSM.

> len<-100

> p<-2

> q<-3

> r<-max(p,q+1)

> # Define AR and MA coefficients

> a_vec<-c(1.2,-0.6,0,0)

> b_vec<-c(1,-1.2,0.6,-0.4)

> # Define state transition matrix

> F_t<-matrix(ncol=r,nrow=r)

> F_t[r,]<-0

> F_t[1:(r-1),2:r]<-diag(rep(1,r-1))

> for (i in 1:r)

+ {

+ F_t[i,1]<-ifelse(i<p+1,a_vec[i],0)

+ }

> # Define Q (optional: it is not used in simulation)

> Q<-b_vec%*%t(b_vec)

> # Define H-matrix

> H<-c(1,rep(0,r))

> # Generate state vector

> xi<-matrix(ncol=len+1,nrow=r)

> xi[,1]<-0

> set.seed(10)

> eps<-rnorm(len)

> for (i in 1:len) #i<-1

+ {

+ xi[,i+1]<-F_t%*%xi[,i]+b_vec*eps[i]

+ }

> # Plot of first component of state vector

> ts.plot(xi[1,])

> # estimate ARMA(2,3) model

> arma_obj<-arima(xi[1,],order=c(2,0,3))

> # perform diagnostics

> tsdiag(arma_obj)

4 The Kalman-Filter

The r-dim state vector ξt in


yt = H′tξt + wt

(with V ar(wt) = Rt and V ar(vt) = Qt) must be estimated based on data y1,y2, ...,yt. In orderto do so we require the model assumptions in section 1.2 to be satisfied: essentially we require thenoise terms vt and wt to be mutually uncorrelated Gaussian white noise sequences.

14

4.1 The Recursion

The Kalman-filter is a set of recursive linear estimation steps, which computes ξt based on ξt−1and the new data-point yt. Specifically:

yt = H′tξt|t−1 (4)

εt = yt − yt (5)

ξt|t = ξt|t−1 + Pt|t−1Ht(H′tPt|t−1Ht + Rt)

−1εt (6)

ξt+1|t = Ft+1ξt|t (7)

Pt|t = Pt|t−1 −Pt|t−1Ht(H′tPt|t−1Ht + Rt)

−1H′tPt|t−1 (8)

Pt+1|t = Ft+1Pt|tF′t+1 + Qt+1 (9)

Hereby ξt|t is the estimate of ξt, given data y1,y1, ...,yt up to time point t and Pt|t is the variance

of this estimate. The vector ξt+1|t designates the estimate of ξt+1 given the same informationy1,y2, ...,yt up to t ‘only’: it is a forecast of ξt+1. The matrix Pt+1|t is the variance of this

forecast. The above equations, the Kalman Filter (KF), up-dates both estimates, ξt|t and ξt+1|tas well as their variance-covariance matrices Pt|t and Pt+1|t as new data flows in. If the noisedisturbances are Gaussian, then the estimates are Gaussian too (they are linear combinations ofGaussian data) and therefore we can set-up confidence bands

ξk,t|t ± 1.96√pkk,t|t

ξk,t+1|t ± 1.96√pkk,t+1|t

where ξk,t|t is the k-th component of ξt|t and pkk,t|t is the k-th diagonalelement of Pt|t. Therecursive scheme of the KF allows to proceed to fast numerical computations of these four entities.Note that one needs starting values ξ1|0 and P1|0 to initialize the recursions, see section 4.2 below.We now attempt to justify and to motivate the above equations.

Assume we have a good forecast ξt|t−1 of ξt based on data y1,y2, ...,yt−1 up to t − 1. Thefirst equation 4 in the KF then just derives a good forecast for yt:

yt = H′tξt + wt

yt = H′tξt + 0

= H′tξt|t−1

Note that the best forecast of wt based on data y1,y2, ...,yt−1 is zero because of the white noiseproperty. The main idea here is: if ξt|t−1 is a good estimate (forecast) of ξt, then yt must be a goodestimate (forecast) of yt. The next equation 5 checks the quality of this forecast by computingthe forecast error εt once the new data point yt becomes available. If we did a good job, thenεt′εt should be small. This gives us a criterion for computing unknown parameters, see section 5

below. The next equation 6 is more tedious:


−1εt

The main idea goes as follows: the up-date ξt|t of ξt based on the new observation yt is a weighted

combination of the ‘old’ estimate (forecast) ξt|t−1 and the forecast error εt: if the forecast error is

small, then we deduce that ξt|t−1 must be a good estimate of ξt and therefore we don’t need to

revise the old estimate (forecast) ξt|t−1 i.e.

ξt|t ≈ ξt|t−1

Otherwise, if the forecast error is ‘unexpectedly’ large, then ξt|t−1 cannot be considered a goodestimate of ξt i.e. the ‘old’ estimate must be revised. The weighting term Pt|t−1Ht(H

′tPt|t−1Ht +

15

Rt)−1, called the Kalman-Gain, is the crucial-element in this up-dating: it assigns an optimal

weight to the unexpected forecast error εt. Note that since εt = yt− yt the up-dating mechanismreally depends on the new observation yt or, to be more precise, it depends on the ‘surprise’ εtthat we could not anticipate at the time of knowing y1, ...,yt−1 ‘only’. An exact derivation ofthe Kalman-Gain is provided in the appendix: it is shown that the Kalman-Gain is an optimalmean-square weighting scheme. Therefore the up-dated ξt|t is an optimal mean-square estimateof ξt given new data yt.We now proceed to the next equation 7. For this we first consider the original state-equation


If we want to forecast ξt+1 based on data y1, ...,yt up to time point t, then we insert our up-dated

ξt|t into this equation

ξt+1|t = Ft+1ξt|t

Note that vt+1 = 0 because of the white noise assumption. We just closed the circle starting inξt|t−1 and ending in ξt+1|t. But we need to up-date the variances too.

Equation 8 up-dates Pt|t−1 once the new data yt becomes available:

Pt|t = Pt|t−1 −Pt|t−1Ht(H′tPt|t−1Ht + Rt)

−1H′tPt|t−1

The effect of the new information is to decrease uncertainty: we know more about ξt once yt hasbeen observed. This explains that ‘something’ is subtracted from Pt|t−1: the variance decreases.The exact derivation of this equation is provided in the appendix.Finally, equation 9 closes the circle for the second order moments, since we obtain the new variancePt+1|t of the new forecast ξt+1|t of ξt+1 based on data up to t:

Pt+1|t = Ft+1Pt|tF′t+1 + Qt+1

This is obtained by applying the variance operator to the state-equation:

V ar(ξt+1|y1, ...,yt

)= V ar (Ft+1ξt|y1, ...,yt) + V ar (vt+1|y1, ...,yt)

= Ft+1Pt|tF′t+1 + Qt+1

Remarks:

• Under the required model-assumption the estimate ξt|t−1 is the conditional mean of ξt,given data y1, ...,yt−1 and Pt|t−1 is the conditional variance. Thus, under the assumption

of Gaussian noise disturbances, the whole conditional (Gaussian) distribution of ξt|t−1 iscompletely specified for all t.

• The up-dating equations for the variance, equations 8 and 9, are independent from the otherequations and, in particular, from the data yt.

• If the Gaussian assumption does not hold, then the KF delivers the best linear unbiasedestimate of the state vector ξt.

• The estimation procedure can be robustified by substituting a bounded function ψ(εt) for εtin the critical up-dating equation 6. Corresponding robustified code is provided in section 9.

4.2 Initializing the Kalman Filter

In order to start the KF-recursions one needs initial values P1|0 and ξ1|0. Formally, ξ1|0 is theconditional mean of ξ1 in the absence of data. It is thus the unconditional mean E[ξ1]. In analogy,P1|0 = V ar(ξ1) must be the unconditional variance of ξ1. Let’s illustrate the problem by relyingon two examples.

16

4.2.1 MA(1)-Process

Consider the MA(1)-process in SS-form

ξt =

(0 10 0

)ξt−1 +

(1b1

)εt

yt = (1, 0)ξt

recall section 3.2. We saw that ξ′t = (εt + b1εt−1, b1εt). Therefore

ξ1|0 = E[ξ1] = 0

P1|0 = V ar(ξ1) = σ2

(1 + b21 b1b1 b21

)In general, b1 und σ2 are unknown and must be estimated, see section 5.

4.2.2 AR(1)-Process

The AR(1)-process admits a SS-representation as

ξt =

(1 00 1

)ξt−1

yt = (1, yt−1)ξt + εt

where ξ′t = (c, a1), recall section 3.1. The initial value ξ1|0 now addresses the unconditional meanof (c, a1). The question is: what is the mean-value for these parameters in the absence of informa-tion? A plausible value could be 0. But what is the variance in this case? A possible setting could

be P1|0 =

(∞ 00 ∞

)which is a so-called diffuse prior. The infinite variance specifies that we

don’t know anything about the true parameters (c, a1) and thus our initial guess ξ′1|0 := (0, 0) is

totally arbitrary. Stated otherwise: the initial value ξ1|0 is completely irrelevant for the estimationof the state vector. If we know that the model is stationary, then |a1| < 1 and our initial variance

can be refined to something like P1|0 =

(∞ 00 0.25

). We are thus expressing the fact that the

true parameter value a1 lies with probability 95% in the interval [0−2√

0.25, 0+2√

0.25] = [−1, 1]).

A fundamental difference of the SS-framework and the KF, when compared to classical least-squares estimates, is that the user can supply a priori knowledge in the form of ξ1|0 and he canspecify the quality of this knowledge by a corresponding P1|0: the smaller P1|0, the better the

initial estimate ξ1|0. The classic mean-square estimate (without a priori knowledge) corresponds

to the diffuse prior P1|0 =

(∞ 00 ∞

).

Remarks

• The KF up-dates a priori knowledge by incorporating data yt. The KF can be interpretedas a Bayesian estimation algorithm. This Bayesian framework is identical with the classicalfrequentist approach if a diffuse prior is used.

• The initial values - the unconditional means and variances - ξ1|0,P1|0 are not always known.In such a case, they could be estimated together with the other unknown parameters.

4.2.3 The Diffuse Prior

When nothing is known about the state vector, then P1|0 =

(∞ 00 ∞

)is recommended, in

theory, since then the initial estimate ξ1|0 is discarded. Note, however, that excessively large

17

values of P1|0 may lead to numerical rounding errors (negative variances may result). In practice wetherefore recommend to set P1|0 to a large multiple of the maximum variances (diagonalelements)in R,Q: say 1000 or 10000 times this maximum variance.

5 Optimization

5.1 Example: ARMA-Model

To illustrate the problem we here rely on the ARMA-process

xt = a1xt−1 + ...+ apxt−p + εt + b1εt−1 + ...+ bqεt−q

in section 3.3. We saw that the process can be represented as

yt = xt

wt = 0

H ′ = (1, 0, ..., 0)

vt =

1b1...br

εt

Q = σ2

1 b1 ... brb1 b21 ... b1brb2 b2b1 ... b2br:::br brb1 ... b2r

R = 0

Ft =

a1 1 0 ... 0a2 0 1 ... 0. . . ... .

ar−1 0 0 ... 1ar 0 0 ... 0

where r = max(p, q + 1). If p < r then ap+1, ..., ar = 0; if q < r, then bq+1, ..., br = 0. If a1, ..., apand b1, ..., bq are known, then the KF can be run to generate estimates of the state-vector ξt.However, the ARMA-parameters are generally unknown and must be determined in order to ‘fit’data.

5.2 Least-Squares

A straighforward optimization criterion would be to determine a1, ..., ap and b1, ..., bq such that

T∑t=1

εt′εt → min

a1,...,ap,b1,...,bq(10)

i.e. we determine the ARMA-parameters such that the mean-square one-step ahead forecast errordefined in 5 is minimized.

5.3 Maximum Likelihood

A slightly more refined optimization relies on the maximum likelihood paradigm, assuming Gaus-sian noise disturbances. Then, the unconditional distribution of y1 as well as the sequence of

18

conditional distributions of yt|yt−1,yt−2, ... are Gaussian with densities

f(y1) = (2π)−r/2|P1|0|−1/2

exp

(−

(y1 −H′1ξ1|0)′(P1|0)−1(y1 −H′1ξ1|0)

2

)f(yt|yt−1, ...) = (2π)−r/2|H′tPt|t−1Ht + Rt|−1/2 (11)

exp

(−

(yt − yt)′(H′tPt|t−1Ht + Rt)

−1(yt − yt)

2

)where the notation |H′tPt|t−1Ht + Rt| means the determinant of the corresponding matrix and ris the dimension of the state vector. Note that yt|yt−1,yt−2, ... are independent Gaussian randomvariables with mean yt, as defined in 4, and (conditional) variance

V ar(yt|yt−1,yt−2, ...) = V ar(H′tξt|t−1 + wt)

= (H′tPt|t−1Ht + Rt)−1

which justifies the expressions for the densities in 11. These Gaussian densities are determined bythe initial values ξ1|0 and P1|0 and by the unknown parameters (in our example a1, ..., ap, b1, ..., bq).The KF provides all necessary terms to calculate the above densities. Therefore, one can maximizethe common distribution function

f(y1, ...,yt) = f(y1)

T∏t=2

f(yt|yt−1, ...)→ maxθ

where θ groups all unknown parameters and possibly the initial values ξ1|0, P1|0. For numericalconvenience one generally minimizes the negative log-likelihood function

− ln(f(y1))−T∑

t=2

ln(f(yt|yt−1, ...))

= ln(|P1|0|) +

T∑t=2

ln(|H′tPt|t−1Ht + Rt|)

+(y1 −H′1ξ1|0)′(P1|0)−1(y1 −H′1ξ1|0)

+

T∑t=2

(yt − yt)′(H′tPt|t−1Ht + Rt)

−1(yt − yt)→ minθ

(12)

5.4 In-Sample vs. Out-of-Sample Criteria

To illustrate the topic we here rely on the least-squares criterion:

T∑t=1

εt′εt → min

a1,...,ap,b1,...,bq(13)

where εt = yt−H′tξt|t−1. This is called an ‘out-of-sample’ criterion because the forecast H′tξt|t−1relies on data y1, ...,yt−1 up to t − 1 only. This is in contrast to classical regression approaches,including ARIMA-models, where the forecast error is invariably an in-sample error i.e. in eachtime point t = 1, ..., T the whole data y1, ...,yT is known when computing the forecast error εt.Being able to rely on true out-of-sample performances is a great advantage of the SSM-approach.However, we could also define an in-sample least-squares criterion. For this purpose consider

εinsamplet := yt −H′tξt|t

19

where ξt|t−1 has been replaced by ξt|t: the latter estimate ‘knows’ yt and therefore εinsamplet is

truly an in-sample forecast error since we forecast yt by knowing already the observation.

Any of the above optimization citeria, least-squares and/or maximum likelihood, can be set-up

based on true out-of-sample forecast errors εt or on in-sample forecast errors εinsamplet thus giving

a total of four different optimization criteria. These four criteria will be analyzed empirically inthe next series of exercise and we shall see that the out-of-sample maximum likelihood criterionhas some interesting advantages over the other three criteria. We shall see, also, that in-samplecriteria perform poorly in the case of adaptive models (with time varying parameters).

5.5 Exercises Part I: Constant Level

1. Consider the univariate level-model

yt = µt + wt

µt = µt−1 + vt

Define ξt, Ft, Ht, Qt and Rt.

2. Write a piece of R-code which estimates ξt based on the KF-algorithm.

> # Function which implements the KF

> # -parma is a three-dim parametervector: the two unknown variances R and Q as well as

> # the initial value xi10

> # -y is the data

> # -opti is boolean: if T then the function returns the criterion value;

> # otherwise the state vector and the variances are returned too

> # -outofsample is a boolean: if T then the out-of-sample forecast errors are used;

> # otherwise the in-sample forecast errors

> # -maxlik is a boolean: if T then the maximum likelihood algorithm is used;

> # otherwise least-squares

> # P10 is the starting value for the variance

>

> KF_level<-function(parma,y,opti,outofsample,maxlik,P10)

+ {

+ len<-length(y)

+ # note that we parametrize the variances such that they are always positive!

+ R<-parma[1]^2

+ Q<-parma[2]^2

+ xi10<-parma[3]

+ Pttm1<-0:len

+ Ptt<-1:len

+ Pttm1[1]<-P10

+ xtt<-xi10

+ # initialization of loglikelihood

+ logl<-0.

+ # we collect the state vectors xi_{t|t-1} and xi_{t|t}

+ xittm1<-1:len

+ xitt<-xittm1

+ # Start of KF recursions

+ for (i in 1:len) #i<-2

+ {

+ M<-Q

+ K<-Pttm1[i]

+ # The Kalman Gain

20

+ Kg<-1/(K+R)

+ # epsilon

+ epshatoutofsample<-y[i]-xtt

+ # xi_{t|t-1}

+ xittm1[i]<-xtt

+ # up-date xi_{t|t}

+ xtt<-xtt+Pttm1[i]*Kg*epshatoutofsample

+ # in-sample forecast error (after y_t has been observed)

+ epshatinsample<-y[i]-xtt

+ # compute P_{t|t}

+ Ptt[i]<-Pttm1[i]-Pttm1[i]*Kg*Pttm1[i]

+ # compute P_{t+1|t}

+ Pttm1[i+1]<-Ptt[i]+M

+ # trace xi_{t|t}

+ xitt[i]<-xtt

+ # The optimization criterion

+ if (outofsample)

+ {

+ if (maxlik)

+ {

+ logl<-logl+log(K+R)+epshatoutofsample^2*Kg

+ } else

+ {

+ logl<-logl+epshatoutofsample^2

+ }

+ } else

+ {

+ if (maxlik)

+ {

+ logl<-logl+log(K+R)+epshatinsample^2*Kg

+ } else

+ {

+ logl<-logl+epshatinsample^2

+ }

+ }

+ }

+ if (opti)

+ {

+ return(logl/len)

+ } else

+ {

+ return(list(logl=logl/len,

+ xitt=xitt,xittm1=xittm1,Pttm1=Pttm1,Ptt=Ptt))

+ }

+ }

3. Consider the special case of a constant level µt = µ0. Simulate a series yt of length 100 withµ = 10, R = 1, Q = 0.

> set.seed(10)

> len<-100

> mu<-10

> R<-1.


> ts.plot(cbind(y,mu),lty=1:2)

21

4. Estimate ξt based on the above R-function. Hint: set P1|0 to a ‘large’ value (not too largebecause of numerical rounding issues), set ξ1|0, Q and plug these values into the vector

parma<-c(√R,√Q,xi10); select the classical mean-square in-sample criterion (outofsam-

ple=maxlik=F).

> opti<-F

> # Diffuse prior

> P10<-100000000

> xi10<-0

> # Constant level

> Q<-0.0

> parma<-c(sqrt(R),sqrt(Q),xi10)

> # in-sample mean-square (not relevant here since we do not optimize)

> outofsample<-F

> maxlik=F

> # KF

> obj<-KF_level(parma,y,opti,outofsample,maxlik,P10)

5. Compare the estimated ξt|t and ξt|t−1 with the natural estimate of the level: the arithmeticmean:

> # Compare state vector and mean

> cbind(obj$xitt,obj$xittm1)[len,]

[1] 9.863451 9.849678

> mean(y)

[1] 9.863451

6. Plot the two state vectors ξt|t and ξt|t−1, see fig.6:

> ymin<-min(min(obj$xitt),min(obj$xittm1))

> ymax<-max(max(obj$xitt),max(obj$xittm1))

> file = paste("z_level_1.pdf", sep = "")

> pdf(file = paste(path.out,file,sep=""), paper = "special", width = 6, height = 6)

> par(mfrow=c(2,1))

> ts.plot(obj$xitt,ylim=c(ymin,ymax),col="blue",main=

+ paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""))

> lines(obj$xittm1,col="green",xlab="",ylab="")

> mtext("Xi_{t|t-1}", side = 3, line = -1,at=len/2,col="green")

> mtext("Xi_{t|t}", side = 3, line = -2,at=len/2,col="blue")

> ts.plot(obj$xitt[2:len],col="blue",main="State vector simple level model:

+ without initial values",xlab="",ylab="")

> lines(obj$xittm1[2:len],col="green")



> dev.off()

windows

2

These graphical tools are a characteristic strength of SSM because the estimates ξt|t and ξt|t−1are available as time series. Thus, for example structural breaks become easily identifiableby eye.

22

State vector simple level model: P_{1|0}=1e+08, Q=0

Time

obj$

xitt

0 20 40 60 80 100

04

8

Xi_{t|t−1}Xi_{t|t}

State vector simple level model:without initial values

0 20 40 60 80 100

9.5

9.7

9.9 Xi_{t|t−1}

Xi_{t|t}

Figure 6: Constant level model: state vectors t|t-1 (green) and t|t (blue)

7. Same graph but add 95%-intervalls for the data. Hint: yt = Hξt + wt = ξt + wt, yt = ξt|t;therefore V ar(yt − yt) = Pt|t +R, see fig.7.

> ymin<-min(min(y),min(obj$xitt-2*sqrt(R+obj$Ptt)))

> ymax<-max(max(y),max(obj$xitt+2*sqrt(R+obj$Ptt)))

> file = paste("z_level_1_sigma.pdf", sep = "")


> ts.plot(y,ylim=c(ymin,ymax),col="black",main=

+ paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""))

> lines(obj$xitt,col="green",xlab="",ylab="")

> lines(obj$xitt+2*sqrt(R+obj$Ptt),col="green",xlab="",ylab="",lty=2)

> lines(obj$xitt-2*sqrt(R+obj$Ptt),col="green",xlab="",ylab="",lty=2)



> dev.off()

windows

2

8. Assign a larger importance to the initial value ξ1|0 = 0 by selecting a ‘small’ P1|0 = 0.1 andcompare the resulting state vectors to the mean. Perform a plot of the state vectors, seefig.8.

> P10<-0.1 #0


> cbind(obj$xitt,obj$xittm1)[len,]

23


Time

y

0 20 40 60 80 100

78

910

1112

13 Xi_{t|t−1}Xi_{t|t}

Figure 7: Constant level model: data (black) and upper/lower 95-confidence bands

[1] 8.966774 8.946037

> mean(y)

[1] 9.863451

> ymin<-min(min(obj$xitt),min(obj$xittm1))

> ymax<-max(max(obj$xitt),max(obj$xittm1))



> ts.plot(obj$xitt,ylim=c(ymin,ymax),col="blue",

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""))




> dev.off()

windows

2

What happens if P1|0 = 0? Try to explain this outcome and to interpret the meaning of thea prior knowledge summarized in ξ1|0 and P1|0.

9. Introduce a singular level shift in t = 50: shift the whole data by -10 from t = 51 : 100: thefirst half will be on a mean-level of 10 and the second part on a mean-level of zero. Use alarge P1|0 to start. Compare the state vectors with the aritmetic mean and plot both statevectors together with the data, see fig.9.

24

State vector simple level model: P_{1|0}=0.1, Q=0

Time

obj$

xitt

0 20 40 60 80 100

02

46

8



> P10<-100000

> zy<-y-c(rep(0,length(y)/2),rep(mu,length(y)/2))

> ts.plot(zy)

> obj<-KF_level(parma,zy,opti,outofsample,maxlik,P10)

> obj$xitt[len]

[1] 4.863451

> mean(zy)

[1] 4.863451



> ts.plot(obj$xitt,ylim=c(min(zy),max(zy)),col="blue",

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""))


> lines(zy)



> dev.off()

windows

2

25


Time

obj$

xitt

0 20 40 60 80 100

−2

02

46

810



5.6 Exercises Part II: Changing Level and Optimization Criteria

1. We take the same series yt as above (constant level µ = 10) but we now assume a SSM withvarying level Q > 0. Specifically set P1|0 to a large value (not too large because of roundingerrors) - diffuse prior -, select ξ1|0 = 0 (which is irrelevant because of the diffuse prior), and

try Q = 0.01, Q = 0.1, Q = 1, Q = 10. Plot the data yt together with the state vectors ξt|tand ξt|t−1 (fig.10). What is the main difference when compared to the case Q = 0 (Hint:

consistency?). What happens when Q gets large? What is the difference between ξt|t and

ξt|t−1 (hint: it has to do with F = 1 being an identity)?

> P10<-1000000

> xi10<-0

> Q<-0.01# Q<-0.1 Q<-1 Q<-10



> file = paste("z_level_change_1.pdf", sep = "")


> ts.plot(obj$xitt,ylim=c(min(y),max(y)),col="blue",

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""),xlab="",ylab="")

> lines(y,col="black")




> dev.off()

26

windows

2

State vector simple level model: P_{1|0}=1e+06, Q=0.01

0 20 40 60 80 100

89

1011

12Xi_{t|t−1}

Xi_{t|t}


2. We now analyze the four different optimization criteria proposed in section 5.4: in- andout-of-sample least-squares and maximumum likelihood criteria. These four criteria canbe selected by the boolean parameters outofsample,maxlik in the head of the KF-functionKF level(parma, y, opti, outofsample,maxlik, P10).

> opti<-T

> # Out-of-sample least-squares

> outofsample<-T

> maxlik=F

> KF_level(parma,y,opti,outofsample,maxlik,P10)

[1] 1.912117

> # in sample least-squares

> outofsample<-F

> maxlik=F


[1] 0.7300567

> # out-of-sample maximum likelihood

> outofsample<-T

> maxlik=T


27

[1] 1.067867

> # in-sample maximum likelihood

> outofsample<-F

> maxlik=T


[1] 0.9100582

Try different values of Q = 0, 0.01, 0.1, 1, 10 and check the criterion values. Which criteriaare problematic (Hint: the true model is Q = 0: a constant-level model with µt = µ0 = 10).

3. We now use our series zyt which has a singular shift in t = 50, see fig.9. Try to fit a modelwith time-varying level Q > 0 to this series, plot the state vector and observe what happenswhen Q increases (fig.11). Select Q ‘by hand’ according to one of the two out-of-sampleoptimization criteria (which Q would you chose based on in-sample criteria?).

> opti<-F

> R<-1

> Q<-0.1#Q<-0.1 Q<-10

> P10<-10000

> xi10<-0

> outofsample=T

> maxlik<-T


> obj<-KF_level(parma,zy,opti,outofsample,maxlik,P10)



> ts.plot(obj$xitt,ylim=c(min(zy),max(zy)),col="blue",

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q=",Q,sep=""),xlab="",ylab="")

> lines(zy,col="black")




> dev.off()

windows

2

> # Criterion

> obj$logl

[1] 2.565157

5.7 Exercises Part III: the Q/R Ratio

1. We here analyze the effect of Q and R and, more specifically, the effect of the ratio Q/Rwhich is a signal-to-noise ratio: the variance of the innovation vt of the signal (the level) µt

is compared to the noise variance R of wt in yt = µt +wt. For this purpose we assume thatR and Q/R can be varied and that Q is determined by R ∗Q/R. Write a short piece of codewhich applies this model to the series yt with constant level µt = µ0 = 10 and observe whathappens when R is varied while keeping Q/R fixed. Plot the state vector and describe theeffect of Q/R (fig. 12).

28

State vector simple level model: P_{1|0}=10000, Q=0.1

0 20 40 60 80 100

−2

02

46

810



> opti<-F

> QdR_1<-1.

> R<-1

> Q<-R*QdR_1

> P10<-1000*max(c(Q,R))

> xi10<-0

> # maxlik out-of-sample

> outofsample=T

> maxlik<-T


> obj_1<-KF_level(parma,y,opti,outofsample,maxlik,P10)

> QdR_2<-.0001 #QdR<-10000

> Q<-R*QdR_2


> obj_2<-KF_level(parma,y,opti,outofsample,maxlik,P10)



> par(mfrow=c(2,1))

> ts.plot(obj_1$xitt,ylim=c(min(y),max(y)),col="blue",

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q/R=",QdR_1,sep=""),xlab="",ylab="")


> lines(obj_1$xittm1,col="green",xlab="",ylab="")



> ts.plot(obj_2$xitt,ylim=c(min(y),max(y)),col="blue",

29

+ main=paste("State vector simple level model: P_{1|0}=",P10,", Q/R=",QdR_2,sep=""),xlab="",ylab="")


> lines(obj_2$xittm1,col="green",xlab="",ylab="")



> dev.off()

windows

2

State vector simple level model: P_{1|0}=1000, Q/R=1

0 20 40 60 80 100

89

11


State vector simple level model: P_{1|0}=1000, Q/R=1e−04

0 20 40 60 80 100

89

11



5.8 Exercises Part IV: Optimization

Until yet we experimented ‘by hand’, providing more or less ad-hoc R and Q. We now try toestimate these parameters based on the data y1, ..., yt (no a priori knowledge is assumed). For thispurpose we want to compare the four optimization criteria: least-squares or maximum likelihood,in- or out-of-sample. We analyze the series with constant level yt as well as the series zyt with thesingular shift in t = 50.

1. Compute an optimal estimate of the parameter-vector parma < −c(√R,√Q, ξ1|0) for the

constant-level series yt according to the maximum likelihood out-of-sample criterion. Use thenumerical optimization ‘workhorse’ nlminb. Look at the obtained solution (

√Ropt,

√Qopt, ξopt1|0 )

(short comment) and plot the resulting state-vector obtained by plugging (√Ropt,

√Qopt, ξopt1|0 )

into the KF (fig.13). Print the criterion value.

> P10=1000

> # For the numerical optimization we have to set opt<-T

30

> opti<-T

> outofsample<-T

> maxlik<-T

> objopt<-nlminb(start=c(0.5,0.5,0.0),objective=KF_level,

+ y=y,opti=opti,outofsample=outofsample,maxlik=maxlik,P10=P10)

> parma<-objopt$par

> parma

[1] 0.92461025 -0.04243299 9.72870914

> # After optimization we set opti<-F and we generate the state-vector based

> # on the optimal solution

> opti<-F


> # Criterion value

> obj$logl

[1] 0.9736482




+ main=paste("maxlik=",maxlik,", outofsample=",outofsample,", solution=",

+ round(parma[1],3),",",round(parma[2],3),",",round(parma[3],2),sep=""),

+ xlab="",ylab="")





> dev.off()

windows

2

2. Same exercise as above for the remaining optimization criteria: least-squares out-of-sample,maximum likelihood in-sample and least-squares in-sample. Analyze the resulting optimalsolutions (

√Ropt,

√Qopt, ξopt1|0 ) and plot the corresponding state-vectors. Compare criterion

values, solutions (√Ropt,

√Qopt, ξopt1|0 ) and state-vectors for all four criteria. Which criterion

is the ‘best’? The following piece of code computes the in-sample maximum likelihood (theleast-squares estimates are not shown here), see fig.14.

> P10=1000

> # For the numerical optimization we have to set opt<-T

> opti<-T

> outofsample<-F

> maxlik<-T

> objopt<-nlminb(start=c(0.5,0.5,0.0),objective=KF_level,

+ y=y,opti=opti,outofsample=outofsample,maxlik=maxlik,P10=P10)

> parma<-objopt$par

> parma

[1] 3.057741e-15 -2.319308e-10 -9.816584e-08

31

maxlik=TRUE, outofsample=TRUE, solution=0.925,−0.042,9.73

0 20 40 60 80 100

89

1011

12



> # After optimization we set opti<-F and we generate the state-vector based

> # on the optimal solution

> opti<-F


> # Criterion value

> obj$logl

[1] -43.00578




+ main=paste("maxlik=",maxlik,", outofsample=",outofsample,", solution=",

+ round(parma[1],3),",",round(parma[2],3),",",round(parma[3],2)),xlab="",ylab="")





> dev.off()

windows

2

32

maxlik= TRUE , outofsample= FALSE , solution= 0 , 0 , 0

0 20 40 60 80 100

89

1011

12



6 Regression Analysis

6.1 Classical Fixed Coefficients Regression

Let yt be determined by a set of explanatory variables x1t, ..., xpt:

yt = β0 + β1x1t + ...+ βpxpt + εt (14)

where εt is an iid Gaussian sequence with V ar(εt) = σ2 and where the unknown coefficientsβ0, β1, ..., βp must be determined. We can express this equation in terms of a SSM with system-components

yt = yt

ξ′t = (β0, β1, ..., βp)

wt = εt

H′t,p+1 = (1, x1t, ..., xpt)

Qp+1∗p+1 = 0

R = σ2

Fp+1∗p+1 = Idp+1∗p+1

where we added the dimensions of the system matrices as indices. Thus

yt = H′tξt + wt

ξt+1 = Fξt

33

replicates 14 (note that vt = 0 since Q = 0). We can provide initial values

ξ′1|0 = (0, 0, ..., 0)

P1|0 = ∞ · Idp+1∗p+1

(P1|0 is a diffuse prior) to replicate the classical least-squares linear regression estimate b of(β0, ..., βp), see the exercises below.

RemarkWe don’t need to know R = σ2 to obtain the estimate b of β because Q = 0 i.e. the signal-to-noiseration Q/R = 0 for any R > 0. We can thus set R = 1 or R = 0.1 or R = 10 without affectingthe estimates, see the exercises below.

6.2 Adaptive Regression

The elegance of the SSM-approach is best illustrated by allowing the regression coefficients tobecome time-dependent in 14

yt = β0t + β1tx1t + ...+ βptxpt + εt

This is obtained by letting Q > 0 in the state-equation i.e.

ξt+1 = ξt + vt

where now (some or all of) the variances of vkt = Qkk > 0 are larger than zero in the state-equation. In contrast to the previous fixed-coefficient model, the signal-to-noise ratio Q/R > 0is now crucial when determining the amount of adpativity. Therefore, both variances must beestimated (by relying on any of the out-of-sample criteria), see the following exercices.

6.3 A More General KF-Code

We here consider a more general function which implements the KF for univariate yt when Ft andHt are not necessarily identities.

> # -parma collects the variances: parma=(Q,R). R is a scalar, Q is a diagonal matrix

> # -y is the data

> # -optim is a Boolean: if T then one of the four criteria (determined by maxlik and

> # outofsample) is returned;

> # otherwise, the criterion as well as the state vector and the variances are returned

> # -outofsample and maxlik are boolean which determine the optimization criterion (

> # least-squares or ML, in- or out-of-sample)

> # -xi10 and P10 are the initial values for starting the Kalman recursions.

> # Their dimension must correspond to Q i.e. to length(parma)-1

> # -Fm and H correspond to the system matrices

> # -if time_varying<-T then the system matrix H is filled with the data

> # (model with time varying parameters)

>

> KF_gen<-function(parma,y,opti,outofsample,maxlik,xi10,P10,Fm,H,time_varying,x_data)

+ {

+ len<-length(y)

+ if (length(parma)>2)

+ {

+ Q<-diag(parma[1:(length(parma)-1)]^2)

+ } else

+ {

34

+ Q<-as.matrix(parma[1]^2)

+ }

+ R<-parma[length(parma)]^2

+ Pttm1<-array(dim=c(dim(Q),len+1))

+ Pttm1[,,1:(len+1)]<-P10

+ Ptt<-array(dim=c(dim(Q),len))

+ xttm1<-xi10

+ logl<-0.

+ xittm1<-matrix(nrow=len,ncol=(length(parma)-1))

+ xittm1[1,]<-xi10

+ xitt<-xittm1

+ # If time_varying==T then we fill data into H (SSM with time-varying coefficients)

+ if (time_varying)

+ {

+ if (is.null(x_data))

+ {

+ # For an autoregressive model we fill in past y's+ H<-c(y[1:dim(Q)[2]])

+ # We need the first y[1:p] in H: therefore the first equation will be for t=p+1

+ anf<-dim(Q)[2]+1

+ } else

+ {

+ # For a regression model we fill in the explanatory data

+ H<-x_data[1,]

+ anf<-1

+ }

+ } else

+ {

+ anf<-1

+ }

+ # Kalman-Recursion: starts in i=dim(Q)[2] for a time series model

+ # and in i=1 for a regression model

+

+ for (i in anf:len) #i<-1 H<-c(1,0) xitt[,2]

+ {

+ # Kalman-Gain

+ He<-(H%*%(Pttm1[,,i]%*%H))[1,1]+R

+ epshatoutofsample<-y[i]-(H%*%xttm1)[1,1]

+ xittm1[i,]<-xttm1

+ xtt<-xttm1+Pttm1[,,i]%*%H*epshatoutofsample/He

+ epshatinsample<-y[i]-(H%*%xtt)[1,1]

+ xitt[i,]<-xtt

+ xttm1<-Fm%*%xtt

+ Ptt[,,i]<-Pttm1[,,i]-((Pttm1[,,i]%*%H)%*%(H%*%Pttm1[,,i]))/He

+ Pttm1[,,i+1]<-Fm%*%Ptt[,,i]%*%t(Fm)+Q

+ if (time_varying)

+ {

+ if (is.null(x_data))

+ {

+ # For an autoregressive model we fill past y's in H

+ H<-c(y[i-dim(Q)[2]+1:(dim(Q)[2])])

+ } else

+ {

+ # For a regression model we fill the explanatory data in H

35

+ H<-x_data[min(len,i+1),]

+ }

+ }

+ # Here we specify the optimization criterion: least-squares

+ # or maximum likelihood, in-sample or out-of-sample

+ if (outofsample)

+ {

+ if (maxlik)

+ {

+ logl<-logl+log(He)+epshatoutofsample^2/He

+ } else

+ {

+ logl<-logl+epshatoutofsample^2

+ }

+ } else

+ {

+ if (maxlik)

+ {

+ logl<-logl+log(He)+epshatinsample^2/He

+ } else

+ {

+ logl<-logl+epshatinsample^2

+ }

+ }

+ }

+ if (opti)

+ {

+ return(logl/len)

+ } else

+ {

+ return(list(logl=logl/len,xitt=xitt,xittm1=xittm1,Ptt=Ptt,Pttm1=Pttm1))

+ }

+ }

>

6.4 Exercises

We here provide two series of exercises.

• The first one is concerned with a replication of the classical fixed-coefficient regression model.Specifically, we show that the proposed above general KF-function KF gen is able to replicatethe standard lm-function in R. We also explore the effect of R and/or P1|0 in this classicalsetting. Finally, we assume a completely general adaptive design and estimate R and Q: aswe shall see, the estimates are remarkably close to the true values R = 1 and Q = 0 such thatthe resulting (in principle much more general) regression-estimates will be virtually identicalwith the optimal estimates obtained by lm.

• In the second series of exercises we simulate a process with time-varying coefficients. We thenestimate the system matrices R and Q by KF gen and we shall see that the estimates areonce again remarkably close to the true values, particularly given the relatively short samplelength (100 observations only). We show that the resulting filtered (real-time) coefficientstrack the true time-varying coefficients remarkably closely. The classical fixed-coefficientroutine lm in R misses the true (time-varying) dependency-structure, as was to be expected.

36

6.4.1 Exercises: Replicating the Classical (Fixed Coefficient) Regression Model

1. Generate an artificial simulated regression example based on

yt = 10 + 2x1t + 3x2t + 4x3t + εt

where xit, i = 1, 2, 3 are three independent random walk-processes.

> len<-100

> set.seed(1)

> ndim<-4

> x_data<-matrix(ncol=ndim,nrow=len)

> x_data[,1]<-rep(1,len)

> for (i in 1:3)

+ {

+ x_data[,i+1]<-cumsum(rnorm(len))

+ }

> beta<-c(10,2,3,4)

> y<-x_data%*%beta+rnorm(len)

Estimate the unknown regression coefficients by the classical least-squares algorithm (func-tion lm in R).

> lm_obj<-lm(y~x_data-1)

> summary(lm_obj)$coef

Estimate Std. Error t value Pr(>|t|)

x_data1 10.157851 0.20493584 49.56601 3.389436e-70

x_data2 1.979280 0.03004362 65.88021 1.015783e-81

x_data3 3.003579 0.03121184 96.23202 2.800366e-97

x_data4 4.028686 0.05083690 79.24728 2.774412e-89

2. Apply the new KF-code provided above and set time varying = T (the matrix Ht is time-dependent since it is filled with the explanatory data). Set F = Id, R = 1, Q = 0, ξ1|0 = 0and use a diffuse prior where P1|0 is ‘large’ (not too large because of numerical roundingerrors).

> Fm<-diag(rep(1,ndim))

> xi10<-rep(0,ndim)

> # Diffuse prior

> P10<-diag(rep(1000000,ndim))

> # We have a time-varying design where the H-matrix depends on the explanatory data

> time_varying<-T

> R<-1

> # The diagonal of the Q-matrix corresponds to the first

> # ndim-elements of parma (they are zero)

> # The last element of parma is R (it is equal to one)

> parma<-c(rep(0,ndim),R)

> # H can initialized arbitrarily since it will be specified in KF_gen

> H<-rep(0,ndim)

• Print the criterion value, the parameter estimates ξT |T obtained at the sample endT = len = 100 and the standard deviations PT |T of the estimates at the sample end.Compare these estimates with the above lm-output.

> obj_1<-KF_gen(parma,y,opti,outofsample,maxlik,xi10,P10,Fm,

+ H,time_varying,x_data)

> # Criterion

> obj_1$logl

37

[1] 1.443574

> # State vector at the sample end

> obj_1$xitt[len,]

[1] 10.157851 1.979280 3.003579 4.028686

> # Standard deviations (on the diagonal of Ptt[,,len] at the sample end t=len)

> sqrt(diag(obj_1$Ptt[,,len]))

[1] 0.20517021 0.03007797 0.03124754 0.05089504

• Plot the coefficient-estimates over the whole time span and briefly comment the result,see fig.15.

> file = paste("z-reg_coef.pdf", sep = "")


> ymin<-min(apply(obj_1$xitt,1,min))

> ymax<-max(apply(obj_1$xitt,1,max))

> ts.plot(obj_1$xitt[,1],ylim=c(ymin,ymax),col="blue",

+ main=paste("R=",R,", P10=",P10[1],sep=""),xlab="",ylab="")

> colo<-c("blue","red","green","brown")

> for (i in 2:ndim)

+ lines(obj_1$xitt[,i],col=colo[i])

> dev.off()

windows

2

R=1, P10=1e+06

0 20 40 60 80 100

−2

02

46

810

Figure 15: Regression estimates: classical fixed-coefficient model (diffuse prior)

• What happens if R = 0.1 or R = 10 (instead of R = 1)?

38

• What happens ifR = 1 and P1|0 is ‘small’, say P1|0 = Idp+1∗p+1 or P1|0 = 0.01Idp+1∗p+1?Compute the new parameter-estimates and plot the state-vector, see fig.16.


> obj<-KF_gen(parma,y,opti,outofsample,maxlik,xi10,P10,Fm,H,time_varying,x_data)

> file = paste("z-reg_coef_small.pdf", sep = "")


> ymin<-min(apply(obj$xitt,1,min))

> ymax<-max(apply(obj$xitt,1,max))

> ts.plot(obj$xitt[,1],ylim=c(ymin,ymax),col="blue",

+ main=paste("R=",R,", P10=",P10[1],sep=""),xlab="",ylab="")


> for (i in 2:ndim)

+ lines(obj$xitt[,i],col=colo[i])

> dev.off()

windows

2

R=1, P10=1

0 20 40 60 80 100

−2

02

46

810

Figure 16: Regression estimates: non-classical (P10 is ‘small’) but fixed-coefficient model

3. We now assume a practically more realistic setting where we ignore whether the regressioncoefficients are fixed or not. In such a case we have to estimate R and Q.

• Optimize Q and R = σ2. Use a diffuse prior P10 < −diag(rep(1000000, ndim)), set anarbitrary initialization value for the unknown variances parma < −rep(var(y), ndim+1), set opti<-T, outofsample<-T, maxlik<-T (out-of-sample maximum likelihood opti-mization).

39

> # Some arbitrary initialization

> parma<-rep(sqrt(var(y)),ndim+1)

> # Diffuse prior


> # Numerical optimization

> opti<-T

> outofsample<-T

> maxlik<-T

> objopt<-nlminb(start=parma,objective=KF_gen,

+ y=y,opti=opti,outofsample=outofsample,maxlik=maxlik,xi10=xi10,P10=P10,Fm=Fm,

+ H=H,time_varying=time_varying,x_data=x_data)

> parma<-objopt$par

>

• Print the optimized parameter-vector parma corresponding to the variances in Q andR (note that parma is squared in KF gen so we have to square it to in order to obtainthe variances); estimate the resulting regression coefficients and plot the optimal state-vector, see fig.17.

> parma^2

[1] 1.478343e-16 1.048485e-17 5.983261e-04 2.833801e-17

[5] 9.317224e-01

> opti<-F

> # Determine state-vector for optimal parameter


> file = paste("z-reg_coef_var.pdf", sep = "")





+ main=paste("Optimized Variances: R=",round(parma[length(parma)]^2,3),", Q=",round(parma[1]^2,3),",",round(parma[2]^2,3),

+ ",",round(parma[3]^2,3),",",round(parma[4]^2,3),sep=""),xlab="",ylab="")


> for (i in 2:ndim)


> dev.off()

null device

1

We see that the optimized variances Q = diag(0, 0, 0.000598326, 0) and R = 0.93 are re-markably close (up to rounding errors) to the true values. Therefore, the resulting regressionestimates (10.171, 1.99, 3.012, 4.063) are also remarkably close to the classical least-squares es-timates (10.158, 1.979, 3.004, 4.029). This is an interesting outcome because our SSM-modelis much more general – it is adaptive – than the classical fixed-coefficient linear regressionapproach (which is optimal here because we simulated the data accordingly).

6.4.2 Exercises: Adaptive Regression

1. Generate an artificial simulated regression example based on

yt = b0t + b1tx1t + b2tx2t + b3tx3t + εt

where xit, i = 1, 2, 3 are three independent random walk-processes and the regression coef-ficients bit, i = 0, ..., 3 are now random-walks too (instead of being fixed coefficients).

40

Optimized Variances: R=0.932, Q=0,0,0.001,0

0 20 40 60 80 100

−2

02

46

810

Figure 17: Optimized hyper-parameters: regression estimates

> len<-100

> set.seed(1)

> ndim<-4

> x_data<-matrix(ncol=ndim,nrow=len)

> x_data[,1]<-rep(1,len)

> for (i in 1:3)

+ {

+ x_data[,i+1]<-cumsum(rnorm(len))

+ }

> # The regression coefficients are random-walks too

> beta<-cbind(cumsum(rnorm(len)),cumsum(rnorm(len)),cumsum(rnorm(len)),cumsum(rnorm(len)))

> y<-apply(x_data*beta,1,sum)+rnorm(len)

2. Estimate the unknown coefficients by the classical least-squares algorithm (function lm inR) and print estimates.

> lm_obj<-lm(y~x_data-1)

> summary(lm_obj)$coef

Estimate Std. Error t value Pr(>|t|)

x_data1 23.189350 3.9077993 5.934120 4.676920e-08

x_data2 -6.498934 0.5728838 -11.344245 1.996500e-19

x_data3 -1.735849 0.5951600 -2.916608 4.405676e-03

x_data4 -1.670489 0.9693785 -1.723258 8.806138e-02

3. Apply the above general KF gen function to determine the time-varying coefficients.

41

• Optimize Q and R = σ2. Use a diffuse prior P10 < −diag(rep(1000000, ndim)), set anarbitrary initialization value for the unknown variances parma < −rep(var(y), ndim+1), set opti<-T, outofsample<-T, maxlik<-T (out-of-sample maximum likelihood opti-mization); use ξ1|0 = 0 and F = Id.

> # Some arbitrary initialization

> parma<-rep(var(y),ndim+1)

> xi10<-rep(0,ndim)

> # Diffuse prior


> Fm<-diag(rep(1,ndim))


> opti<-T

> outofsample<-T

> maxlik<-T


+ y=y,opti=opti,outofsample=outofsample,maxlik=maxlik,xi10=xi10,P10=P10,Fm=Fm,


> parma<-objopt$par

• Print the optimized squared parameter-vector parma2 corresponding to the variancesin Q and R; estimate the resulting regression coefficients and plot the optimal state-vector, see fig.18.

> parma^2

[1] 1.834509e+00 1.886701e+00 5.012875e-01 1.985483e+00

[5] 1.063802e-11

> opti<-F



> file = paste("z-reg_coef_varvar.pdf", sep = "")


> par(mfrow=c(2,1))




+ main=paste("Optimized variances: R=",round(parma[length(parma)]^2,3),", Q=",round(parma[1]^2,3),",",round(parma[2]^2,3),

+ ",",round(parma[3]^2,3),",",round(parma[4]^2,3),sep=""),xlab="",ylab="")


> for (i in 2:ndim)


> for (i in 1:ndim)

+ lines(beta[,i],col=colo[i],lty=2)

> ymin<-min(apply(beta,1,min))

> ymax<-max(apply(beta,1,max))

> ts.plot(rep(summary(lm_obj)$coef [1,1],len),ylim=c(ymin,ymax),col="blue",

+ main=paste("Classical linear (fixed-coefficient) estimates vs. true coefficients",sep=""),xlab="",ylab="")

> for (i in 2:ndim)

+ lines(rep(summary(lm_obj)$coef [i,1],len),col=colo[i])

> for (i in 1:ndim)

+ lines(beta[,i],col=colo[i],lty=2)

> dev.off()

null device

1

42

Optimized variances: R=0, Q=1.835,1.887,0.501,1.985

0 20 40 60 80 100

−2

0−

55

Classical linear (fixed−coefficient) estimates vs. true coefficients

0 20 40 60 80 100

−2

0−

55

Figure 18: Top-graph: Optimized hyper-parameters and resulting regression estimates (solid lines)as compared to the true time-varying coefficients (dotted lines). Classical linear regression esti-mates (flat lines) are plotted in the bottom graph.

We see that the true time-varying coefficients (dotted) are tracked quite well by the fil-tered estimates ξt|t (solid lines) in the top graph. The classical fixed estimates obtainedby lm in the bottom graph are misspecified and thus they cannot track the true coeffi-cients, as expected. Also, the optimized variances Q = diag(1.834508851, 1.886701131, 0.501287497, 1.985482506)are remarkably close to the true (unity) values, particularly since we have only a shortsample of length 100 at disposal. The estimated variance of the observation equationR = 0 is misspecified, however, since R = 1. But this cosmetic issue should no dis-tract from the relevant remarkably good tracking of the time-varying coefficients by thegeneral KF-function KF gen in the top graph of fig.18.

43

7 Forecasting, Smoothing and Interpolation

Forecasting concerns the estimation of ξt+s or yt+s, s > 0 given data y1, ...,yt up to t. Filteringconcerns the case s = 0 while smoothing addresses s < 0.

7.1 Forecasting

Consider the state-equation in


yt = H′tξt + wt

By recursive substitution this becomes

ξt+s =

(s∏

k=1

Ft+k

)ξt +

(s∏

k=2

Ft+k

)vt+1 +

(s∏

k=3

Ft+k

)vt+2 + ...+ Ft+svt+s−1 + vt+s

Therefore, the best forecast of ξt+s is

ξt+s|t =

(s∏

k=1

Ft+k

)ξt|t (15)

and the variance of the forecast error is

Pt+s|t =

(s∏

k=1

Ft+k

)Pt|t

(s∏

k=1

F′t+k

)+

(s∏

k=2

Ft+k

)Qt+1

(s∏

k=2

F′t+k

)

+

(s∏

k=3

Ft+k

)Qt+2

(s∏

k=3

F′t+k

)+ ...+ Ft+sQt+s−1(F′t+s) + Qt+s (16)

This expression could be written recursively as

Pt+k|t = Ft+kPt+k−1|tF′t+k + Qt+k

starting with k = 1 and unfolding the whole sequence up to k = s.

The best forecast of the data yt+s is

yt+s = H′t+sξt+s|t (17)

and the variance of the forecast error is

V ar(yt+s − yt+s) = V ar(H′t+s(ξt+s − ξt+s|t) + wt+s

)H′t+sPt+s|tHt+s + Rt+s (18)

7.2 Smoothing

We did not find any practical utility to smoothing and therefore we ignore these issues. Theinterested reader is referred to section 13.6 in Hamilton (1994).

7.3 Interpolation

Missing observations yt = NA can be interpolated very effectively by replacing unobserved data -NA’s - by forecasts (or smoothers when filling back in time). In such a case equations 6 and 8 inthe KF are replaced by identities i.e.

ξt|t = ξt|t−1

Pt|t = Pt|t−1

44

because no data is available for up-dating the state-vector. If several values yt, ..., yt+k are miss-ing, then this proceeding can be applied k + 1-times. The missing data can be interpolated viaH′tξt+s|t−1, s = 0, ..., k, where ξt+s|t−1 = (

∏sk=0 Ft+k) ξt−1|t−1.

7.4 Exercises

7.4.1 Exercises Forecasting Part I: Varying Level and Slope

1. Consider the variable level and slope model in section 2.3:

ξt =

(µt

dt

)yt = yt,vt =

(v1tv2t

),wt = wt

H′t = (1, 0)

Ft =

(1 10 1

)where we set a = 1 in Ft = F (the state-transition matrix is fixed: it does not depend ontime t). This SSM replicates the varying level and slope model:

yt = µt + wt

µt = dt−1 + µt−1 + v1t

dt = dt−1 + v2t

Simulate a series zt of length 100 with varying level and slope. Set Q =

(0.00001 0

0 0.000001

)and R = 1:

> set.seed(10)

> len<-100

> parma<-c(sqrt(0.00001),sqrt(0.000001),1) # Q[1,1],Q[2,2], R

> Q<-diag(parma^2)

> R<-1

> d<-cumsum(sqrt(Q[1,1])*rnorm(len))

> mu<-cumsum(d+sqrt(Q[2,2])*rnorm(len))

> z<-mu+R*rnorm(len)

This series (with this set.seed) was plotted in fig.5.

2. Apply the new functionKF gen(parma, y, opti, outofsample,maxlik, xi10, P10, Fm,H, time varying)to zt. Use a diffuse prior (a large P1|0) and try various parameter settings for parma=(

√Q11,

√Q22,

√R).

Check the value of the optimization criterion (maximum likelihood out-of-sample) and plotthe 2-dimensional state-vector: the level as well as the slope, see fig.19.

> Fm<-diag(c(1,1))

> Fm[1,2]<-1

> H<-c(1,0)

> xi10<-c(0,0)

> P10<-diag(rep(10000,2))

> time_varying<-F

> # We don't have explanatory data: therefore we initialize x_data with NULL

> x_data<-NULL

> obj<-KF_gen(parma,z,opti,outofsample,maxlik,xi10,P10,Fm,H,time_varying,x_data)

> obj$logl

45

[1] 1.273896

> file = paste("z_forecast_1.pdf", sep = "")


> par(mfrow=c(2,1))

> ts.plot(obj$xitt[,1],ylim=c(min(z),max(z)),col="blue",

+ main=paste("maxlik=",maxlik,", outofsample=",outofsample,", parm=",

+ as.integer(parma[1]*100)/100,",",

+ as.integer(parma[2]*100)/100,",",as.integer(parma[3]*100)/100,sep=""),xlab="",ylab="")

> lines(z,col="black")


> ts.plot(obj$xitt[,2],ylim=c(min(obj$xitt[,2]),max(obj$xitt[,2])),col="blue",

+ main=paste("Slope",sep=""),xlab="",ylab="")

> dev.off()

null device

1

maxlik=TRUE, outofsample=TRUE, parm=0,0,1

0 20 40 60 80 100

−4

−2

0 Xi_{t|t}

Slope

0 20 40 60 80 100

−0.

8−

0.2

Figure 19: Data and changing level (top) and slope (bottom)

3. Compute 1-10 steps ahead 95% forecast-intervals of the data, of the level and of the slope,according to section 7.1. Plot the data, the level and the slope together with appendedforecast intervals, see fig.20. We note that the forecast intervals for the slope and thelevel both start with a large gap because the estimate ξ100|100 might be more or less faraway from the true unobserved state vector in ξ100|100. After the initial gap, the intervalsare increasingly spreading with increasing forecast horizon but the rate of divergence isrelatively modest because the noise variances are quite small. The forecast interval for the

46

future data y101, ..., y110 is broadest in this example, because we have to account for theadditional noise-disturbances w101, ..., w110.

> horizon<-10

> fore_cast<-matrix(ncol=2,nrow=horizon)

> interval<-array(dim=c(2,2,horizon))

> std_state<-fore_cast

> std_data<-1:horizon

> Q<-diag(parma[1:2]^2)

> # Compute forecast intervals

> for (i in 1:horizon) #i<-1

+ {

+ if (i==1)

+ {

+ fore_cast[i,]<-Fm%*%obj$xitt[len,]

+ interval[,,i]<-Fm%*%obj$Ptt[,,len]%*%t(Fm)+Q

+ } else

+ {

+ fore_cast[i,]<-Fm%*%fore_cast[i-1,]

+ interval[,,i]<-Fm%*%interval[,,i-1]%*%t(Fm)+Q

+ }

+ # variances of state-vector forecasts

+ std_state[i,]<-diag(interval[,,i])

+ # Variance for data forecast

+ std_data[i]<-H%*%interval[,,i]%*%H+R

+ }

> # Plot forecasts



> ymin<-min(c(z,fore_cast[,1]-2*sqrt(std_data)))

> ymax<-max(c(z,fore_cast[,1]+2*sqrt(std_data)))

> par(mfrow=c(3,1))

> ts.plot(c(z,rep(NA,length(fore_cast[,1]))),ylim=c(ymin,ymax),col="black",

+ main=paste("parm=",


+ as.integer(parma[2]*100)/100,",",as.integer(parma[3]*100)/100,sep=""),xlab="",ylab="")

> lines(c(rep(NA,length(z)),fore_cast[,1]),col="blue",lty=1)

> lines(c(rep(NA,length(z)),fore_cast[,1]-2*sqrt(std_data)),col="blue",lty=2)

> lines(c(rep(NA,length(z)),fore_cast[,1]+2*sqrt(std_data)),col="blue",lty=2)

> mtext("Original Data", side = 3, line = -1,at=len/2,col="black")

> mtext("Forecasts of Data", side = 3, line = -2,at=len,col="blue")

> ymin<-min(c(mu,fore_cast[,1]-2*sqrt(std_state[,1])))

> ymax<-max(c(mu,fore_cast[,1]+2*sqrt(std_state[,1])))

> ts.plot(c(mu,rep(NA,length(fore_cast[,2]))),ylim=c(ymin,ymax),col="black",

+ main=paste("Level",sep=""),xlab="",ylab="")

> lines(c(obj$xitt[,1],fore_cast[,1]),col="blue",lty=1)

> lines(c(obj$xitt[,1],fore_cast[,1]-2*sqrt(std_state[,1])),col="blue",lty=2)

> lines(c(obj$xitt[,1],fore_cast[,1]+2*sqrt(std_state[,1])),col="blue",lty=2)

> mtext("True Level", side = 3, line = -1,at=len/2,col="black")

> mtext("Estimated level and forecast appended", side = 3, line = -2,at=len/2,col="blue")

> ymin<-min(c(d,fore_cast[,2]-2*sqrt(std_state[,2])))

> ymax<-max(c(d,fore_cast[,2]+2*sqrt(std_state[,2])))

> ts.plot(c(d,rep(NA,length(fore_cast[,2]))),ylim=c(ymin,ymax),col="black",

+ main=paste("Slope",sep=""),xlab="",ylab="")

47




> mtext("True slope", side = 3, line = -1,at=len/2,col="black")

> mtext("Estimated slope and forecast appended", side = 3, line = -2,at=len/2,col="blue")

> dev.off()

null device

1

parm=0,0,1

0 20 40 60 80 100

−6

−2

0

Original DataForecasts of Data

Level

0 20 40 60 80 100

−4

−2

0 True LevelEstimated level and forecast appended

Slope

0 20 40 60 80 100

−0.

06−

0.02

True slopeEstimated slope and forecast appended

Figure 20: Forecast intervals of data (top), level (middle) and slope (bottom) for the model withvarying level and slope

4. Simulate new data based on different (larger) Q and compute forecasts with plots as above.

7.4.2 Exercises Interpolation: Varying Level and Slope

Exercise is in preparation.

48

7.4.3 Exercises Forecasting Part II: Changing Autocorrelation (AR(1))

We here adress a difficult estimation problem which is related to a changing autocorrelation struc-ture (rather than changing level/slope). For this purpose we generate data from an artificialAR(1)-process whose AR-parameter evolves according to a sinusoid.

xt = atxt−1 + wt

at = 0.9 ∗ cos(2 ∗ pi ∗ t/100 + pi/2)

We then attempt to track this non-stationarity by relying of the general KF-functionKF gen(parma, y, opti, outofsample,maxlik, xi10, P10, Fm,H, time varying).

1. Generate a realization of the above AR(1) process of length 100, see fig.21 top graph.

> set.seed(100)

> len<-100

> # centered AR-process (mean zero)

> mu<-rep(0,len)

> # AR(1)-parameter

> a1<-0.9*cos(2*pi*(1:len)/len+pi/2)

> x<-1:len

> R<-1

> eps<-sqrt(R)*rnorm(len)

> x[1]<-mu[1]/(1-a1[1])

> # Generate artificial series

> for (i in 2:len)

+ {

+ x[i]<-mu[i]+a1[i]*x[i-1]+eps[i]

+ }

2. Assume we estimate the time-dependent autocorrelation structure of the above non-stationaryprocess by relying on the following simple SSM

yt = yt−1ξt + wt

ξt+1 = ξt + vt+1

where Ht = yt−1, F = 1. Note that we do not assume knowledge of the particular (deter-ministic) cosine shape: we just have a general adaptive model.

• Estimate the unknown parameters R and Q by relying on the general KF-functionKF gen(parma, y, opti, outofsample,maxlik, xi10, P10, Fm,H, time varying). Hint:set time varying < −T before the function-call in order to obtain the time-varyingsystem-matrix Ht = yt−1.

• Compute the state-vector based on the optimal solution and proceed to a plot (seefig.21).

> P10=(1000)

> xi10<-0

> outofsample<-T

> maxlik<-T

> parma<-c(0,100)

> opti<-T

> time_varying<-T

> Fm<-1

> H<-1

> # We don't have explanatory data: therefore we initialize x_data with NULL

> x_data<-NULL

49



+ y=x,opti=opti,outofsample=outofsample,maxlik=maxlik,xi10=xi10,P10=P10,Fm=Fm,


> parma<-objopt$par

> opti<-F


> obj<-KF_gen(parma,x,opti,outofsample,maxlik,xi10,P10,Fm,H,time_varying,x_data)



> par(mfrow=c(2,1))

> ts.plot(x,main="AR(1) with changing autocorrelation structure",xlab="",ylab="")

> ts.plot(obj$xitt[,1],ylim=c(-1.2,1.2),col="blue",

+ main=paste("Tracking the varying AR(1): Optimized Parameters=",


+ as.integer(parma[2]*100)/100,sep=""),xlab="",ylab="")

> lines(a1,col="black")

> mtext("True AR-Parameter", side = 3, line = -1,at=len/2,col="black")


> dev.off()

null device

1

AR(1) with changing autocorrelation structure

0 20 40 60 80 100

−4

02

4

Tracking the varying AR(1): Optimized Parameters=−0.11,0.99

0 20 40 60 80 100

−1.

00.

01.

0 True AR−ParameterXi_{t|t}

Figure 21: Tracking the non-stationary autocorrelation structure

3. Generate 1-10 steps ahead forecasts for the data as well as for the state-vector. Plot thedata as well as the forecasts, see fig.22.

50

> horizon<-10

> fore_cast<-matrix(ncol=1,nrow=horizon)

> interval<-array(dim=c(1,1,horizon))

> std_state<-fore_cast

> std_data<-1:horizon

> Q<-as.matrix(parma[1]^2)

> # Compute forecast intervals

> for (i in 1:horizon) #i<-1

+ {

+ if (i==1)

+ {

+ fore_cast[i,]<-Fm%*%obj$xitt[len,]

+ interval[,,i]<-Fm%*%obj$Ptt[,,len]%*%t(Fm)+Q

+ } else

+ {

+ fore_cast[i,]<-Fm%*%fore_cast[i-1,]

+ interval[,,i]<-Fm%*%interval[,,i-1]%*%t(Fm)+Q

+ }

+ # variances of state-vector forecasts

+ if (length(parma)>2)

+ {

+ std_state[i,]<-diag(interval[,,i])

+ } else

+ {

+ std_state[i,]<-interval[,,i]

+ }

+

+ # Variance for data forecast

+ std_data[i]<-H%*%interval[,,i]%*%H+R

+ }

> # Plot forecasts



> ymin<-min(c(x,fore_cast[,1]-2*sqrt(std_data)))

> ymax<-max(c(x,fore_cast[,1]+2*sqrt(std_data)))

> par(mfrow=c(2,1))

> ts.plot(c(x,rep(NA,length(fore_cast[,1]))),ylim=c(ymin,ymax),col="black",

+ main=paste("parm=",


+ as.integer(parma[2]*100)/100,sep=""),xlab="",ylab="")

> lines(c(rep(NA,length(x)),fore_cast[,1]),col="blue",lty=1)

> lines(c(rep(NA,length(x)),fore_cast[,1]-2*sqrt(std_data)),col="blue",lty=2)

> lines(c(rep(NA,length(x)),fore_cast[,1]+2*sqrt(std_data)),col="blue",lty=2)

> mtext("Original Data", side = 3, line = -1,at=len/2,col="black")

> mtext("Forecasts of Data", side = 3, line = -2,at=len,col="blue")

> ymin<-min(c(a1,fore_cast[,1]-2*sqrt(std_state[,1])))

> ymax<-max(c(a1,fore_cast[,1]+2*sqrt(std_state[,1])))

> ts.plot(c(a1,rep(NA,length(fore_cast[,1]))),ylim=c(ymin,ymax),col="black",

+ main=paste("AR(1)",sep=""),xlab="",ylab="")




> mtext("True AR(1)", side = 3, line = -1,at=len/2,col="black")

> mtext("Estimated AR(1) and forecast appended", side = 3, line = -2,at=len/2,col="blue")

51

> dev.off()

null device

1

parm=−0.11,0.99

0 20 40 60 80 100

−4

02

4 Original DataForecasts of Data

AR(1)

0 20 40 60 80 100

−0.

50.

5

True AR(1)Estimated AR(1) and forecast appended

Figure 22: Forecast intervals of data (top), level (middle) and slope (bottom) for the model withvarying level and slope

As can be seen, the sinusoid shape of the AR(1)-parameter cannot be anticipated and repli-cated by the forecasts because we did not assume any particular function; instead our modelrelies on a general adaptive scheme for the state-vector (random-walk). To be specific: ourstochastic adaptive model is misspecified.

8 Time Series Decomposition

An additive decomposition of a time series into independent components can be performed quiteeasily by specifying the interesting components through block-diagonal entries in the state-transitionmatrix Ft and by letting Ht ‘extract’ the relevant components or elements from the state-vector.Independence is realized by specifying the innovations belonging to the repsective components to

52

be orthogonal (uncorrelated or ideally independent). The general SSM

yt = H′tξt + wt


is then structured as follows:

Ft =

B1t 0 0 ... 00 B2t 0 ... 0:::0 0 0 ... Blt

Ht = (e1t, e2t, ..., elt)

where Bit are ni ∗ni dimensional matrices and eit are ni-dimensional vectors typically picking-outthe relevant components off the state-vector. As an example, a monthly series could be decomposedinto trend+slope, SARMA(1, 0, 1)(1, 0, 1)12 and AR(2)-cycle as follows

Blevel/slope =

(1 10 1

)

Bseasonal =

a1 1 0 0 ... 00 0 1 0... 0:::a12 0 0 ... 1−a1a12 0 0 ... 0

0 0 0 ... 0

Bcycle =

(a1 1a2 0

)e′1 = (1, 0)

e2 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

e3 = (1, 0)

where all entities are time-invariant. The seasonal component

(1− a1B)(1− a12B12)ξt3 = (1 + b1B)(1 + b12B12)εt,seasonal

generated in the third element of the state vector ξt relies on the ARMA-specification proposed insection 3.3 (of dimension max(p, q+1) = 14). The block-diagonal (time invariant) state-transitionmatrix F is (2+14+2)∗(2+14+2)-dimensional and the (time invariant) H is 2+14+2-dimensional.The ei-vectors pick-out the relevant trend+slope, seasonal and cycle components which are addedwith the observation noise wt to give yt in the observation equation. The innovations vit belongingto these components are given by

v′1t = (εt,level, εt,slope)

v′2t = (1, b1, 0, ..., 0, b12, b1b12)εt,seasonal

v′3t = (εt,cycle)

All three block-specific innovation vectors are stacked into a single vt. Under the assumption ofindependent components, the variance-covariance matrix Q becomes block-diagonal too

Q =

Q1 0 00 Q2 00 0 Q3

53

with

Q1 =

(σ2level 00 σ2

slope

)

Q2 = σ2seasonal

1 b1 0 ... b12 b1b12b1 b21 0 ... b1b12 b21b120 0 0 ... 0 0:::b12 b12b1 0 ... b212 b1b

212

b1b12 b21b12 0 ... b1b212 b21b

212

Q3 =

(σ2cycle 0

0 0

)Of course, we could add as many (unobserved) components as we wish to but one must be awareof the fact that more components invariably complicate the identification: it is not unfrequent, forexample, that the level component get mixed-up with the cycle since both address a slowly decayingpositive autocorrelation pattern. Also, the numerical estimation problem (whose complexity isoften underrated) gets more tedious.

9 A Completely Worked-Out Real-World Project from theHealth-Care Sector

The Swiss Medical Association (acronym FMH) represents Swiss physicians with roughly 40000members in 2013. In 2004 a new tariff-mode called TARMED was introduced nationwide. Sincethen, the FMH and the national insurer, SUVA, meet on a regular basis to determine the valueof the so-called taxpoint, which assigns a fixed nationwide monetary value to an elementary med-ical ‘unit’: each medical act can be decomposed into ‘units’ and thus costs can be decomposedaccordingly. The purpose of this unifying tariff-system is to allow for more transparency throughcross-sectional comparisons of cost-dynamics. Negociations between both players, the FMH andthe SUVA, rely on an analysis of historical and recent cost-growth. Unfortunately, the relevantsystemic growth, the drift, of the cost-series is masked by undesirable seasonal variations as wellas by multiple breaks/shifts, in particular by a singular shift corresponding to the introduction ofthe TARMED, see fig.23.

> # Set paths

> Path_MHK.dat<-

+ paste("C:\\wia_desktop\\Projekte\\2013\\Unterricht\\Eco3\\FMH\\Daten\\",sep="")

> Path_MHK.Par<-

+ "C:\\wia_desktop\\Projekte\\2013\\Unterricht\\Eco3\\FMH\\Parameter\\"

> Path_MHK.out<-

+ paste("C:\\wia_desktop\\Projekte\\2013\\Unterricht\\Eco3\\FMH\\Output\\",sep="")

> # Read the data

> z<-read.table(paste(Path_MHK.dat,"TARMED_costs.txt",sep=""),sep="\t")

> len<-dim(z)[1]

> # Make monthly time series

> y<-ts(z,frequency=12,start=c(2000,1))

> # Plot the data

> file = paste("z-MHK.pdf", sep = "")


> ts.plot(y,main="Historical Costs and TARMED-Shift",xlab="",ylab="")

> mtext("Introduction of TARMED", side = 3, line = -1,at=2004,col="red")

> abline(v=2004,col="red")

> dev.off()

54

null device

1

Historical Costs and TARMED−Shift

2000 2002 2004 2006 2008

230

240

250

260

270

280

290

300

Introduction of TARMED

Figure 23: Historical Costs and Introduction of TARMED

This research project was concerned with the estimation (the extraction) of the relevant growth-dynamics based on the observed ‘noisy’ cost-series.

9.1 The State Space Model

For this purpose we need to account for

• a noise component wt

• a seasonal component

• the TARMED shift

• the changing level as well as

• the changing slope.

Specifically, we want to extract (estimate) the changing level and slope parameters. The structureof the state-space model must account for these data-features:

yt = θIt≥2004 + H′ξt + wt (19)

ξt+1 = Fξt + vt+1

55

where It≥2004 :=

{1 t ≥ 20040 t < 2004

is an intervention variable which accounts for the singular

TARMED shift, θ is the TARMED-effect (must be estimated) and where

F15∗15 =

1 1 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 a1 0 0 0 0 0 0 0 0 0 0 a12 −a1a120 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0:::0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

H′15 = (1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

Q15∗15 =

Q11 0 0 0 0 0 0 0 0 0 0 0 0 0 00 Q22 0 0 0 0 0 0 0 0 0 0 0 0 00 0 Q33 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0:::0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0

where we added the dimensions of the matrices and the vector as an index. We now briefly lookat the state vector and derive the relevant dynamics. The first two components obey the followingdifference equation (

ξ1,t+1

ξ2,t+1

)=

(1 10 1

)(ξ1,tξ2,t

)+

(v1,t+1

v2,t+1

)with V ar(v1t) = Q11, V ar(v2t) = Q22. This is our simple changing level+slope model, see section2.3. We now look to the remaining components:

ξ3,t+1

ξ4,t+1

...ξ15,t+1

=

a1 0 0 0 0 0 0 0 0 0 0 a12 −a1a121 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0:::0 0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 0 0 0 1

ξ3,tξ4,t...ξ15,t

+

v3,t+1

0...0

where V ar(v3t) = Q33. Solving this system of equations recursively backwards leads to

ξt+1,15 = ξt,14

ξt+1,14 = ξt,13

... = ...

ξt+1,4 = ξt,3

ξt+1,3 = a1ξt,3 + a12ξt,14 − a1a12ξt,15 + v3,t+1

Substituting the first eleven equations back into the last gives

ξt+1,3 = a1ξt,3 + a12ξt−11,3 − a1a12ξt−12,3 + v3,t+1

or(1− a1B)(1− a12B12)ξt+1,3 = v3,t+1

56

Thus the third component ξt+1,3 of ξt is a seasonal autoregressive (1, 0, 0) ∗ (1, 0, 0)12 process. Asa consequence, we have all ingredients available:

• The observation noise wt.

• The intervention variable for the singular TARMED-shift: θIt≥2004 in 2004.

• The changing level: ξt,1.

• The changing slope: ξt,2.

• The adaptive seasonal component: ξt,3.

9.2 Initial Values

We need some meaningful initialization ξ1|0 of the state vector. For this, we first consider the leveland the slope:

ξ1|0level=

1

5

5∑k=1

yk

ξ1|0slope = 0

We thus intialize the level with the mean of the first few (five) observations and we set the slopeto zero. For the seasonal components we fit a seasonal airline model (0, 1, 1) ∗ (0, 1, 1)12 to thetime-reverted series and we backcast twelve observations1.

> # Initialisation of state vector xi1|0: mean level of first observations, slope and seasonal

> sais<-T

> if (sais)

+ {

+ fit<-arima(y[length(y):1], include.mean=F,order = c(0,1,1),

+ seasonal=list(order=c(0,1,1),period=12),method="ML")

+ fores<-predict(fit, n.ahead = 14)$pred

+ # Backcast the season

+ xi10<-c(mean(y[1:5]),rep(0,2),fores[1:12]-mean(fores[1:12]))

+ } else

+ {

+ xi10<-c(mean(y[1:5]),rep(0,14)) #ts.plot(xi10[3:length(xi10)])

+ }

Note that we supply the time-reverted series y[length(y):1] to the arima-function; also, we initializethe seasonal-part of the state-vector with the centered (zero mean) seasonal forecasts (fores[1:12]-mean(fores[1:12])) because the seasonal component ξt,3 is a stationary zero-mean process. Indeed,the (non-zero) level is supposed to be extracted by the level component ξt1. We explicitly attemptto avoid confusions of effects in our components which have clearly defined meanings and purposes.

The variance P1|0 of ξ1|0 is initialized as follows

P1|011 = min(parm[8]2, var(yt)/100)

P1|022 = min(parm[9]2, var(yt)/1000)

P1|0kk= var(yt)/10 if k > 2

As an example, we assume that our initial value for the slope ξ1|0slope = 0 has a variance P1|0slope =

min(parm[9]2, var(yt)/1000) which is smaller than 11000 times the variance of the data var(yt).

This assumption is fine for many practically relevant applications where we expect the slope-innovation to be of (much) smaller magnitude than the data.

1This proceeding is somehow arbitrary but experience suggests that it works fine in many applications withseasonal components.

57

9.3 A More General Adaptive Optimization Criterion

The unknown parameters are grouped in the parm-vector:

parm′ = (a1, a2, a12, R,Q11, Q22, Q33, P1|011 , P1|022 , θ)

The AR(2)-parameter a2 is not used in our example but the code could account for an additionalAR(2)-cycle (this would be relevant for business-cycle analysis, for example). We now provideinitial values in order to proceed to optimization2:

> # Set forecast horizon

> flen<-18

> # Forecast step in KF: The new procedure is very general and allows

> # for multi-step ahead optimization

> fstep<-1

> fore<-array(dim=c(fstep,flen))

> # Initialization of parm (since optimization might be long we here provide a good

> # initial estimate)

> parm<-c(149.96335194,97.93872433,-14.20951405,3.47020845,0.35075566,-0.03507273,

+ -0.51884736,84.77328378,0.06700115,25.09927957)

Note that we added an additional AR()-parameter a2 in parm i.e. our code is slightly moregeneral than needed3. The unknown parameters grouped into parm must be estimated either bymaximum likelihood (preferred) or by least-squares; section 5.6 clearly suggests the use of out-of-sample criteria when working with adaptive models.

We now introduce an additional useful optimization feature which is illustrated by relying onthe least-squares criterion:

T∑t=1

αT−tε2t → mina1,...,ap,b1,...,bq

The parameter 0 < α ≤ 1 allows to discount the remote past more heavily than the recentobservations. We have found this feature to be useful in the context of series subject to frequentand/or large structural shifts, as is the case for many time series in the health care sector, due tocontinuous interventions into the health system (for example by political players). The maximumlikelihood criterion can be modified accordingly:

αT−1 ln(|P1|0|) +

T∑t=2

αT−t ln(|H′tPt|t−1Ht + Rt|)

+αT−1(y1 −H′1ξ1|0)′(P1|0)−1(y1 −H′1ξ1|0)

+

T∑t=2

αT−t(yt − yt)′(H′tPt|t−1Ht + Rt)

−1(yt − yt)→ minθ

If α = 1, then the original least-squares or maximum likelihood criteria proposed in section 5 arereplicated. For 0 < α < 1 the remote past is discounted more heavily than recent observations(we often select α = 0.98 when working with economic data).

9.4 Robustification of the KF

One of the potential drawbacks of the KF lies hidden in the up-dating equation 6


−1εt

2The initial value of parm is chosen close to the optimum such that optimization is fast.3This additional AR(2)-parameter could be used to analyse cycles, for example.

58

If vt,wt are Gaussian, then this up-dating scheme is optimal. Unfortunately, in practice, theGaussian assumption is hardly credible and large outliers εt can severly distort the up-date ξt|t aswell as part or whole of the sequence ξt+k|t+k, k > 0. This is particularly true in the presence ofheavy shifts as is the case in the above data (TARMED-effect). A simple robustification devicereplaces εt by a bounded function ψ(εt) in order to robustify the KF. Specifically, we select thefollowing function ψ(·):

ψ(εt) =

{εt |εt| < 2.5 ∗median(|εt, εt−1, ..., ε1|)

sign(εt) ∗ 2.5 ∗median(|εt, εt−1, ..., ε1|) otherwise

We thus check if |εt| is an outlier. Specifically, we verify whether |εt| > 2.5∗median(|εt, εt−1, ..., ε1|)or not. Note that we are using a robust estimate of the scale based on the median absolute valuesmedian(|εt, εt−1, ..., ε1|), instead of the classical sample standard deviation. If |εt| is identified asan outlier, then ψ(εt) trims the observation.

9.5 Constraining the Parameter Space

Considerparm′ = (a1, a2, a12, R,Q11, Q22, Q33, P1|011 , P1|022 , θ)

Note that a2 is not used in our example (but the procedure could work with an AR(2)-cycle).Trivially, the seasonal ARMA-model shouldn’t be unstable, which puts constraints on the spaceof possible (meaningful) AR-parameters. Also, all variances should be positive which translatesinto the following parameterizations:

(a1, a12) = 2

(1

1 + exp(parm[1, 3])− 0.5

)1

korarma

(Q11, Q22, Q33) = (parm[5 : 7])2

R = parm[4]2

P1|011 = min(parm[8]2, var(yt)/100)

P1|022 = min(parm[9]2, var(yt)/1000)

P1|0kk = var(yt)/10 if k > 2

Remarks

• The function 2

(1

1 + exp(parm[1, 3])− 0.5

)constrains the parameter-space to [−1, 1]: the

lower limit -1 is obtained if parm[c(1, 3)] =∞; the upper limit +1 is obtained if parm[c(1, 3)] =−∞. The parameter korarma ≥ 1 allows to strengthen the restriction further by restrictingthe parameter space to

[− 1

korarma ,1

korarma

]. In our example below we set korarma = 1.01:

we thus restrict the parameter space to [−1/1.01,+1/1.01] thus avoiding a non-stationarysolution (such as for example a1 = a12 = 1).

• All variances must be positive: thus we square the corresponding parm-entries.

The above restrictions are shaping (imposing structure to) the state-space model according to apriori knowledge and they are constraining the parameter space to a meaningful subspace (positivevariances). Therefore, optimization is simplified. Note that numerical optimization of state spacemodels is a difficult computational task – parameters and components are not well-identified –and therefore the above restrictions sustain and facilitate numerical convergence to the optimum.

9.6 Summary: the Generalized Kalman Filter (R-Code)

Here we obtain the R-code corresponding to the above modifications. The function kalmanrec readsthe parameters and sets-up all relevant system matrices by relying on the parameter constraintsproposed in section 9.5. Once completed, it calls the KF-function.

59

> kalmanrec<-function(parm,li)#%%

+ {

+ y<-li$y

+ loglik<-li$loglik

+ opti<-li$opti

+ fstep<-li$fstep

+ xi10<-li$xi10

+ alpha<-li$alpha

+ korarma<-li$korarma

+ sais<-li$sais

+ inter<-li$intermhk

+

+ len<-length(y)

+ ndim=2+13

+ xttx <- matrix(0*1:(ndim*(len)),ncol=ndim)

+ # P1|0

+ Sigmattm1 <- diag(c(min(parm[8]^2,var(y)/100),min(parm[9]^2,var(y)/1000),

+ rep(ifelse(sais,var(y)/10,0),ndim-2)))

+ # Q

+ if (sais)

+ {

+ Q <- diag(c(parm[5:7]^2,rep(0,ndim-3)))

+ } else

+ {

+ Q <- diag(c(parm[5]^2,parm[6]^2,parm[7]^2,rep(0,ndim-3)))

+ }

+ # R

+ R<-parm[4]^2

+ # Fm (state transition matrix)

+ Fm<-diag(c(1,1,rep(0,ndim-2)))

+ Fm[1,2]<-1

+ Fm[4:ndim,3:(ndim-1)]<-diag(rep(1,ndim-3))

+ # AR(1)*AR(12)

+ parmh<-2*(1/(1+exp(parm[1:3]))-0.5)/korarma

+ Fm[3,3]<-parmh[1]

+ # In the absence of seasonals an AR(2)-cycle is estimated

+ Fm[3,4]<-ifelse(sais,0,min(ifelse(length(y)<50,0.6,0.97)-parmh[1],parmh[2]))

+ Fm[3,14]<-ifelse(sais,parmh[3],0) #parm[2]<-0

+ Fm[3,15]<-ifelse(sais,-parmh[3]*parmh[1],0)

+ # H (observation matrix)

+ H<-c(rep(1,3),rep(0,ndim-3))

+ # Intervention variable for TARMED-effect

+ inter<-inter*parm[length(parm)]

+ # Run the KF

+ f<-kald(len,y,Q,Fm,H,xttx,ndim,Sigmattm1,R,fstep,xi10,alpha,inter)

+ if (opti)

+ {

+ # f$logl is based on trimmed (robust) one-step ahead out-of-sample forecast errors

+ return(ifelse(loglik,ifelse(fstep==1,f$logl,f$loglmu),log(f$lsq)))

+ } else

+ {

+ return(list(f=f,Fm=Fm,H=H,Q=Q,R=R))

+ }

+ }

60

The KF-function kald(len, y,Q, Fm,H, xttx, ndim, Sigmattm1, R, fstep, xi10, alpha, inter) is givenby:

> # kalman Filter

> kald<-function(len,y,Q,Fm,H,xttx,ndim,Sigmattm1,R,fstep,xi10,alpha,inter)

+ {

+ # fstep-ahead (multi-step) out of sample error

+ epshatms<-0*1:len

+ # one-step-ahead out of sample error

+ epshatos<-1:len

+ # multi-step out-of sample forecast

+ outsampms<-0*1:len

+ # one- step out of sample forecast

+ outsampos<-0*1:len

+ # ins ample multi step ahead forecast

+ echtms<-0*1:len

+ # out of sample one-step ahead forecasts

+ insampos<-0*1:len

+ epsvar<-0

+ epsbias<-0

+ logl<-0.

+ loglmu<-0

+ lsq<-0. #H[4]<-0

+ snrrw<-1:(len)

+ snrme<-1:(len)

+ xttz1<-xi10

+ nu<-xttx*0

+ # Multi-step ahead forecasts

+ fFm<-diag(1,ndim)

+ if (fstep-1>0)

+ {

+ for (j in 1:(fstep-1))

+ {

+ fFm<-fFm%*%Fm

+ }

+ }

+ Ptt<-array(dim=c(dim(Sigmattm1),len))

+ Pttm1<-array(dim=c(dim(Sigmattm1),len+1))

+ Pttm1[,,1]<-Sigmattm1

+ if (len-fstep+1>1)

+ {

+ for (i in 1:len) #i<-47

+ {

+ Sigmattm1m<-fFm%*%Sigmattm1%*%t(fFm)

+ if (H%*%Sigmattm1%*%(H)+R<1.e-90)

+ {

+ print("!!!!!!!!!!!!!!!!!!!!Possible Singularity!!!!!!!!!!!!!!!!!!!!!!!!!!!")

+ He<-1.e-20

+ } else

+ {

+ He <- H%*%Sigmattm1%*%(H)+R

+ Hem<-H%*%Sigmattm1m%*%(H)+R

+ }

+ Hi <- 1/He

61

+ Him<-1/Hem

+ # Out-of-sample multi-step ahead forecasts

+ outsampms[i]<-H%*%fFm%*%xttz1+inter[i]

+ if (!(i+fstep-1>len))

+ {

+ epshatms[i] <- y[i+fstep-1]-outsampms[i]

+ }

+ # Out-of-sample one-step ahead forecasts

+ outsampos[i]<-H%*%xttz1+inter[i]

+ # Robustification of up-dating equation

+ if (i>40&abs(y[i]-outsampos[i])>2.5*median(abs(epshatos[10:(i-1)])))

+ {

+ epshatos[i] <- sign(y[i]-outsampos[i])*2.5*median(abs(epshatos[10:(i-1)]))

+ } else

+ {

+ epshatos[i] <- y[i]-outsampos[i]

+ }

+ xttz <- xttz1+Sigmattm1%*%(H)%*%Hi%*%epshatos[i]

+ xttz1<-Fm%*%xttz

+ Sigmatt<-Sigmattm1-((Sigmattm1%*%(H))%*%Hi%*%(H%*%Sigmattm1))

+ Ptt[,,i]<-Sigmatt

+ Sigmattm1 <- Fm%*%Sigmatt%*%t(Fm)+Q

+ Pttm1[,,i+1]<-Sigmattm1

+ # in sample state-vector

+ xttx[i,]<-xttz

+ # in sample one-step ahead forecast

+ insampos[i]<-H%*%xttz+inter[i]

+ # Multi-step forecasts

+ echtms[i]<-H%*%fFm%*%Fm%*%xttz+inter[i]

+ # -loglikelihood and least-squares criteria: remote past is downweighted by alpha,

+ # forecast error is trimmed (robust estimation)

+ logl <- alpha*logl+(log(He)+epshatos[i]*Hi*epshatos[i])/(2.*(len))

+ loglmu <- alpha*loglmu+(log(Hem)+epshatms[i]*Him*epshatms[i])/(2.*(len))

+ # multi-step out-of-sample least-squares: remote past is downweighted by alpha

+ lsq <- alpha*lsq+epshatms[i]^2/(len-fstep+1)

+ # nu's+ if (i>1)

+ {

+ nu[i,]<-xttx[i,]-Fm%*%xttx[i-1,]

+ }

+ # signal-to noise ratios

+ if(!(i+fstep-1>len))

+ {

+ epsvar<-epsvar+epshatms[i]^2/(len-fstep+1)

+ epsbias<-epsbias+epshatms[i]/(len-fstep+1)

+ if (i>1)

+ {

+ snrrw[i]<-sum(epshatms[1:i]^2)/sum((y[fstep-1+2:i]-y[1:(i-1)])^2)

+ snrme[i]<-sum(epshatms[1:i]^2)/sum((y[fstep-1+2:i]-mean(y[1:(i-1)]))^2) #@

+ }

+ }

+ }

+ return(list(logl = logl, lsq = lsq, loglmu=loglmu,xttx = xttx, Sigmatt = Sigmatt,

+ snrrw = snrrw, snrme = snrme, epsvar =

62

+ epsvar, epsbias = epsbias,nu = nu,outsampms=outsampms,

+ outsampos=outsampos,echtms=echtms,insampos=insampos,

+ epshatms=epshatms,epshatos=epshatos,Ptt=Ptt,Pttm1=Pttm1))

+

+ } else

+ {

+ print("Too few observations for Multi-step ahead estimation: (len-fstep+1)<1")

+ }

+ }

This very general routine implements the weighted optimization scheme proposed in section 9.3as well as the robustification proposed in section 9.4. It even has one additional feature that wedid not discuss so far: it can estimate parameters based on multi-step ahead (instead of one-stepahead) forecast performances. Depending on the particular application, this additional featurecould be helpful4. We here ignore this possibility i.e. we set fstep<-1 in our main code (whichmeans that one-step ahead performances are computed).

9.7 Run this Code: Estimate Components

We now run this code and first estimate the unknown parameters in the vector parm. For thispurpose we set α = 0.98, see section 9.3 (experience suggests that this is a fairly good discountingof the remote past for many economic time series5).

> # Likelihood or least-squares (least-squares with korarma<-1.25 works fine too)

> loglik<-T

> # Downweighting past residuals in optimization: assigning less weight to remote past:

> # alpha must be smaller than one

> alpha<-0.98

> alpha<-min(alpha,1)

> # ARMA-Parameters are divided by korarma, such that biggest value is 1/korarma

> # Reason: we don't want the non-stationary level to be confounded with the seasonal component

> # i.e. we want a stationary seasonal (or cycle) and a non-stationary level

> korarma<-1.01

> opti<-T

> # Intervention variable (level shift after TARMED)

> intermhk<-rep(0,length(y))

> intermhk[49:length(intermhk)]<-1

> # pack everything into a single list

> li<-list(y=y,loglik=loglik,opti=opti,fstep=fstep,xi10=xi10,alpha=alpha,

+ korarma=korarma,sais=sais,intermhk=intermhk)


> parme<-parm

> min<-1.e+98

> if (opti)

+ {

+ # BFGS

+ parm.obj<-optim(parme, kalmanrec,method="BFGS",

+ control = list(REPORT=1,trace=6,maxit=100),li=li)

+ parme<-parm.obj$par

+ # optimized parameter vector: initial values and variances in observation and state equations

+ parm<-parme

+ write.table(parme,paste(Path_MHK.out,"parm",sep=""))

4We used this feature for the NN3 and NN5 forecasting competitions.5We used this value for the NN3 and NN5 forecasting competitions too.

63

+ } else

+ {

+ parm<-read.table(paste(Path_MHK.out,"parm",sep=""))[,1]

+ }

initial value 0.836994

iter 2 value 0.836755






final value 0.835189

converged

Remark

• David Friskin, a fidel SEFBlog reader, proposed the following slightly better solution:

> #parm<-c(149.96335194,97.93872433,-14.20951405,3.47020845,0.35075566,-0.03507273,

> # -0.51884736,84.77328378,0.06700115,25.09927957)

He relied on computationally more intensive simulated annealing and genetic algorithms.The (negative) likelihood decreases a bit and I prefer the resulting ARMA-model (as well asthe forecasts that will be seen in fig.29 below). The TARMED-shift as obtained by David’ssolution is more pronounced and therefore ‘growth’ (the second element of the state-vector)is weaker than in ‘my’ solution (the result of the above numerical computations).

• If you want to look at David’s solution just use his settings, above, and run my code, below.

We now proceed to the decomposition of the time series based on the received optimal parm-vector:

> opti<-F

> li<-list(y=y,loglik=loglik,opti=opti,fstep=fstep,xi10=xi10,

+ alpha=alpha,korarma=korarma,sais=sais,intermhk=intermhk)

> kf<-kalmanrec(parm,li)#%%

> kf$f$logl

[,1]

[1,] 0.8351886

> xttx<-kf$f$xttx

> Q<-kf$Q

> R<-kf$R

> Ptt<-kf$f$Ptt

> Pttm1<-kf$f$Pttm1

> Fm<-kf$Fm

> Fn<-Fm

> H<-kf$H

The object kf contains all relevant entities: state-vector and variances. We now briefly look at theoptimized parm-vector as well as at the relevant (optimized) system matrices which are obtainedby transforming parm according to section 9.5:

> parm

64

[1] 149.96335194 97.93872433 -14.20951262 3.45694238

[5] 0.40010230 -0.03640980 -0.55053678 84.77328378

[9] 0.03791357 25.09807908

> Fm

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 1 0.000000 0 0 0 0 0 0 0

[2,] 0 1 0.000000 0 0 0 0 0 0 0

[3,] 0 0 -0.990099 0 0 0 0 0 0 0

[4,] 0 0 1.000000 0 0 0 0 0 0 0

[5,] 0 0 0.000000 1 0 0 0 0 0 0

[6,] 0 0 0.000000 0 1 0 0 0 0 0

[7,] 0 0 0.000000 0 0 1 0 0 0 0

[8,] 0 0 0.000000 0 0 0 1 0 0 0

[9,] 0 0 0.000000 0 0 0 0 1 0 0

[10,] 0 0 0.000000 0 0 0 0 0 1 0

[11,] 0 0 0.000000 0 0 0 0 0 0 1

[12,] 0 0 0.000000 0 0 0 0 0 0 0

[13,] 0 0 0.000000 0 0 0 0 0 0 0

[14,] 0 0 0.000000 0 0 0 0 0 0 0

[15,] 0 0 0.000000 0 0 0 0 0 0 0

[,11] [,12] [,13] [,14] [,15]

[1,] 0 0 0 0.0000000 0.0000000

[2,] 0 0 0 0.0000000 0.0000000

[3,] 0 0 0 0.9900977 0.9802947

[4,] 0 0 0 0.0000000 0.0000000

[5,] 0 0 0 0.0000000 0.0000000

[6,] 0 0 0 0.0000000 0.0000000

[7,] 0 0 0 0.0000000 0.0000000

[8,] 0 0 0 0.0000000 0.0000000

[9,] 0 0 0 0.0000000 0.0000000

[10,] 0 0 0 0.0000000 0.0000000

[11,] 0 0 0 0.0000000 0.0000000

[12,] 1 0 0 0.0000000 0.0000000

[13,] 0 1 0 0.0000000 0.0000000

[14,] 0 0 1 0.0000000 0.0000000

[15,] 0 0 0 1.0000000 0.0000000

> H

[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

> Q

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]

[1,] 0.1600819 0.000000000 0.0000000 0 0 0 0 0

[2,] 0.0000000 0.001325674 0.0000000 0 0 0 0 0

[3,] 0.0000000 0.000000000 0.3030907 0 0 0 0 0

[4,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[5,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[6,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[7,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[8,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[9,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

65

[10,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[11,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[12,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[13,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[14,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[15,] 0.0000000 0.000000000 0.0000000 0 0 0 0 0

[,9] [,10] [,11] [,12] [,13] [,14] [,15]

[1,] 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0

[3,] 0 0 0 0 0 0 0

[4,] 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 0 0

[7,] 0 0 0 0 0 0 0

[8,] 0 0 0 0 0 0 0

[9,] 0 0 0 0 0 0 0

[10,] 0 0 0 0 0 0 0

[11,] 0 0 0 0 0 0 0

[12,] 0 0 0 0 0 0 0

[13,] 0 0 0 0 0 0 0

[14,] 0 0 0 0 0 0 0

[15,] 0 0 0 0 0 0 0

> R

[1] 11.95045

According to the state-transition matrix F the optimal seasonal model is (1+0.99B)(1−0.99B12)ξt,3 =vt,3. The variances in the state-equation Q11 = 0.1601, Q22 = 0.00133 and Q33 = 0.3 suggest thatthe innovation of the stationary seasonal component is largest, followed by the trend-innovationand then by the slope-innovation which is smallest in magnitude. This result confirms expecta-tions according to which the slope should adapt more gradually than the level, for example. Thenoise component wt in the observation equation is dominant with a variance R = 12. Note thatwe relied on the maximum-likelihood optimization algorithm (loglik<-T). The least-squares result(not shown here) would be very close. We can now proceed to the plots.

9.8 Interpretation

the TARMED-adjusted series yt − θIt≥2004 is plotted in fig.24.

> lwde<-1

> y_adj<-ts(z-parm[length(parm)]*intermhk,frequency=12,start=c(2000,1))

> file = paste("z-y_adj.pdf", sep = "")


> tit<-paste("TARMED-adjusted data",sep="")

> ts.plot(ts(y_adj,frequency=12,start=c(2001,1)),type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="black")

> abline(v=2004)

> dev.off()

null device

1

The estimated θ = parm[10] seems to capture quite well the singular shift.

66

TARMED−adjusted data

2002 2004 2006 2008

230

240

250

260

270

Figure 24: A comparison of components by dlm and the SSM in the previous section

We next attempt to interpret the variances Q11 = 0.1601, Q22 = 0.00133 which seem fairlysmall when compared to R = 11.95. The level-innovation is integrated in ξ1t like a (first order)random-walk. The series length is len = 102: we thus have a total dynamic variance at the endof the sample corresponding to len ∗ Q11 = 16.33. Similarly, the slope-innovation is integratedas a (first-order) random-walk in the slope ξ2t, but since ξ2t is itself integrated in the level ξ1twe finally obtain a second order (integrated) random walk. The double sum implies a cumulated

variance-effect oflen(len− 1)

2Q22 = 6.83 at the sample end.

We now briefly compare these numbers with

• the sample variance of the TARMED-adjusted series

V ar(yt − θIt≥2004) = 93.22

• the sample variance of the estimated level+slope component

V ar(ξt|t,1) = 33.78

see top graph in fig.26 (note that P1|0 supplies additional variance to this aggregate);

• the sample variance of the seasonal component in the bottom graph of figure 26 which is

V ar(ξt|t,3) = 36.99

When comparing these numbers we can recognize a fairly well-equilibrated design which conformsquite well with a visual inspection of the TARMED-adjusted series in fig.24. As we shall see,alternative R-packages proposed in sections 10.1 and 10.2 introduce more or less severe distortions(either too adaptive or insufficiently adaptive level and slope components) which will affect forecastperformances for the health-care costs under scrutiny.

67

9.9 Plots of the Components

We here first compare in-and out-sample fits of the data by the state-space model. Specifically weplot yt (the data), H′ξt|t−1 (out-of-sample: green line) and H′ξt|t (in-sample: blue lines) as wellas the confidence intervals, see fig.25. Note that the confidence intervals are obtained as

H′Pt|tH +R

H′Pt|t−1H +R

> std_insamp<-1:len

> std_outsamp<-1:len

> for (i in 1:len)

+ {

+ std_insamp[i]<-H%*%Ptt[,,i]%*%H+R

+ std_outsamp[i]<-H%*%Pttm1[,,i]%*%H+R

+ }

> lwde<-1

> file = paste("z-insample.pdf", sep = "")


> par(mfrow=c(3,1))

> tit<-"Data vs. in-sample fit"

> ymin<-min(kf$f$insampos-2*sqrt(std_insamp))

> ymax<-max(kf$f$insampos+2*sqrt(std_insamp))

> ts.plot(y,col="black",lty=1,type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,ylim=c(ymin,ymax))

> lines(ts(kf$f$insampos-2*sqrt(std_insamp),frequency=12,start=c(2000,1)),

+ col="blue",lty=2,lwd=lwde+1)

> lines(ts(kf$f$insampos+2*sqrt(std_insamp),frequency=12,start=c(2000,1)),

+ col="blue",lty=2,lwd=lwde+1)

> lines(ts(kf$f$insampos,frequency=12,start=c(2000,1)),col="blue",lty=1,lwd=lwde+1)

> mtext("Data", side = 3, line = -1,at=2004,col="black")

> mtext("In-sample fit", side = 3, line = -2,at=2004,col="blue")

> tit<-"Data vs. out-of-sample fit"

> ymin<-min(kf$f$outsampms-2*sqrt(std_insamp))

> ymax<-max(kf$f$outsampms+2*sqrt(std_insamp))

> ts.plot(y,col="black",lty=1,type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,ylim=c(ymin,ymax))

> lines(ts(kf$f$outsampms,frequency=12,start=c(2000,1)),col="green",lty=1,lwd=lwde+1)

> lines(ts(kf$f$outsampms-2*sqrt(std_outsamp),frequency=12,start=c(2000,1)),

+ col="green",lty=2,lwd=lwde+1)

> lines(ts(kf$f$outsampms+2*sqrt(std_outsamp),frequency=12,start=c(2000,1)),

+ col="green",lty=2,lwd=lwde+1)

> mtext("Data", side = 3, line = -1,at=2004,col="black")

> mtext("out-of-sample fit", side = 3, line = -2,at=2004,col="green")

> tit<-"In-sample vs. out-of-sample fits"

> ts.plot(ts(kf$f$insampos,frequency=12,start=c(2000,1)),col="blue",lty=1,type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1)

> lines(ts(kf$f$outsampms,frequency=12,start=c(2000,1)),col="green",lty=1,lwd=lwde+1)

> mtext("In-sample fit", side = 3, line = -1,at=2004,col="blue")

> mtext("out-of-sample fit", side = 3, line = -2,at=2004,col="green")

> dev.off()

null device

1

68

Data vs. in−sample fit

2000 2002 2004 2006 2008

22

02

60

30

0 DataIn−sample fit

Data vs. out−of−sample fit

2000 2002 2004 2006 2008

24

02

80

Dataout−of−sample fit

In−sample vs. out−of−sample fits

2000 2002 2004 2006 2008

23

02

60

29

0 In−sample fitout−of−sample fit

Figure 25: Data (black), in-sample fit (blue) and out-of-sample fit (green)

69

We can see that the model fits the data nicely and that the out-of-sample fit is pretty close tothe in-sample fit which means that the model anticipates the future (one-step ahead) data quitewell. A possible exception is towards 2004 (TARMED introduction): in fact some of the actorsin the health-care sector implemented TARMED prior to January 2004 whereas our interventionvariable is set to January 2004 but we can safely ignore this ‘cosmetic’ issue here. Note also thelarger confidence intervals of the out-of-sample estimate (green line) at the start of the samplingperiod which is due to P1|0.

We now plot the components: the level ξ1t, the slope ξ2t and the seasonal component ξ3t, seefig.26.

> ce<-1

> file = paste("z-components.pdf", sep = "")


> par(mfrow=c(3,1))

> tit<-"Level"

> ymin<-min(xttx[,1]-2*sqrt(Ptt[1,1,]))

> ymax<-max(xttx[,1]+2*sqrt(Ptt[1,1,]))

> ts.plot(ts(xttx[,1],frequency=12,start=c(2000,1)),type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="brown",ylim=c(ymin,ymax))

> lines(ts(xttx[,1]+2*sqrt(Ptt[1,1,]),frequency=12,start=c(2000,1)),col="brown",lty=2)

> lines(ts(xttx[,1]-2*sqrt(Ptt[1,1,]),frequency=12,start=c(2000,1)),col="brown",lty=2)

> abline(v=2004)

> tit<-"Slope/drift"



> ts.plot(ts(xttx[,2],frequency=12,start=c(2000,1)),lty=1,type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="orange",ylim=c(ymin,ymax))

> lines(ts(xttx[,2]+2*sqrt(Ptt[2,2,]),frequency=12,start=c(2000,1)),col="orange",lty=2)

> lines(ts(xttx[,2]-2*sqrt(Ptt[2,2,]),frequency=12,start=c(2000,1)),col="orange",lty=2)

> abline(v=2004)

> tit<-"Season"



> ts.plot(ts(xttx[,3],frequency=12,start=c(2000,1)),lty=1,type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="violet",ylim=c(ymin,ymax))

> lines(ts(xttx[,3]+2*sqrt(Ptt[3,3,]),frequency=12,start=c(2000,1)),col="violet",lty=2)

> lines(ts(xttx[,3]-2*sqrt(Ptt[3,3,]),frequency=12,start=c(2000,1)),col="violet",lty=2)

> abline(v=2004)

> dev.off()

null device

1

The interesting structural slope parameter, which determines the critical systemic growth of health-care expenditures, increases steadily up to the TARMED-intervention. From then on, growthstabilizes and slightly decreases back to growth-dynamics as observed in 2002. The width of theconfidence intervalls reflect the adaptivity of the components.

We next plot the TARMED-intervention effect, as captured by θIt≥2004 in 19 (note that θ =

parm[10]), see fig.27. We also print the optimized θ = parm[10].

> ce<-1

> file = paste("z-intervention.pdf", sep = "")


70

Level

2000 2002 2004 2006 2008

24

52

55

26

5

Slope/drift

2000 2002 2004 2006 2008

−0

.20

.20

.6

Season

2000 2002 2004 2006 2008

−2

0−

55

Figure 26: Components: level (top), slope (middle) and season (bottom)

71

> tit<-"Level+TARMED intervention"

> ts.plot(ts(xttx[,1]+parm[length(parm)]*intermhk,frequency=12,start=c(2000,1)),type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="brown",ylim=c(min(y),max(y)))

> lines(y)

> abline(v=2004)

> dev.off()

null device

1

> parm[10]

[1] 25.09808

Level+TARMED intervention

2000 2002 2004 2006 2008

230

240

250

260

270

280

290

300

Figure 27: Data (black) and level + intervetion variable (brown)

The singular TARMED effect is estimated as 25.1 which is close to a singular 10%-increase of costs:a substantial undesirable effect since the introduction of TARMED should have been achievedunder the aspect of cost-neutrality. However, as seen in the previous graph, the critical growth-parameter has stabilized. To be more specific, the relevant relative growth (100*slope/level in %)has peaked during introduction of TARMED and since then it tends to back to historical levels,see fig.28: note that the (arbitrary) initialization of zero at the begin of the time series should beignored.

> ce<-1

> file = paste("z-relativegrowth.pdf", sep = "")


> tit<-"Relative Growth: 100*Drift/Level (in %)"

72

> ts.plot(ts(100*xttx[,2]/(xttx[,1]+parm[length(parm)]*intermhk),

+ frequency=12,start=c(2000,1)),type="l",

+ main=tit,ylab="%-Growth",xlab="",lwd=lwde+1,col="orange")

> abline(v=2004)

> dev.off()

null device

1

Relative Growth: 100*Drift/Level (in %)

%−

Gro

wth

2000 2002 2004 2006 2008

0.00

0.05

0.10

Figure 28: ratio of growth to level in percent

9.10 Forecasts

In order to discuss and possibly to fix the value of the taxpoint, the parties (FMH and SUVA)have to agree on the future growth-dynamics eigtheen months ahead of the time-point of meeting.The following forecasts yT+s, s = 1, ..., 18 are determined according to section 7.1

ξT+s|T =

(s∏

k=1

F

)ξT |T

yT+s = H′ξT+s|T

PT+s|T = FPT+s−1|TF′ + Q

V ar(yT+s − yT+s) = H′PT+s|TH + R

see fig.29.

73

> sigma_for<-array(dim=c(dim(Ptt[,,len]),flen))

> sigma_intervall<-1:flen

> Fn<-Fm

> for (j in 1:flen)

+ {

+ fore[fstep,j]<-H%*%Fn%*%xttx[length(y),]+parm[length(parm)]*intermhk[length(intermhk)]

+ Fn<-Fm%*%Fn

+ if (j==1)

+ {

+ sigma_for[,,j]<-Fm%*%Ptt[,,len]%*%t(Fm)+Q

+ } else

+ {

+ sigma_for[,,j]<-Fm%*%sigma_for[,,j-1]%*%t(Fm)+Q

+ }

+ sigma_intervall[j]<-H%*%sigma_for[,,j]%*%H+R

+ }

> ce<-1

> file = paste("z-forecasts.pdf", sep = "")


> tit<-paste("1-",flen," steps ahead 95%-forecast interval of costs",sep="")

> ymax<-max(fore[fstep,]+2*sqrt(sigma_intervall))

> ymin<-min(y)

> ts.plot(ts(c(y,rep(NA,flen)),frequency=12,start=c(2000,1)),type="l",

+ main=tit,ylab="%-Growth",xlab="",lwd=lwde+1,col="black",ylim=c(ymin,ymax))

> lines(ts(c(rep(NA,len),fore[fstep,]),frequency=12,start=c(2000,1)),col="red")

> lines(ts(c(rep(NA,len),fore[fstep,]-2*sqrt(sigma_intervall)),frequency=12,start=c(2000,1)),col="red",lty=2)

> lines(ts(c(rep(NA,len),fore[fstep,]+2*sqrt(sigma_intervall)),frequency=12,start=c(2000,1)),col="red",lty=2)

> abline(v=2004)

> dev.off()

null device

1

>

We infer that the relevant systemic growth seems to be under control (the true costs were veryclose to the reported forecasts and fully within the intervall). Note also that the width of theforecast intervall increases as a function of the forecast horizon because of the adpativity of theSSM. This effect is due to the cumulative effect of Q in

PT+s|T = FPT+s−1|TF′ + Q

10 R-Packages

We here briefly discuss two R-packages dedicated to SSM, dlm and KFAS (the latter will beinterfaced by the package dlmodeler). There are other packages available such as dse, sspir andFKF but space is limited and we do not want to discuss every available approach. The readershould be aware that each package relies on own specific nomenclatures which translates into aconfusing Babylonian specification of system matrices and noise variances. The package dlmodelerattempts to provide some kind of unifying interface to various other packages although we did notverify pertinence of this claim (we just interfaced KFAS).

74

1−18 steps ahead 95%−forecast interval of costs

%−

Gro

wth

2000 2002 2004 2006 2008 2010

240

260

280

300

Figure 29: Forecast intervall of health-care expenditures 18-months ahead

10.1 dlm-Package

The DLM-package can handle adaptive level, slope and seasonal components and more, see Gio-vanni Petris (2011)6. A changing level-model can be specified by dlmModPoly(1) (order onestochastic polynomial) whereas a changing level+slope model is given by dlmModPoly(2) (ordertwo stochastic polynomial). A seasonal component is obtained by dlmModSeas(s) where s is theseasonal length: 12 for monthly data or 4 for quarterly data. The following piece of code appliesthe package to the health-care time series in the previous section though we adjust for the singularTARMED-shift in 2004.

First, the package dlm must be loaded: library(dlm).

> library(dlm)

Then the TARMED adjusted series y adj is defined. Components can be added to obtain the finalSSM: dlmModPoly(2) + dlmModSeas(12):

> # Remove TARMED-Shift and Make time series

> y_adj<-ts(z-parm[length(parm)]*intermhk,frequency=12,start=c(2000,1))

> dlm_mhk <- dlmModPoly(2) + dlmModSeas(12)

The variance-matrix Q of the state-equation is called W whereas the variance R of the observationequation is called V. Both matrices are defined in the function buildFun().

> buildFun <- function(parm) {

+ diag(W(dlm_mhk))[2:3] <- exp(parm[1:2])

6www.jstatsoft.org/v41/i04/paper and http://definetti.uark.edu/ gpetris/

75

+ V(dlm_mhk) <- exp(parm[3])

+ return(dlm_mhk)

+ }

Neither seasonal or non-seasonal AR-parameters nor initial values are estimated. Thus the parameter-vector parm is three dimensional only. We can now estimate parm:

> fit <- dlmMLE(y_adj, parm = rep(0, 3), build = buildFun)

> dlm_mhk <- buildFun(fit$par)

> drop(V(dlm_mhk))

[1] 30.45436

> diag(W(dlm_mhk))[2:3]

[1] 4.013260e-04 8.860186e-06

A quick comparison with Q in section 9.7 reveals that dlm is more conservative or less adaptivesince Q11 = 0.1601 is larger than W11 = 4e−04 and Q22 = 0.00133 is larger than W22 = 8.9e−06.

Finally we compute filtered estimates of the components - level, slope and seasonal - and plotthese together with the estimates obtained in the previous section, see fig.30. Note that we discardthe first twelve observations because of undesirable initialization effects in the dlm-package.

> mhk_filter <- dlmFilter(y_adj, mod = dlm_mhk)

> x <- cbind(y_adj, dropFirst(mhk_filter$m[, 1:3]))

> anf<-13

> colnames(x) <- c("MHK", "Trend","slope", "Seasonal")

> file = paste("z-dlm_components.pdf", sep = "")


> par(mfrow=c(2,2))

> tit<-paste("Linearized Data (TARMED intervention is substracted)",sep="")

> ts.plot(ts(x[anf:len,1],frequency=12,start=c(2001,1)),type="l",


> tit<-paste("Trend",sep="")


+ main=tit,ylab="",xlab="",lwd=lwde+1,col="black",ylim=c(min(xttx[anf:len,1]),max(xttx[anf:len,1])))

> lines(ts(xttx[anf:len,1],frequency=12,start=c(2001,1)),type="l",

+ lwd=lwde+1,col="brown")

> mtext("Trend DLM", side = 3, line = -1,at=2004,col="black")

> mtext("Trend SSM-MHK", side = 3, line = -2,at=2004,col="brown")

> tit<-paste("Slope",sep="")


+ main=tit,ylab="",xlab="",lwd=lwde+1,col="black",ylim=c(min(xttx[anf:len,2]),max(x[anf:len,3])))


+ lwd=lwde+1,col="orange")

> mtext("Slope DLM", side = 3, line = -1,at=2004,col="black")

> mtext("Slope SSM-MHK", side = 3, line = -2,at=2004,col="orange")

> tit<-paste("Seasonal",sep="")




+ lwd=lwde+1,col="violet")

> mtext("Seasonal DLM", side = 3, line = -1,at=2004,col="black")

> mtext("Seasonal SSM-MHK", side = 3, line = -2,at=2004,col="violet")

> dev.off()

76

null device

1

Linearized Data (TARMED intervention is substracted)

2002 2004 2006 2008

240

250

260

270

Trend

2002 2004 2006 2008

250

255

260

265

Trend DLMTrend SSM−MHK

Slope

2002 2004 2006 2008

0.0

0.2

0.4 Slope DLM

Slope SSM−MHK

Seasonal

2002 2004 2006 2008

−10

−5

05

Seasonal DLMSeasonal SSM−MHK


We see that the level obtained by dlm is lower (which is compensated by a slight upward shift ofthe seasonal component) and that the slope estimate is less adaptive, as expected (the true slopeis slightly underestimated by dlm i.e. the true costs after 2008 are growing more markedly). Allin all we observe quite a close match of both models though a slight preference is assigned to themore adaptive SSM proposed in the previous section because the data is subject to frequent shiftsand structural changes. We conjecture that the increased adaptivity is a consequence of selectingα = 0.98 < 1 in section 9.3.

To conclude we compare the forecasts of both SSM: please note that we add the intervention(TARMED-effect) back:

> mhk_forecast <- dlmForecast(mhk_filter, n = flen)

> file = paste("z-dlmforecasts.pdf", sep = "")


> tit<-paste("1-",flen," steps ahead forecasts of dlm and benchmark SSM (previous section)",sep="")

> ymax<-max(fore[fstep,]+2*sqrt(sigma_intervall))

> ymin<-min(y)






> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+mhk_forecast$a[,1]+mhk_forecast$a[,3]),frequency=12,start=c(2000,1)),col="blue",lty=1)

77

> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+mhk_forecast$a[,1]+mhk_forecast$a[,3]+2*sqrt(unlist(mhk_forecast$Q))),frequency=12,start=c(2000,1)),col="blue",lty=2)

> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+mhk_forecast$a[,1]+mhk_forecast$a[,3]-2*sqrt(unlist(mhk_forecast$Q))),frequency=12,start=c(2000,1)),col="blue",lty=2)

> abline(v=2004)

> mtext("DLM", side = 3, line = -1,at=2004,col="blue")

> mtext("SSM", side = 3, line = -2,at=2004,col="red")

> dev.off()

null device

1

>

1−18 steps ahead forecasts of dlm and benchmark SSM (previous section)

%−

Gro

wth

2000 2002 2004 2006 2008 2010

240

260

280

300

DLMSSM

Figure 31: Forecasts of dlm and benchmark SSM 18-months ahead

The main difference in the forecasts stems from the (negative) AR(1)-term of the SSM whichimplies an alternate (negatively autocorrelated) forecast overlaying seasonal and trend extrapo-lations. The similarity of both forecasts is a bit fortuitous and should not detract from the factthat the dlm-model prefers a fixed-slope constant-growth model (very small innovation-varianceW22 = 8.9e − 06 for the slope in the state-equation) vs. our SSM prefering an adaptive slope(Q22 = 0.00133). Clearly, the latter specification is to be preferred for the type of health-caredynamics under scrutiny here.

10.2 Package KFAS (as interfaced by dlmodeler)

The package dlmodeler is an interface to various SS-packages: we here use KFAS, see http://cran.r-project.org/web/packages/KFAS/index.html.

> library(dlmodeler)

> library(KFAS)

78

We first wanted to implement a SSM with variable level and slope components as well as amonthly seasonal, directly in KFAS. Unfortunately the structSSM-function didn’t allow for all threecomponents: either level+seasonal (without changing slope) or level+slope (without seasonal) werepossible. The more general SSModel function should be able to handle the level-slope-seasonalTroika but then one would have to specify all system matrices ‘by hand’ which is cumbersome,particularly so because the nomenclature is completely different (all system matrices and variance-covariance matrices are called differently). Therefore, we relied on the more comfortable dlmodelerinterface where the Troika can be implemented as follows7:

> mod <- dlmodeler.build.structural(

+ pol.order=1,

+ pol.sigmaQ=NA,

+ dseas.order=12,

+ dseas.sigmaQ=NA,

+ sigmaH=NA,

+ name='KFAS')

The parameter pol.order=1 means that we allow for variable level and slope (pol.order=0 wouldmean a variable level without slope). NA’s indicate that the corresponding parameters must beestimated. Another idiosyncracy is the specification of the time series which must be a row-matrix :

> z_adj<-matrix(ts(z-parm[length(parm)]*intermhk,frequency=12,start=c(2000,1)),nrow=1)

Otherwise, the numerical optimization will complain! Next we fit the unkown parameters i.e. thevariances R and Q.

> fit <- dlmodeler.fit(z_adj, mod, method="MLE")

CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH : 0 in 35 iterations

> fit$model$Ht

[,1]

[1,] 13.00915

> fit$model$Qt[1,1]

[1] 11.07678

> fit$model$Qt[2,2]

[1] 2.353679e-08

> fit$model$Qt[3,3]

[1] 1.509263e-07

The variances of wt Ht = 13.009 (which corresponds to R = 11.95 in section 9.7) and of the levelQt[1, 1] = 11.1 dominate: the variance of the level as estimated by KFAS is much larger thanQ11 = 0.16008 as estimated in section 9.7. This will result in a (very) noisy level-component, seefig.32 below (middle graph). We note, also, that the cumulated variance of the level-innovationwithin the random-walk specification of the level would result in a total variance of the level-component

V ar(ξ1t) = len ·Qt[1, 1] = 1129.83

7The notation and nomenclature for the system-matrices in dlmodeler also partially departs from dlm or fromour own notation: so for example the matrix H (called sigmaH in the code) corresponds to R, the variance ofthe wt. Surprisingly, the matrix Q (called sigmaQ in the code) corresponds to our own Q, the variance of theinnovations in the state-equation.

79

which heavily exceeds the variance of the TARMED-adjusted series

V ar(yt − θIt≥2004) = 93.22

Possibly the numerical optimization did not converge to the global minimum here (although thecorresponding summary did not indicate potential problems). In stark contrast, the variancesof the slope Qt[2, 2] = 2e − 08 and of the seasonal Qt[3, 3] = 1.5e − 07 are close to zero thussuggesting a constant slope model which conflicts with the empirical findings in section 9 (whereQ22 = 0.001 and Q33 = 0.303) as well as with our a priori knowledge about the continuouslyevolving health-care dynamics. Therefore we mark our preference for the SSM in section 9 (recallthat the dlm-model in the previous section was too conservative, too).

We can compute filtered estimates of the components, see fig.32. Note that the slope cannot beseparated from the trend (which is unfortunate because the slope is the most interesting componentin this application) and therefore we plot the sum of both components.

> # compute the filtered and smoothed values

> f <- dlmodeler.filter(z_adj, fit$model, smooth=TRUE)

> # f.ce represents the filtered one steap ahead observation

> # prediction expectations E[y(t) | y(1), y(2), ..., y(t-1)]

> f.ce <- dlmodeler.extract(f, fit$model,

+ type="observation", value="mean")

> # s.ce represents the smoothed observation expectations

> # E[y(t) | y(1), y(2), ..., y(n)]

> s.ce <- dlmodeler.extract(f$smooth, fit$model,

+ type="observation", value="mean")

> file = paste("z-KFAS_components.pdf", sep = "")


> anf<-13

> par(mfrow=c(3,1))

> tit<-paste("Linearized Data (TARMED intervention is substracted)",sep="")

> ts.plot(ts(z_adj[anf:len],frequency=12,start=c(2001,1)),type="l",


> tit<-paste("Trend+Slope",sep="")

> ts.plot(ts(t(f.ce[[1]])[anf:len],frequency=12,start=c(2001,1)),type="l",

+ main=tit,ylab="",xlab="",lwd=lwde+1,col="black",

+ ylim=c(min(xttx[anf:len,1]),max(xttx[anf:len,1])))


+ lwd=lwde+1,col="brown")

> mtext("Trend+slope KFAS", side = 3, line = -1,at=2004,col="black")

> mtext("Trend SSM-MHK", side = 3, line = -2,at=2004,col="brown")

> tit<-paste("Seasonal",sep="")

> ts.plot(ts(t(f.ce$seasonal)[anf:len],frequency=12,start=c(2001,1)),type="l",



+ lwd=lwde+1,col="violet")

> mtext("Seasonal KFAS", side = 3, line = -1,at=2004,col="black")

> mtext("Seasonal SSM-MHK", side = 3, line = -2,at=2004,col="violet")

> dev.off()

null device

1

The components as estimated by KFAS are again benchmarked against the components obtainedby our SSM in section 9. As expected, the level+slope component in the mid-graph is very noisy

80

Linearized Data (TARMED intervention is substracted)

2002 2004 2006 2008

240

260

Trend+Slope

2002 2004 2006 2008

250

260

Trend+slope KFASTrend SSM−MHK

Seasonal

2002 2004 2006 2008

−10

1030

Seasonal KFASSeasonal SSM−MHK


which does not match our research purpose. Note also that this noisy component seems to trackseasonal fluctuations around 2004-2006 which is clearly undesirable, since the regular yearly fluc-tuations should be tracked by the seasonal component (bottom graph).

To conclude we compare the forecasts of KFAS with the benchmark SSM of section 9: notethat we add back the TARMED-intervention, see fig.33.

> f <- dlmodeler.forecast(z_adj, fit$model, start=len+1, ahead=flen)

> file = paste("z-KFAS_forecasts.pdf", sep = "")


> tit<-paste("KFAS vs. Bechmark SSM Forecasts",sep="")

> ymax<-max(f$upper+parm[length(parm)]*intermhk[len])

> ymin<-min(y)

81






> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+f$yhat),frequency=12,start=c(2000,1)),col="dark blue")

> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+f$lower),frequency=12,start=c(2000,1)),col="blue",lty=2)

> lines(ts(c(rep(NA,len),parm[length(parm)]*intermhk[len]+f$upper),frequency=12,start=c(2000,1)),col="blue",lty=2)

> abline(v=2004)

> mtext("KFAS", side = 3, line = -1,at=2004,col="blue")

> mtext("SSM", side = 3, line = -2,at=2004,col="red")

> dev.off()

null device

1

KFAS vs. Bechmark SSM Forecasts

%−

Gro

wth

2000 2002 2004 2006 2008 2010

240

260

280

300

320

KFASSSM

Figure 33: Forecasts : KFAS (blue) vs. benchmark-SSM

The forecast-intervall of KFAS is (unrealistically) much wider than the SSM forecast-intervall. Thisis a direct consequence of the larger (unrealistic) innovation-variance Qt[1, 1] = 11.1 of the (noisier)level-component of KFAS as compared to the level-variance Q11 = 0.1601 of the benchmark SSMin section 9.

10.3 Summary

All three approaches differ in the specification of the components. The SSM proposed in section 9is the most flexible but it is also the most complex because it involves more unknown parameters

parm′ = (a1, a2, a12, R,Q11, Q22, Q33, P1|011 , P1|022 , θ)

82

(note that a2 is not estimated). However, the added computational burden is rewarded by im-proved component-estimates, consistent with time series dynamics, as well as better forecastingperformances of this model. The dlm prefers a model with nearly fixed level and slope components,which conflicts with our a priori knowledge about the dynamics of the time series. In contrast,KFAS assumes a very noisy heavily adaptive level-component which cannot be reconciled with theobserved time series dynamics8. As a consequence, forecast intervals spread much too fast as afunction of the forecast horizon.

Disclaimer:

We cannot completely exclude that we did not parametrize correctly (or did not make use ofthe full potential) of the dlm- and KFAS-packages. Moreover, SSM’s attempt to solve a difficultidentification (estimation) problem and therefore the numerical optimization algorithm may fail toconverge to the global minimum (although the corresponding summaries did not suggest problems).In general, we tend to assign a strong preference to customized models, such as proposed in section9, because ‘standardized’ approaches, such as dlm or KFAS, are rather difficult to control and tohandle correctly. Moreover, one-fit-all designs are frequently challenged by ‘typical’ dynamics ofeconomic time series. Experience invariably suggests that carefully implemented DIY-SSM (do-it-yourself designs) outperform standardized approaches!

11 Appendix

11.1 The Kalman-Gain

Equation 6 can be derived from a simple mean-square estimation of the unknown weight β in theregression equation

ξt = ξt|t−1 + βεt + ut

The least-squares formula for the estimate of β is

b = Cov(ξt, εt|yt−1,yt−2, ...,y1)V ar(εt|yt−1,yt−2, ...,y1)−1 (20)

Now

Cov(ξt, εt|yt−1,yt−2, ...y1) = E[(ξt − ξt|t−1)(yt − yt|t−1)′]

= E[(ξt − ξt|t−1)(H′(ξt − ξt|t−1) + wt)′]

= Pt|t−1H

And

V ar(εt|yt−1,yt−2, ...,y1) = V ar(yt − yt|t−1)

= V ar((H′(ξt − ξt|t−1) + wt))

= H′Pt|t−1H + R

By plugging these expressions into 20 we thus obtain the Kalman-Gain in equation 6.

We now derive 8: Pt|t is the variance of ξt|t. Using 6 we obtain

Pt|t = V ar(ξt|t)

= Pt|t−1 + 2Kov(ξt|t−1,Pt|t−1H(H′Pt|t−1H + R)−1εt)

+V ar(Pt|t−1H(H′Pt|t−1H + R)−1εt) (21)

8The cumulated variance would exceed by ten-times the effectively observed sample variance of the series.

83

Now

V ar(Pt|t−1H(H′Pt|t−1H + R)−1εt) = V ar(Pt|t−1H(H′Pt|t−1H + R)−1(yt −H′ξt|t−1))

= V ar(Pt|t−1H(H′Pt|t−1H + R)−1(H′(ξt − ξt|t−1) + wt))

= Pt|t−1H(H′Pt|t−1H + R)−1(H′Pt|t−1H + R)

·(Pt|t−1H(H′Pt|t−1H + R)−1)′

= Pt|t−1H(H′Pt|t−1H + R)−1(H′Pt|t−1

·H + R)(H′Pt|t−1H + R)−1H′P′t|t−1

= Pt|t−1H(H′Pt|t−1H + R)−1H′Pt|t−1

And

Kov(ξt|t−1,Pt|t−1H(H′Pt|t−1H + R)−1εt)

= Kov(ξt|t−1,Pt|t−1H(H′Pt|t−1H + R)−1(yt −H′ξt|t−1))

= Kov(ξt|t−1,Pt|t−1H(H′Pt|t−1H + R)−1(−H′(ξt|t−1 − ξt) + wt))

= Pt|t−1H(H′Pt|t−1H + R)−1)(−H′)Kov(ξt|t−1, ξt|t−1)

= −Pt|t−1H(H′Pt|t−1H + R)−1H′Pt|t−1

Pluging this into 21 we obtain 8.

84

Documents

An Introduction to State Space Models - ZHAW Blogs · PDF fileAn Introduction to State Space Models Marc Wildi May 6, ... (as interfaced by dlmodeler) ... • Classical linear regression