Lecture 3: Simulation algorithmsbodavid/GMRF2015/Lectures/F3slides.pdf · The sparse precision matrix Recallthat E(x i ijx i) = 1 Q ii X j˘i Q ij(x i j); Prec(x ijx i) = Q ii Inmostcases:

Lecture 3: Simulation algorithmsGaussian Markov random fields

David BolinChalmers University of Technology

January 26, 2015

The sparse precision matrix

Recall that

E(xi−µi | x−i) = −1

Qii

∑j∼i

Qij(xi−µj), Prec(xi | x−i) = Qii

In most cases:• Total number of neighbours is O(n).• Only O(n) of the n2 terms in Q will be non-zero.• Use this to construct exact simulation algorithms for GMRFs,using numerical algorithms for sparse matrices.

Introduction David Bolin

Example of a typical precision matrix





Each node have on the average 7 neighbours.


Simulation algorithms for GMRFs

Can we take advantage of the sparse structure of Q?• It is faster to factorise a sparse Q compared to a dense Q.• The speedup depends on the “pattern” in Q, not only thenumber of non-zero terms.

Our task• Formulate all algorithms to use only sparse matrices.• Unconditional simulation• Conditional simulation

• Condition on a subset of variables• Condition on linear constraints• Condition on linear constraints with normal noise

• Evaluation of the log-density in all cases.

Simulation algorithms for GMRFs — Task David Bolin

The result

In most cases, the cost is• O(n) for temporal GMRFs• O(n3/2) for spatial GMRFs• O(n2) for spatio-temporal GMRFs

including evaluation of the log-density.

Condition on k linear constraints, add O(k3).

These are general algorithms only depending on the graph G notthe numerical values in Q.

The core is numerical algorithms for sparse matrices.

Simulation algorithms for GMRFs — Summary of the simulation algorithms David Bolin

Cholesky factorisation

If A > 0 be a n× n positive definite matrix, then there exists aunique Cholesky triangle L, such that L is a lower triangular matrixwith positive diagonal elements, and

A = LLT

This factorisation is the basis for solving systems like

Ax = b or AX = B

for k right hand sides, or equivalently, computing

x = A−1b or X = A−1B

Basic numerical linear algebra — Cholesky factorisation and the Cholesky triangle David Bolin

How to compute the Cholesky factorisation

Qij =

j∑k=1

LikLjk, i ≥ j.

vi = Qij −j−1∑k=1

LikLjk, i ≥ j,

Then• L2

jj = vj , and• LijLjj = vi for i > j.

If we know {vi} for fixed j, then

Ljj =√vj and Lij = vi/

√vj , for i = j + 1, . . . , n.

This gives the jth column in L.


Cholesky factorization of Q > 0

Algorithm 1 Computing the Cholesky triangle L of Q

1: for j = 1 to n do2: vj:n = Qj:n,j3: for k = 1 to j − 1 do vj:n = vj:n − Lj:n,kLjk4: Lj:n,j = vj:n/

√vj

5: end for6: Return L

The overall process involves n3/3 flops.


Solving linear equations

Algorithm 2 Solving Ax = b where A > 0

1: Compute the Cholesky factorisation, A = LLT

2: Solve Lv = b3: Solve LTx = v4: Return x

Step 2 is called forward-substitution and cost O(n2) flops.

=

The solution v is computed in a forward-loop

vi =1

Lii(bi −

i−1∑j=1

Lijvj), i = 1, . . . , n (1)

Basic numerical linear algebra — Solving linear equations David Bolin

Solving linear equations

Algorithm 2 Solving Ax = b where A > 0


2: Solve Lv = b3: Solve LTx = v4: Return x

Step 3 is called back-substitution and costs O(n2) flops.

The solution x is computed in a backward-loop

xi =1

Lii(vi −

n∑j=i+1

Ljixj), i = n, . . . , 1 (2)

Basic numerical linear algebra — Solving linear equations David Bolin

We do not need to compute inverses

To compute A−1B where B is a n× k matrix, we do this bycomputing the solution X of

AXj = Bj

for each of the k columns of X.

Algorithm 3 Solving AX = B where A > 0


2: for j = 1 to k do3: Solve Lv = Bj

4: Solve LTXj = v5: end for6: Return X

Basic numerical linear algebra — Avoid computing the inverse David Bolin

Sample x ∼ N (µ,Q−1)

If Q = LLT and z ∼ N (0, I), then x defined by

LTx = z

has covariance

Cov(x) = Cov(L−T z) = (LLT )−1 = Q−1

Algorithm 4 Sampling x ∼ N (µ,Q−1)

1: Compute the Cholesky factorisation, Q = LLT

2: Sample z ∼ N (0, I)3: Solve LTv = z4: Compute x = µ+ v5: Return x

Unconditional sampling — Simulation algorithm David Bolin

Evaluating the log-density

For Bayesian computations, we often have to compute thelog-density for the normal distribution:

log π(x) = −n2log 2π +

1

2log det(Q)− 1

2(x− µ)TQ(x− µ)︸︷︷︸

=q

If x is sampled, then q = zT z otherwise, compute this term as• u = x− µ• v = Qu

• q = uTv

log det(Q) is also easy to evaluate: We get the determinant forfree with the Cholesky decomposition:

1

2log det(Q) =

1

2log det(LL>) = log det(L) =

n∑i=1

logLii

Unconditional sampling — Evaluating the log-density David Bolin

Conditional simulation of a GMRF

Decompose x as(xAxB

)∼ N

((µAµB

),

(QAA QAB

QBA QBB

)−1)

ThenxA | xB

has canonical parameterisation

xA − µA | xB ∼ NC(−QAB(xB − µB),QAA) (3)

Simulate using an algorithm for a canonical parameterisation.

Conditional sampling David Bolin

Sample x ∼ NC(b,Q−1)

Recall that

“NC(b,Q) = N (Q−1b,Q−1)′′

so we need to compute the mean as well.

Algorithm 5 Sampling x ∼ NC(b,Q)


2: Solve Lw = b3: Solve LTµ = w4: Sample z ∼ N (0, I)5: Solve LTv = z6: Compute x = µ+ v7: Return x

Conditional sampling — Sampling from a canonical parameterised GMRF David Bolin

Posterior sampling and kriging

A common scenario is that we have a hierarchical model

y|x ∼ N(Ax,Q−1ε )

x ∼ N(µx,Q−1x )

and are interested in sampling x|y ∼ N(µx|y,Q−1x|y), with

µx|y = µx + Q−1x|yA>Qε(y −Aµx)

Qx|y = Qx + A>QεA

Direct method for sampling: Use Algorithm 5 forx|y − µx ∼ NC(A

>Qε(y −Aµx),Qx|y).

This work well if we only have local observations.

Conditional sampling — Posterior sampling David Bolin

Conditioning by kriging

A popular approach for posterior sampling for covariance-basedhierarchical models is the conditioning by kriging approach.

The idea behind it is to sample the prior and then correct thesample so that it becomes a sample from the posterior.

The traditional form of the method is• Sample x∗ ∼ N(µx,Q

−1x )

• Sample y∗ ∼ N(y,Q−1ε )

• Return x = x∗ −Q−1x A>(AQ−1x A> + Q−1ε )−1(Ax∗ − y∗)

The name comes from the fact that the final expression is thecovariance-based kriging equation.

Conditional sampling — Conditioning by kriging David Bolin

Conditioning by kriging II

(AQ−1x A> + Q−1ε ) is almost always a dense matrix.• We can only handle a small number of observations.

However, we can modify the expression using the Woodburyidentity to recover the precision-based kriging expression:

x = x∗ −Q−1x A>(AQ−1x A> + Q−1ε )−1(Ax∗ − y∗)

= x∗ + Q−1x|yAQε(y∗ −Ax∗)

which is calculated using forward and backward solves.

A quick operation count shows that this almost always is lessefficient than the direct method for GMRF models

The exception is when we have non-local observations.


Evaluating the conditional marginal data likelihood

To evaluate π(y|θ), we can use that for any x∗, we have

π(y|θ) = π(θ,y)

π(θ)

=π(θ,y)π(x|θ,y)π(θ)π(x|θ,y)

∣∣∣∣x=x∗

=π(θ,y,x)

π(θ)π(x|θ,y)

∣∣∣∣x=x∗

=π(X|θ)π(y|θ,x)

π(x|θ,y)

∣∣∣∣x=x∗

In practise, we use x∗ = µx|y for numerical stability.


Evaluating the conditional marginal data likelihood

With x∗ = µx|y, we get

log π(y|θ) =− n

2log(2π) +

1

2log |Qx|+

1

2log |Qε| −

1

2log |Qx|y|

− 1

2(µx|y − µx)>Qx(µx|y − µx)

− 1

2(y −Aµx|y)

>Qε(y −Aµx|y)

Note that we do not need to evaluate Qy, which is dense.

By comparing terms, we can note that1 log |Qy| = log |Qx|+ log |Qε| − log |Qx|y|2 (y −Aµx)

>Qy(y −Aµx) is given by the sum of thequadratic forms above.


Linear constraints

Posterior sampling can be seen as sampling under linear contraints.

More generally, there are three basic classes of linear contraintsthat are important

1 Non-interacting hard constraintsCertain nodes are given explicitly

2 Interacting hard contraintsLinear combinations of nodes are constrained

3 Soft contraintsLinear combinations are constrained under uncertainty


Linear constraints

All these cases can be written as

Ax + ε = e

where

ε =

{0 for hard contraintsN(0,Q−1ε ) for soft contraints

Aim: Sample from x|e when

x ∼ N(µ,Q−1)

e|x ∼ N(Ax,Q−1ε )


Soft constraints

Soft contraints are equivalent to the posterior sampling case.

Thus, in general we use the direct method for sampling.

The exception is if we have non-local observations, when it is betterto use conditioning by kriging since the posterior precision is dense.

ExampleObserve the sum of xi with unit variance noise, the posteriorprecision is

Q + 11T

which is a dense matrix.

Sampling under soft linear constraints David Bolin

Algorithm 6 Sampling x | Ax = ε when ε ∼ N (e,Σε) and x ∼ N (µ,Q−1)


2: Sample z ∼ N (0, I)3: Solve LTv = z4: Compute x = µ+ v5: Compute Vn×k = Q−1AT using Algorithm 3 using L6: Compute Wk×k = AV + Σε

7: Compute Uk×n = W−1VT using Algorithm 38: Sample ε ∼ N (e,W) using the factorisation from step 79: Compute c = Ax− ε

10: Compute x∗ = x−UT c11: Return x∗

When z = 0 and ε = e then x∗ is the conditional mean.

This is computationally feasible if k � n.

Sampling under soft linear constraints David Bolin


With only local observations, we evaluate the log-density of theposterior as we evaluated the prior log-denstiy.

For non-local observations, the log-density can be computed via

π(x | e) = π(x)π(e | x)π(e)

(4)

All required Cholesky factors are available from the simulationalgorithm.π(x)-term. This is a GMRF.

π(e | x)-term. e | x is Gaussian with mean Ax and covariance Σε.π(e)-term. e is Gaussian with mean Aµ and covariance matrix

AQ−1AT + Σε.

Sampling under soft linear constraints — Evaluating the log-density David Bolin

Non-interacting hard contraints

This is the simplest case, we split the joint distribution for x innon-constrained nodes xn and constrained nodes xc:(

xnxc

)∼ N

((µnµc

),

(Qnn Qnc

Qcn Qcc

)−1).

By Theorem 2.5 we have xn|xc = e ∼ N(µn|c,Q−1cc ), where

µn|c = µn −Q−1nnQnc(e− µc)

Thus, we can sample the constrained distribution using Algorithm 6.Note that the joint constrained distribution is degenerate:((

xnxc

)| xc = e

)∼ N

((µn|ce

),

(Q−1nn 0

0 0

)).

Non-interacting hard contraints David Bolin

Interacting hard contraints x|Ax = e

This case occurs quite frequently:

A sum-to-zero constraint corresponds to k = 1, A = 1T

and e = 0.

The linear constraint makes the conditional distribution Gaussian• but it is singular as the rank of the constrained covariancematrix is n− k

• more care must be exercised during samplingWe have that

E(x | Ax = e) = µ−AQ−1(AQ−1AT )−1(Aµ− e) (5)Cov(x | Ax = e) = Q−1 −Q−1AT (AQ−1AT )−1AQ−1 (6)

This is typically a dense-matrix case, which must be solved usinggeneral O(n3) algorithms.

Interacting hard contraints David Bolin

Conditioning via Kriging

However, this is where we can use conditioning by kriging andcorrect for the constraints, at nearly no costs if k � n.

With Q−1ε = 0 in the earlier expression, we get:

Let x ∼ N (µ,Q), then compute

x∗ = x−Q−1AT (AQ−1AT )−1(Ax− e). (7)

Now x∗ has the correct conditional distribution!

AQ−1AT is a k × k matrix, hence its factorisation is fast tocompute for small k.

Interacting hard contraints — Specific algorithm with not to many constraints David Bolin

Algorithm 7 Sampling x | Ax = e when x ∼ N (µ,Q−1)


2: Sample z ∼ N (0, I)3: Solve LTv = z4: Compute x = µ+ v5: Compute Vn×k = Q−1AT with Algorithm 36: Compute Wk×k = AV7: Compute Uk×n = W−1VT using Algorithm 38: Compute c = Ax− e9: Compute x∗ = x−UT c

10: Return x∗

If z = 0 in Algorithm 7, then x∗ is the conditional mean.Extra cost is only O(k3) for large k!

Interacting hard contraints — Specific algorithm with not to many constraints David Bolin


The log-density can be rapidly computed using the followingidentity,

π(x | Ax) =π(x)π(Ax | x)

π(Ax), (8)

We can compute each term on the rhs easier than the lhs.

π(x)-term. This is a GMRF and the log-density is easy tocompute using L computed in Algorithm 7 step 1.

Interacting hard contraints — Evaluating the log-density David Bolin




π(Ax), (8)


π(Ax | x)-term. This is a degenerate density, which is either zeroor a constant,

log π(Ax | x) = −1

2log |AAT | (9)

The determinant of a k × k matrix can be found itsCholesky factorisation.





π(Ax), (9)


π(Ax)-term. Ax is Gaussian with mean Aµ and covariancematrix AQ−1AT with Cholesky triangle L̃ availablefrom Algorithm 7 step 7.


Numerical methods for sparse matrices

We see that computations for GMRFs can be expressed such thatthe main tasks are

1 compute the Cholesky factorisation of Q = LLT , and2 solve Lv = b and LTx = z.

Now we will see how sparsity can be used to do these tasksefficiently.

The second task is much faster than the first, but sparsity will be ofadvantage also here.

Numerical methods for sparse matrices David Bolin

The goals

The goal with this lecture and the next is to explain• why a sparse Q allow for fast factorisation• how we can take advantage of it,• why we gain if we permute the vertics before factorising thematrix,

• how statisticians can benefit for recent research in this area bythe numerical mathematicians.

At the end we present a small case study factorising some typicalmatrices for GMRFs, using a classical and more recent methods forfactorising matrices.

Numerical methods for sparse matrices David Bolin

Interpretation of L (I)

Let Q = LLT , then the solution of

LTx = z where z ∼ N (0, I)

is N (0,Q−1) distributed.

Since L is lower triangular then

xn =1

Lnnzn

xn−1 =1

Ln−1,n−1(zn−1 − Ln,n−1xn)

. . .

The Cholesky triangle — Interpretation David Bolin

Interpretation of L (II)

TheoremLet x be a GMRF wrt to the labelled graph G, with mean µ andprecision matrix Q > 0. Let L be the Cholesky triangle of Q. Thenfor i ∈ V,

E(xi | x(i+1):n) = µi −1

Lii

n∑j=i+1

Lji(xj − µj) and

Prec(xi | x(i+1):n) = L2ii.

The Cholesky triangle — Interpretation David Bolin

Determine the zero-pattern in L (I)

Theorem

Let x be a GMRF wrt G, with mean µ and precision matrix Q > 0.Let L be the Cholesky triangle of Q and define for 1 ≤ i < j ≤ nthe set

F (i, j) = {i+ 1, . . . , j − 1, j + 1, . . . , n},

which is the future of i except j. Then

xi ⊥ xj | xF (i,j) ⇐⇒ Lji = 0.

If we can verify that Lji is zero, we do not have to compute itwhen factorising Q

The Cholesky triangle — The zero-pattern in L David Bolin

Determine the zero-pattern in L (II)

The global Markov property provide a simple and sufficient criteriafor checking if Lji = 0.

CorollaryIf F (i, j) separates i < j in G, then Lji = 0.

CorollaryIf i ∼ j then F (i, j) does not separate i < j.

The idea is simple• Use the global Markov property to check if Lji = 0.• Compute only the non-zero terms in L, so that Q = LLT .• Note that the corollary does not use numerical values of Q,thus it is true for any Q > 0 with the same graph.

The Cholesky triangle — The zero-pattern in L David Bolin

Example

3

1

4

2

Q =

× × ×× × ×× × ×× × ×

L =

×× ×× ? ×? × × ×

L =

×× ××√×

× × ×

The Cholesky triangle — Example: A simple graph David Bolin

Example: AR(1)-process

xt | x1:(t−1) ∼ N (φxt−1, σ2), t = 1, . . . , n

Q =

× ×× × ×× × ×× × ×× × ×× × ×× ×

L =

×× ×× ×× ×× ×× ×× ×

The Cholesky triangle — Example: auto-regressive processes David Bolin

Bandwidth is preserved

Similarly, for an AR(p)-process• Q have bandwidth p.• L have lower-bandwidth p.

TheoremLet Q > 0 be a band matrix with bandwidth p and dimension n,then the Cholesky triangle of Q has (lower) bandwidth p.

...easy to modify existing Cholesky-factorisation code to use onlyentries where |i− j| ≤ p.

Band matrices — Bandwidth is preserved David Bolin

Avoid computing Lij and reading Qij for |i− j| > p.

Algorithm 8 Band-Cholesky factorization of Q with bandwidth p

1: for j = 1 to n do2: λ = min{j + p, n}3: vj:λ = Qj:λ,j4: for k = max{1, j − p} to j − 1 do5: i = min{k + p, n}6: vj:i = vj:i − Lj:i,kLjk7: end for8: Lj:λ,j = vj:λ/

√vj

9: end for10: Return L

Cost is now n(p2 + 3p) flops assuming n� p.

Band matrices — Cholesky factorisation for band-matrices David Bolin

Documents

Lecture 3: Simulation algorithmsbodavid/GMRF2015/Lectures/F3slides.pdf · The sparse precision matrix Recallthat E(x i ijx i) = 1 Q ii X j˘i Q ij(x i j); Prec(x ijx i) = Q ii Inmostcases: