7 The multivariate normal modelpeople.musc.edu/~brn200/abcm/Reading/hoff7.pdf7 The multivariate normal model Up until now all of our statistical models have been univariate models,

7

The multivariate normal model

Up until now all of our statistical models have been univariate models, thatis, models for a single measurement on each member of a sample of individualsor each run of a repeated experiment. However, datasets are frequently multi-variate, having multiple measurements for each individual or experiment. Thischapter covers what is perhaps the most useful model for multivariate data,the multivariate normal model, which allows us to jointly estimate populationmeans, variances and correlations of a collection of variables. After first cal-culating posterior distributions under semiconjugate prior distributions, weshow how the multivariate normal model can be used to impute data that aremissing at random.

7.1 The multivariate normal density

Example: Reading comprehension

A sample of twenty-two children are given reading comprehension tests beforeand after receiving a particular instructional method. Each student i will thenhave two scores, Yi,1 and Yi,2 denoting the pre- and post-instructional scoresrespectively. We denote each student’s pair of scores as a 2× 1 vector Y i, sothat

Y i =(Yi,1Yi,2

)=(

score on first testscore on second test

).

Things we might be interested in include the population mean θ,

E[Y ] =(

E[Yi,1]E[Yi,2]

)=(θ1θ2

)and the population covariance matrix Σ,

Σ = Cov[Y ] =(

E[Y 21 ]− E[Y1]2 E[Y1Y2]− E[Y1]E[Y2]E[Y1Y2]− E[Y1]E[Y2] E[Y 22 ]− E[Y2]2

)=(σ21 σ1,2σ1,2 σ

22

),

P.D. Hoff, A First Course in Bayesian Statistical Methods,Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6 7,c© Springer Science+Business Media, LLC 2009

106 7 The multivariate normal model

where the expectations above represent the unknown population averages.Having information about θ and Σ may help us in assessing the effectivenessof the teaching method, possibly evaluated with θ2− θ1, or the consistency ofthe reading comprehension test, which could be evaluated with the correlationcoefficient ρ1,2 = σ1,2/

√σ21σ

22 .

The multivariate normal density

Notice that θ and Σ are both functions of population moments, or populationaverages of powers of Y1 and Y2. In particular, θ and Σ are functions of first-and second-order moments:

first-order moments: E[Y1],E[Y2]second-order moments: E[Y 21 ],E[Y1Y2],E[Y

22 ]

Recall from Chapter 5 that a univariate normal model describes a populationin terms of its mean and variance (θ, σ2), or equivalently its first two moments(E[Y ] = θ,E[Y 2] = σ2 + θ2). The analogous model for describing first- andsecond-order moments of multivariate data is the multivariate normal model.We say a p-dimensional data vector Y has a multivariate normal distributionif its sampling density is given by

p(y|θ, Σ) = (2π)−p/2|Σ|−1/2 exp{−(y − θ)TΣ−1(y − θ)/2}

where

y =

y1y2...yp

θ =θ1θ2...θp

Σ =

σ21 σ1,2 · · · σ1,pσ1,2 σ

22 · · · σ2,p

......

...σ1,p · · · · · · σ2p

.Calculating this density requires a few operations involving matrix algebra.For a matrix A, the value of |A| is called the determinant of A, and measureshow “big” A is. The inverse of A is the matrix A−1 such that AA−1 is equalto the identity matrix Ip, the p×p matrix that has ones for its diagonal entriesbut is otherwise zero. For a p× 1 vector b, bT is its transpose, and is simplythe 1 × p vector of the same values. Finally, the vector-matrix product bTAis equal to the 1 × p vector (

∑pj=1 bjaj,1, . . . ,

∑pj=1 bjaj,p), and the value of

bTAb is the single number∑pj=1

∑pk=1 bjbkaj,k. Fortunately, R can compute

all of these quantities for us, as we shall see in the forthcoming example code.Figure 7.1 gives contour plots and 30 samples from each of three different

two-dimensional multivariate normal densities. In each one θ = (50, 50)T ,σ21 = 64, σ

22 = 144, but the value of σ1,2 varies from plot to plot, with σ1,2 =

−48 for the left density, 0 for the middle and +48 for the density on the right(giving correlations of -.5, 0 and +.5 respectively). An interesting feature ofthe multivariate normal distribution is that the marginal distribution of eachvariable Yj is a univariate normal distribution, with mean θj and variance σ2j .

7.2 A semiconjugate prior distribution for the mean 107

This means that the marginal distributions for Y1 from the three populationsin Figure 7.1 are identical (the same holds for Y2). The only thing that differsacross the three populations is the relationship between Y1 and Y2, which iscontrolled by the covariance parameter σ1,2.

20 40 60 80

2030

4050

6070

80

y1

y 2

x

x

x

xx

x

xx

xx

xx

x

x

xx

x x

x

xx

x

xx

x

x

xx

x

x

20 40 60 80

2030

4050

6070

80

y1

y 2

x

x xx

x

x

x

x

x

xx

x

x

x

x

x

xxx

x

xx

x

xxx xx

x

x

20 40 60 80

2030

4050

6070

80

y1

y 2 xx

xxx

x

x

x

x

x

x

x

x

xx

x

x

x

xx

xx

x xx

x

xx

xx

Fig. 7.1. Multivariate normal samples and densities.

7.2 A semiconjugate prior distribution for the mean

Recall from Chapters 5 and 6 that if Y1, . . . , Yn are independent samples froma univariate normal population, then a convenient conjugate prior distributionfor the population mean is also univariate normal. Similarly, a convenient priordistribution for the multivariate mean θ is a multivariate normal distribution,which we will parameterize as

p(θ) = multivariate normal(µ0, Λ0),

where µ0 and Λ0 are the prior mean and variance of θ, respectively. What isthe full conditional distribution of θ, given y1, . . . ,yn and Σ? In the univariatecase, having normal prior and sampling distributions resulted in a normal fullconditional distribution for the population mean. Let’s see if this result holdsfor the multivariate case. We begin by examining the prior distribution as afunction of θ:

p(θ) = (2π)−p/2|Λ0|−1/2 exp{−12(θ − µ0)TΛ−10 (θ − µ0)}

= (2π)−p/2|Λ0|−1/2 exp{−12θTΛ−10 θ + θ

TΛ−10 µ0 −12µT0 Λ

−10 µ0}

∝ exp{−12θTΛ−10 θ + θ

TΛ−10 µ0}

= exp{−12θTA1θ + θT b1}, (7.1)


where A0 = Λ−10 and b0 = Λ−10 µ0. Conversely, Equation 7.1 says that if a

random vector θ has a density on Rp that is proportional to exp{−θTAθ/2+θT b} for some matrix A and vector b, then θ must have a multivariate normaldistribution with covariance A−1 and mean A−1b.

If our sampling model is that {Y 1, . . . ,Y n|θ, Σ} are i.i.d. multivariatenormal(θ, Σ), then similar calculations show that the joint sampling densityof the observed vectors y1, . . . ,yn is

p(y1, . . . ,yn|θ, Σ) =n∏i=1

(2π)−p/2|Σ|−1/2 exp{−(yi − θ)TΣ−1(yi − θ)/2}

= (2π)−np/2|Σ|−n/2 exp{−12

n∑i=1

(yi − θ)TΣ−1(yi − θ)}

∝ exp{−12θTA1θ + θT b1}, (7.2)

where A1 = nΣ−1, b1 = nΣ−1ȳ and ȳ is the vector of variable-specificaverages ȳ = ( 1n

∑ni=1 yi,1, . . . ,

1n

∑ni=1 yi,p)

T . Combining Equations 7.1 and7.2 gives

p(θ|y1, . . . ,yn, Σ) ∝ exp{−12θTA0θ + θT b0} × exp{−

12θTA1θ + θT b1}

= exp{−12θTAnθ + θT bn}, where (7.3)

An = A0 + A1 = Λ−10 + nΣ−1 and

bn = b0 + b1 = Λ−10 µ0 + nΣ−1ȳ.

From the comments in the previous paragraph, Equation 7.3 implies thatthe conditional distribution of θ therefore must be a multivariate normaldistribution with covariance A−1n and mean A

−1n bn, so

Cov[θ|y1, . . . ,yn, Σ] = Λn = (Λ−10 + nΣ−1)−1 (7.4)E[θ|y1, . . . ,yn, Σ] = µn = (Λ−10 + nΣ−1)−1(Λ

−10 µ0 + nΣ

−1ȳ) (7.5)p(θ|y1, . . . ,yn, Σ) = multivariate normal(µn, Λn). (7.6)

It looks a bit complicated, but can be made more understandable by analogywith the univariate normal case: Equation 7.4 says that posterior precision, orinverse variance, is the sum of the prior precision and the data precision, just asin the univariate normal case. Similarly, Equation 7.5 says that the posteriorexpectation is a weighted average of the prior expectation and the samplemean. Notice that, since the sample mean is consistent for the populationmean, the posterior mean also will be consistent for the population meaneven if the true distribution of the data is not multivariate normal.

7.3 The inverse-Wishart distribution 109

7.3 The inverse-Wishart distribution

Just as a variance σ2 must be positive, a variance-covariance matrix Σ mustbe positive definite, meaning that

x′Σx > 0 for all vectors x.

Positive definiteness guarantees that σ2j > 0 for all j and that all correlationsare between -1 and 1. Another requirement of our covariance matrix is that itis symmetric, which means that σj,k = σk,j . Any valid prior distribution forΣ must put all of its probability mass on this complicated set of symmetric,positive definite matrices. How can we formulate such a prior distribution?

Empirical covariance matrices

The sum of squares matrix of a collection of multivariate vectors z1, . . . ,znis given by

n∑i=1

zizTi = Z

TZ,

where Z is the n× p matrix whose ith row is zTi . Recall from matrix algebrathat since zi can be thought of as a p× 1 matrix, zizTi is the following p× pmatrix:

zizTi =

z2i,1 zi,1zi,2 · · · zi,1zi,p

zi,2zi,1 z2i,2 · · · zi,2zi,p

......

zi,pzi,1 zi,pzi,2 · · · z2i,p

.If the zi’s are samples from a population with zero mean, we can think ofthe matrix zizTi /n as the contribution of vector zi to the estimate of thecovariance matrix of all of the observations. In this mean-zero case, if wedivide ZTZ by n, we get a sample covariance matrix, an unbiased estimatorof the population covariance matrix:

1n [Z

TZ]j,j = 1n∑ni=1 z

2i,j = sj,j = s

2j

1n [Z

TZ]j,k = 1n∑ni=1 zi,jzi,k = sj,k .

If n > p and the zi’s are linearly independent, then ZTZ will be positivedefinite and symmetric. This suggests the following construction of a “ran-dom” covariance matrix: For a given positive integer ν0 and a p×p covariancematrix Φ0,

1. sample z1, . . . ,zν0 ∼ i.i.d. multivariate normal(0, Φ0);2. calculate ZTZ =

∑ν0i=1 ziz

Ti .

We can repeat this procedure over and over again, generating matricesZT1 Z1,. . .,Z

TSZS . The population distribution of these sum of squares matri-

ces is called a Wishart distribution with parameters (ν0, Φ0), which has thefollowing properties:


• If ν0 > p, then ZTZ is positive definite with probability 1.• ZTZ is symmetric with probability 1.• E[ZTZ] = ν0Φ0.

The Wishart distribution is a multivariate analogue of the gamma distribution(recall that if z is a mean-zero univariate normal random variable, then z2 is agamma random variable). In the univariate normal model, our prior distribu-tion for the precision 1/σ2 is a gamma distribution, and our full conditionaldistribution for the variance is an inverse-gamma distribution. Similarly, itturns out that the Wishart distribution is a semi-conjugate prior distributionfor the precision matrix Σ−1, and so the inverse-Wishart distribution is oursemi-conjugate prior distribution for the covariance matrix Σ. With a slightreparameterization, to sample a covariance matrix Σ from an inverse-Wishartdistribution we perform the following steps:

1. sample z1, . . . ,zν0 ∼ i.i.d. multivariate normal(0,S−10 );

2. calculate ZTZ =∑ν0i=1 ziz

Ti ;

3. set Σ = (ZTZ)−1.

Under this simulation scheme, the precision matrixΣ−1 has a Wishart(ν0,S−10 )distribution and the covariance matrix Σ has an inverse-Wishart(ν0,S−10 ) dis-tribution. The expectations of Σ−1 and Σ are

E[Σ−1] = ν0S−10

E[Σ] =1

ν0 − p− 1(S−10 )

−1 =1

ν0 − p− 1S0.

If we are confident that the true covariance matrix is near some covariancematrix Σ0, then we might choose ν0 to be large and set S0 = (ν0 − p− 1)Σ0,making the distribution of Σ concentrated around Σ0. On the other hand,choosing ν0 = p+ 2 and S0 = Σ0 makes Σ only loosely centered around Σ0.

Full conditional distribution of the covariance matrix

The inverse-Wishart(ν0,S−10 ) density is given by

p(Σ) =

2ν0p/2π(p2)/2|S0|−ν0/2 p∏j=1

Γ ([ν0 + 1− j]/2)

−1 ×|Σ|−(ν0+p+1)/2 × exp{−tr(S0Σ−1)/2}. (7.7)

The normalizing constant is quite intimidating. Fortunately we will only haveto work with the second line of the equation. The expression “tr” stands fortrace and for a square p × p matrix A, tr(A) =

∑pj=1 aj,j , the sum of the

diagonal elements.We now need to combine the above prior distribution with the sampling

distribution for Y 1, . . . ,Y n:

7.3 The inverse-Wishart distribution 111

p(y1, . . . ,yn|θ, Σ) = (2π)−np/2|Σ|−n/2 exp{−n∑i=1

(yi − θ)TΣ−1(yi − θ)/2} .

(7.8)An interesting result from matrix algebra is that the sum

∑Kk=1 b

TkAbk =

tr(BTBA), where B is the matrix whose kth row is bTk . This means that theterm in the exponent of Equation 7.8 can be expressed as

n∑i=1

(yi − θ)TΣ−1(yi − θ) = tr(SθΣ−1), where

Sθ =n∑i=1

(yi − θ)(yi − θ)T .

The matrix Sθ is the residual sum of squares matrix for the vectors y1, . . . ,ynif the population mean is presumed to be θ. Conditional on θ, 1nSθ providesan unbiased estimate of the true covariance matrix Cov[Y ] (more generally,when θ is not conditioned on the sample covariance matrix is

∑(yi− ȳ)(yi−

ȳ)T /(n − 1) and is an unbiased estimate of Σ). Using the above result tocombine Equations 7.7 and 7.8 gives the conditional distribution of Σ:

p(Σ|y1, . . . ,yn,θ)∝ p(Σ)× p(y1, . . .yn|θ, Σ)

∝(|Σ|−(ν0+p+1)/2 exp{−tr(S0Σ−1)/2}

)×(|Σ|−n/2 exp{−tr(SθΣ−1)/2}

)= |Σ|−(ν0+n+p+1)/2 exp{−tr([S0 + Sθ]Σ−1)/2} .

Thus we have

{Σ|y1, . . . ,yn,θ} ∼ inverse-Wishart(ν0 + n, [S0 + Sθ]−1). (7.9)

Hopefully this result seems somewhat intuitive: We can think of ν0 +n as the“posterior sample size,” being the sum of the “prior sample size” ν0 and thedata sample size. Similarly, S0 + Sθ can be thought of as the “prior” residualsum of squares plus the residual sum of squares from the data. Additionally,the conditional expectation of the population covariance matrix is

E[Σ|y1, . . . ,yn,θ] =1

ν0 + n− p− 1(S0 + Sθ)

=ν0 − p− 1

ν0 + n− p− 11

ν0 − p− 1S0 +

n

ν0 + n− p− 11nSθ

and so the conditional expectation can be seen as a weighted average of theprior expectation and the unbiased estimator. Because it can be shown that Sθconverges to the true population covariance matrix, the posterior expectationof Σ is a consistent estimator of the population covariance, even if the truepopulation distribution is not multivariate normal.


7.4 Gibbs sampling of the mean and covariance

In the last two sections we showed that

{θ|y1, . . . ,yn, Σ} ∼ multivariate normal(µn, Λn){Σ|y1, . . . ,yn,θ} ∼ inverse-Wishart(νn,S

−1n ),

where {Λn,µn} are defined in Equations 7.4 and 7.5, νn = ν0 + n and Sn =S0 +Sθ. These full conditional distributions can be used to construct a Gibbssampler, providing us with an MCMC approximation to the joint posteriordistribution p(θ, Σ|y1, . . . ,yn). Given a starting valueΣ(0), the Gibbs samplergenerates {θ(s+1), Σ(s+1)} from {θ(s), Σ(s)} via the following two steps:

1. Sample θ(s+1) from its full conditional distribution:a) compute µn and Λn from y1, . . . ,yn and Σ(s);b) sample θ(s+1) ∼ multivariate normal(µn, Λn).

2. Sample Σ(s+1) from its full conditional distribution:a) compute Sn from y1, . . . ,yn and θ

(s+1);b) sample Σ(s+1) ∼ inverse-Wishart(ν0 + n,S−1n ).

Steps 1.a and 2.a highlight the fact that {µn, Λn} depend on the value of Σ,and that Sn depends on the value of θ, and so these quantities need to berecalculated at every iteration of the sampler.

Example: Reading comprehension

Let’s return to the example from the beginning of the chapter in which eachof 22 children were given two reading comprehension exams, one before acertain type of instruction and one after. We’ll model these 22 pairs of scores asi.i.d. samples from a multivariate normal distribution. The exam was designedto give average scores of around 50 out of 100, so µ0 = (50, 50)T wouldbe a good choice for our prior expectation. Since the true mean cannot bebelow 0 or above 100, it is desirable to use a prior variance for θ that putslittle probability outside of this range. We’ll take the prior variances on θ1and θ2 to be λ20,1 = λ

20,2 = (50/2)

2 = 625, so that the prior probabilityPr(θj 6∈ [0, 100]) is only 0.05. Finally, since the two exams are measuringsimilar things, whatever the true values of θ1 and θ2 are it is probable thatthey are close. We can reflect this with a prior correlation of 0.5, so thatλ1,2 = 312.5. As for the prior distribution on Σ, some of the same logic aboutthe range of exam scores applies. We’ll take S0 to be the same as Λ0, but onlyloosely center Σ around this value by taking ν0 = p+ 2 = 4.

mu0

7.4 Gibbs sampling of the mean and covariance 113

The observed values y1, . . . ,y22 are plotted as dots in the second panelof Figure 7.2. The sample mean is ȳ = (47.18, 53.86)T , the sample variancesare s21 = 182.16 and s

22 = 243.65, and the sample correlation is s1,2/(s1s2) =

0.70. Let’s use the Gibbs sampler described above to combine this sampleinformation with our prior distributions to obtain estimates and confidenceintervals for the population parameters. We begin by setting Σ(0) equal to thesample covariance matrix, and iterating from there. In the R-code below, Y isthe 22× 2 data matrix of the observed values.data ( chapter7 ) ; Y


of children, then the average score on the second exam would be higher thanthat on the first. This evidence is displayed graphically in the first panel ofFigure 7.2, which shows 97.5%, 75%, 50%, 25% and 2.5% highest posteriordensity contours for the joint posterior distribution of θ = (θ1, θ2)T . A high-est posterior density contour is a two-dimensional analogue of a confidenceinterval. The contours for the posterior distribution of θ are all mostly abovethe 45-degree line θ1 = θ2.

35 40 45 50 55 60

4045

5055

6065

θθ1

θθ 2

0 20 40 60 80 100

020

4060

8010

0

y1

y 2

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

Fig. 7.2. Reading comprehension data and posterior distributions

Now let’s ask a slightly different question - what is the probability thata randomly selected child will score higher on the second exam than onthe first? The answer to this question is a function of the posterior pre-dictive distribution of a new sample (Y1, Y2)T , given the observed values.The second panel of Figure 7.2 shows highest posterior density contours ofthe posterior predictive distribution, which, while mostly being above theline y2 = y1, still has substantial overlap with the region below this line,and in fact Pr(Y2 > Y1|y1, . . . ,yn) = 0.71. How should we evaluate theeffectiveness of the between-exam instruction? On one hand, the fact thatPr(θ2 > θ1|y1, . . . ,yn) = 0.99 seems to suggest that there is a “highlysignificant difference” in exam scores before and after the instruction, yetPr(Y2 > Y1|y1, . . . ,yn) = 0.71 says that almost a third of the students willget a lower score on the second exam. The difference between these two prob-abilities is that the first is measuring the evidence that θ2 is larger than θ1without regard to whether or not the magnitude of the difference θ2 − θ1 islarge compared to the sampling variability of the data. Confusion over thesetwo different ways of comparing populations is common in the reporting ofresults from experiments or surveys: studies with very large values of n oftenresult in values of Pr(θ2 > θ1|y1, . . . ,yn) that are very close to 1 (or p-values

7.5 Missing data and imputation 115

that are very close to zero), suggesting a “significant effect,” even thoughsuch results say nothing about how large of an effect we expect to see for arandomly sampled individual.

7.5 Missing data and imputation

Y[, j1]

Freq

uenc

y

50 100 150 200

010

2030

40

glu

●●

●

●

●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

60 100 140 180

4060

8010

0

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

● ●●

●●

●

●

●

●

●●

●

●

●● ●

●

●●

●

●

●

●

● ●

● ●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●●

●

●

●

● ●

●

60 100 140 180

2040

6080

100

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●●

●

●

●●

●●

●●

60 100 140 180

2030

40

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●● ●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

40 60 80 100

6010

014

018

0

Y[, j1]

Freq

uenc

y

40 60 80 100

010

3050

bp

●

●

●

●●

● ●

●

●

●

●

●

● ●

●

●●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●

●

●

● ●●

●

●

●● ●

● ●●

●●

●

●

●

●

●●

●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

40 60 80 100

2040

6080

100

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

40 60 80 100

2030

40

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

20 40 60 80 100

6010

014

018

0

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

● ●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

● ●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

20 40 60 80 100

4060

8010

0

Y[, j1]

Freq

uenc

y

0 20 40 60 80

010

3050

skin

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

●

●●

●

●

●

●●

●●

●●

20 40 60 80 100

2030

40

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

● ● ●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

20 30 40

6010

014

018

0

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

20 30 40

4060

8010

0

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●●

●

●

●●

●

●

● ●●

●

●

●

●

●

●●

●

●

●● ●

●

●

●●

●

●

●

●

●

● ●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

20 30 40

2040

6080

100

Freq

uenc

y

15 25 35 45

010

3050

bmi

Fig. 7.3. Physiological data on 200 women.

Figure 7.3 displays univariate histograms and bivariate scatterplots forfour variables taken from a dataset involving health-related measurements on200 women of Pima Indian heritage living near Phoenix, Arizona (Smith et al,1988). The four variables are glu (blood plasma glucose concentration), bp(diastolic blood pressure), skin ( skin fold thickness) and bmi (body massindex). The first ten subjects in this dataset have the following entries:


glu bp skin bmi

1 86 68 28 30.2

2 195 70 33 NA

3 77 82 NA 35.8

4 NA 76 43 47.9

5 107 60 NA NA

6 97 76 27 NA

7 NA 58 31 34.3

8 193 50 16 25.9

9 142 80 15 NA

10 128 78 NA 43.3

The NA’s stand for “not available,” and so some data for some individualsare “missing.” Missing data are fairly common in survey data: Sometimespeople accidentally miss a page of a survey, sometimes a doctor forgets towrite down a piece of medical data, sometimes the response is unreadable,and so on. Many surveys (such as the General Social Survey) have multipleversions with certain questions appearing in only a subset of the versions. Asa result, all the subjects may have missing data.

In such situations it is not immediately clear how to do parameter estima-tion. The posterior distribution for θ and Σ depends on

∏ni=1 p(yi|θ, Σ), but

p(yi|θ, Σ) cannot be computed if components of yi are missing. What canwe do? Unfortunately, many software packages either throw away all subjectswith incomplete data, or impute missing values with a population mean orsome other fixed value, then proceed with the analysis. The first approach isbad because we are throwing away a potentially large amount of useful infor-mation. The second is statistically incorrect, as it says we are certain aboutthe values of the missing data when in fact we have not observed them.

Let’s carefully think about the information that is available from subjectswith missing data. Let Oi = (O1, . . . , Op)T be a binary vector of zeros andones such that Oi,j = 1 implies that Yi,j is observed and not missing, whereasOi,j = 0 implies Yi,j is missing. Our observed information about subject iis therefore Oi = oi and Yi,j = yi,j for variables j such that oi,j = 1. Fornow, we’ll assume that missing data are missing at random, meaning that Oiand Y i are statistically independent and that the distribution of Oi does notdepend on θ or Σ. In cases where the data are missing but not at random,then sometimes inference can be made by modeling the relationship betweenOi, Y i and the parameters (see Chapter 21 of Gelman et al (2004)).

In the case where data are missing at random, the sampling probabilityfor the data from subject i is

p(oi, {yi,j : oi,j = 1}|θ, Σ) = p(oi)× p({yi,j : oi,j = 1}|θ, Σ)

= p(oi)×∫ p(yi,1, . . . , yi,p|θ, Σ) ∏

yi,j :oi,j=0

dyi,j

.


In words, our sampling probability for data from subject i is p(oi) mul-tiplied by the marginal probability of the observed variables, after inte-grating out the missing variables. To make this more concrete, supposeyi = (yi,1, NA, yi,3, NA)T , so oi = (1, 0, 1, 0)T . Then

p(oi, yi,1, yi,3|θ, Σ) = p(oi)× p(yi,1, yi,3|θ, Σ)

= p(oi)×∫p(yi|θ, Σ) dy2 dy4.

So the correct thing to do when data are missing at random is to integrate overthe missing data to obtain the marginal probability of the observed data. Inthis particular case of the multivariate normal model, this marginal probabil-ity is easily obtained: p(yi,1, yi,3|θ, Σ) is simply a bivariate normal density withmean (θ1, θ3)T and a covariance matrix made up of (σ21 , σ1,3, σ

23). But combin-

ing marginal densities from subjects having different amounts of informationcan be notationally awkward. Fortunately, our integration can alternativelybe done quite easily using Gibbs sampling.

Gibbs sampling with missing data

In Bayesian inference we use probability distributions to describe our infor-mation about unknown quantities. What are the unknown quantities for ourmultivariate normal model with missing data? The parameters θ and Σ areunknown as usual, but the missing data are also an unknown but key com-ponent of our model. Treating it as such allows us to use Gibbs sampling tomake inference on θ, Σ, as well as to make predictions for the missing values.

Let Y be the n × p matrix of all the potential data, observed and unob-served, and let O be the n× p matrix in which oi,j = 1 if Yi,j is observed andoi,j = 0 if Yi,j is missing. The matrix Y can then be thought of as consistingof two parts:

• Yobs = {yi,j : oi,j = 1}, the data that we do observe, and• Ymiss = {yi,j : oi,j = 0}, the data that we do not observe.

From our observed data we want to obtain p(θ, Σ,Ymiss|Yobs), the poste-rior distribution of unknown and unobserved quantities. A Gibbs samplingscheme for approximating this posterior distribution can be constructed bysimply adding one step to the Gibbs sampler presented in the previous sec-tion: Given starting values {Σ(0),Y(0)miss}, we generate {θ

(s+1), Σ(s+1),Y(s+1)miss }from {θ(s), Σ(s),Y(s)miss} by

1. sampling θ(s+1) from p(θ|Yobs,Y(s)miss, Σ(s)) ;2. sampling Σ(s+1) from p(Σ|Yobs,Y(s)miss,θ

(s+1)) ;3. sampling Y(s+1)miss from p(Ymiss|Yobs,θ

(s+1), Σ(s+1)).

Note that in steps 1 and 2, the fixed value of Yobs combines with the currentvalue of Y(s)miss to form a current version of a complete data matrix Y

(s) having


no missing values. The n rows of the matrix of Y(s) can then be plugged intoformulae 7.6 and 7.9 to obtain the full conditional distributions of θ and Σ.Step 3 is a bit more complicated:

p(Ymiss|Yobs,θ, Σ) ∝ p(Ymiss,Yobs|θ, Σ)

=n∏i=1

p(yi,miss,yi,obs|θ, Σ)

∝n∏i=1

p(yi,miss|yi,obs,θ, Σ),

so for each i we need to sample the missing elements of the data vector condi-tional on the observed elements. This is made possible via the following resultabout multivariate normal distributions: Let y ∼ multivariate normal(θ, Σ),let a be a subset of variable indices {1, . . . , p} and let b be the complement ofa. For example, if p = 4 then perhaps a = {1, 2} and b = {3, 4}. If you knowabout inverses of partitioned matrices you can show that

{y[b]|y[a],θ, Σ} ∼ multivariate normal(θb|a, Σb|a), whereθb|a = θ[b] +Σ[b,a](Σ[a,a])−1(y[a] − θ[a]) (7.10)Σb|a = Σ[b,b] −Σ[b,a](Σ[a,a])−1Σ[a,b]. (7.11)

In the above formulae, θ[b] refers to the elements of θ corresponding to theindices in b, and Σ[a,b] refers to the matrix made up of the elements that arein rows a and columns b of Σ.

Let’s try to gain a little bit of intuition about what is going on in Equations7.10 and 7.11. Suppose y is a sample from our population of four variables glu,bp, skin and bmi. If we have glu and bp data for someone (a = {1, 2}) but aremissing skin and bmi measurements (b = {3, 4}), then we would be interestedin the conditional distribution of these missing measurements y[b] given theobserved information y[a]. Equation 7.10 says that the conditional mean ofskin and bmi start off at their unconditional mean θ[b], but then are modifiedby (y[a]−θ[a]). For example, if a person had higher than average values of gluand bp, then (y[a]−θ[a]) would be a 2×1 vector of positive numbers. For ourdata the 2×2 matrix Σ[b,a](Σ[a,a])−1 has all positive entries, and so θb|a > θ[b].This makes sense: If all four variables are positively correlated, then if weobserve higher than average values of glu and bp, we should also expecthigher than average values of skin and bmi. Also note that Σb|a is equal tothe unconditional varianceΣ[b,b] but with something subtracted off, suggestingthat the conditional variance is less than the unconditional variance. Again,this makes sense: having information about some variables should decrease,or at least not increase, our uncertainty about the others.

The R code below implements the Gibbs sampling scheme for missing datadescribed in steps 1, 2 and 3 above:


data ( chapter7 ) ; Y


### save r e s u l t sTHETA


0.1

0.3

0.5

0.7

glu

● ●

●

0.1

0.3

0.5

0.7

bp

●● ●

0.1

0.3

0.5

0.7

skin

● ●

●

0.1

0.3

0.5

0.7

bmi

●

●

●

glu bp

skin

bmi

−0.

20.

20.

40.

6gl

u

● ●

●

−0.

20.

20.

40.

6bp

●

● ●

−0.

20.

20.

40.

6sk

in

●●

●

−0.

20.

20.

40.

6bm

i

●

●

●

glu bp

skin

bmi

Fig. 7.4. Ninety-five percent posterior confidence intervals for correlations (left)and regression coefficients (right).

the conditional mean of y[b] =glu, given numerical values of y[a] = {bp, skin,bmi }, is given by

E[y[b]|θ, Σ,y[a]] = θ[b] + βTb|a(y[a] − θ[a])

where βTb|a = Σ[b,a](Σ[a,a])−1. Since this takes the form of a linear regression

model, we call the value of βb|a the regression coefficient for y[b] given y[a]based on Σ. Values of βb|a can be computed for each posterior sample ofΣ, allowing us to obtain posterior expectations and confidence intervals forthese regression coefficients. Quantile-based 95% confidence intervals for eachof {β1|234,β2|134,β3|124,β4|123} are shown graphically in the second columnof Figure 7.4. The regression coefficients often tell a different story than thecorrelations: The bottom row of plots, for example, shows that while there


is strong evidence that the correlations between bmi and each of the othervariables are all positive, the plots on the right-hand side suggest that bmi isnearly conditionally independent of glu and bp given skin.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

80 100 120 140 160 180

110

115

120

125

130

135

140

true glu

pred

ictie

d gl

u

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

40 50 60 70 80 90 100

6668

7072

7476

78

true bp

pred

ictie

d bp

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

10 20 30 40

1520

2530

3540

true skin

pred

ictie

d sk

in

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

25 30 35 40

2628

3032

3436

true bmi

pred

ictie

d bm

i

Fig. 7.5. True values of the missing data versus their posterior expectations.

Out-of-sample validation

Actually, the dataset we just analyzed was created by taking a complete datamatrix with no missing values and randomly replacing 10% of the entries withNA’s. Since the original dataset is available, we can compare values predictedby the model to the actual sample values. This comparison is made graphicallyin Figure 7.5, which plots the true value of yi,j against its posterior mean foreach {i, j} such that oi,j = 0. It looks like we are able to do a better job

7.6 Discussion and further references 123

predicting missing values of skin and bmi than the other two variables. Thismakes sense, as these two variables have the highest correlation. If skin ismissing, we can make a good prediction for it based on the observed value ofbmi, and vice-versa. Such a procedure, where we evaluate how well a modeldoes at predicting data that were not used to estimate the parameters, iscalled out-of-sample validation, and is often used to quantify the predictiveperformance of a model.

7.6 Discussion and further references

The multivariate normal model can be justified as a sampling model for rea-sons analogous to those for the univariate normal model (see Section 5.7): It ischaracterized by independence between the sample mean and sample variance(Rao, 1958), it is a maximum entropy distribution and it provides consistentestimation of the population mean and variance, even if the population is notmultivariate normal.

The multivariate normal and Wishart distributions form the foundationof multivariate data analysis. A classic text on the subject is Mardia et al(1979), and one with more coverage of Bayesian approaches is Press (1982).An area of much current Bayesian research involving the multivariate nor-mal distribution is the study of graphical models (Lauritzen, 1996; Jordan,1998). A graphical model allows elements of the precision matrix to be ex-actly equal to zero, implying some variables are conditionally independent ofeach other. A generalization of the Wishart distribution, known as the hyper-inverse-Wishart distribution, has been developed for such models (Dawid andLauritzen, 1993; Letac and Massam, 2007).

The multivariate normal modelThe multivariate normal densityA semiconjugate prior distribution for the meanThe inverse-Wishart distributionGibbs sampling of the mean and covarianceMissing data and imputationDiscussion and further references

Documents

7 The multivariate normal modelpeople.musc.edu/~brn200/abcm/Reading/hoff7.pdf7 The multivariate normal model Up until now all of our statistical models have been univariate models,