Course Background - Dalhousie Universitykeith/Multivariate_2013/notes_2013_Sep_5.pdf · Notes on Applied Multivariate Analysis 4350/5350 Keith R. Thompson September, 2013 These notes

Course Background

• Office Hours: By Appointment

• Office: Room 5672, Oceanography, Life Sciences Complex

• Phone: 494-3491

• E-mail: [email protected]

• Class time: 10:05 am to 11:25 am on Tuesday and Thursday.

• Classroom: Dunn Building, room 302.

The textbook is the 6th edition of

Applied Multivariate Statistical Analysis

published by Prentice Hall and written by

R.A. Johnson and D.W. Wichern

Marking Scheme

30% Assignments30% Midterm40% Final

I will use the computer software package is MATLAB but youcan use S-Plus or R if you prefer.

1

Notes onApplied Multivariate Analysis

4350/5350

Keith R. Thompson

September, 2013

These notes are intended to complement, not replace, the coursetext Applied Multivariate Statistical Analysis by R.A. Johnson andD.W. Wichern. They are designed to guide class presentation anddiscussion. It is essential that you read the textbook in conjunctionwith my notes.

Notes, updates, assignments etc., are available from

http://www.phys.ocean.dal.ca/˜keith/Multivariate_2013/

READING AHEAD PAYS OFF!

2

Contents

1 Preliminaries 7

1.1 Simple Descriptive Statistics . . . . . . . . . . . . 14

1.2 Linear Transformations of Variables . . . . . . . . 22

1.3 Spectral Decomposition of Covariance Matrices . . 26

1.4 Sample Principal Components . . . . . . . . . . . 31

1.5 Simple Linear Regression and Principal Components 39

1.6 Statistical Distance . . . . . . . . . . . . . . . . . 40

1.7 Generalized Sample Variance . . . . . . . . . . . . 42

2 Random Vectors and Matrices 44

2.1 Brief Review of Random Variables . . . . . . . . . 44

2.2 Two Random Variables and Associated Pdfs . . . . 45

2.3 Generalization to Higher Dimensions . . . . . . . 47

3

2.4 Expectations, Covariance, Correlation and Inde-

pendence . . . . . . . . . . . . . . . . . . . . . . 48

2.5 Random Vectors and Matrices . . . . . . . . . . . 50

2.6 Partitioning Random Vectors . . . . . . . . . . . . 51

2.7 Linear Transformations of Random Vectors . . . . 52

2.8 Independence and Random Samples . . . . . . . . 53

3 The Multivariate Normal 55

3.1 Useful Results Related to the Multivariate Normal . 57

3.2 Assessing Normality and Detecting Outliers Post

Midterm . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.1 Q-Q Plots . . . . . . . . . . . . . . . . . . 64

3.2.2 Q-Q plots for multivariate data . . . . . . . 66

4 Inferences About Multivariate Means 69

4.1 Review of Confidence Intervals for Univariate Means 69

4.2 Confidence Regions for Multivariateµwith Σ Known. 71

4.3 Confidence Regions for Multivariateµwith Σ Un-

known . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4 Large Samples . . . . . . . . . . . . . . . . . . . 75

4

4.5 Simultaneous Confidence Intervals and Regions . . 76

4.6 Bonferroni Confidence Intervals . . . . . . . . . . 78

4.7 Maximum Likelihood Estimation of µ and Σ . . . 81

5 Multiple and Multivariate Regression 82

5.1 Multiple Regression . . . . . . . . . . . . . . . . . 82

5.1.1 Least Squares Estimation of β and σ2 . . . 83

5.1.2 Means and Covariances of the Estimators

β, ε and s2 . . . . . . . . . . . . . . . . . 85

5.1.3 Confidence Regions and Intervals . . . . . 85

5.1.4 Tests Involving linar combinations of the βi 86

5.2 Multivariate Regression . . . . . . . . . . . . . . . 87

5.2.1 Least Squares Estimation of β and Σ . . . 87

5.2.2 Means and Covariances of the Estimators

β, ε and Σ . . . . . . . . . . . . . . . . . 88

5.2.3 Tests for Subsets of Regression Parameters 89

5.2.4 Confidence Intervals and Regions for z′0β . 91

6 Relating Two Random Vectors 93

6.1 Principal Component Analysis . . . . . . . . . . . 93

5

6.2 Canonical Correlation Analysis . . . . . . . . . . . 95

6.3 Multivariate Linear Regression Using Random Vec-

tors . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Multivariate Regression on Principal Components . 98

6.5 Redundancy Analysis . . . . . . . . . . . . . . . . 100

6.6 Partial Least Squares and Regression . . . . . . . . 102

6

Chapter 1

Preliminaries

Multivariate statistical analysis is the analysis of simultaneous mea-surements of more than one variable on a number of “items” e.g.,individuals, trials, experimental units, times.

Usually the measurements are organized in an n×p data matrix:

X =

x11 x12 . . . x1p

x21 x22 . . . x2p...

xn1 xn2 . . . xnp

(1.1)

where each row corresponds to a specific item and each columncorresponds to a variable. More complex situations can includeseveral data matrices with different dimensions and missing val-ues.

EXAMPLE: The classic study of Jolicoeur and Mosimann exam-ined the size and shape of female turtle shells. (See references tothe carapace data set in JW.) Jolicoeur and Mosimann made p = 3

measurements (length, width and height in millimeters) on n = 24

turtles. The data are given in the following data matrix.

7

X =

98 81 38

103 84 38

103 86 42

105 86 42

109 88 44

123 92 50

123 95 46

133 99 51

133 102 51

133 102 51

134 100 48

136 102 49

138 98 51

138 99 51

141 105 53

147 108 57

149 107 55

153 107 56

155 115 63

155 117 60

158 115 62

159 118 63

162 124 61

177 132 67

In general xij is the element on the ith row and jth column of X.Here i = 1, . . . 24 defines a specific turtle and j = 1, . . . 3 definesthe variable (length, width and height respectively).

QUESTIONS: What is the width of the shell of the second turtle?Eyeball the mean length of the shells? How much does width varyfrom turtle to turtle? Notice any structure in the data?

8

Always plot your data first! Here’s a scatterplot matrix:

0 100 200Height

0 100 200Width

0 100 2000

50

100

150

200

Length

Height

0

50

100

150

200

Width

0

50

100

150

200

Length

Figure 1.1: Scatterplot matrix of the turtle data. All measurements are in millimeters. His-tograms appear in the diagonal subpanels.

1. Which variable has the largest mean? Smallest standard devi-ation?

2. What do you notice about the relationship between subplotsabove and below the diagonal?

3. Which pair of variables is most highly correlated?

4. Do the observations appear to have come from a Gaussian dis-tribution?

9

EXAMPLE: Amitriptyline is an antidepressant. There are concernsover its effectiveness with respect to gender (z1 = 1 for female, 0for male) and amount taken (z2). The “response” variables are to-tal TCAD (tricyclic antidepressant) plasma level (Y1) and amountof amitriptyine present in TCAD plasma level (Y2). Data was col-lected on 17 patients admitted to hospital after amitriptyine over-dose. Matrix below gives values of Y1, Y2, z1 and z2 in columns 1to 4 respectively. (For details see Exercise 7.25 on p426.)

X =

3389 3149 1 7500

1101 653 1 1975

1131 810 0 3600

596 448 1 675

896 844 1 750

1767 1450 1 2500

807 493 1 350

1111 941 0 1500

645 547 1 375

628 392 1 1050

1360 1283 1 3000

652 458 1 450

860 722 1 1750

500 384 0 2000

781 501 0 4500

1070 405 0 1500

1754 1520 1 3000

QUESTIONS: How does this differ from a multiple regression prob-lem? What is special about the gender variable? How can weassess confidence in our predictions?

10

Here is a scatter plot matrix of the amitriptyline data (use plot-matrix(X) in Matlab).

0 5000 100000 0.5 10 2000 40000 2000 4000

0

5000

10000

0

0.5

1

0

2000

4000

0

2000

4000

Figure 1.2: Scatterplots of the amitriptyline data. The variables are ordered as follows: totalTCAD plasma level (Y1); amount of amitriptyine present in TCAD plasma level (Y2); gender(z1); amount taken (z2).

Partition the panels into “response” and “predictor” groups anddiscuss relations of variables within and between groups.

11

EXAMPLE: Altimeters can measure sea level variability from spacewith an accuracy of cm. Here the data matrix consists of snapshotsof sea level measured weekly along the equator in the Pacific atp = 111 locations and n = 763 times (from 14 October, 1992 to 23May, 2007). Thus each row of X corresponds to a particular time,and each column to sea level at a particular location.

A useful way to plot the data is shown below.

Figure 1.3: Plot of sea level measured weekly along the equator from 1992 to 2007. High sealevel is shown in red, low sea level in blue. This is a “Hovmoller plot” but is also a graphicalrepresentation of the data matrix X. Longitude is measured in degrees east from the dateline.

12

Multivariate analysis addresses questions like1:

1. Can the variability from individual to individual be describedeffectively using less than p variables?

How many variables required to effectively describe the vari-ability in the turtle data? Describe a three dimensional viewof this data set. Effective dimensions of the sea level data?

2. Can the multivariate observations be grouped or “clustered”in some way?

There is no evidence of clustering in the turtle data set. Canyou think of an example where there may be clustering?

3. Given measurements on some variables, how well can the re-maining variables be predicted?

If you know the gender and amount of amitriptyline taken bythe overdose patient, what can you tell me about TCAD bloodplasma levels? If you know sea level in the western Pacific,what can you say about sea level in the eastern Pacific?

4. Are the population means and correlations equal?

The diagonal panels of the turtle scatterplot matrix providestrong evidence that the mean length is longer than mean height.Many cases arise where it is more difficult to detect a statisti-cally significant difference.

How do you find a 95% confidence region for the populationmean length and width of turtle shells?

We are asking here for a generalization of a univariate confi-dence interval.

1See p2-5 of JW

13

1.1 Simple Descriptive Statistics

Assume the data are organized in the n× p data matrix

X =

x11 x12 . . . x1p

x21 x22 . . . x2p...

xn1 xn2 . . . xnp

SAMPLE MEAN:

xk =1

nΣnj=1xjk k = 1, . . . p

This is the sample mean of the kth column. The p sample meansfor the p variables (x1, . . . xp) are arranged in the p × 1 samplemean vector:

x =

x1

x2...xp

SAMPLE VARIANCE AND COVARIANCE:

s2k =

1

n− 1Σnj=1(xjk − xk)2 k = 1, . . . p

This is the sample variance of the kth column. We sometimesuse skk for s2

k. Note that the square root of variance is standarddeviation so sk =

√skk.

14

The sample covariance between the ith and kth variable is

sik =1

n− 1Σnj=1(xji − xi)(xjk − xk)

The sample variances and covariances for the p variables areorganized in the p× p sample covariance matrix

S =

s11 s12 . . . s1p

s21 s22 . . . s2p...

sp1 sp2 . . . spp

SAMPLE CORRELATION:

The sample correlation between the ith and kth variable is

rik =siksisk

=sik√siiskk

Note that correlation has no units, is always between±1, and mea-sures the strength of the linear relationship between ith and kthvariables. The sample correlations for the p variables are orga-nized in the p× p sample correlation matrix:

R =

1 r12 . . . r1p

r21 1 . . . r2p...

rp1 rp2 . . . rpp

QUESTION: What do you notice about the form of S and R?

15

EXAMPLE:

Consider the turtle data consisting of measurements of the length,width and height (in millimeters) of the 24 turtle shells.

The sample means of length, width and height are

x = [136.0, 102.6, 52.0]′

QUESTIONS: Locate the sample means on the scatterplot matrix.

The sample covariance matrix is

S =

452 271 166

271 172 102

166 102 65

QUESTIONS: What are the units of the elements of S? Whichvariable is most variable and what is its standard deviation? Relateto the scatterplot matrix shown earlier. What do you notice aboutthe form of S?

The sample correlation matrix is

R =

1 0.973 0.971

0.973 1 0.966

0.971 0.966 1

QUESTIONS: Which pair of variables is most highly correlated?What proportion of the variance of the lengths can be predictedusing a linear predictor based on width?

16

REVIEW OF PARTITIONED MATRICES: The above equations canbe simplified greatly using matrix notation. Before simplifying, itis useful to review multiplication of partitioned matrices (additionand scalar multiplication are straightforward).

Partition A and B into submatrices, or blocks, as follows:

A =

A11 A12 . . . A1p

A21 A22 . . . A2p... ... . . . ...Aq1 Aq2 . . . Aqp

and

B =

B11 B12 . . . B1r

B21 B22 . . . B2r... ... . . . ...Bp1 Bq2 . . . Bpr

The product of these two matrices, C = AB, is given by

C =

C11 C12 . . . C1r

C21 C22 . . . C2r... ... . . . ...Cq1 Cq2 . . . Cqr

where

Cij =

p∑k=1

AikBkj

assuming the dimensions of the submatrices (or blocks) allowmultiplication (i.e., the partition is conformable).

QUESTION: What happens if you partition just A by its rows? JustB by its columns? A by its columns and B by its rows?

17

SAMPLE STATISTICS IN MATRIX FORM

Define the n× 1 vector of ones by2

1 =

1

1...1

(1.2)

It follows3

x =1

nX′1 (1.3)

QUESTION: Interpret the n × p matrix X = 1x′. Show X −X isa matrix of deviations about the mean (i.e., the column mean hasbeen removed from each element of the data matrix).

The sample covariance matrix can be written in the form

S =1

n− 1(X−X)′(X−X)

=1

n− 1

[X′X−X′X−X

′X + X

′X]

=1

n− 1

[X′X− 1

nX′11′X− 1

nX′11′X +

1

nX′11′X

]and hence

S =1

n− 1X′[I− 1

n11′]

X (1.4)

2Note use of boldface.3Read Section 3.5, p137-140, of JW and the supplement to Chapter 2 if necessary. If you still have problems

with the matrix algebra see me.

18

Let

Q =

[I− 1

n11′]

QUESTION: Describe the elements of this matrix.

It is straightforward to show (can you?)

Q2 = Q

This means that Q is an idempotent matrix.

The matrix of deviations about the mean can be written

X−X = QX

and the covariance matrix S can be written

S =1

n− 1X′QX =

1

n− 1X′Q2X

QUESTION: Interpret X′Q2X and hence S.

The sample correlation matrix is related to the sample covari-ance matrix by

R = D−1SD−1 (1.5)

where

D =

s1 0 0 0

0 s2 0 0...

0 0 0 sp

is a diagonal matrix with standard deviations on the diagonal.

QUESTION: What isD−1? D1/2? D−1/2?

19

EXAMPLE:

The sample mean of the Pacific sea level data is a vector oflength 111, with an element for each longitude (measured degreeseast from 180W). A similar vector can be made up of the standarddeviations of the sea level (one element for the standard deviationof each column of the data matrix).

Figure 1.4: Sample standard deviation of sea level anomalies as a function of longitude alongthe equator (roughly Philipines to Peru).

20

Can also plot the correlation between sea level at one grid pointwith all other grid points (corresponding to plotting a row or col-umn of R). Here are two examples (each corresponding to arow/column of the correlation matrix):

Figure 1.5: Sample correlation of sea level from a longitude of 0◦ and 60◦ with all other loca-tions along the equatorial Pacific.

QUESTION: For which location does sea level “decorrelate” morequickly with distance? What do you notice about the form of thecorrelation for the red line close to r = 1? Bottom line is that thereis often a lot of information in basic statistical plots if you interpretthem carefully.

21

1.2 Linear Transformations of Variables

How will the scatterplots and sample statistics change if we lin-early transform the observations (e.g., change the units of the tur-tle widths from mm to cm). More specifically, how will the scat-terplots and statistics change if we replace the turtle length (x1),width (x2) and height (x3) by the following indices:4

z1 = 0.8139x1 + 0.4961x2 + 0.3025x3

z2 = −0.5549x1 + 0.8180x2 + 0.1514x3

z3 = 0.1723x1 + 0.2911x2 − 0.9411x3

If this transformation is applied to each multivariate observation(i.e., to each row of the data matrix), the result can be written inthe following matrix expression:

Z=XC

where X is the matrix of turtle measurements and

C =

0.8139 −0.5549 0.1723

0.4961 0.8180 0.2911

0.3025 0.1514 −0.9411

(1.6)

QUESTION: What is the dimension of the matrix Z? What do thecolumns and rows correspond to? Suppose we have defined q suchindices, what would be the dimension of Z?

QUESTION: Interpret Z if C = 1 for the turtle example.

4These indices will turn out to be principal components.

22

0 5 10z3

10 15 20 25z2

100 150 200 250

0

2

4

6

8

z1

z 3

10

15

20

25

z 2

100

150

200

250z 1

Figure 1.6: Scatterplot matrix of the new indices z1, z2 and z3. Note that the correlation betweenthe indices has been reduced, and z1 has by far the largest standard deviation.

Is there a simple way to calculate the sample means, variancesand covariances of the transformed variables? Yes (read Section 3.6of JW). The sample means of the columns of Z are

z =1

nZ′1 =

1

nC′X′1 = C′

(1

nX′1

)and so

z = C′x

Thus to get the sample means of the columns of Z, simple trans-form the sample mean of the original data (x) in the same way aseach multivariate observation (i.e., each row of X).

23

Similarly the q × q covariance matrix of Z is

Szz =1

n− 1Z′[I− 1

n11′]

Z

=1

n− 1C′X′

[I− 1

n11′]

XC

and soSzz = C′SC

Thus to get sample covariances of the new indices, pre- andpost-multiply the sample covariance of the original data by C′ andC respectively. More generally if we write C in partitioned form,

C = [c1 c2 . . . cq]

then (??) can be written

Szz =

c′1c′2...c′b

S [c1 c2 . . . cq] =

c′1Sc1 . . . c′1Scqc′2Sc1 . . . c′2Scq

... . . . ...c′qSc1 . . . c′qScq

Thus if we define two linear combinations by Xci and Xcj, their

sample variances are c′iSci and c′jScj, and covariance is c′iScj.

QUESTIONS: What is the sample correlation between the indicesXci and Xcj? (Read Section 3.6 of JW, p 140-144.) Discuss thespecial case ci = cj = 1 where 1 denotes a vector of ones.

ResultThe sample mean and covariance of Z = XC + b are

z = C′x + b

Szz = C′SC

24

EXAMPLE:

The sample mean of the turtle variables, after transforming by(1.6), is given by

z = [177.35, 16.31, 4.32]′

The sample covariance matrix is

Szz =

678.37 0 0

0 6.77 0

0 0 2.85

and the sample correlation matrix is

Rzz =

1 0 0

0 1 0

0 0 1

Note that the correlation between the new indices is exactly

zero. (See Figure 1.6 for a scatterplot of the 3 indices. Note thereduction in correlation compared to the original data shown inFigure 4.3.)

QUESTION: Calculate the standard deviation of the 3 indices andrelate to the histograms plotted in Figure 1.6. Note the high ratioof the largest to smallest standard deviation. How does this ratiocompare to that of the original observations?

Can we always find a linear transformation that eliminates thecorrelations amongst the original variables? The answer, as youwill see in the next section, is yes. This transformation greatlysimplifies the analysis of multivariate data.

25

1.3 Spectral Decomposition of Covariance Matrices

In the last example the original data were linearly transformed insuch a way that the correlations of the new indices were all zero.But how do we find these transformations? To answer this ques-tion we need to introduce orthogonal matrices and understand thespectral decomposition of positive definite matrices.5

Let E denote a square matrix with ith column denoted by ei.

Definition The square matrix E = [e1 e2 . . . ep] is said to be or-thogonal if its columns, considered as vectors, are mutually or-thogonal and have unit length, i.e., E′E = I

Thus the columns of E are unit length (e′iei = 1) and orthogonal(e′iej = 0, i 6= j) i.e., they are orthonormal.

Resulta

A matrix E is said to be orthogonal if and only E−1 = E′. For anorthogonal matrix

E−1E′ = E′E−1 = I

and so the rows of E are also orthonormal.aSee p97, JW. Note that, in general, the matrix A cannot have different left and right inverses. To see this

assume BA = I and AC = I . Then B(AC) = (BA)C and so BI = IC, i.e., B = C.

QUESTION: What can you say about the determinant of an orthog-onal matrix? (Hint: |A′| = |A| and |AB| = |A| |B|)

5All of this material (and more) is covered in detail in Chapter 2 and its supplement. It’s important that youunderstand this material. See me if you have problems.

26

REVIEW OF SINGULAR VALUE DECOMPOSITION: SVD is oneof the most powerful decompositions of Multivariate Analysis.

Resulta

LetA be anm×k matrix of real numbers. There exist orthogonalmatrices U and V such that

A = UΛV ′

where Λ has {i, i} entry λi ≥ 0 for i = 1, . . .min(m, k), 0 other-wise. The positive λi are called the singular values ofA.

aSee p100, JW.

QUESTION: What are the dimension of U, Λ and V?

QUESTION: Assume m < k and write Λ = [Λm 0] where Λm isan m×m diagonal matrix. ShowA = UmΛmV

′m where Um and

Vm hold the first m columns of U and V. (Thin SVD.)

QUESTION: What are the eigenvectors and eigenvalues of AA′

andA′A in terms of U , V and Λ?

SVD is useful when approximatingA by a “simpler” matrix:

Eckart-Young Theorema

Let A be an m × k matrix of real numbers with singular valuedecomposition UΛV ′. Let s be an integer, less than rank(A).

B =

s∑i=1

λiuiv′i

minimizes∑m

i=1

∑kj=1(aij − bij)2 over allB with rank(B) ≤ s.

aSee Result 2A.16, p102 of JW.

27

Orthogonal matrices also appear when we rotate datasets. Toillustrate, let E denote an orthogonal matrix. An arbitrary vectorx can always be written as the following sum of projections6:

x = EE′x = [e1 e2 . . . ep] z = e1z1 + e2z2 + . . . epzp

where zi = e′ix. The zi can be interpreted as the coordinates of x

with respect to the orthonormal basis defined by the columns of E.

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

e1

e2

x

Figure 1.7: Coordinates of x with respect to the orthonormal basis defined by the columns ofE. The coordinates are the elements of z = E′x. In this example E defines a counterclockwiserotation through 30◦.

QUESTION: What happens if we approximate a cloud of 2D pointsby setting z2 = 0? What happens to a 3D cloud of points, and 3orthonormal vectors, if we set z3 = 0?

6See p54, JW

28

Suppose we transform each row of our original data matrix X

using the orthogonal matrix E as follows:

Z = XE

It follows thatSzz = E′SE

Can we find a matrix E that transforms the original covariancematrix into a diagonal matrix?

ResultGiven a symmetric p × p matrix A, there exists an orthogonalmatrix E such that

E′AE = Λ (1.7)

where Λ is the diagonal matrix

Λ =

λ1 0 . . . 0

0 λ2 . . . 0. . .

0 0 . . . λp

If A is positive definite, the λi are greater than zero. If the λi aredistinct, the columns of E are unique up to a sign.

This is an extremely important result. It means, for example, anycovariance matrix can be transformed into a diagonal matrix (im-plying uncorrelated variables) using an orthogonal transformation.

29

Pre-multiplying (1.7) by E, and using EE′ = I, gives

AE = EΛ

Equivalently (make sure you can show using partitioned matrices)

Aei = λiei i = 1, . . . p

where ei is the ith column of E. Thus the columns of E are eigen-vectors of A and the λi are the corresponding eigenvalues.7

Pre and post-multiplying (1.7) by E and E′ respectively givesthe following important result:

SPECTRAL DECOMPOSITION OF A SYMMETRIC MATRIXa

A = EΛE′ = λ1e1e′1 + λ2e2e

′2 + . . . λpepe

′p (1.8)

aSee p61 of JW.

This result shows A can be expressed as a sum of “simpler” ma-trices. This will prove useful later on.

QUESTION: What is the i, j element of A if λ2, . . . λp = 0?

QUESTION: What is the spectral decomposition of a diagonal A

matrix with distinct diagonal elements?

QUESTIONS: What are the eigenvectors and eigenvalues of thesample correlation matrix

R =

[1 r

r 1

]What is its spectral decomposition? Explain how can you alwaystransform a bivariate data set into uncorrelated variables.

7See p97 and 98, JW.

30

1.4 Sample Principal Components

Transforming the dataset X using the eigenvectors of S has attrac-tive properties beyond just providing uncorrelated variables. Toillustrate, consider the following example.

EXAMPLE: The National Track records for men for n = 54 coun-tries are listed in Table 8.6 of JW for the following p = 8 races:

Race 100 m 200m 400m 800m 1500m 5000m 10,000m MarathonUnits (s) (s) (s) (min) (min) (min) (min) (min)Mean 10.22 20.54 45.83 1.77 3.65 13.62 28.54 133.48

std 0.22 0.55 1.44 0.05 0.15 0.76 1.68 8.95

Table 1.1: Sample means and standard deviations of the National Track records data for men.

100 150 200Marathon

20 30 4010000m

10 15 205000m

3 4 51500m

1.6 1.8 2800m

40 50 60400m

15 20 25200m

9 10 11100150200

100m

Marathon 20

3040

10000m

101520

5000m

345

1500m

1.61.82

800m

405060

400m

152025

200m

91011

100m

Figure 1.8: Scatterplot matrix of the men’s National Track data. (Histograms on the diagonal.)

31

The correlation matrix, R, is given below.

R =

1 0.91 0.80 0.71 0.77 0.74 0.71 0.68

0.91 1 0.85 0.80 0.80 0.76 0.75 0.72

0.80 0.84 1 0.77 0.77 0.78 0.77 0.71

0.71 0.80 0.77 1 0.90 0.86 0.84 0.81

0.77 0.80 0.77 0.90 1 0.92 0.90 0.88

0.74 0.76 0.78 0.86 0.92 1 0.99 0.94

0.71 0.75 0.77 0.84 0.90 0.99 1 0.95

0.68 0.72 0.71 0.81 0.88 0.94 0.95 1

QUESTION: What do you notice about the scatterplot matrix over-leaf, and the form of the sample correlation matrix?

QUESTION: How would you summarize the variability in words?

We will now summarize the the variability in the run times forthe p = 8 races using a small number of indices, i.e., principalcomponents. Although principal component analysis is describedin Chapter 8 of JW, bringing it forward helps greatly in understand-ing several basic concepts including the spectral decomposition ofa covariance matrix, statistical distance and the multivariate nor-mal.

32

Let e denote an arbitrary p×1 vector of unit length. What choiceof e maximizes the variance of the transformed variable z = e′x,and what choice minimizes it? To answer such questions we firstneed to define, and examine the properties of, quadratic forms.

DefinitionA symmetric matrix A, and the associated quadratic form

x′Ax,

are said to be positive definite if, for all x 6= 0,

x′Ax > 0

x′Ax is called a quadratic form because it involves sums ofterms like x2

i and xixj. (Write out the quadratic form for p = 2.)

Quadratic forms arise when finding the variance of linearly trans-formed variables and play an important role in Multivariate Anal-ysis. (The sample variance of Xe is e′Se, i.e., a quadratic form.)

QUESTION: Show that A is positive definite if and only if itseigenvalues are positive. (Hint: To prove λi > 0 implies A > 0,define y = Ex. Note y′y = x′x > 0 thus x 6= 0 implies y 6= 0.)

The following result shows how to maximize and minimize quadraticforms and thus the variance of transformed variables.

33

ResultLet S denote a p× p positive definite matrix with eigenvalues

λ1 ≥ λ2 ≥ . . . λp > 0

and corresponding orthonormal eigenvectors

e1, e2, . . . ep.

Thenmax

e 6= 0

e′Se

e′e= λ1 with e = e1

min

e 6= 0

e′Se

e′e= λp with e = ep

Further,max

e 6= 0

e′Se

e′e

subject to the constraints

e′e1 = e′e2 = . . . e′ek = 0

is λk+1 and is achieved when e = ek+1.

Proof: See p80 of JW. To illustrate, let’s prove the last part. Firstwrite S = EΛE′ where E = [e1 . . . ep]. Define zj = e′ej. Theconstraints are equivalent to z1 = z2 = . . . zk = 0 and thus

e′Se

e′e=

z′Λz

z′z=

∑pi=k+1 λiz

2i∑p

i=1 z2i

= λk+1

∑pi=k+1(λi/λk+1)z2

i∑pi=1 z

2i

Note∑p

i=k+1(λi/λk+1)z2i ≤

∑pi=1 z

2i and the result follows.

�

34

The above result implies the transformed variable e′x with max-imum variance, subject to the normalizing constraint e′e = 1, isdefined by taking e = e1, i.e., e is the normalized eigenvector of S

with the largest eigenvalue, λ1. The sample variance of e′x is λ1.

The linear combination e′1x is the first (sample) principal com-ponent. The jth principal component, e′xj, maximizes variancesubject to the constraints e′jej = 1 and e′jek = 0 for k < j. Thelast principal component has minimum variance.

Principal components are the coordinates of the multivariate ob-servations with respect to the eigenvectors of S:

x = Ez = e1z1 + e2z2 + . . . epzp

!! !" # " !!!

!"$%

!"

!#$%

#

#$%

"

"$%

!

Figure 1.9: Scatterplot of a simulated bivariate data set. The eigenvectors of S, the covariancematrix of the simulated observations, are shown by the red lines. The principal components, z1

and z2, are coordinates of the points wrt the red lines. Which eigenvector corresponds to e1?The green line is from the regression of x2 on x1. Note it is not aligned with e1.

35

PARTITIONING OF VARIANCE AND SCREE PLOTS: A simplescalar measure of the “spread” of a multivariate data set is the sumof the variances of the individual components, i.e., the trace of S:

tr S = s11 + s22 + . . . spp

This is sometimes called the total sample variance. Using theSpectral Decomposition Theorem, and the matrix identity tr AB =

tr BA, it is straightforward to show (make sure you can)

tr S = λ1 + λ2 + . . . λp

i.e., the total sample variance is the sum of variances of the p sam-ple principal components.

The quantity∑q

k=1 λk/∑p

k=1 λk is the proportion of total vari-ance accounted for by the first q principal components. Similarly,the total variance of the error made by approximating the observedx by e1z1 + e2z2 + . . . eqzq is

∑pk=q+1 λk.

A plot of λk against k is called a scree plot and is used to visual-ize the how effectively the dimension of a multivariate data set canbe reduced by linearly transforming the original variables. (Seep444-446 of JW.)

SMALL EIGENVALUES CAN BE INTERESTING TOO: Sometimesthe eigenvectors with small (or possibly zero) eigenvalues are ofinterest because they correspond to linear combinations of the orig-inal variables that do not change amongst individuals. This corre-sponds in some applications to interesting conserved quantities.

36

Summary of Sample Principal Componentsa

Let S denote a p× p positive definite matrix with eigenvalues

λ1 ≥ λ2 ≥ . . . λp > 0

and corresponding orthonormal eigenvectors

e1, e2, . . . ep.

The ith sample principal component is defined by

zi = e′ix

where x is a p× 1 observation vector.

The sample variance of zi is λi and the covariance amongst alldistinct pairs of principal components is zero.

The total sample variance can be written

tr S = s11 + s22 + . . . spp = λ1 + λ2 + . . . λp

The sample correlation between the ith principal component andthe observations of the kth variable is

rzi,xk= eik

√λiskk

aSee p442 of JW

QUESTION: Prove rzi,xk= eik

√λiskk

.

37

EXAMPLE: Let’s find the eigenvectors, and principal components,of the National Track record data plotted in Figure 1.8.

In general it makes no sense to minimize the trace of S if theoriginal measurements are in different units (e.g. cm, kg). In sucha case the variables must be transformed so that linear combina-tions are meaningful. It sometimes makes sense to scale the vari-ables by their sample standard deviation, equivalent to performinga principal component analysis on the sample correlation matrix,R. (Note the eigenvectors of S and R are, in general, not simplyrelated to each other.) Why use R in the track example?

Race 100 m 200m 400m 800m 1500m 5000m 10,000m Marathone1 0.3324 0.3461 0.3391 0.3530 0.3660 0.3698 0.3659 0.3543e2 0.5294 0.4704 0.3453 -0.0895 -0.1537 -0.2948 -0.3336 -0.3866e3 0.3439 -0.0038 -0.0671 -0.7827 -0.2443 0.1829 0.2440 0.3346

Table 1.2: First three eigenvectors of the correlation matrix of the National Track records datafor men. Note the eigenvectors are unit length and orthogonal.

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

Principal compnent

Eige

n va

lue

Scree plot of the National Track data

Figure 1.10: Scree plot of the men’s National Track record data data. Note the trace of p × pcorrelation matrix is p and so the sum of the eigenvalues in this example is p = 8. What does ahigh value of z1 correspond to? A low (negative) value?

38

1.5 Simple Linear Regression and Principal Components

Principal component analysis rotates the coordinate system to alignthe first axis with the direction of maximum variability, the secondaxis with the direction of maximum variability subject to the con-straint it is orthogonal to the first axis, and so on. See figure below.

!! !" # " !!!

!"$%

!"

!#$%

#

#$%

"

"$%

!

Figure 1.11: Scatterplot with principal axes (red lines) and regression line of y on x (green line).Note the major principal axis and the regression line do not coincide. Which line would youuse to describe the relationship between x and y, and why?

Principal component analysis minimizes the perpendicular dis-tance of points from the fitted (red) line; linear regression mini-mizes the vertical distance (green line). Beware: the slopes will,in general, be different!

QUESTION: Find e1, λ1, the slope of the regression line andR2 for

S = s2

[1 r

r 1

]for r > 0. Compare the slopes of the two fitted lines as r increasesfrom 0 to 1.

39

1.6 Statistical Distance

The squared Euclidean distance between two points is defined by

d2 = (x− y)′(x− y) = (x1 − y1)2 + (x2 − y2)2 + . . . (xp − yp)2

Statistical distance allows for the variability within a data set, orpopulation, when quantifying how far one point is from another8.

The concept of statistical distance is illustrated in Figure 1.12.It relies heavily on the ideas underlying principal component anal-ysis and is a fundamental concept in Multivariate Analysis.

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

P

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

P

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

P

Figure 1.12: Illustration of statistical distance of a point relative to the origin. Left panel:Scatterplot of a simulated bivariate data set. The point P is ‘unusual’ compared to the others,even though its Euclidean distance from the sample mean is not unusual. Middle panel: Tocalculate the “statistical distance” of point P from the origin, first rotate the coordinate systemto align the principal axes of variability with the x and y axes. Right panel: Scale each axisby the variability in each direction (i.e., divide the rotated data by their respective standarddeviations) and then use Euclidean distance. The result is a “statistical distance” relative to theorigin. Note the point P is now clearly identified as unusual. The ellipses in the left and middlepanels, and the circles in the right panel, correspond to points that are a statistical distances of1, 2 and 3 units from the origin.

8Section 1.5, p30 to 37.

40

MATHEMATICAL DEFINITION OF STATISTICAL DISTANCE: Ro-tating and scaling corresponds to the following transformation:

x = Λ−1/2E′x

where the columns of E are the eigenvectors of S and

Λ1/2 = diag(√λ1,√λ2, . . . ,

√λp)

and λi are the corresponding eigenvalues. E′x is the p × 1 vectorof sample principal components of the multivariate observation x;multiplying by Λ−1/2 scales the sample principal components bytheir standard deviation.

Squared statistical distance between two points (x1 and x2) ismeasured in a Euclidian sense after rotation and scaling:

d2 = (x1 − x2)′(x1 − x2) = (x1 − x2)′EΛ−1E′(x1 − x2)

Noting EΛ−1E′ = S−1 leads to the following definition ofsquared statistical distance between x1 and x2:

d2 = (x1 − x2)′S−1(x1 − x2) (1.9)

QUESTION: Show d2 is invariant with respect to transformationsof the form Ax + b where A is a nonsingular matrix.

QUESTION: Show points that are a constant statistical distance dfrom the origin trace an ellipsoid that crosses the ith principal axisat zi = ±d

√λi. See Figure 1.12.

QUESTION: Show d2 satisfies the conditions required to be a validdistance (JW, p37): (i) d(P,Q) = d(Q,P ); (ii) d(P,Q) = 0, P =

Q; (iii) d(P,Q) > 0, P 6= Q; (iv) d(P,Q) ≤ d(P,R) + d(R,Q).

41

1.7 Generalized Sample Variance

Sometimes we need a single number to characterize the “spread”of a multivariate sample.9 We have already see one such measure:

TOTAL SAMPLE VARIANCE:

tr S = s11 + s22 + . . . spp = λ1 + λ2 + . . . λp

where λ1, . . . λp are the eigenvalues of S.

The determinant 10 of S is also used sometimes:

GENERALIZED SAMPLE VARIANCE:

|S| = λ1λ2λ3 . . . λp

To interpret the generalized sample variance, note the statisticaldistance of x from the origin is

d2 = x′S−1x =z2

1

λ1+z2

2

λ2+ . . .

z2p

λp

where z = E′x are the principal components of x. In 2D this isthe equation of an ellipse (see Figure 1.12). The area of the ellipsedefined by d2 = 1 is

A = π√λ1λ2

and thus, for p = 2, we have

|S| = λ1λ2 =A2

π2

Thus |S| is proportional to the squared area of the ellipse definedby d = 1.

9Section 3.4, JW10|BC| = |B||C|

42

If p = 3 then setting d2 = 1 defines an ellipsoid; if p > 3 itdefines a hyper-ellipsoid. It can be shown11 that for arbitrary p

|S| ∝ A2

where the coefficient of proportionality depends only on p, and Adenotes the area (p = 2) or volume (p > 2) of the ellipsoid definedby d2 = 1.

Thus, loosely speaking, the generalized sample variance |S| =

λ1λ2 . . . λp increases with the squared “volume” of that part of thep-dimensional variable space peppered by data (see Figure 1.12).

QUESTION: Show |S| is zero ⇔ at least one eigenvalues of S iszero ⇔ measurements on one variable can be written as a linearcombination of measurements on the remaining variables, after re-moval of the mean. (If |S| = 0 at least one columns of Xa containsredundant information and the scatter of observations will lie is asubspace e.g., in 2D(3D) the points lie on a straight line(plane).

QUESTIONS: Discuss the total sample variance and generalizedsample variance of

S =

[3 0

0 3

]S =

[3 3

3 3

]Draw possible scatterplots.

QUESTION: What is the generalized variance of R, the samplecorrelation matrix, for p = 2? What is the relationship betweenthe generalized variance of S and R for arbitrary p?

READ SECTION 3.4, P123-137.11See p126 of JW

43

Chapter 2

Random Vectors and Matrices

So far we have focused on the data matrix X of observations andstatistics such as the sample mean, covariance and correlation. Of-ten we require inferences about the population from which the dataare drawn. This requires concepts such as random variables, vec-tors and matrices and also random samples. (Read Sections 2.5and 2.6 of JW.)

2.1 Brief Review of Random Variables

Let X denote a continuous random variable (rv) with a probabilitydensity function (pdf) denoted by f (x).1 The probability that Xwill be between a and a + ∆ is given by the area under the pdf:

Pr(a < X < a + ∆) =

∫ a+∆

a

f (x)dx

To interpret the pdf note that if ∆ is small,

Pr(a < X < a + ∆) ≈ f (a)∆ (2.1)

Note probability has no units so f (x) has units of [x]−1.1Note that we are denoting rv by uppercase, and specific values by lowercase.

44

Let φ(X) denote some function of the rvX . The expected valueof φ(X) is given by (provided the integral exists)

E [φ(X)] =

∫ ∞−∞

φ(x)f (x)dx

If φ(X) = X we obtain µ, the mean of the rv X:

E(X) =

∫ ∞−∞

xf (x)dx

If φ(X) = (X − µ)2 we obtain σ2, the variance of the rv X:

E[(X − µ)2

]=

∫ ∞−∞

(x− µ)2f (x)dx

The square root of the variance is the standard deviation, σ.

QUESTION: Make sure you can show E(aX + b) = aµ + b andVar(aX + b) = a2σ2 where a and b are constants. Note the addi-tive constant b has no effect on variance. We will generalize theseresults to random vectors in subsequent sections.

2.2 Two Random Variables and Associated Pdfs

We denote the joint pdf of a pair of rvs, X1 and X2, by f12(x1, x2).

The probability X1 will be between a1 and a2, and X2 between b1

and b2, is given by the following volume integral (a straightforwardgeneralization of the area integral for the univariate case):

Pr(a1 < X1 < a2, b1 < X2 < b2) =

∫ a2

a1

∫ b2

b1

f12(x1, x2)dx2dx1

45

x1

x2

Bivariate and marginals

−0.4 −0.2 0 0.2 0.4−0.4

−0.2

0

0.2

0.4

0 2 4−0.4

−0.2

0

0.2

0.4Conditional

Figure 2.1: Typical bivariate probability density function. If you think of the contours as defin-ing a topographic map, Pr(2 < X1 < 2.5, 1 < X2 < 2) is just the volume of that part of the“mountain” above the rectangle 2 < x1 < 2.5, 1 < x2 < 2. The marginal densities are shownby the shaded regions. Note the marginal pdf is obtained by integrating the joint pdf and theconditional pdf is obtained by “slicing” it. This particular density is a skew-t density.

To interpret the bivariate pdf, note

Pr(a1 < X1 < a2, b1 < X2 < b2) ≈ f12(a1, a2)(a2 − a1)(b2 − b1)

QUESTIONS: SupposeX1 andX2 are the length and width in cm ofa turtle to be selected at random. What are the units of f12(x1, x2)?What is the total volume (i.e., the value of the integral as a1, b1 →−∞ and a2, b2 → −∞?

THE MARGINAL DENSITY of X1 is defined by

f1(x1) =

∫ ∞−∞

f (x1, x2)dx2

It corresponds to the probability that X1 will between x1 andx1 + dx1, regardless of the value of X2.

46

THE CONDITIONAL DENSITY of X1 given X2 = x2 is defined by

f1|2(x1|x2) =f12(x1, x2)

f2(x2)

To see where this definition comes from, note that for small ∆

f1|2(x1|x2)∆ =f12(x1, x2)∆2

f2(x2)∆

≈ Pr(x1 < X1 < x1 + ∆, x2 < X2 < x2 + ∆)

Pr(x2 < X2 < x2 + ∆)= Pr(x1 < X1 < x1 + ∆|x2 < X2 < x2 + ∆)

This is just the condition probability Pr(A|B) where A is theevent x1 < X1 < x1 + ∆, and B is the event x2 < X2 < x2 + ∆.

2.3 Generalization to Higher Dimensions

The generalization of marginal and conditional pdfs from the bi-variate (p = 2) case to higher dimensions is straightforward.2

Let f1...p(x1, . . . xp) denote the joint density of X1, . . . Xp. TheMARGINAL DENSITY of X1, . . . Xq for q < p is defined by

f1...q(x1, . . . xq) =

∫ ∞−∞

. . .

∫ ∞−∞

f1...p(x1, . . . xp)dxq+1 . . . dxp

Similarly the CONDITIONAL DENSITY of X1 through Xr, givenXr+1 = xr+1 through Xp = xp, is

f1...r|r+1,...p(x1, . . . xr|xr+1, . . . xp) =f (x1, . . . xp)

fr+1,...p(xr+1, . . . xp)

2See p69 of JW.

47

2.4 Expectations, Covariance, Correlation and Independence

Let φ(X1, X2) denote a function of the two rv X1 and X2.

The expected value of φ(X1, X2) is

E [φ(X1, X2)] =

∫ ∞−∞

∫ ∞−∞

φ(x1, x2)f12(x1, x2)dx2 dx1

The covariance of X1 and X2 is based on

φ(X1, X2) = (X1 − µ1)(X2 − µ2)

where µ1 and µ2 are the means of X1 and X2. This leads to thefollowing definition of the covariance of X1 and X2:

σ12 = E (X1 − µ1)(X2 − µ2) (2.2)

Note σ12 depends on the units in whichX1 andX2 are measured.

The correlation of X1 and X2 is

ρ12 =σ12

σ1σ2(2.3)

Correlation measures the strength of the linear association be-tweenX1 andX2. It has no units and corresponds to the covarianceif we first scale X1 and X2 by their standard deviations.

Two continuous rv are said to be statistically independent iftheir joint pdf factors into a product of marginals:

f12(x2, x2) = f1(x1)f2(x2)

QUESTION: Prove statistical independence implies zero correla-tion but give an example showing the reverse is not always true.

48

QUESTION: Make sure you can show the mean and variance of alinear combination of X1 and X2 are

E(a1X1 + a2X2 + a3) = a1µ1 + a2µ2 + a3

Var(a1X1 + a2X2 + a3) = a21σ11 + a2

2σ22 + 2a1a2σ12

QUESTION: Relate the above expressions for the mean and vari-ance of linear combinations of random variables to the equivalentresults obtained earlier from the data matrix.

49

2.5 Random Vectors and Matrices

Random vectors and matrices are vectors and matrices with ele-ments that are random variables.3

The expected value of a random vector or matrix is found byreplacing each element by its expected value, e.g., for the p × 1

random vector X = (X1, . . . Xp)′ the population mean is given by

E(X) = µ =

E(X1)

E(X2)...

E(Xp)

=

µ1

µ2...µp

(2.4)

Similarly the population variance-covariance matrix of X is

E [(X− µ)(X− µ)′] = Σ =

σ11 σ12 . . . σ1p

σ21 σ22 . . . σ2p. . .

σp1 σp2 . . . σpp

(2.5)

If each element of Σ is standardized by its associated standarddeviations we obtain the population correlation matrix

ρ =

1 ρ12 . . . ρ1p

ρ21 1 . . . ρ2p. . .

ρp1 ρp2 . . . 1

(2.6)

QUESTION: Relate Σ and ρ using a standardizing matrix.43Read section 2.5 of JW.4See p72 of JW.

50

2.6 Partitioning Random Vectors

We often need to partition5 the random vector X:

X =

[X1

X2

](2.7)

The corresponding (population) mean and covariance of X are

µ =

[E(X1)

E(X2)

]=

[µ1

µ2

](2.8)

Σ =

[Σ11 Σ12

Σ21 Σ22

](2.9)

where Σij = E (Xi − µi)(Xj − µj)′ for i, j = 1, 2.

QUESTION: What are the dimensions of Σ11, Σ12, Σ21 and Σ22

if the random vector X is divided into subvectors of length q andp− q? What is the relationship between Σ12 and Σ21?

QUESTION: Let X denote a future sea level profile across theEquatorial Pacific (see Figure 1.3). Partition this rv into two sub-vectors with X1 corresponding to the western part and X2 corre-sponding to the eastern part of the profile. What information doesΣ11 provide? Σ12?

5Read p73 to 75 of JW.

51

2.7 Linear Transformations of Random Vectors

Let X denote a p×1 random vector with mean and variance µ andΣ respectively. Consider the transformation

Z = AX + b

where Z and b are q × 1 vectors, and A is q × p. This compactmatrix equation is equivalent to the definition of q linear combina-tions of the elements of X:

Z1

Z2...Zq

=

a′1a′2...a′q

X +

b1

b2...bq

=

a′1X + b1

a′2X + b2...

a′qX + bq

It is straightforward to show6

E(Z) = Aµ + b

Cov(Z) = AΣA′

The last equation shows, for example, that the covariance be-tween a′iX and a′jX is a′iΣaj and their correlation is

corr(ajX, ajX) =a′iΣaj√

(a′iΣa′i)(a′jΣaj)

(2.10)

These expressions for population means, variances and covari-ances are essentially identical to their sample counterparts dis-cussed earlier.

6Read p75-77 of JW

52

2.8 Independence and Random Samples

The random variables X1, X2, . . . Xn are mutually statistically in-dependent if their joint density factors into the product of theirmarginal densities7:

f (x1, . . . xn) = f1(x1)f2(x2) . . . fn(xn)

The random vectors X1,X2, . . .Xn are statistically independentif their joint density factors into a product of their marginals:

f (x1,x2 . . .xn) = f1(x1)f2(x2) . . . fn(xn)

For many practical applications it makes sense to assume therows of the random n× p matrix

X =

X′1X′2...X′n

are statistically independent random vectors but with the samemarginal density, fX say. In this case

f (x1,x2, . . .xn) = fX(x1)fX(x2) . . . fX(xn)

X1, . . .Xn are said to form a random sample8 from the density fX.

Note we have not assumed elements from the same row are in-dependent. The assumption of a random sample greatly simplifiesinference because the complete distribution of the n × p rv in Xis now determined by the p dimensional pdf fX.

7See p69 of JW.8See p119 of JW.

53

The assumption of independence of the rows of the data matrixdistinguishes Multivariate Analysis from Time Series Analysis (al-though most of the results from Multivariate can be generalized toallow for dependence as shown by Brillinger).

There are several ways in which the assumption of random sam-pling can be violated including (see Section 3.3 of JW for exam-ples):

1. Multivariate observations drifting through time (or are corre-lated in time, as in the Pacific sea level example due to physi-cal processes such as El Nino);

2. Sampling only part of a population (e.g., assessing the healthof a widely dispersed fish population by only collecting sam-ples from a single offshore bank).

Let X1, . . .Xn denote a random sample from a joint distributionwith mean µ and covariance matrix Σ.

QUESTION: Show the sample mean vector X is an unbiased esti-mator of µ, i.e., E(X) = µ.

QUESTION: Show the covariance matrix of the sample mean esti-mator is given by

Cov(X) =1

nΣ

QUESTION: Show the sample covariance matrix S is an unbiasedestimator of Σ, i.e., E(S) = Σ.9

9Hint: Read Result 3.1 and proof, on p121 of JW.

54

Chapter 3

The Multivariate Normal

The multivariate normal pdf is often used to model the pdf of therandom vector X i.e., for fX. It is as important in MultivariateAnalysis as the univariate normal is in elementary statistics. It notonly provides a good approximation of pdfs that appear in manyapplications, but it also has attractive theoretical properties. Thispdf is now defined and its properties are discussed.1

Let X denote a p × 1 random vector with mean µ and posi-tive definite covariance matrix Σ. X is said to have a multivariatenormal distribution (X ∼ Np(µ,Σ)) if its pdf is

f (x) =1

(2π)p/2 |Σ|1/2

d2 = (x− µ)′Σ−1(x− µ)

Note d2 is the squared statistical distance of x from the populationmean. Given the contours of constant statistical distance of x fromµ are elliptical for p > 1, it follows that the contours of constantjoint pdf will also be elliptical (and centered on µ).

1Read sections 4.1 to 4.3 of JW.

55

Here are plots of some bivariate normal pdf’s.

−2 0 2

−2

0

2

x1

x 2

−2 0 2

−2

0

2

x1

x 2

−2 0 2

−2

0

2

x1

x 2

−2 0 2

−2

0

2

x1

x 2

Figure 3.1: Contours of constant density for a bivariate random variable with a multivariatenormal distribution. The top panels have Σ equal to the identity matrix; for the bottom left Σ =[1 0.7; 0.7 1] and for the bottom right panel Σ = [1 − 0.7;−0.7 1]. What are the populationmeans of the four pdf? See Figure 4.2 of JW for some perspective plots.

56

3.1 Useful Results Related to the Multivariate Normal

The results below2 are based primarily on the classic text by T. W. An-derson, An Introduction to Multivariate Statistical Analysis.

ResultLet X ∼ Np(µ,Σ) and Y = CX where C is nonsingular.

Y ∼ Np(Cµ,CΣC′)

Proof

X has a multivariate normal distribution and so

f (x) =1

(2π)p/2|Σ|1/2exp[−(x− µ)′Σ−1(x− µ)/2

]Transforming from X to Y (and allowing for the change in vol-ume) gives

fY(y) = fX(C−1y)1

abs|C|

The exponent can be rewritten

(C−1y − µ)−1Σ−1(C−1y − µ) = (C−1(y −Cµ))′Σ−1C−1(y −Cµ)

= (y −Cµ)′(C′)−1Σ−1C−1(y −Cµ)

= (y −Cµ)′(CΣC′)−1(y −Cµ)

Substituting gives

fY(y) =1

(2π)p/2|CΣC′|1/2exp[−(y −Cµ)′(CΣC)−1(y −Cµ)/2

]and the result follows.

�2Read Section 4.2 of JW, p149-167.

57

Result

Let[

X1

X2

]∼ N

[(µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

)].

X1 and X2 are independent⇐⇒ Σ12 = Σ′21 = 0.

Proof

Assume X1 and X2 are independent. This implies

fX1,X2(x1,x2) = fX1(x1)fX2(x2)

Let Xi, Xj denote arbitrary components of X1 and X2 respec-tively. The marginal density of Xi, Xj can be written in the form

fXi,Xj(xi, xj) = fXi

(xi)fXj(xj)

and thusσij = E(Xi − µi)(Xj − µj) = 0

This holds for all i, j and so Σ12 = Σ′21 = 0.

Assume Σ12 = 0. The density of X can be written

fX1,X2(x1,x2) =1

(2π)p/2|Σ11|1/2|Σ22|1/2exp[−(x1 − µ1)′Σ−1

11 (x1 − µ1)/2

− (x2 − µ2)′Σ−122 (x2 − µ2)/2

]Thus if Σ12 = 0 the joint density of X factors into the product ofthe marginal densities of X1 and X2. Thus X1 and X2 are inde-pendent.

�

58

ResultLet X ∼ N(µ,Σ). The marginal distribution of any set of com-ponents of X, X2 say, is N(µ2, Σ22).

Proof

Define [Y1

Y2

]=

[I −Σ12Σ

−122

O I

] [X1

X2

]This transformation can be written in the form

Y = CX

where

C =

[I −Σ12Σ

−122

O I

]It follows

Y ∼ N(Cµ, CΣC′)

By straightforward substitution

Cµ =

[µ1 −Σ12Σ

−122 µ2

µ2

]CΣC′ =

[Σ11 −Σ12Σ

−122 Σ21 0

0 Σ22

]Thus Y1 and Y2 are independent. It follows that the marginal

distribution of Y2 = X2 is N(µ2,Σ22).

�

59

ResultLet X ∼ Np(µ,Σ) and assume D is a q×p matrix of rank q ≤ p.

DX ∼ Nq(Dµ,DΣD′)

Proof

Using the Singular Value Decomposition Theorem3

D = UΛV′

where U is a q × q orthogonal matrix, V is a p × p orthogonalmatrix, and Λ is q × p and zero except for the (i, i) values whichequal the q positive singular values of D.

Partitioning we can write

D = U [Λ1 0]

[V′1V′2

]Consider now the transformation

[Y1

Y2

]=

[D

V′2

]X

The matrix [D

V′2

]is p × p and of full rank (consider its SVD). Thus the marginaldistribution of Y1 = DX is Nq(Dµ,DΣD′).

�

3p100 of JW.

60

Notes on the last result

1. If q = 1, and hence D = d′ say, then

d′X ∼ N(d′µ,d′Σd)

Linear combinations of normal variates are normal.

2. All subsets of the random components of X = (X1, X2, . . . Xp)′

are normally distributed. To see this, relabel the Xi so that X1

is the q × 1 vector of components of interest. Then take

X1 = DX =[

I O] [ X1

X2

]Applying the Result gives

X1 ∼ Nq(µ1,Σ11).

3. Let

V1 = c1X1 + c2X2 + . . . cnXn

V2 = b1X1 + b2X2 + . . . bnXn

denote two random vectors that are linear combinations of in-dependent random vectors distributed as follows:

Xi ∼ Np(µi,Σ).

To determine the joint density of V1 and V2 define the np× 1

stacked vector

Xs =

X1

X2...

Xn

61

Then we can write (V1

V2

)= DXs

where

D =

[c1I c2I . . . cnI

b1I b2I . . . bnI

]Using the Result gives

(V1

V2

)∼ N2p

D

µ1

µ2...µn

,D

Σ 0

Σ

0 . . .Σ

D′

Substituting the above form for D gives4

(V1

V2

)∼ N2p

([Σni=1ciµi

Σni=1biµi

],

[(c′c)Σ (c′b)Σ

(b′c)Σ (b′b)Σ

])

QUESTION: Interpret the special case b′ = (1, 0 . . . 0), c′ =

n−1(1, 1 . . . 1) and µi = µ.

QUESTION: Use the result to explore the random walk modelwith correlated displacements. (I’ll take the lead.)

4See Result 4.8 on p165 of JW

62

ResultLet X =

[X1

X2

]∼ Np

[(µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

)]with Σ22 posi-

tive definite.

The conditional density of X1 given X2 = x2 is

Nq(µ1 + Σ12Σ−122 (x2 − µ2), Σ11 −Σ12Σ

−122 Σ21)

QUESTIONS: Make sure you understand, and can reproduce, theproof given on p161 of JW. Discuss the Result when X1 and X2

are random variables . Relate to linear regression.

Result Let X ∼ Np(µ,Σ) with positive definite Σ. The squaredstatistical distance d2 = (X−µ)′Σ−1(X−µ) has a χ2 distributionwith p degrees of freedom:

d2 ∼ χ2p

Proof

Let Σ = EΛE′ and define

Z = Λ−1/2E′(X− µ)

It is straightforward to show (can you?)

Z ∼ Np(0, I)

It follows

d2 = (X− µ)′Σ−1(X− µ) = Z′Z = Z21 + . . . Z2

p

This d2 is the sum of p, squared independent standard normal vari-ates (Z2

i for i = 1, . . . p) which, by definition, has a χ2p distribution.

�

63

3.2 Assessing Normality and Detecting Outliers Post Midterm

Do the multivariate observations x1, . . .xn violate the assumptionthey are a random sample from a multivariate normal distribution?

First plot the data, e.g., scatterplots, dotplots and histograms ofthe original data and sample principal components. Next check ifthe distribution of a single column of the data matrix X, or a linearcombination of the columns (e.g., the first or last sample principalcomponent), has approximately a normal distribution.5

3.2.1 Q-Q Plots

Order the n observations from smallest to largest:

x(1) ≤ x(2) ≤ . . . x(n)

The proportion of observations that are less than or equal to x(i)

is taken to be pi = (i− 1/2)/n. For a standard normal distributionpdf, φ(x), the ith quantile qi is given implicitly by∫ qi

−∞φ(x)dx = pi

A Q-Q (quantile-quantile) plot is just a plot of the x(i) against thecorresponding qi. The closer the points to a straight line, the closerthe “shape” of the observed histogram to a standard normal. Astraightforward test6 for normality is based on the sample correla-tion between x(i) and qi. Beware: for small n only severe devia-tions from normality can be detected; for large n we will almostalways reject null hypothesis of normality.7

5Read Section 4.6 of JW.6p178-181 of JW7p187, JW.

64

EXAMPLE:

Figure 1.12 shown earlier is a scatterplot of a random sampleof size n = 200 drawn from a normal distribution with one of thevalues replaced by an “outlier”. The following figure shows Q-Qplots of the original data and also the two principal components.

−4 −2 0 2 4−4

−2

0

2

4

qi

x 1i

QQ plot of first column of X

−4 −2 0 2 4−4

−2

0

2

4

qi

x 2i

QQ plot of second column of X

−4 −2 0 2 4−4

−2

0

2

4

qi

z 1i

QQ plot of first principal component

−4 −2 0 2 4−4

−2

0

2

4

qi

z 2i

QQ plot of second principal component

Figure 3.2: QQ plots of the original data (top panels) and principal components of the simulateddata plotted in Figure 1.12.

Note that the Q-Q plots of the original data do not suggest a sig-nificant deviation from normality (the points are close to a straightline). The outlier however in clearly evident if the Q-Q plot ofthe second principal component (i.e., the one associated with thesmallest eigenvalue).

65

3.2.2 Q-Q plots for multivariate data

The following result, proved earlier, leads to a useful way of check-ing normality of multivariate observations.

ResultAssume X ∼ Np(µ,Σ). The squared statistical distance

d2 = (X− µ)′Σ−1(X− µ)

has a χ2 distribution with p degrees of freedom:

d2 ∼ χ2p

How do we use this result in practice? Let x1 . . .xn be a setof n multivariate observations drawn randomly from Np(µ,Σ). Ifn is large enough we approximate the population mean by x, andcovariance by S, and expect that approximately 100(1 − α)% ofthe sample will satisfy

(xi − x)′S−1(xi − x) < χ2p(α) i = 1, . . . n.

Chi Square Plot(i) Order, in an ascending sequence, the squared statistical dis-tances of the observations from x based on S;

(ii) Plot the ordered distances, d2(i), against χ2

p (1− (i− 1/2)/n);

(iii) If the population is multivariate normal, and (n − p) ≥ 30,the points should lie close to a straight line.

66

But remember, if the sample size n is small, we can only detectthe most pronounced forms of non-normality. Conversely if n islarge, we’ll likely detect a statistically significant difference.8

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

qi

d2 i

χ2 plot of simulated data

Figure 3.3: Chi-square plot for the simulated data plotted in Figure 1.12.

0 5 10 15 20 25 300

5

10

15

20

25

30

qi

d2 i

χ2 plot of National track records data

Figure 3.4: Chi-square plot for the simulated data plotted in Figure 1.8.

8For more details, and worked examples, read Section 4.6 of JW.

67

Here is a small simulation study to give an idea of how muchchi-square plots will change due to sampling variability. Using thesame sample size and number of variables as the National TrackRecords data set (n = 54, p = 8), four X matrices were randomlysampled from a multivariate normal population. For each samplea chi-square plot was generated using x and S calculated from X.

0 10 200

10

20

qi

d2i

Realization 1 (n=54, p=8)

0 10 200

10

20

qi

d2i


0 10 200

10

20

qi

d2i


0 10 200

10

20

qi

d2i


Figure 3.5: Chi-square plots for four X matrices of size n = 54, p = 8 randomly drawn froma multivariate normal population. Compare to the chi-square plot of the real data shown on theprevious page. Conclusion?

68

Chapter 4

Inferences About Multivariate Means

Confidence regions are a straightforward generalization of confi-dence intervals for univariate means. Hypothesis testing will notbe covered beyond noting it is a straightforward multivariate gen-eralization of the univariate testing. For more detail on inferencesabout multivariate means, and hypothesis tests, see Chapter 5, JW.

4.1 Review of Confidence Intervals for Univariate Means

AssumeX1, . . . Xn is a univariate random sample drawn fromN(µ, σ2).If σ is known, it follows that X ∼ N(µ, σ2/n) and thus

Pr

[|X − µ|σ/√n

< z(α/2)

]= 1− α

This leads to the following 100(1−α)% confidence interval for µ:

X ± z(α/2)σ√n

Classical (non Bayesian) interpretation: Draw large number of re-alizations of X1, . . . Xn from N(µ, σ2). For each realization, cal-culate a confidence interval. We expect 100(1− α)% to include µi.e., will be “good”.

69

If σ is not known, the sample standard deviation (s) is used inplace of σ and the confidence interval takes the form

X ± tn−1(α/2)s√n

If the sample size n is large (typically greater than 30) it is possibleto replace t(α/2) by z(α/2). It is also possible to relax the assump-tion of normality of the underlying probability density function byappealing to the Central Limit Theorem:

Central Limit Theorema

Let X1, . . . Xn denote a random sample from any populationwith mean µ and variance σ2. Then

√n X has approximately

an N(µ, σ2) distribution for large sample sizes.ap176 of JW

70

4.2 Confidence Regions for Multivariate µ with Σ Known.

Let X1, . . .Xn denote a random sample from Np(µ,Σ). We haveproved the sample mean has a normal distribution,

X ∼ N(µ,ΣX) ΣX :=1

nΣ,

and the squared statistical distance of X from the population meanµ has a χ2

p distribution:

d2 ∼ χ2p d2 := (X− µ)′Σ−1

X(X− µ)

Thus the probability the squared statistical distance between X

and µ will be less than the critical value χ2p(α) is, by definition,

P (d2 < χ2p(α)) = 1− α

Interpretation: Repeatedly draw random samples of size n fromN(µ,Σ). For approximately 100(1 − α)% of the realizations, thesquared statistical distance of X from µ will be less than χ2

p(α).

Consider the set of possible µ values (Figure 4.1). Draw anellipsoid, centered on X, of possible µ values within a squaredstatistical distance χ2

p(α) of X:

(X− µ)′Σ−1X

(X− µ) < χ2p(α) (4.1)

The probability this region covers the true µ is 1 − α. The set ofpossible µ values is called a 100(1 − α) confidence region for thetrue mean.1

1Read Section 5.4 of JW.

71

Sample mean

Sample mean

True µ

possible µ1

poss

ible

µ2

Figure 4.1: Confidence region for the true population mean defined in the space of possible µvalues. Each ellipse is centered on a random sample mean X. The ellipse encompasses possibleµ values within a squared statistical distance χ2(α) of X. The probability the region will coverthe true mean is, by construction, equal to 1− α.

72

4.3 Confidence Regions for Multivariate µ with Σ Unknown

The population covariance Σ is usually not known. The followingresult allows us to construct confidence regions when all that isavailable is the sample covariance matrix. S.

ResultLet X and S denote the sample mean and sample covariance of arandom sample of size n drawn from Np(µ,Σ). Then

n(X− µ)′S−1(X− µ) ∼ (n− 1)p

n− pFp,n−p (4.2)

Note that the statistic n(X−µ)′S−1(X−µ) is the squared sta-tistical distance of X from µ based on the approximation

Cov(X) ≈ 1

nS

The squared distance based on S is called the T 2 statistic2:

T 2 = n(X− µ)′S−1(X− µ)

QUESTION: Show3 that T 2 is invariant with respect to linear trans-formations of the form Ax + b.

Thus to find a confidence region when Σ is not known all wehave to change in (4.1) is the critical value:

n(X− µ)′S−1(X− µ) <(n− 1)p

(n− p)Fp,n−p(α) (4.3)

2p212 of JW3p215 of JW.

73

EXAMPLE4

An ecologist measures the tail (x1) and wing (x2) length (in mm)of n = 45 female hook-billed kites.

160 170 180 190 200 210 220 230240

250

260

270

280

290

300

310

Tail Length

Win

g Le

ngth

Figure 4.2: Scatter plot of the female kite data and the 95% confidence region for the mean tailand wing length of female kites. All measurements are in millimeters.

The sample means for the female kites are

x = [193.6, 279.8]′

and the sample covariance matrix is

S =

[120.7 122.3

122.3 208.5

]4p285 of JW.

74

To find a 95% confidence region forµwe must find the ellipticalboundary of the region defined by (4.2). Note

S = EΛE′ E =

[0.8179 0.5754

−0.5754 0.8179

]Λ =

[34.6 0

0 294.6

]and for α = 0.05

(n− 1)p

n(n− p)Fp,n−p(α) =

44× 2

45× 43× 3.215

The major axis is aligned with the first eigenvector of S. (See p221of JW, and example 5.3, for help constructing the ellipsoid.)

4.4 Large Samples

Suppose you have a random sample of n multivariate observa-tions. If n − p is large w can relax the assumption the sample isdrawn from a normal distribution and still use (4.1). The rationaleis based on the following (unproved) result5

ResultLet X1,X2, . . .Xn be a random sample from a population, possi-bly not normal, with mean µ and positive definite covariance Σ.For large n−p, the squared statistical distance n(X−µ)′S−1(X−µ) is approximately distributed as χ2

p.

Thus for large samples we can test hypotheses, and calculateconfidence regions, using the χ2-based formulae even though thepopulation from which the sample is drawn may not be not normal.Note this parallels exactly the univariate situation.

5see p176 of JW

75

4.5 Simultaneous Confidence Intervals and Regions

Let X1, . . .Xn be a random sample from Np(µ,Σ). Then

(X− µ)′S−1(X− µ) <(n− 1)p

n(n− p)Fp,n−p(α)

will hold with probability 1−α. This led to confidence regions forµ based on the sample mean x and and sample covariance S.

Confidence regions are difficult to visualize for p ≥ 3. In ad-dition we are often interested in a confidence intervals/regions forsubsets of the components of µ.

EXAMPLE: The length (x1), width (x2) and height (x3) of n = 24

female turtle shells were recorded. Find confidence intervals andregions for the following linear combinations of elements of µ:

Linear combination i Ai Aiµ

1 [1 0 0] µ1

2 [0 1 − 1] µ2 − µ3

3 [1 1 1] µ1 + µ2 + µ3

4[

1 0 00 1 0

] [µ1

µ2

]

How do we construct confidence intervals/regions for A1µ, . . .Amµ

that will all be good simultaneously with probability 1 − α? Theobvious approach is to use AiX and its estimated covariance, n−1AiSA′i,to define confidence regions of the form

(X− µ)′A′i(AiSA′i)−1Ai(X− µ) < c2 i = 1, . . .m

But how do we choose c2 such that the probability allm confidenceregions/intervals are simultaneously good is 1− α?

76

Using the extended Cauchy-Schwartz inequality6 it can be shown

(X− µ)′A′i(AiSA′i)−1Ai(X− µ) ≤ (X− µ)′S−1(X− µ)

If the full confidence ellipsoid covers the true mean µ then

(X− µ)′S−1(X− µ) <(n− 1)p

n(n− p)Fp,n−p(α)

By design this will happen with probability 1 − α. But when thisevent occurs (i.e., the region is ”good”) the following also occurs:

(X− µ)′A′i(ASA′i)−1Ai(X− µ) <

(n− 1)p

n(n− p)Fp,n−p(α) (4.4)

i.e., all m confidence interval/regions for Aiµ, i = 1, . . . ,m willbe good; they will hold simultaneously with confidence coefficientof at least 1− α.

The bottom line is that if we use (4.4) we can examine as manylinear combinations of the elements ofµ as we want and still main-tain an overall confidence level of 100(1 − α)%. For this reasonthey are called simultaneous confidence intervals or regions. Theyare excellent for “data snooping”.

The reason simultaneous confidence intervals/regions work (i.e.,have an overall confidence level of 1−α) is that they are the “shad-ows” of the p-dimensional ellipsoid onto the subspace spanned bythe rows of Aiµ. Thus if µ is covered by the full ellipsoid, itsprojection onto the subspace will also be covered by the shadowcast by the ellipsoid.7

6(b′d)2 ≤ (b′Wb)(d′W−1d). Take b = X− µ, d = A′i(AiSA′i)−1Ai(X− µ) and W = S−1.

7Read Supplement 5A of JW for more detail.

77

4.6 Bonferroni Confidence Intervals

The intervals and regions discussed in the previous section aregood for “data snooping”. But assume that, before we analyzethe data, we decide to focus on a limited number of confidence in-tervals. Can we find tighter intervals that will hold with an overallconfidence level of 1− α? The answer is yes, as explained below.

Let Ci be the event that the ith confidence interval covers thepopulation quantity of interest (i.e. it is “good”). If C i occurs wehave a bad interval. The following Bonferroni inequality turns outto be useful:

P (C1 ∩ C2 . . . ∩ Cm) ≥ 1− P (C1)− P (C2) . . .− P (Cm)

Assume each individual interval covers the corresponding popula-tion quantity with probability 1 − α. This means the probabilityit is “bad” is α, i.e. P (C i) = α. Using the above inequality, theprobability that all m intervals are good will exceed 1−mα i.e.

P (C1 ∩ C2 . . . ∩ Cm) ≥ 1−mα

But by requiring the individual confidence intervals to have a con-fidence coefficient of 1 − α/m we obtain an overall confidencelevel for the m intervals of better than 1− α.

Bonferroni intervals are very straightforward to calculate andwill generally be significantly narrower than their simultaneouscounterparts for small m. See JW, and the following example, formore detail.

78

Example

The sample means of length, width and height of the n = 24

turtle shells arex = [136.0, 102.6, 52.0]′

and the sample covariance matrix is

S =

452 271 166

271 172 102

166 102 65

TWO BONFERRONI INTERVALS WITH α = 0.05

xi ± tn−1(α/2m)

√siin

tn−1(α/4) = 2.3979

tn−1(α/4)√s11/n = 10.4006

tn−1(α/4)√s22/n = 6.4143

SIMULTANEOUS INTERVALS WITH α = 0.05

xi ±

√p(n− 1)

n− pFp,n−p(α)

siin

Fp,n−p(α) = 3.0725√p(n− 1)/(n− p)Fp,n−p(α)s11/n = 13.7813√p(n− 1)/(n− p)Fp,n−p(α)s22/n = 8.4992

79

90 100 110 120 130 140 150 160 170 18080

90

100

110

120

130

140

Length of turtle shell (in mm)

Wid

th o

f tur

tle s

hell (

in m

m)

Figure 4.3: 95% confidence region for the mean length and width of female turtle shells (redline). The green lines are the two Bonferroni intervals, and the blue lines are the two simulta-neous intervals. Note the simultaneous intervals define the shadow of the ellipse on the x1 andx2 axes (corresponding to A1 = (1, 0, 0) and A2 = (0, 1, 0) respectively).

80

4.7 Maximum Likelihood Estimation of µ and Σ

The density of a random sample X1, . . .Xn from Np(µ,Σ) is

f (x1, . . .xn) =1

(2π)pn/2|Σ|n/2exp[−Σn

i=1(xi − µ)′Σ−1(xi − µ)/2]

The exponent can be rewritten (make sure you can show this)

Σni=1(xi − µ)′Σ−1(xi − µ) = tr

[Σ−1Σn

i=1(xi − µ)(xi − µ)′]

= tr[(n− 1)Σ−1S + n(x− µ)′Σ−1(x− µ)

]The joint density of X1, . . .Xn depends only on x and S: they

are “sufficient statistics” (they alone specify the joint density ofX1, . . .Xn - not generally true for non normal populations). Sub-stituting the observed values of the Xi into the joint density, andtreating it as a function of Σ and µ, gives the likelihood function8:

L(µ,Σ) =1

(2π)np/2|Σ|n/2exp

[−n− 1

2trΣ−1S + n(x− µ)′Σ−1(x− µ)

]The maximum likelihood estimates (MLE) of µ and Σ maximizeL(µ,Σ).

Given Σ > 0, the MLE for µ is

µ = x

The likelihood function now simplifies to

L(Σ) =1

(2π)np/2|Σ|n/2exp

[−(n− 1)

2trΣ−1S

]Using Result 4.10 of JW, the MLE for Σ is

Σ =(n− 1)

nS

8For details and additional information, read Section 4.3 of JW.

81

Chapter 5

Multiple and Multivariate Regression

We first provide a brief overview of multiple regression (one de-pendent variable, m = 1) and then discuss multivariate regression(m > 1). Bottom line is the methodology and theory of multipleregression carries over almost directly to multivariate regression.

5.1 Multiple Regression

According to the classical multiple regression model, the rv of ob-servations on the independent (response) variable is of the form

Y = Zβ + ε

where Z is a fixed n× (r + 1) “design” matrix

Z =

1 z11 z12 . . . z1r

1 z21 z22 . . . z2r... ...1 zn1 zn2 . . . znr

holding the known values of the independent variables (predic-tors). ε is a n× 1 zero mean random vector with variance σ2I . Ingeneral β and σ2 are unknown and estimated from observations.

82

QUESTIONS: Write out, and interpret, the equation for the ith ob-servation, Yi. What is the role of the column of ones in Z? Whatis E(Y) and Cov(Y)? How might you transform the model ifCov ε) = σ2W and W is known?

5.1.1 Least Squares Estimation of β and σ2

Let y denote an n×1 vector of given observations of the dependentvariable. AssumeZ has rank r+1 ≤ n (thereby ensuring (Z ′Z)−1

exists). The b that minimizes the sum of squares of errors, (y −Zb)′(y −Zb), is the familiar least-squares estimate1:

β = (Z ′Z)−1Z ′y

The predicted (or fitted) values of y are given by

y = Hy

whereH = Z(Z ′Z)−1Z ′

The n × n matrix H is often referred to as the hat matrix. Hprojects any n×1 vector onto the subspace spanned by the columnsof Z. To see this, note that if the SVD of Z is UΛV′, where U isn× (r + 1) and U′U = I , thenH = UU′.

The residuals from the least squares regression are given by

ε = y − y = (I −H)y

I −H projects any vector onto the subspace spanned by vectorsthat are orthogonal to the columns of Z.

QUESTION: Interpret M = n−111′, I −M and H −M . Showthey are symmetric and idempotent.

1p364 of JW

83

It is straightforward to show the residual vector is orthogonal tothe predictors and the predictions2:

Z ′ε = 0 y′ε = 0

The “sum of squares decomposition” follows immediately:

ε′ε = y′y − y′y

To correct for the sample mean y note we can write

ε′ε = y′(I −M − (H −M ))y

= [(I −M )y]′(I −M )y − [(H −M )y]′(H −My)

and so

Σni=1ε

2i = Σn

i=1(yi − y)2 − Σni=1(yi − y)2

SSE SST SSR

This leads to the definition of the “coefficient of determination”:

R2 =SSR

SST

Note 0 ≤ R2 ≤ 1. It is the proportion of variance of the dependentvariable observations accounted for by the r predictors.

The sample standard deviation of the residuals is defined by

s =

√SSE

n− r − 1

Note that least squares estimation is just an exercise in geome-try. None of the statistical assumptions about the model are used.

2p364 of JW

84

5.1.2 Means and Covariances of the Estimators β, ε and s2

If ε is a zero mean rv with variance σ2I , make sure you can show

E(β) = β E(ε) = 0 E(s2) = σ2

Cov(β) = σ2(Z ′Z)−1 Cov(ε) = σ2(I −H) Cov(β, ε) = 0

5.1.3 Confidence Regions and Intervals

If ε ∼ Nn(0, σ2I) it follows that

β ∼ Nr+1

[β, σ2(Z ′Z)−1

],

and

s2 ∼ σ2

n− r − 1χ2n−r−1

Using Cov(β, ε) = 0 (thus β and ε are independent) it follows

s−2(β − β)′Z ′Z(β − β) ∼ (r + 1)Fr+1,n−r−1

This allows 100(1 − α)% confidence ellipsoids for β to be con-structed in the usual way, i.e.,

(β − β)′Z ′Z(β − β) < s2(r + 1)Fr+1,n−r−1(α) (5.1)

Simultaneous and Bonferroni intervals can be readily constructedin the same way as before.

Suppose we want to predict the mean response when the predic-tors are given by z0, i.e., we want to predict z′0β. A 100(1 − α)%

confidence interval for z′0β is given by the following interval:

z′0β ± tn−r−1(α/2)s√

z′0(Z ′Z)−1z0

85

Suppose we want to find a confidence region for (r − q) linearcombinations of β of the form Cβ where C is an (r− q)× (r+ 1)

matrix of constants3. Assuming ε ∼ Nn(0, σ2I)

Cβ ∼ Nr−q(Cβ, σ2C(Z ′Z)−1C′)

Using the independence of β and ε it follows

(β − β)′C′[C(Z ′Z)−1C′

]−1C(β − β) ∼ s2(r − q)Fr−q,n−r−1

This leads to 100(1− α)% confidence regions for Cβ.

5.1.4 Tests Involving linar combinations of the βi

The Likelihood Ratio test4 of

H0 :Cβ = 0

H1 :Cβ 6= 0

is equivalent5 to checking if the above confidence region includes0, i.e., rejecting H0 if

β′C′[C(Z ′Z)−1C′

]−1Cβ > s2(r − q)Fr−q,n−r−1(α)

β0 β1 . . . βq βq+1 . . . βr

To test H0 : βq+1 = βq+2 . . . βr = 0 take C = [0r−q,q+1 Ir−q,r−q].With a bit of linear algebra, the test reduces to rejecting H0 if

(SSEq − SSEr)/(r − q)s2

> Fr−q,n−r−1(α)

where SSEq and SSEr are the sums of squares of errors based onthe first q and all r predictors respectively6.

3See JW p3754p219-220 of JW5See p376 of JW6For details see p374-6 of JW.

86

5.2 Multivariate Regression

Generalize the multiple regression model to m response variables:

Y = Zβ + ε

Y is an n×mmatrix of observations onm dependent variables,Z is the n × (r + 1) design matrix as before, β is a (r + 1) ×mmatrix of regression coefficients. The zero mean, n×m matrix

ε = [ε1|ε2| . . . |εm]

is a random matrix of errors with covariances defined by

cov(εi, εk) = σikIn×n

Note the covariance of vectors of ε from different rows is zero, andthe variance of any row is Σ = {σij}.

One can think of the multivariate model as m multiple regres-sion models, one for each column of the multivariate model. Theonly connection between the multiple regression models is throughcovariances of εij from the same row.

5.2.1 Least Squares Estimation of β and Σ

Let Y denote the n × m matrix of n multivariate observationsmade on the m independent variables. Assume Y known. The Bthat minimizes the sum of squared errors across all observations,tr(Y −ZB)′(Y −ZB), is

β = (Z ′Z)−1Z ′Y

87

This is just the regressions coefficients from themmultiple regres-sions, stacked side by side.

The n×m matrix of predicted or fitted response variables is

Y = Zβ = Z(Z ′Z)−1ZY = HY

whereH is the hat matrix as before. The residuals are given by

ε = Y − Y = (I −H)Y

It is straightforward to show Z ′ε = 0, Y′ε = 0 and

Y ′Y = Y′Y + ε′ε

This generalizes the sum of squares breakdown in multiple re-gression to a breakdown of covariance matrices.

QUESTION: Show the breakdown in sums of squares holds withcolumn means removed. Divide by n− 1 and interpret.

5.2.2 Means and Covariances of the Estimators β, ε and Σ

Let the ith column of β be denoted by βi, the ith column of ε by εi,and the i, jth element of Σ by σij. Define the sample covariancematrix of the residuals by

Σ =1

n− r − 1ε′ε

If ε is a zero mean random matrix with uncorrelated rows thatindividually have variance Σ, make sure you can show

E(β) = β E(ε) = 0 E(Σ) = Σ

88

Cov(βi, βk) = σik(Z′Z)−1 Cov(εi, εk) = σik(I−H) Cov(βi, εk) = 0

Note these results are straightforward generalizations of the cor-responding results for multiple regression.

5.2.3 Tests for Subsets of Regression Parameters

Partition the matrix of regression parameters as follows7

β =

[β1

β2

]where β1 is (q + 1)×m and β2 is (r − q)×m.

β0 β1 . . . βq βq+1 . . . βr

Suppose we want to test H0 : β2 = 0. Thus q is number ofindependent variables included, and r − q is number being tested.

H0 is more general than it appears because we can always re-order the response variables. If q = 0 the null hypothesis is that allthe independent variables have no effect (i.e., the regression modelunder H0 includes only the intercept).

To obtain a test statistic, fit the full model and calculate SSE =

ε′ε. Then fit the model under the null hypothesis (i.e., set β2 = 0

before fitting). This gives residuals ε1 and sums of squares andcross products defined by SSE1 = ε′1ε1.

The quantity SSE1 − SSE is the extra sums of squares andcross products8. Clearly the “larger” SSE1 − SSE compared to

7Read p395-398 of JW.8See p396 of JW

89

SSE, the more likely we are to reject the null hypothesis becausethe extra predictor variables gave a big reduction in the sums ofsquares. (This is a generalization of the Likelihood Ratio approachdiscussed earlier.)

90

Several test statistics have been proposed:

Wilks’ Lambda |SSE|/|SSE1| Likelihood Ratio test. Log ∼ χ2, large n. p396.Pillai’s trace tr(SSE1 − SSE)SSE−1

1 p398.Hotelling-Lawley trace tr(SSE1 − SSE)SSE−1 p398.

Tests based on Wilks’ Lambda, Pillai’s and Hotelling-Lawley traceare nearly equivalent for large n. JW prefer Wilks’ Lambda be-cause it is based on the Likelihood Ratio test (see p395-6, andp219 and 220 for general discussion of Likelihood Ratio tests).

5.2.4 Confidence Intervals and Regions for z′0β

The goal is to estimate E(Y ) when the predictor variables are

z′0 = [1 z01 z02 . . . z0r]

(Please read p399 to 401 of JW.)

Estimate z′0β by z′0β. Make sure you can show

E(z0β) = z′0β Cov(z0β) = z0(Z ′Z)−1z0 Σ

If we assume the rows of ε are normally distributed with zero meanand variance Σ, and the rows are uncorrelated, it follows that

β′z0 ∼ Nm[β′z0, (z′0(Z ′Z)−1z0)Σ]

The construction of confidence regions follows as before. Forexample with 100(1− α)% confidence

(z′0β − z0′β)S−1

p (z′0β − z0′β)′ <

m(n− r − 1)

n− r −mFm,n−r−m(α)

where

Sp = z′0(Z ′Z)−1z0 Σ Σ =1

n− r − 1ε′ε

91

Note setting m = 1 in this equation gives the equivalent results formultiple regression discussed in Section 7.5 of JW.

Simultaneous confidence intervals/regions for elements and lin-ear combinations of z0

′β follow in the usual way.

EXAMPLE: Antidepressant Drugs

Amitriptyline is prescribed as an antidepressant but there areconcerns over its effectiveness in relation to gender (z1, 1 for fe-male) and amount of antidpressant taken (z2). The response vari-ables are total TCAD plasma level (Y1) and amount of amitriptyinepresent in TCAD plasma level (Y2). Data gathered on n = 17 pa-tients admitted to hospital after amitriptyine overdose9 The dataare plotted in Figure 1.3.

In class we will carry out a multiple (Y1 on z1 and z2) and mul-tivariate regress (Y1 and Y2 on z1 and z2) using Matlab.

EXAMPLE: National Track Records

The National Track records for men for n = 54 countries arelisted in Table 8.6 of JW for the following p = 8 races. Timepermitting, in class we will carry out a multiple of the sprint timeson the long distance times using Matlab.

9For more details see Exercise 7.25 on p7.25 of Johnson and Wichern.

92

Chapter 6

Relating Two Random Vectors

Assume the zero mean random vectorX partitions as follows:

X =

[X1

X2

]E(X) =

[µ1

µ2

]Cov(X) =

[Σ11 Σ12

Σ21 Σ22

]whereX1 andX2 are of length p1 and p2 respectively. This Chap-ter reviews some methods for quantifying, and simplifying, the re-lationship betweenX1 andX2. The first two approaches assume asymmetrical relationship between X1 and X2 (as in correlation);the remaining approaches treat one vector as independent, and theother as independent (as in regression).

The main focus of this chapter is high dimensional problems(p1, p2 large). I will illustrate with an analysis of global sea surfacetemperature and pressure data (monthly, gridded, 1948 to 2006).

6.1 Principal Component Analysis

PCA analysis of Cov(X) requires careful weighting of variables(to allow for different units, number of elements in each vector).See figure and discuss weaknesses.

93

Figure 6.1: First and second principal component of the coupled sea surface temperature andpressure data. For this data set both p and q exceed 104.

94

6.2 Canonical Correlation Analysis

CCA maximizes the correlation between linear combinations ofX1 and X2. The goal is to represent the high dimensional cor-relation structure between X1 and X2 using a small number ofcanonical variates. The idea goes back to Hotelling (1935).

Define

X =

[U ′Σ

−1/211 X1

V ′Σ−1/222 X2

]It follows

Cov(X) =

[I U ′Σ

−1/211 Σ12Σ

−1/222 V

V ′Σ−1/222 Σ21Σ

−1/211 U I

]

If U and V are orthogonal matrices defines by the followingSVD:

Σ−1/211 Σ12Σ

−1/222 = ULV ′,

whereL is a diagonal matrix of non-negative singular values, then

Cov(X) =

[I L

L′ I

]

Note how the linear transformation of X1 and X2 byU ′Σ−1/211 and

V ′Σ−1/222 has greatly simplified both the “within” and “between”

correlation structure of X1 and X2.

95

The first elements ofU ′Σ−1/211 X1 and V ′Σ−1/2

22 X2 are called thefirst pair of canonical variates. It can be shown they have the high-est correlation amongst all pairs of linear combinations of X1 andX2. The second pair of canonical variates are the linear combi-nations with the highest correlation, subject to the constraint theyare uncorrelated with the first pair (and so on for the higher orderpairs).

On the positive side canonical variates are straightforward tocalculate, and they are invariant wrt linear transformations. On thenegative side they can be difficult to interpret. Part of the reasonis that although the variates are associated with high correlationsbetween X1 and X2, they may not provide a good description ofthe total variance of X1 or X2. This is evident in the followingexample. (Discuss problem with large p1 and p2.)

EXAMPLE: Consider the canonical correlation analysis of

Σ =

σ2 0 σ2 0

0 1 0 0

σ2 0 σ2 0

0 0 0 1

Note the first elements ofX1 andX2 are perfectly correlated, butthese components account for a vanishingly small proportion oftheir total variance as σ tends to zero.

For more details on Canonical Correlation Analysis, includ-ing estimation and sampling distributions, see Chapter 10 of JW.Given their limited usefulness in practical applications we will notdiscuss them further.

96

6.3 Multivariate Linear Regression Using Random Vectors

Let X1 and X2 denote two zero mean random vectors. Treat X1

as the response (dependent) and X2 as the predictor(indpendent)in the following linear model:

X1 = BX2

The value ofB that minimizes the trace of Cov(X1 −BX2) is

B = Σ12Σ−122

and the associated prediction error is

R1 = X1 − X1 = X1 −Σ12Σ−122X2

QUESTION: Show Cov(R1,X2) = 0.

The variance of the response partitions into a part (X1) relatedto the predictor (X2) and a part (R1) uncorrelated with (X2):

Σ11 = Σ11·2 + ΣRR

where

Σ11·2 = Σ12Σ−122 Σ21

ΣRR = Σ11 −Σ12Σ−122 Σ21

The covariance matrix of the residuals contains information on thevariability in X1 uncorrelated with X2. The correlations derivedfrom Σ11·2 are called partial correlations.

1. Note the above results are consistent withX1|X2 ∼ Nq(µ1 +

Σ12Σ−122 (x2 − µ2), Σ11 −Σ12Σ

−122 Σ21) under the assumption

of normality (see p56 of these notes).

97

2. Results also consistent with multivariate regression. To seethis, assume the columns of the Y and Z matrices have zeromeans; there is now no need for a column of ones in Z. Theestimated regression coefficient matrix is given by

β = (Z ′Z)−1Z ′Y = S−1zz Szy

Note the similarity to B. (The reason for the transpose is theY andZ matrices have the observations on the dependent andindependent variables as row vectors.)

For more details on multivariate regression based on randomindependent variables, and its relationship to the classical formbased on a fixed design matrix, read Sections 7.8 and 7.9 of JW.

6.4 Multivariate Regression on Principal Components

A problem with multivariate regression when p2 > n is that S22

will be singular. We need to reduce the number of independentvariables.

A simple way to reduce the number of independent variables,and thus obtain a lower dimensional representation of the regres-sion model, is to replaceX2 by its first r principal components:

Z = U ′rX2

where Σ22 = UΛU ′ and U r holds the first r columns of U . Re-gression ofX1 on Z gives

X(r)

1 = Σ1ZΛ−1r Z

98

The problem with regression on principal components is that thereduced subspace for X2 (i.e., the space spanned by the columnsof Ur) was chosen to describe the variance amongst the elementsof X2 and does not take into account what we want to simplify:the relationship betweenX1 andX2.

EXAMPLE: Consider the regression of X1 on the first principalcomponent of bfX2 given the following covariance matrix andσ < 1:

Σ =

σ2 0 σ2 0

0 1 0 0

σ2 0 σ2 0

0 0 0 1

This example clearly shows that regressing on principal compo-

nents leads to the real danger of “throwing out the baby with thebath water”.

99

6.5 Redundancy Analysis

RA1 is based on PCA of the linearly predictable part ofX1:

X1 = Σ12Σ−122X2

Approximate X1 using the first r principal components:

X(r)

1 = U rU′rX1

where the columns of U r are the first r eigenvectors of Cov(X1):

Σ12Σ−122 Σ21 = UΛU ′

It follows X(r)

1 can be written

X(r)

1 = U rU′rΣ12Σ

−122X2

= U rΛ1/2(Λ−1/2U ′rΣ12Σ

−122 )X2

= U rΛ1/2B′rX2

whereBr := Σ−1

22 Σ21U rΛ−1/2

The columns of Br are the r loading vectors for the predictors.In general the columns ofBr are not orthogonal.2

QUESTION: Define an r×1 vector of “amplitudes” byα := B′rX2

such that X(r)

1 = U rΛ1/2α. Show Cov(α) = I .

1See Von Storch and Zwiers, Statistical Analysis in Climate Research, p327-331.2This leads to the use of “adjoint patterns” (columns of (B−1)′) to describe the patterns of X2 related to the

corresponding patterns in X1 (i.e., the columns of U ). See Von Storch and Zwiers for details. I will discuss inclass time permitting.

100

The proportion of the total variance ofX1 accounted for by thefirst r modes is given by the Redundancy Index:

R2(r) =tr Cov(X1)

tr Cov(X1)=

∑ri=1 λ

(12)i∑p1

i=1 λ(11)i

where λ(12)i and λ(12)

i are ith eigenvalues of Cov(X1) and Cov(X1)

respectively.

Figure 6.2: First mode from a Redundancy Analysis of the coupled sea surface temperature andpressure data.

RA is attractive because it explains as much as possible of thepredictable variance of the response using the smallest number ofpredictors. A weakness is the columns of Br may be difficult tointerpret, and bad for prediction, if Σ22 is poorly conditioned.

101

6.6 Partial Least Squares and Regression

PLS is really a class of methods for simplifying the relationshipbetween sets of variables. It is technically challenging (and com-plicated). The following discussion just gives the flavour of theapproach (and is based on SVD, thus PLS-SVD).

First perform an SVD on Cov(X1, X2):

Σ12 = ULV ′

Next approximate3 Σ12 by a lower rank (r) matrix:

Σ12 ≈ U rLrV′r

Note Σ12 is approximated with increasing accuracy with increas-ing r. This type of expansion is at the heart of PLS. (A similarapproach is used to detect coupled modes of the atmosphere-oceansystem, but not referred to as PLS.)

The above approximation of Σ12 is optimal in terms of mini-mizing the sum of squared differences, Σ12 − U rLrV r, using arank r matrix. It is also possible to show that the first elements ofU ′X1 and V ′X2 have the highest squared covariance amongst allpairs of (normalized) linear combinations of X1 and X2 (and soon for the higher order pairs). Bottom line is that SVD providersan elegant way of approximating the covariance between X1 andX2.

3See earlier review of SVD.

102

Consider now the following transformed variables, sometimescalled latent variables, and their covariance:

α =

[U ′X1

V ′X2

]Cov(α) =

[U ′Σ11U L

L′ V ′Σ22V

]Note how the transformation has simplified the covariance be-tweenX1 andX2.

We can use the latent variables to predict X1 from X2. Forexample if we regressX1 on α2 := V ′rX2 we obtain

X(r)

1 = Σ12V r(V′rΣ22V r)

−1V ′rX2

= U rLr(V′rΣ22V r)

−1V ′rX2

One attraction of this approach is that we don’t need to invertΣ22. PLS is sometimes referred to as a “robust form of redundancyanalysis” because it is biased towards more stable directions inpredictor (X2) space. Promising for many applications.

Technique Predictor forX1 in terms ofX2 DefinitionMultivariate Regression Σ12Σ

−122X2

Regression on PC Σ12U rΛ−1r U

′rX2 UΛU ′ = Σ22

Redundancy Analysis U rΛ1/2(Λ−1/2U ′rΣ12Σ

−122 )X2 UΛU ′ = Σ12Σ

−122 Σ21

Partial Least Squares U rLr(V′rΣ22V r)

−1V ′rX2 ULV ′ = Σ12

Table 6.1: Summary of techniques for predicting X1 in terms of X2 using low dimensionalrepresentations.

103

Documents

Course Background - Dalhousie Universitykeith/Multivariate_2013/notes_2013_Sep_5.pdf · Notes on Applied Multivariate Analysis 4350/5350 Keith R. Thompson September, 2013 These notes