1. Introduction to multivariate datamkt/MT3732 (MVA)/Intro.pdf · Chapman & Hall Krzanowski, W.J. Principles of multivariate analysis. Oxford.2000 Johnson, R.A.and D.W. Wichern Applied

1. Introduction to multivariate data

1.1 Books

Chat�eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall

Krzanowski, W.J. Principles of multivariate analysis. Oxford.2000

Johnson, R.A.and D.W. Wichern Applied multivariate statistical analysis. Prentice Hall.

1.2 Applications

The need often arises in science, medicine and social science (business, management) to analyzedata on p variables (Note that p = 2 ) data are bivariate).

Suppose we have a simple random sample of size n. The sample consists of n vectors ofmeasurements on p variates i.e. n p�vectors (by convention column vectors) x1; :::xnwhich areinserted as rows xT1 ; :::x

Tn into a (n� p) data matrix X: When p = 2 we can plot the rows in

2-dimensional space, but in higher dimensions, p > 2; other techniques are needed.

Example 1

Classi�cation of plants (taxonomy)Variables: (p = 3) leaf size (x1), colour of �ower (x2), height of plant (x3)Sample items: n = 4 plants from a single speciesAims of analysis:i) understand within species variabilityii) clasify a new plant species

The data matrix may appear as follows

Variablesx1 x2 x3

1 6.1 2 12Plants 2 8.2 1 8(Items) 3 5.3 0 9

4 6.4 2 10

Example 2

Credit scoringVariables: personal data held by bankItems: sample of good/bad customersAims of analysis:i) predict potential defaulters (CRM)ii) risk assessment for new applicant

1

Example 3

Image processing for e.g. quality controlVariables: "features" extracted from an imageItems: sampled from a production lineAims of analysis:i) quantify "normal" variabilityii) reject faulty (o¤ speci�cation) batches

1.3 Sample mean and covariance matrix

We shall adopt the following notation:

x (p� 1) a random vector of observationson p variables

X (n� p) a data matrix whose rows contain anindependent random sample xT1 ; :::; x

Tn

of observations on x

x (p� 1) sample mean vector x =1

n

Xn

i=1xi

S (p� p) sample covariance matrix containingthe sample covariances de�ned as

sjk =1

n

Xn

i=1(xij � xj) (xik � xk)

R (p� p) sample correlation matrix containingthe sample correlations de�ned as

rjk =sjkpsjjskk

=sjksjsk

, say

Notes

1. xj is de�ned as the jth component of x (mean of variable j)

2. the covariance matrix S is square, symmetric ( S = ST ), and holds the sample variances

sjj = s2j =

1

n

Xn

i=1(xij � xj)2 along its main diagonal

3. the diagonal elements of R are rjj = 1 and 1 � rjk � 1 for each j; k

1.4 Matrix-vector representations

Given a (n� p) data matrix X; de�ne the n�vector of one�s

1 = (1; 1; :::; 1)T

2

The row sums of X are obtained by pre-multiplying X by 1T

1TX =

�nPi=1xi1; ::: ;

nPi=1xip

�= (nx1 ; :::; nxp)

= nxT

Hencex =

1

nXT1 (1.1)

The centred data matrix X 0 is derived from X by subtracting the variable mean from eachelement of X. i.e. x

0ij = xij � xj : or, equivalently, by subtracting a constant vector xT from each

row of X.

X0= X � 1xT

= X� 1n11

TX

=�In � 1

n11T�X

= HX (1.2)

where H =

�In �

1

n11T

�is known as the centring matrix. We now de�ne the sample covariance

matrix as1

n� the centred sum of squares and products (SSP) matrix

S = 1nX

0TX0

(1.3a)

= 1n

nXi=1

x0ix

0Ti (1.3b)

where x0i = xi � x denotes the ith mean-corrected data point

For any real p�vector y we then have

yTSy = 1ny

TX0TX

0y

= 1nz

Tz where z = X0y

= 1nkzk

2

� 0

Hence from the de�nition of a p.s.d. matrix, we have

Proposition 1

The sample covariance matrix S is positive semi-de�nite (p.s.d.)

Example

Two measurements x1; x2 made at the same position on each of 3 cans of food, resulted in thefollowing X�matrix:

X =

24 4 1�1 33 5

35Find the sample mean vector x and covariance matris S.

3

Solution

X =

24 4 1�1 33 5

35 = [x1; x2; x3]Tx =

1

n

3Xi=1

xi =1

3

��41

�+

��13

�+

�35

��=

�23

�

X0=

24 2 �2�3 01 2

35S =

1

3X

0TX0=

�143

�23�2

383

�=

�4:67 �0:67�0:67 2:67

�Note also that S is built up from individual data points:

S =1

3

��2�2

� �2 �2

�+

��30

� ��3 0

�+

�12

� �1 2

��and

R =

�1 �0:189

�0:189 1

�1.5 Measures of multivariate scatter

It is useful to have a single number as a measure of spread in the data. Based on S we de�ne twoscalar quantities

The total variation is

tr (S) = trace (S) =

pXj=1

sii = sum of diagonal elements

= sum of eigenvalues of S

The generalized variance is

jSj = product of eigenvalues of S (1.5)

In the above example

tr (S) = 143 +

83 = 7:33

jSj = 143 :83 �

��23

�2= 12

4

1.6 Random vectors

We will in this course generally regard the data as an independent random sample from somecontinuous population distribution with a probability density function

f (x) = f (x1; :::; xp) (1.6)

Here x = (x1; :::; xp) is regarded as a vector of p random variables. Independence here refersto the rows of the data matrix. If two of the variables (columns) are for example height andweight of individuals (rows), then knowing one individual�s weight says nothing about any otherindividual. However the height and weight for any individual are correlated.

For any region D in p�space of the variables

Pr (x 2 D) =ZDf (x) dx

Mean vector

For any j the population mean of xj is given by the p�fold integral

E (xj) = �j =Zxjf (x) dx

where the region of integration is Rp.

In vector form

� = E (x) = E

[email protected]

1CCCA =

0BBB@�1�2...�p

1CCCA (1.7)

Covariance matrix

The covariance between xj ; xk is de�ned as

�jk = Cov (xj ; xk)

= E��xj � �j

�(xk � �k)

�= E [xjxk]� �j�k

When j = k we obtain the variance of xj

�jj = Eh�xj � �j

�2iThe covariance matrix is a p� p matrix

� =(�ij) =

26664�11 �12 � � � �1p�21 �22 � � � �2p...

...�p1 �p2 � � � �pp

377755

The alternative notations V (x) = Cov (x) = � are used.

In matrix form

� = Eh(x� �) (x� �)T

i(1.8a)

= E�xxT

�� T (1.8b)

More generally we de�ne the covariance between two random vectors x (p� 1) and y (q � 1)as the (p� q) matrix

Cov (x; y) = Eh(x� �x)

�y��y

�T i (1.9)

Important property of �

� is a positive semi-de�nite matrix.

Proof

Let a (p� 1) be a constant vector, then

E�aTx

�= aTE (x) = aT�

and

V�aTx

�= E

h�aTx� aT�

�2i= aTE

h(x� �) (x� �)T

ia

= aT�a

Since variance is always a positive (non-negative) quantity we �nd aT�a � 0: From the de�n-ition (see handout) � is a positive semi-de�nite (p.s.d.)matrix.

Suppose we have an independent random sample x1; x2; :::; xn from a distribution with mean� and covariance matrix �: What is the relation between (a) the sample and population means,(b) the sample and population covariance matrices?

Result 1We �rst establish the mean and covariance of the sample mean x.

E (x) = � (1.10a)

V (x) =1

n� (1.10b)

Proof

E (x) =1

nE

nXi=1

xi

!=1

n

nXi=1

E (xi)

= �

6

V (x) = Cov

0@ 1n

nXi=1

xi;1

n

nXj=1

xj

1A=

1

n2:n�

noting that Cov (xi;xi) = � and Cov (xi;xj) = 0 for i 6= j: Hence

V (x) =1

n�

Result 2We now examine S and derive an unbiased estimator for �:

E (S) =(n� 1)n

� (1.11)

Proof

S =1

n

nXi=1

(xi � x) (xi � x)T

=1

n

nXi=1

xixTi � xxT

since1

n

Pni=1 xix

T =1

nxPni=1 x

Ti = xx

T :

From (1.8b) and (1.10b) we see that

E�xix

Ti

�= �+ ��T

E�xxT

�=

1

n�+ ��T

hence

E (S) = �+ ��T ��1

n�+ ��T

�=

n� 1n

�

Therefore an unbiased estimate of � is

Su =n

n� 1S (1.12)

=1

n� 1X0TX 0

7

1.7 Linear transformations

Let x = (x1; :::; xp)T be a random p�vector. It is often natural and useful to consider linear

combinations of the components of x such as for example y1 = x1 + x2 or y2 = x1 + 2x3 � x4: Ingeneral we consider a transformation from the p component vector x to a q component vector y(q < p) given by

y = Ax+ b (1.13)

where A (q � p) and b (q � 1) are constant matrices.

Suppose that E (x) = � and V (x) = � the corresponding expressions for y are

E (y) = A�+ b (1.14a)

V (y) = A�AT (1.14b)

These follow from the linearity of the expectation operator

E (y) = E (Ax+ b)

= AE (x)+ E (b)= A�+ b

= �y say

and

V (y) = E�yyT

�� y�yT

= Eh(Ax+ b) (Ax+ b)T

i� (A�+ b) (A�+ b)T

= AE�xxT

�AT +AE (x) bT + bE

�xT�AT +bbT

�A��TAT �A�bT � b�AT � bbT

= A�E�xxT

�� TAT

�= A�AT as required

1.8 The Mahalanobis transformation

Given a p�variate random variable x with E (x) = � and V (x) = �. A transformation to astandardized set of uncorrelated variates is given by the Mahalanobis transformation.

Suppose � is positive de�nite i.e. there is no exact linear dependence in x. Then the inversecovariance matrix ��1. has a "square root" ��

12 given by

��12 = V ��

12V T (1.15)

where � = V �V T is the spectral decomposition (see handout), i.e. V is an orthogonal ma-trix

�V TV = V V T = Ip

�whose columns are the eigenvectors of � and � = diag (�1; :::; �p) are

thecorresponding eigenvalues. The Mahalanobis transformation takes the form

z = ��12 (x� �) (1.16)

8

Using results (1.14a) and (1.14b) we can show that

E (z) = 0

V (z) = Ip

Proof

E (z) = Eh��

12 (x� �)

i= ��

12 [E (x)� �]

= 0

V (z) = ��12��

12

= Ip

1.8.1 Sample Mahalanobis transformation

Given a data matrixXT = (x1; :::; xn) ; the sample Mahalanobis transformation zi = S�12 (xi� x) for

i = 1; :::; n where S = Sx is the sample covariance matrix 1n�1X

THX creates a transformed datamatrix ZT = (z1; :::; zn). Now the the data matrices are related by

ZT = S�12X

TH or

Z = HXS�12 (1.17)

where H is the centring matrix. We may easily show (Ex.) that ZT is centred and that Sz = Ip:

1.8.2 Sample scaling transformation

A transformation of the data that scales each variable to have mean zero and variance one butpreserves the correlation structure is given by yi = D�1 (xi� x) for i = 1; :::; n where D =diag (s1; :::; sp) : Now

Y T =D�1XTH or

Y = HXD�1 (1.18)

Ex. Show that Sy = Rx:

1.8.3 A useful matrix identity

Let u; v be n�vectors and form the n� n matrix A = uvT : Then

jI + uvT j = 1+vT u (1.19)

Proof

First observe that A and I +A share a common set of eigenvectors since Av = �v )(I +A) v = (1 + �)v: Moreover the eigenvalues of I +A are 1 + �i where �i are the eigenvaluesof A:

Now uvT is a rank one matrix, therefore has a single nonzero eigenvalue (see handout). Since�uvT

�u = u

�vTu

�= �u where � = vTu, the eigenvalues of I + uvT are 1 + �; 1; :::1; 1: The

determinant of I + uvT is the product of the eigenvalues, hence the result.

9

2. Principal Components Analysis

2.1 Outline of technique

Let xT = (x1; x2:::; xp) be a random vector with mean � and covariance matrix �: PCA is atechnique for dimensionality reduction from p dimensions to k < p dimensions. It tries to �nd, inorder, the most informative k linear combinations of a set of variables y1; y2; :::; yk: Here informationwill be interpreted as a percentage of the total variation (as previously de�ned) in �: The k samplePC�s that "explain" x% of the total variation in a sample covariance matrix S may be similarlyde�ned.

2.2 Formulation

Let

y1 = aT1 x

y2 = aT2 x

...

yp = aTp x

where yj = a1jx1 + a2jx2 + :::+ apjxp are a sequence of standardized linear combinations (SLC�s)of the the x0s such that aTj aj = 1 and a

Tj ak = 0 for j 6= k: i.e. a1; a2; :::; ap form an orthonormal

set of p�vectors. Equivalently we may de�ne A; the p� p matrix formed from the columns fajg ;as an orthogonal matrix so that ATA = AAT = Ip:

We choose a1 to maximizeV ar (y1) = a

T1�a1

subject to aT1 a1 = 1: Then we choose a2 to maximize

V ar (y2) = aT2�a2

subject to aT2 a2 = 1 and aT2 a1 = 0, which ensures that y2 will be uncorrelated with y1: Subsequent

PC�s are chosen as the SLC�s that have maximum variance subject to being uncorrelated withprevious PC�s.

NB. Sometimes the PC�s are taken to be "mean-corrected" linear transformations of the x0s i.e.

yj = aTj (x� �)

emphasizing that the PCS�s can be considered as direction vectors in p�space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the samewhichever de�nition is used.

10

2.3 Computation

To �nd the �rst PC we use the Lagrange multiplier technique for �nding the maximum of a functionf (x) subject to an equality constraint g (x) = 0. We de�ne the Lagrangean function

L (a1) = aT1�a1 � �

�aT1 a1 � 1

�where � is a Lagrange multiplier.

Di¤erentiating, we obtain

@L

@a1= 2�a1 � 2�a1 = 0

�a1 = �a1

Therefore a1 should be chosen to be an eigenvector of � with eigenvalue �: Suppose the eigen-values of � are distinct and ranked in decreasing order �1 > �2 > ::: > �p > 0.

V ar (y1) = aT1�a1

= �aT1 a1

= �

Therefore a1 should be chosen as the eigenvector corresponding to the largest eigenvalue of �.

2nd PCThe Lagrangean is

L (a1) = aT2�a2 � �

�aT2 a2 � 1

��

�aT2 a1

�where �; � are Lagrange multipliers.

@L

@a2= 2 (�� Ip)a2 � �a1 = 0

2aT1�a2 � � = 0

since aT2 a1 = 0: However

aT1�a2 = aT2�a1

= �aT2 a1 = 0

Therefore � = 0 and�a2 = �a2

so a2 is the eigenvector of � corresponding to the second largest eigenvalue �2.

11

2.4 Example

The covariance matrix corresponding to scaled (standardized) variables x1; x2 is

� =

�1 �� 1

�(in fact a correlation matrix). Note � has total variation =2.

The eigenvalues of � are the roots of j�� Ij = 0��1� � �� 1� �

�� = 0(1� �)2 � �2 = 0

Hence � = 1 + �; 1 � �: If � > 0 then �1 = 1 + �; �2 = 1 � �: To �nd a1 we substitute �1 into�a1 = �a1. Note: this gives just one equation in terms of the components of aT1 = (a1; a2)

��a1 + �a2 = 0

so a1 = a2. Applying the normalization

aT1 a1 = a21 + a

22 = 1

we obtain

a1 =

"1p21p2

#Similarly

a2 =

"1p2

�1p2

#so that

y1 = 1p2(x1 + x2)

y2 = 1p2(x1 � x2)

are the PC�s explaining respectively100 (1 + �)

2% and

100 (1� �)2

% of the total variation. Notice

that the PC�s are independent of � while the proportion of the total variation explained by eachPC does depend on �:

12

2.5 PCA and spectral decomposition

Since � (also S) is a real symmetric matrix, we know that it has the spectral decomposition(eigenanalysis)

� = A�AT

=

pXi=1

�iaiaTi

where faig are the eigenvectors of � which we have inserted as columns of the (p� p) matrix Aand �1 � �2 � ::: � �p are the corresponding eigenvalues.

If some eigenvalues are not distinct, so �k = �k+1 = ::: = �l = �, the eigenvectors are notunique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension

l � k + 1 (cf. the major/minor axes of an ellipse x2

a2+y2

b2= 1 as b �! a:). Such a situation arises

with the equicorrelation matrix (see Class Exercise 1).

The transformation of a random p�vector x (corrected for its mean �) to its set of principalcomponents (PC�s) contained in the p�vector y is

y = AT (x� �)

y1 is the linear combination (SLC) of x having maximum variance, y2 is the SLC having max-imum variance subject to being uncorrelated with y1 etc. We have seen that V ar (y1) = �1;V ar (y2) = �2; :::

2.6 Explanation of variance

The interpretation of PC�s (y)as components of variance "explaining" the total variation, i.e. thesum of the variances of the original variables (x) is clari�ed by the following result

Result

The sum of the variances of the original variables and their PC�s are the same.

Proof

A note on trace (�)The sum of diagonal elements of a (p� p) square matrix � is known as the trace of �

tr (�) =

pXi=1

�ii

We show from this de�nition that tr (AB) = tr (BA) whenever AB and BA are de�ned [i.e. Ais (m� n) and B is (n�m)]

tr (AB) =Xi

Xj

aijbji

=Xj

Xi

bjiaij

= tr (BA)

13

The sum of the variances for the PC�s isXi

V ar (yi) =Xi

�i = tr (�)

Now � = A�AT is the spectral decomposition and A is orthonormal so ATA = Ip hence

tr (�) = tr�A�AT

�= tr

��ATA

�= tr (�)

Since � is the covariance matrix of x the sum of its diagonal elements is the sum of the variances�ii of the original variables. Hence the result is proved. �

Consequence (interpretation of PC�s)

It is therefore possible to interpret

�i�1 + �2 + :::+ �p

as the proportion of the total variation in the original data explained by the ith principal componentand

�1 + ::+ �k�1 + �2 + :::+ �p

as the proportion of the total variation explained by the �rst k PC�s.

From a PCA on a (10� 10) sample covariance matrix S; we could for example conclude thatthe �rst 3 PC�s (out of a total of p = 10 PC�s) account for 80% of the total variation in the data.This would mean that the variation in the data is largely con�ned to a 3-dimensional subspacedescribed by the PC�s y1; y2; y3.

2.7 Scale invariance

This unfortunately is a property that PCA does not possess!

In practice we often have to choose units of measurement for our individual variables fxig andthe amount of the total variation accounted for by a particular variable xi is dependent on thischoice (tonnes, kg. or grams).

In a practical study, the data vector x often comprises of physically incomparable quantities(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibilityis to perform PCA on a correlation matrix (e¤ectively choosing each variable to have unit samplevariance), but this is still an implicit choice of scaling. The main point is that the results of a PCAdepends on the scaling adopted.

14

2.8 Principal component scores

The sample PC transform on a data matrix X takes the form for the rth individual (rth row of thesample)

y0r = AT (xr � x)

where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the�rst component y1 corresponds to the scalar product of the �rst column of A with x0r etc.

The components of yr are known as the (mean-corrected) principal component scores for therth individual. The quantities

yr = ATxr

are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of eachdata point with respect to new axes de�ned by the PC�s, i.e. w.r.t. a rotated frame of reference.The scores can provide qualitative information about individuals.

2.9 Correlation of PC�s with original variables

The correlations � (xi; yk) of the kth PC with variable xi are an aid to interpreting the PC�s.

Since y = AT (x� �) we have

Cov (x; y) = E�(x� �)yT

�= E

h(x� �) (x� �)T A

i= �A

and from the spectral decomposition

�A =�A�AT

�A

= A�

Post-multiplying A by a diagonal matrix � has the e¤ect of scaling its columns, so that

Cov (xi; yk) = �kaik

is the covariance between the ith variable and the kth PC.

The correlation

� (xi; yk) =Cov (xi; yk)

V ar (xi)V ar (yk)

=�kaikp�iip�k

= aik

��k�ii

� 12

can be interpreted as the proportion of the variation in xi explained by the kth PC.

15

Exercise

Find the PC�s of the covariance matrix

� =

24 1 �2 0�2 5 00 0 2

35and show that they account for amounts

�1 = 5:83

�1 = 2:00

�3 = 0:17

of the total variation in �:

Compute the correlations � (xi; yk) and try to interpret the PC�s qualitatively.

16

3. Multivariate Normal Distribution

The MVN distribution is a generalization of the univariate normal distribution which has thedensity function (p.d.f.)

f (x) =1p2��

exp

(�(x� �)

2

2�2

)�1 < x <1

where � = mean of distribution, �2 = variance. In p�dimensions the density becomes

f (x) =1

(2�)p=2 j�j1=2exp

��12(x� �)T ��1 (x� �)

�(3.1)

Within the mean vector � there are p (independent) parameters and within the symmetric co-variance matrix � there are 12p (p+ 1) independent parameters [

12p (p+ 3) independent parameters

in total]. We use the notationx s Np (�;�) (3.2)

to denote a RV x having the MVN distribution with

E (x) = �

Cov (x) = �

Note that MVN distributions are entirely characterized by the �rst and second moments of thedistribution.

Basic properties

If x (p� 1)is MVN with mean � and covariance matrix �

� Any linear combination of x is MVNLet y = Ax+ c with A (q � p) and c (q � 1) then

y s N q

��y;�y

�where �y = A�+ c and �y = A�AT :

� Any subset of variables in x has a MVN distribution.

� If a set of variables is uncorrelated, then they are independently distributed. In particular

i) if �ij = 0 then xi; xj are independent

ii) if x is MVN woth covariance matrix �, then Ax and Bx are independent if and only if

Cov (Ax;Bx) = A�BT (3.3)

= 0

� Conditional distributions are MVN.

Result 1

For the MVN distribution, variable are uncorrelated , variable are independent.

17

Proof

Let x (p� 1) be partitioned as

x =

�x1x2

�qp� q

with mean vector

� =

��1�2

�qp� q

and covariance matrix

q p� q

� =

��11 �12�21 �22

�qp� q

i) Independent ) uncorrelated (always holds).

Suppose x1;x2 are independent.

Then�12 = Cov (x1;x2) = Eh(x1 � �1) (x2 � �2)T

ifactorizes into the product of E [(x1 � �1)]

and Eh(x2 � �2)T

iwhich are both zero since E (x1) = �1 and E (x2) = �2: Hence �12 = 0:

ii) Uncorrelated ) independent (for MVN)

This result depends on factorizing the p.d.f. (3.1) when �12 = 0:

In this case (x� �)T ��1 (x� �) has the partitioned form

�xT1 � �T1 xT2 � �T2

� ��11 00 �22

��1 �x1 � �1x2 � �2

�

=�xT1 � �T1 xT2 � �T2

� ��111 0

0 ��122

� �x1 � �1x2 � �2

�= (x1 � �1)T ��111 (x1 � �1) + (x2 � �2)

T ��122 (x2 � �2)

so that expf(x� �)T ��1 (x� �)g factorizes into the product of

expn(x1 � �1)T ��111 (x1 � �1)

oand exp

n(x2 � �2)T ��122 (x2 � �2)

o:

Therefore the p.d.f. can be written as

f (x) = g (x1)h (x2)

proving that x1 and x2 are independent. �

Result 2

Let x =�x1x2

�qp� q be MVN with mean � =

��1�2

�and covariance matrix � =

��11 �12�21 �22

�:

The conditional distribution of x2 given x1 is MVN with

E (x2jx1) = �2 +�21��111 (x1 � �1) (3.4a)

Cov (x2jx1) = �22 ��21��111 �12 (3.4b)

18

Proof

Let x02 = x2 ��21��111 x1:We �rst show that x02 and x1 are independent.Consider the linear transformation�

x1x02

�=

�I 0

��21��111 I

� �x1x2

�(3.5a)

= Ax say. (3.5b)

This linear relationship shows that x1;x02 are jointly MVN (by �rst property of MVN statedabove.

We may show that x1 and x02 are uncorrelated in two waysFirstly

Cov�x1;x

02

�= Cov

�x1;x2 ��21��111 x1

�= Cov(x1;x2)� Cov (x1;x1)��111 �12= �12 ��11��111 �12= 0

or, if we write A =

�BC

�in (3.5) and apply (3.3)

Cov�x1;x

02

�= Cov (Bx;Cx)

= B�CT

Cov�x1;x

02

�=

�I 0

� ��11 �12�21 �22

� ��111 �12I

�=

��11 �12

� ��111 �12I

�= 0

Since MVN and uncorrelated we have shown that x02 and x1 are independent. Therefore

E�x02jx1

�= E

�x02�

= E�x2 ��21��111 x1

�= �2 ��21��111 �1

Now since x02 = x2 ��21��111 x1

E (x2jx1) = E�x02jx1

�+�21�

�111 x1

= �2 ��21��111 �1 +�21��111 x1

= �2 +�21��111 (x1 � �1)

as required.

Because x1 and x02 are independent

Cov�x02jx1

�= Cov

�x02�

19

Conditional on x1 a given constant, x02 = x2 ��21��111 x1 i.e. x02 and x2 di¤er by a constant.Hence

Cov (x2jx1) = Cov�x02jx1

�Therefore

Cov (x2jx1) = Cov�x02�

= C�CT

where C =��21��111 I

�so

��21��111 I

� ��11 �12�21 �22

� ��111 �12I

�=

�0 �22 ��21��111 �12

� ��111 �12I

�= �22 ��21��111 �12

Example

Let x have a MVN distribution with covariance matrix

� =

241 � �2

1 01

35Show that the conditional distribution of (x1; x2) given x3 is also MVN with mean

� =

��1 + �

2 (x3 � �3)�2

�and covariance matrix �

1� �4 �� 1

�

20

3.1 Maximum-likelihood estimation

Let XT = (x1; :::;xn) contain an independent random sample of size n from Np (�;�) :The maxi-mum likelihood estimates (MLEs�) of �;� are

b� = x (3.6a)b� = S (3.6b)

The likelihood function is a function of the parameters �;� given the data X

L (�;�jX) =nYr=1

f (xrj�;�) (3.7)

The RHS is evaluated by substituting the individual data vectors fx1; :::;xng in turn into thep.d.f. of Np (�;�) and taking the product.

nYr=1

f (xrj�;�) = (2�)�np2 j�j�n=2

exp

(�12

nXr=1

(xr � �)T ��1 (xr � �))

Maximizing L is equivalent to minimizing

l = �2 logL

=

nXr=1

log f (xrj�;�)

= K + n log j�j+nXr=1

(xr � �)T ��1 (xr � �)

where K is a constant independent of �;�:

Noting that xr� � = (xr � x) + (x� �) the �nal term in the above may be written

nXr=1

(xr � x)T ��1 (xr � x)

+nXr=1

(xr � x)T ��1 (x� �)

+nXr=1

(x� �)T ��1 (xr � x)

+n (x� �)T ��1 (x� �)

Thus

l (�;�) = tr��1A

�+ ndT��1d (3.8a)

= n�tr��1S

�+dT ��1d

(3.8b)

21

where we de�ne for ease of notation

A = nS (3.9a)

d = x� � (3.9b)

and S is the sample covariance matrix (with divisor n). We have made use of nS = CTC whereC is the (n� p) centred data matrix

CT = (x1 � x;x2 � x; :::;xn � x)

We see that

nXr=1

(xr � x)T ��1 (xr � x) = tr�C��1CT

�= tr

��1CTC

�= tr

��1A

�= ntr

��1S

�Notice that l = l (�;�) and the dependence on � is entirely through d in (3.8). Now assume

that � is positive de�nite (p.d.), then so is ��1 (why?). Thus 8d 6= 0 we have dT��1 d > 0showing that l is minimized with respect to � for �xed � when d = 0.

Hence b� = xTo minimize the log-likelihood l (b�;�) w.r.t. �

l (x;�) = n log j�j+ tr��1A

�= n

�log j�j+ tr

��1S

�up to an arbitrary additive constant.

Let� (�) = n

�log j�j+ tr

��1S

�(3.10)

We show that

� (�)� � (S) = n�log j�j � log jSj+ tr

��1S

�� p

= n�tr��1S

�� log j��1Sj � p

(3.11)

� 0

Lemma 1

��1S is positive de�nite. (proved elsewhere)

Lemma 2

For any set of positive numbersA � logG+ 1

where A and G are the arithmetic, geometric means respectively.

22

Proof

For all x we have ex � 1 + x (simple exercise). For each yi � 0 of a set i 2 f1; :::; ng therefore

yi � 1 + log yiXyi � n+

Xlog yi

A � 1 + log�Y

yi

� 1n

= 1 + logG

as required.

In (3.11) assuming that the eigenvalues of ��1S are positive, recall that for any square matrixA; we have tr (A) =

P�i the sum of the eigenvalues, and j Aj =

Y�i the product of the

eigenvalues.

Let �i (i = 1; :::; p) be the eigenvalues of ��1S and substitute in (3.11)

log j��1Sj = log�Y

�i

�= p logG

tr��1S

�=

X�i

= pA

� (�)� � (S) = np fA� logG� 1g� 0

This show that the MLE�s are as stated in (3:6) :

23

3.2 Sampling distribution of x̄ and S

The Wishart distribution (De�nition)

If M (p� p) can be written M = XTX where X (m� p) is a data matrix from Np (0;�) thenM is said to have a Wishart distribution with scale matrix � and degrees of freedom m: We write

M sWp (�;m) (3.12)

When � = Ip the distribution is said to be in standard form.

Note:

The Wishart distribution is the multivariate generalization of the chi-square �2 distribution

Additive property of matrices with a Wishart distribution

LetM1,M2 be matrices having the Wishart distribution

M1 s Wp (�;m1)

M2 s Wp (�;m2)

independently, thenM1 +M2 sWp (�;m1 +m2)

This property follows from the de�nition of the Wishart distribution because data matrices areadditive in the sense that if

X =

�X1

X2

�is a combined data matrix consisting of m1 +m2 rows then

XTX = XT1X1+X

T2 X2

is matrix (known as the "Gram matrix") formed from the combined data matrix X:

Case of p = 1

When p = 1 we know from the de�nition of �2r as the distribution of the sum of squares of rindependent N (0; 1) variates that

M =mXi=1

x2i s �2�2m

so thatW1

��2;m

�� 2�2m

24

Sampling distributions

Let x1;x2; :::;xn be a random sample of size n from Np (�;�). Then

1. The sample mean �x has the normal distribution

�x s Np��;1

n�

�

2. The sample covariance matrix S�MLE: S =

1

nCTC

�has the Wishart distribution

nS sWp (�;n� 1)

3. The distributrions of �x and S are independent.

3.3 Estimators for special circumstances

3.3.1 � proportional to a given vector

Sometimes � is known to be proportional to a given vector, so � =k�0.For example if x represents a sample of repeated measurements then � =k1 where 1 =(1; 1; :::; 1)T

is the p�vector of 10s:

We �nd the MLE of k for this situation. Suppose � is known and � =k�0 the log likelihood is

l = �2 logL= n

nlog j�j+ tr

��1S

�+ (�x�k�0)T ��1 (�x�k�0)

oWe set

@l

@k= 0 to minimize l w.r.t. k

�xT��1 �x� 2k�T0��1 �x+ k2�T0��1�0 = 0

from which

k̂ =�T0�

�1�x

�T0��1�0

(3.13)

We may show that k̂ is an unbiased estimator of k and determine the variance of k̂

In (3.13) k̂ takes the form1

�cT �x with cT = �T0�

�1 and a = �T0��1�0 so

Ehk̂i=

1

�cTE [�x]

=k

�cT�0:

25

HenceEhk̂i= k (3.14)

showing that k̂ is an unbiased estimator.

Note that V ar [�x] =1

n� and therefore that V ar

�cT �x

�=1

ncT�c we have

V ar�k̂�

=1

n�2cT�c

=1

n�T0��1�0

(3.15)

3.3.2 Linear restriction on �

We determine an estimator for � to satisfy a linear restriction

A� = b

where A is (m� p) and b (m� 1)

Introduce a vector � of m Lagrange multipliers and seek to minimize

l + 2�T (A�� b) = nn(�x��)T ��1 (�x��) + 2�T (A�� b)

oDi¤erentiate w.r.t. �

�2��1 (�x��) + 2AT� = 0

�x�� = �AT� (3.16)

We use the constraint A� = b to evaluate the Lagrange multipliers �:Premultiply by A

A�x� b = A�AT�

� =�A�AT

��1(A�x� b)

Substitute into (3.16)

�̂ =�x� �AT�A�AT

��1(A�x� b) (3.17)

26

3.3.3 Covariance matrix � proportional to a given matrix

We consider estimating k when � = k�0 when �0 is given.

The likelihood (3.8) takes the form

l = n

�log jk�0j+ tr

�1

k��10 S

��plus terms not involving k:

l =

�p log k +

1

ktr��10 S

��dl

dk=

p

k� 1

k2tr��10 S

�= 0

Hence

k̂ =tr��10 S

�p

(3.18)

27

4. Hypothesis testing (Hotelling�s t2-statistic)

Consider the test of hypothesisH0 : � = �0HA � = �1 6= �0

(1)

4.1 The Union-Intersection Principle

W accept the hypothesis H0 as valid if and only if

H0 (a) : aT� = aT�0

is accepted for all a: [In some sense the union of all such hypotheses]

For �xed a we set y = aTx so that in the population

E (y) = aT�0

V ar (y) = aT�a

under H0; and in our sample

�y = aT �x

s:e: (�y) =aTSapn� 1

The univariate t-statistic for testing H0 (a) against the alternative � (y) 6= aT�0 is

t (a) =�y � aT�0s:e: (�y)

=

pn� 1aT (�x� �0)p

aTSa

The acceptance threshold for H0 (a) takes the form t2 (a) � R for some R . The multivariateacceptance region is the intersection

\�t2 (a) � R

�(4.1)

which is true if and only if max�t2 (a)

�� R: Therefore we adopt

max�t2 (a)

�as the test statistic for H0: Equivalently

Maximize (n� 1)aT (�x� �0) (�x� �0)T a

subject toaT Sa = 1

(4.2)

Write d = �x� �0 we introduce a Lagrangean multiplier and seek to determine � and a tosatisfy

d

da

haT (�x� �0) (�x� �0)

T a� �aTSai= 0

28

ddT a� �Sa = 0 (4.3a)�S�1ddT � �I

�a = 0 (4.3b)

jS�1ddT � �Ij = 0 (4.3c)

(4.3b) can be writtenMa = �a showing that a is an eigenvector of S�1ddT .(4.3c) is the determinantal equation satis�ed by the eigenvalues of S�1ddT .Premultiplying (4.3a) by aT gives

aTddT a� �aTSa = 0

� =aTddTa

aTSa= t2 (a)

Therefore in order to maximize t2 (a) we choose � to be the largest eigenvalue of S�1ddT : Thisis a rank 1 matrix with the single non-zero eigenvalue

tr S�1ddT = dTS�1d

and the maximum of (4.2) is known as Hotelling�s T 2 statistic

T 2 = (n� 1) (�x� �0)T S�1 (�x� �0) (4.4)

which is (n� 1) � the sample Mahalanobis distance between �x and �0.

4.2 Distribution of T2

Under H0 it can be shown thatT 2

n� 1 sp

n� pFp;n�p (4.5)

where Fp;n�p is the F distribution on p and n� p degrees of freedom. Note that depending on thecovariance matrix used, T 2 has slightly di¤erent forms

T 2 =

((n� 1) (�x� �0)

T S�1 (�x� �0)

n (�x� �0)T S�1U (�x� �0)

where SU is the unbiased estimator of � (with divisor n� 1).

Example 1

In an investigation of adult intelligence, scores were obtained on two tests "verbal" and "per-formance" for 101 subjects aged 60 to 64. Doppelt and Wallace (1955) reported the following meanscore and covariance matrix: �

�x1�x2

�=

�55:2434:97

�

SU =

�210:54 126:99126:99 119:68

�29

At the � = :01 (1%) level, test the hypothesis that��1�2

�=

�6050

�We �rst compute

S�1U =

�:01319 �:01400�:01400 :02321

�and

d = �x� �0

=��4:76 �15:03

�TThe T 2 statistic is then

T 2 = 101�4:76 15:03

� � :01319 �:01400�:01400 :02321

� �4:7615:03

�= 101

�4:762 � :01319� 2� 4:76� 15:03� :01400

+15:032 � :02321

�= 357:4

This gives

F =99

2� 357:4100

= 176:9

The nearest tabulated 1% value corresponds to F2;60 and is 4.98.

Therefore we conclude the null hypothesis should be rejected. The sample probably arose froma population with a much lower mean vector, rather closer to the sample mean.

Example 2

The change in levels of free fatty acid (FFA) were measured on 15 hypnotised subjects who hadbeen asked to experience fear, depression and anger e¤ects while under hypnosis. The mean FFAchanges were

�x1 = 2:699 �x2 = 2:178 �x3 = 2:558

Given that the covariance matrix of the stress di¤erences yi1 = xi1 � xi2 and yi2 = xi1 � xi3 is

SU =

�1:7343 1:16661:1666 2:7733

�S�1U =

�0:8041 �0:3382�0:3382 2:7733

�test at the 0.05 level of signi�cance, whether each e¤ect produced the same change in FFA.

[T 2 = 2:68 and F = 1:24 with degrees of freedom 2,13.Do not reject the hypothesis "no emotion e¤ect" at the � = :05 level]

30

4.3 Invariance of T2

T 2 is unafected by changes in the scale or origin of the (response) variables. Consider

y = Cx+ d

where C is (p� p) and non-singular.

The null hypothesis H0 : �x = �0 is equivalent to H0 : �y = C�0 + d.

We have under linear transformation

�y = C�x+ d

Sy = CSCT

so that

1

n� 1T2y =

��y � �y

�TS�1y

��y � �y

�= (�x� �0)

T CT�CSCT

��1C (�x� �0)

= (�x� �0)T CT

�CT��1

S�1C�1C (�x� �0)

= (�x� �0)T S�1 (�x� �0)

which demonstrates invariance.

4.4 Con�dence interval for a mean

A con�dence region for � can be obtained given the distribution of T 2

(n� 1) (�x� �)T S�1 (�x� �) sp (n� 1)n� p Fp;n�p (4.6)

by substituting the data values �x and S�1:

In Example 1 above we have

�x = (55:24; 34:97)T

100S�1 =

�1:32 �1:40�1:40 2:32

�and F2;99 (:01) is approximately 4.83 (by interpolation). Hence

1:32 (�1 � 55:24)2 � 2:80 (�1 � 5:24) (�2 � 34:97)+2:32 (�2 � 34:97)2

� 2:100

99� 4:83 = 9:76

This is an ellipse in p = 2 dimensional space (can be plotted). In higher dimensions an ellipsoidalcon�dence region is obtained.

31

4.5 Likelihood ratio test

Given a data matrix X of observations on a random vector x whose distribution depends on avector of parameters � , the likelihood ratio for testing the null hypothesis

H0 : � 2 0 against the alternativeH1 : � 2 1 is de�ned as

� =sup�20 L

sup�21 L(4.7)

where L = L (�;X) is the likelihood function. In a likelihood ratio test (LRT) we reject H0 forlow values of �: In a likelihood ratio test (LRT) we reject H0 for low values of �; i.e. if � < c wherec is chosen so that the probability of Type I error is a:

If we de�ne l�0 = �2 logL0 where L0 is the value of the numerator and similarly l�1 = �2 logL1,the rejection criterion takes the form

�2 log � = �2 log�L�0L�1

�= l�0 � l�1 > k (4.8)

Result

When H0 is true and for n �large� the log likelihood ratio (4.8) has the �2-distribution on rdegrees of freedom, �2r , where r equals the number of free parameters under H1 minus the numberof free parameters under H0:

4.6 LRT for a mean when � is known

H0 : � = �0 a given value when � is known

Given a random sample from N (�;�) resulting in �x and S the likelihood given in (3.8b) is (towithin an additive constant)

l (�;�) = nnlog j�j + tr

��1S

�+ (�x� �)T ��1 (�x� �)

o(4.9)

Under H0 the value of � is known and

l�0 = l (�0;�)

= nnlog j�j + tr

��1S

�+ (�x� �0)

T ��1 (�x� �0)o

Under H1 with no restriction on �; the m.l.e. of � is �̂ = �x: Thus

l�1 = n�log j�j + tr

��1S

�Therefore

�2 log � = l�0 � l�1= n (�x� �0)

T ��1 (�x� �0) (4.10)

32

which is n times the Mahalanobis distance of �x from �0. Note the similarity with Hotelling�s T2

statistic. Given the distribution of �x under H0 is

�x s Np

��0;

1n��

and (4.10) may be written using the transformation y =�1n�� 1

2 (�x� �0) to a standard set ofindependent N (0; 1) variates as

�2 log � = yTy =pXi=1

y2i (4.11)

we have the exact distribution�2 log � s �2p (4.12)

showing that in this case the asymptotic distribution of �2 log � is exact for the small sample case.

Example

Measurements of the length of skull were made on a sample of �rst and second sons from 25families.

�x =

�185:72183:84

�S =

�91:48 66:88

96:78

�Assuming that in fact

� =

�100 00 100

�test at the � = :05 level the hypothesis

H0 : � =�182 182

�TSolution

�2 log � = 25��3:72 1:84

� �:01 00 :01

� �3:721:84

�= 0:25�

�3:722 + 1:842

�= 4:31

Since �22 (:05) = 5:99 do not reject H0

33

4.7 LRT for mean when � is unknown

Consider the test of hypothesis

H0 : � = �0 when � is unknown.

H1 : � 6= �0

In this case � must be estimated under H0 and also under H1:

Under H0

l (�0;�) = nnlog j�j + tr

��1S

�+ (�x� �0)

T ��1 (�x� �0)o

(4.13a)

= n�log j�j + tr

��1S

�+ dT0�

�1d0

(4.13b)


��1S

�+ tr

�dT0�

�1d0�

(4.13c)


��1S

�+ tr

��1d0d

T0

�(4.13d)


��1

�S + d0d

T0

��(4.13e)

writing d0 for �x� �0:

Under H1

l (�̂;�) = nnlog j�j + tr

��1S

�+ (�x� �̂)T ��1 (�x� �̂)

o(4.14a)


��1S

�(4.14b)

l��̂; �̂

�= n

�log jSj + tr

�S�1S

�(4.14c)

= n flog jSj + tr (Ip)g (4.14d)

l�1 = n log jSj + np (4.14e)

after substitution of the m.l.e.�s �̂ = �x and �̂ = S obtained previously.

Comparing (4:13e) with (4:14b) we see that the m.l..e. of � under H0 must be

�̂ = S + d0dT0

and that the corresponding value of l = �2 logL is

l�0 = n log jS + d0dT0 j + np

l�0 � l�1 = n log jS + d0dT0 j � n log jSj

= n log jS�1j� n log jS + d0dT0 j

= n log jS�1�S + d0d

T0

�j

= n log jIp+S�1d0 dT0 j= n log

�1 + dT0 S

�1d0�

(4.15)

making use of the useful matrix result proved in (1:8:3) that jIp+uvT j =�1 + vTu

�:

34

Since

�2 log � = n log�1 +

T 2

n� 1

�(4.16)

we see that � and T 2 are monotonically related. Therefore we can conclude that the LRT ofH0 : � = �0 when � is unknown is equivalent to use of Hotelling�s T 2 statistic.

4.8 LRT for � = �0 with � unknown

H0 : � = �0 when � is unknown.

H1 : � 6= �0

Under H0 we substitute �̂ = �x into

l (�̂;�0) = nnlog j�0j + tr

��10 S

�+ (�x� �̂)T ��10 (�x� �̂)

ogiving

l�0 = n�log j�0j + tr

��10 S

�(4.17)

Under H1 we substitute the unrestricted m.l.e.�s �̂ = �x and �̂ = S giving as in (4:14e)

l�1 = n log jSj + np (4.18)

l�0 � l�1 = n�log j�0j + tr

��10 S

�� log jSj � p

= n

�� log j��10 Sj+ tr

��10 S

�� p

(4.19)

This statistic depends only on the eigenvalues of the positive de�nite matrix ��10 S and has theproperty that l�0 � l�1 = �2 log �! 0 as S approaches �0:

Let A be the arithmetic mean and G the geometric mean of the eigenvalues of ��10 S

tr��10 S

�= pA

j��10 Sj = Gp

then

�2 log � = n fpA� p logG� pg= np fA� logG� 1g (4.20)

The general result for the distribution of (4:20) for large n gives

l�0 � l�1 s �2r (4.21)

where r = 12p (p+ 1) is the number of independent parameters in �:

35

4.10 Test for sphericity

A covariance matrix is said to have the property of "sphericity" if

� = kIp (4.22)

for some k: We see that this is a special case of the more general situation � = k�0 treated inSection (3.3.3). The same procedure can be applied. The general likelihood:expresion for a samplefrom the MVN distribution is:

�2 logL = n�log j�j+ tr

��1

�S + ddT

��Under H0 : � = kIp and �̂ = �x so

�2 logL = n�log jkIpj+ tr

�k�1S

�= n

�p log k + k�1tr S

(4.23)

Set@

@k[�2 logL] = 0 at a minimum

p

k� 1

k2tr S = 0

k̂ =tr S

p(4.24)

which is in fact the arithmetic mean A of the eigenvalues of S:

Substitute back into (4.23) gives

l�0 = np (logA+ 1)

Under H1 : �̂ = �x and �̂ = S

l�1 = n log jSj + np= np (logG+ 1)

thus

�2 log � = l�0 � l�1

= np log

�A

G

�(4.25)

The number of free parameters contained in � is 1 under H0 and 12p (p+ 1) under H1: Hence

the appropriate distribution for comparing �2 log � is �2r where

r =1

2p (p+ 1)� 1

=1

2(p� 1) (p+ 1) (4.26)

36

4.11 Test for independence

Independence of the variables x1; :::; xp is manifest by a diagonal covariance matrix

� = diag (�11; :::; �pp) (4.27)

We considerH0 : � is diagonal against the general alternativeH1 :.� is unrestricted

Under H0 it is clear in fact that we will �nd �̂ii = sii because the estimators of �ii for each xiare independent. We can also show this formally

= n�log j�j+ tr ��1

�S + ddT

�= n

(pXi=1

log �ii +

pXi=1

sii�ii

)

Set@

@�ii(�2 logL) = 0

1

�ii� sii�2ii

= 0

b�ii = sii

Therefore

= n

(pXi=1

log sii + p

)= n flog jDj+ pg

where D = diag (s11; :::; spp) :

Under H1 as before we �ndl�1 = n log jSj + np

Therefore

l�0 � l�1 = n[log jDj � log jSj]= �n log jD�1Sj= �n log jD� 1

2SD� 12 j

= �n log jRj (4.28)

The number of free parameters contained in � is p under H0 and 12p (p+ 1) under H1: Hence

the appropriate distribution for comparing �2 log � is �2r where

r =1

2p (p+ 1)� p

=1

2p (p� 1) (4.29)

37

4.12 Simultaneous con�dence intervals (Sche¤e, Roy & Bose)

The union-intersection method for deriving Hotelling�s T 2 statistic provides "simultaneous con�-dence intervals" for the parameters when � is unknown. Following Section 4.1 let

T 2 = (n� 1) (�x� �)T S�1 (�x� �) (4.30)

where � is the unknown (true) mean. Let t (a) be the univariate t�statistic corresponding to thelinear compound y = aTx: Then

maxat2 (a) = T 2

and for all p�vectors at2 (a) � T 2 (4.31)

where

t (a) =�y � �ysy=pn

=

pn� 1aT (�x� �)p

aTSa(4.32)

From Section 4.2 the distribution of T 2 is

T 2

n� 1 sp

n� pFp;n�p

so

Pr

�T 2 � (n� 1) p

n� p Fp;n�p (�)

�= 1� �

therefore from (4.31), for all p�vectors a

Pr

�t2 (a) � (n� 1) p

n� p Fp;n�p (�)

�= 1� � (4.33)

Substituting from (4.32), the con�dence statement in (4.33) is:

With probability 1� � for all p�vectors a

jaT �x� aT�j ��(n� 1) pn� p Fp;n�p (�)

�1=2saTSan� 1

= K�

saTSa

n� 1 say, (4.34)

where K� is the constant

K� =

�(n� 1) pn� p Fp;n�p (�)

�1=2(4.35)

A 100 (1� �)% con�dence interval for the linear compound aT� is therefore

aT �x�K�

saTSa

n� 1 (4.36)

38

How can we apply this result? We might be interested in a de�ned set of linear combinations(linear compounds) of �: The ith component of � is for example the linear compound de�ned byaT = (0; :::; 1; :::0) the unit vector with a single 1 in the ith position. For a large number of such setsof CI�s we would expect 100 (1� �)% to contain no mis-statements while 100�% would contain atleast one mis-statement.

We can relate the T 2 con�dence intervals to the T 2 test of H0 : � = �0. If this H0 is rejectedat signi�cance level � then there exists at least one vector a such that the interval (4.36) does notinclude the value aT �0:

NB. If the covariance matrix Su (with denominator n� 1) is supplied, then in (4.36)raTSa

n� 1

may be replaced by

raTSua

n:

4.13 The Bonferroni method

This provides another way to construct simultaneous CI�s for a small number of linear compoundsof � whilst controlling the overall level of con�dence.

Consider a set of events A1; A2; :::; Am

Pr (A1 \ ::: \Am) = 1� Pr�A1 [ ::: [Am

�From the additive law of probabilities

Pr�A1 [ ::: [Am

��

mXi=1

Pr�Ai�

Therefore

Pr (A1 \ ::: \Am) � 1�mXi=1

Pr�Ai�

(4.37)

Let Ck denote a con�dence statement about the value of some linear compound aTk� withPr (Ck true) = 1� �k:

Pr (all Ck true) � 1� (�1 + :::+ �m) (4.38)

Therefore we can control the overall error rate given by �1 + :::+ �m = � say. For example, inorder to construct simultaneous 100 (1� �)% CI�s for all p components �k of � we could choose

�k =�

p(k = 1; :::; p) leading to

�x1 � tn�1��

2p

�rs11n

...

�xp � tn�1��

2p

�rsppn

if sii derives from Su:

39

Example

Intelligence scores data on n = 101 subjects:

�x =

��x1�x2

�=

�55:2434:97

�

SU =

�210:54 126:99126:99 119:68

�1. Construct 99% simultaneous con�dence intervals for �1; �2 and �1 � �2:For �1 take a

T = (1; 0)

aT �x =�1 0

� �55:2434:97

�= 55:24

aTSua = 210:54

Now take � = :01

K� =

�(n� 1) p(n� p) Fp;n�p (�)

� 12

=

�100� 299

F2;99 (:01)

� 12

= 3:12

taking F2;99 (:01) = 4:83 (approx). Therefore the CI for �1 is

55:24� 3:12�r210:54

101= 55:24� 4:50

giving an interval = (50:7; 59:7)

For �2 we already have K� , take aT = (0; 1) then

aT �x = 34:97

aTSua = 119:68

The CI for �2 is

34:97� 3:12�r119:68

101= 34:97� 3:40

giving an interval = (31:6; 38:4)

For �1 � �2 take aT = [1;�1]

aT �x = [1;�1]�55:2434:97

�= 20:27

aTSua = [1;�1]�210:54 126:99126:99 119:68

� �1�1

�= 210:54� 2� 126:9 + 119:68= 76:24

40

CI for �1 � �2 is

20:27� 3:12�r76:24

101= 20:27� 2:71

= (17:6; 23:0)

2. Construct CI�s for �1; �2 by Bonferroni method. Use � = :01:

Individual CI�s are constructed using �k =:01

2= :005 (k = 1; 2) : Then

t100

��k2

�= t100 (:0025)

' ��1 (:0075)

= 2:81

CI for �1 is

55:24� 2:81�r210:54

101= 55:24� 4:06

= (51:2; 59:3)

and for �2 is

34:97� 2:81�r119:68

101= 34:97� 3:06

= (31:9; 38:0)

Comparing CI�s obtained by the two methods we see that the simultaneous CI�s for �1 and �2and �1 � �2 are 8.7% wider than the coirresponding Bonferroni CI�s.

NB. If we had required 99% Bonferroni CI�s for �1; �2 and �1 � �2 then m = 3 in (4.38) and�

m=:01

6= :0017: The corresponding percentage point of t would be

t100 (:0017) ' ��1 (:9983)

= 2:93

leading to a slightly wider CI Than obtained above.

41

4.14 Two sample procedures

Suppose we have two independent random samples fx11; :::;x1n1g fx21; :::;x2n2g of size n1; n2 fromtwo populations.

�1 : x s Np (�1;�)�2 : x s Np (�2;�)

giving rise to sample means �x1; �x2 and sample covariance matrices S1; S2 . Note the assumptionof a common covariance matrix �:

We consider testing

H0 : �1 = �2 against

H1 : �1 6= �2Let d = �x1 � �x2: Under H0

d s N�0;

�1

n1+1

n2

��

�(a) Case of � known

Analogously to the one sample case�n1n2n1 + n2

� 12

��12d s N (0; Ip)

n1n2ndT��1d s �2p where n = n1 + n2

(b) Case of � unknown

We have the Wishart distributed quantitities

n1S1 s Wp (�;n1 � 1)n2S2 s Wp (�;n2 � 1)

Let

Sp =n1S1 + n2S2

n� 2be the pooled estimator of the covariance matrix �: Then from the additive properties of theWishart distribution (n� 2)Sp has the Wishart distribution Wp (�; n� 2) and�n1n2

n

� 12d s N (0;�)

It may be shown that

T 2 =�n1n2n

�dTS�1p d

has the distribution of a Hotelling�s T 2 statistic. In fact

T 2 s(n� 2) pn� p� 1Fp;n�p�1 (4.39)

42

4.15 Multi-sample procedures (MANOVA)

We consider the case of k samples from populations �1; :::;�k: The sample from population �i isof size ni: By analogy with the univariate case we can decompose the SSP matrix into �orthogonal�parts. This decomposition can be represented as a Multivariate Analysis of Variance (MANOVA)table.

The MANOVA model is

xij = �+ � i + eij j = 1; :::; ni and i = 1; :::; k (4.40)

where eij are independent Np (0;�) variables. Here the parameter vector � is the overall (grand)mean and the � i is the ith treatment e¤ect with

kXi=1

ni� i = 0 (4.41)

De�ne the ith sample mean as �xi =1

ni

Xni

j=1xij :

The Between Groups sum of squares and cross-products (SSP) matrix is

B =kXi=1

ni (�xi � �x) (�xi � �x)T (4.42)

The Grand Mean is �x =Xk

i=1ni�xi and the Total SSP matrix is

T =kXi=1

niXj=1

(�xij � �x) (�xij � �x) (4.43)

It can be shown algebraically that T = B +W where W is the Within Groups (or residual)SSP matrix given by

W =kXi=1

niXj=1

(�xij � �xi) (�xij � �xi)T (4.44)

The MANOVA table is

Source Matrix of SS and Degrees ofof variation cross-products (SSP) freedom (d.f.)

Treatment B =Xk

i=1ni (�xi � �x) (�xi � �x)T k � 1

Residual W =Xk

i=1

Xni

j=1(xij � �xi) (xij � �xi)T

Xk

i=1ni � k

Total (corrected T = B +W =Xk

i=1

Xni

j=1(xij � �x) (xij � �x)

Xk

i=1ni � 1

for the mean)

43

We are interested in testing the hypothesis

H0 : �1 = �2 = ::: = �k (4.42)

whether the samples in fact come from the same population against the general alternative

H1 : �1 6= �2 6= ::: 6= �k (4.43)

We can derive a likelihood ratio test statistic known as Wilk�s � :

Under H0 the m.l.e.�s are

�̂ = �x

�̂ = S

leading to the maximized log likelihood (minimum of� 2 logL)

l�0 = np+ n log jSj (4.44)

Under H1 the m.l.e.�s are

�̂i = �xi

�̂ =1

nW

where

W =kXi=1

W i =kXi=1

niSi

This follows from

l�1 = min�;di

(n log j�j+

kXi=1

ni tr��1

�Si + did

Ti

��)

= min�

(n log j�j+ n tr ��1

1

n

kXi=1

ni Si

!)

since d̂i = �xi � �̂i = 0. Hence �̂ =1

nW and

l�1 = np+ n log

�� 1nW�� (4.45)

Therefore since T = nS

l�0 � l�1 = �n log�jW jjT j

�= �n log � (4.46)

where � is known as Wilk�s � statistic. We reject H0 for small values of � or large values of�n log �: Asymptotically, the rejection region is the upper tail of a �2p(k�1). Under H0 the unknown� has p parameters and under H1 the number of parameters for �1; :::;�k is pk: Hence the d.f. ofthe �2 is p (k � 1). Apart from this asymptotic result, other approximate distributions (notablyBartlett�s approximation) are available, but the details are outside the scope of this course.

44

4.15.1 Calculation of Wilk�s �

Result

Let �1; :::; �p be the eigenvalues of W�1B then

� =

pYj=1

(1 + �j)�1 (4.47)

Proof

� =��T�1W �� = ��(W +B)�1W

��=

��W�1 (W +B)��1

=��I +W�1B

��1=

pYj=1

(1 + �j)�1 (4.48)

by the �useful�identity proved earlier in the notes.

4.15.2 Case k = 2

We show that use of Wilk�s � for k = 2 groups is equivalent to using Hotelling�s T 2 statistic.Speci�cally, we show that � is a monotonic function of T 2. Thus to reject H0 for � < �1 isequivalent to rejecting H0 for T 2 > �2 (for some constants �1; �2):

Proof

For k = 2 we can show (Ex.) that

B =n1n2nddT (4.49)

where d = �x1 � �x2. Then ��I +W�1B�� =

��I + n1n2nW�1ddT

��= 1 +

n1n2ndTW�1d

Now W is just (n� 2)Sp where Sp is the pooled estimator of �:

Thus

��1 = 1 +T 2

n� 2 (4.50)

45

5. Discriminant Analysis (Classi�cation)

Given k populations (groups) �1; :::;�k: An individual from �j has p.d.f. fj (x) for a set of pmeasurement x.

The purpose of discriminant analysis is to allocate an individual to one of the groups f�jg onthe basis of x, making as few "mistakes" as possible. For example a patient presents at a doctor�ssurgery with a set of symptom x. The symptoms suggest a number of posible disease groups f�jgto which the patient might belong. What is the most likely diagnosis?

The aim initially is to �nd a partition of RP into disjoint regions R1; :::; Rk together with adecision rule

x 2 Rj ) allocate x to �j

The decision rule will be more accurate if "�j has most of its probability concentrated in Rj"for each j:

5.1 The maximum likelihood (ML) rule

Allocate x to population �j that gives the largest likelihood to x. Choose j by

Lj (x) = max1�i�k

Li (x)

(break ties arbitrarily).

Result 1

If f�ig is the multivariate normal (MVN) population Np (�i;�) for i = 1; :::; k; the ML ruleallocates x to population �i that minimize the Mahalanobis distance between x and �i:

Proof

Li (x) = j2��j�12 exp

��12(x� �i)

T ��1 (x� �i)

�so the likelihood is maximized when the exponent is minimized.

Result 2

When k = 2 the ML rule allocates x to �1 if

dT (x� �) > 0 (5.1)

where d = ��1 (�1 � �2) and � = 12 (�1 + �2) and to �2 otherwise.

Proof

For the two group case, the ML rule is to allocate x to �1 if

(x� �1)T ��1 (x� �1) < (x� �2)

T ��1 (x� �2)

46

which reduces to

2dTx > �1��1�1 � �2��1�2

= (�1 � �2)T ��1 (�1 + �2)= dT (�1 + �2)

Hence the result. The function

h (x) = (�1 � �2)T ��1�x� 1

2(�1 + �2)

�(5.2)

is known as the discriminant function (DF). In this case the DF is linear in x.

5.2 Sample ML rule

In practice �1; �2;� are estimated by, respectively �x1; �x2;SP where SP is the pooled (unbiased)estimator of covariance matrix.

Example

The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types ofiris. Two of the variables: x1 = sepal length and x2 = sepal width gave the following data onspecies I and II:

�x1 =

�5:03:4

��x2 =

�6:02:8

�S1 =

�:12 :10:10 :14

�S2 =

�:26 :08:08 :10

�(The data have been rounded for clarity).

Sp =50S1 + 50S2

98

=

�0:19 0:090:09 0:12

�Hence

d = S �1p (�x1 � �x2)

=

�0:19 0:090:09 0:12

��1 ��1:00:6

�=

��11:414:1

�

� =1

2(�x1 + �x2) =

�5:53:1

�giving the rule:

Allocate x to �1 if

11:4 (x1 � 5:5) + 14:1 (x2 � 3:1) > 0

�11:4x1 + 14:1x2 + 19:0 > 0

47

5.3 Misclassi�cation probabilities

The misclassi�cation probabilities pij de�ned as

pij = Pr [Allocate to �i when in fact from �j ]

form a k� k matrix, of which the diagonal elements fpiig are a measure of the classi�er�s accuracy.

For the case k = 2p12 = Pr [h (x) > 0 j �2]

Since h (x) = dT (x� �) is a linear compound of x it has a (univariate) normal distribution.

Given that x 2 �2 :-

E [h (x)] = dT��2 �

1

2(�1 + �2)

�=

1

2dT (�2 � �1)

=1

2�2

where �2 = (�2 � �1)T ��1 (�2 � �1) is the Mahalanobis distance between �2 and �1:

The variance of is

dT�d = (�2 � �1)T ��1��1 (�2 � �1)= (�2 � �1)T ��1 (�2 � �1)= �2

Pr [h (x) > 0] = Pr

"h (x) + 1

2�2

�>

12�

2

�

#

= Pr

"Z >

12�

2

�

#= 1� �

�12��

= ��12��

(5.3)

By symmetry this is also p21 i.e.p12 = p21

Example (contd.)

We can estimate the misclassi�cation probability from the sample Mahalanobis distance between�x2 and �x1

D2 = (�x2 � �x1)T S�1p (�x2 � �x1)

=��1:0 0:6

� ��11:414:1

�' 19:9

��12D�= �(�2:23)= 0:013

The misclassi�cation rate is 1.3%.

48

Documents

1. Introduction to multivariate datamkt/MT3732 (MVA)/Intro.pdf · Chapman & Hall Krzanowski, W.J. Principles of multivariate analysis. Oxford.2000 Johnson, R.A.and D.W. Wichern Applied