Upload
dinhanh
View
229
Download
7
Embed Size (px)
Citation preview
1. Introduction to multivariate data
1.1 Books
Chat�eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall
Krzanowski, W.J. Principles of multivariate analysis. Oxford.2000
Johnson, R.A.and D.W. Wichern Applied multivariate statistical analysis. Prentice Hall.
1.2 Applications
The need often arises in science, medicine and social science (business, management) to analyzedata on p variables (Note that p = 2 ) data are bivariate).
Suppose we have a simple random sample of size n. The sample consists of n vectors ofmeasurements on p variates i.e. n p�vectors (by convention column vectors) x1; :::xnwhich areinserted as rows xT1 ; :::x
Tn into a (n� p) data matrix X: When p = 2 we can plot the rows in
2-dimensional space, but in higher dimensions, p > 2; other techniques are needed.
Example 1
Classi�cation of plants (taxonomy)Variables: (p = 3) leaf size (x1), colour of �ower (x2), height of plant (x3)Sample items: n = 4 plants from a single speciesAims of analysis:i) understand within species variabilityii) clasify a new plant species
The data matrix may appear as follows
Variablesx1 x2 x3
1 6.1 2 12Plants 2 8.2 1 8(Items) 3 5.3 0 9
4 6.4 2 10
Example 2
Credit scoringVariables: personal data held by bankItems: sample of good/bad customersAims of analysis:i) predict potential defaulters (CRM)ii) risk assessment for new applicant
1
Example 3
Image processing for e.g. quality controlVariables: "features" extracted from an imageItems: sampled from a production lineAims of analysis:i) quantify "normal" variabilityii) reject faulty (o¤ speci�cation) batches
1.3 Sample mean and covariance matrix
We shall adopt the following notation:
x (p� 1) a random vector of observationson p variables
X (n� p) a data matrix whose rows contain anindependent random sample xT1 ; :::; x
Tn
of observations on x
x (p� 1) sample mean vector x =1
n
Xn
i=1xi
S (p� p) sample covariance matrix containingthe sample covariances de�ned as
sjk =1
n
Xn
i=1(xij � xj) (xik � xk)
R (p� p) sample correlation matrix containingthe sample correlations de�ned as
rjk =sjkpsjjskk
=sjksjsk
, say
Notes
1. xj is de�ned as the jth component of x (mean of variable j)
2. the covariance matrix S is square, symmetric ( S = ST ), and holds the sample variances
sjj = s2j =
1
n
Xn
i=1(xij � xj)2 along its main diagonal
3. the diagonal elements of R are rjj = 1 and 1 � rjk � 1 for each j; k
1.4 Matrix-vector representations
Given a (n� p) data matrix X; de�ne the n�vector of one�s
1 = (1; 1; :::; 1)T
2
The row sums of X are obtained by pre-multiplying X by 1T
1TX =
�nPi=1xi1; ::: ;
nPi=1xip
�= (nx1 ; :::; nxp)
= nxT
Hencex =
1
nXT1 (1.1)
The centred data matrix X 0 is derived from X by subtracting the variable mean from eachelement of X. i.e. x
0ij = xij � xj : or, equivalently, by subtracting a constant vector xT from each
row of X.
X0= X � 1xT
= X� 1n11
TX
=�In � 1
n11T�X
= HX (1.2)
where H =
�In �
1
n11T
�is known as the centring matrix. We now de�ne the sample covariance
matrix as1
n� the centred sum of squares and products (SSP) matrix
S = 1nX
0TX0
(1.3a)
= 1n
nXi=1
x0ix
0Ti (1.3b)
where x0i = xi � x denotes the ith mean-corrected data point
For any real p�vector y we then have
yTSy = 1ny
TX0TX
0y
= 1nz
Tz where z = X0y
= 1nkzk
2
� 0
Hence from the de�nition of a p.s.d. matrix, we have
Proposition 1
The sample covariance matrix S is positive semi-de�nite (p.s.d.)
Example
Two measurements x1; x2 made at the same position on each of 3 cans of food, resulted in thefollowing X�matrix:
X =
24 4 1�1 33 5
35Find the sample mean vector x and covariance matris S.
3
Solution
X =
24 4 1�1 33 5
35 = [x1; x2; x3]Tx =
1
n
3Xi=1
xi =1
3
��41
�+
��13
�+
�35
��=
�23
�
X0=
24 2 �2�3 01 2
35S =
1
3X
0TX0=
�143
�23�2
383
�=
�4:67 �0:67�0:67 2:67
�Note also that S is built up from individual data points:
S =1
3
��2�2
� �2 �2
�+
��30
� ��3 0
�+
�12
� �1 2
��and
R =
�1 �0:189
�0:189 1
�1.5 Measures of multivariate scatter
It is useful to have a single number as a measure of spread in the data. Based on S we de�ne twoscalar quantities
The total variation is
tr (S) = trace (S) =
pXj=1
sii = sum of diagonal elements
= sum of eigenvalues of S
The generalized variance is
jSj = product of eigenvalues of S (1.5)
In the above example
tr (S) = 143 +
83 = 7:33
jSj = 143 :83 �
��23
�2= 12
4
1.6 Random vectors
We will in this course generally regard the data as an independent random sample from somecontinuous population distribution with a probability density function
f (x) = f (x1; :::; xp) (1.6)
Here x = (x1; :::; xp) is regarded as a vector of p random variables. Independence here refersto the rows of the data matrix. If two of the variables (columns) are for example height andweight of individuals (rows), then knowing one individual�s weight says nothing about any otherindividual. However the height and weight for any individual are correlated.
For any region D in p�space of the variables
Pr (x 2 D) =ZDf (x) dx
Mean vector
For any j the population mean of xj is given by the p�fold integral
E (xj) = �j =Zxjf (x) dx
where the region of integration is Rp.
In vector form
� = E (x) = E
1CCCA =
0BBB@�1�2...�p
1CCCA (1.7)
Covariance matrix
The covariance between xj ; xk is de�ned as
�jk = Cov (xj ; xk)
= E��xj � �j
�(xk � �k)
�= E [xjxk]� �j�k
When j = k we obtain the variance of xj
�jj = Eh�xj � �j
�2iThe covariance matrix is a p� p matrix
� =(�ij) =
26664�11 �12 � � � �1p�21 �22 � � � �2p...
...�p1 �p2 � � � �pp
377755
The alternative notations V (x) = Cov (x) = � are used.
In matrix form
� = Eh(x� �) (x� �)T
i(1.8a)
= E�xxT
�� ��T (1.8b)
More generally we de�ne the covariance between two random vectors x (p� 1) and y (q � 1)as the (p� q) matrix
Cov (x; y) = Eh(x� �x)
�y��y
�T i (1.9)
Important property of �
� is a positive semi-de�nite matrix.
Proof
Let a (p� 1) be a constant vector, then
E�aTx
�= aTE (x) = aT�
and
V�aTx
�= E
h�aTx� aT�
�2i= aTE
h(x� �) (x� �)T
ia
= aT�a
Since variance is always a positive (non-negative) quantity we �nd aT�a � 0: From the de�n-ition (see handout) � is a positive semi-de�nite (p.s.d.)matrix.
Suppose we have an independent random sample x1; x2; :::; xn from a distribution with mean� and covariance matrix �: What is the relation between (a) the sample and population means,(b) the sample and population covariance matrices?
Result 1We �rst establish the mean and covariance of the sample mean x.
E (x) = � (1.10a)
V (x) =1
n� (1.10b)
Proof
E (x) =1
nE
nXi=1
xi
!=1
n
nXi=1
E (xi)
= �
6
V (x) = Cov
0@ 1n
nXi=1
xi;1
n
nXj=1
xj
1A=
1
n2:n�
noting that Cov (xi;xi) = � and Cov (xi;xj) = 0 for i 6= j: Hence
V (x) =1
n�
Result 2We now examine S and derive an unbiased estimator for �:
E (S) =(n� 1)n
� (1.11)
Proof
S =1
n
nXi=1
(xi � x) (xi � x)T
=1
n
nXi=1
xixTi � xxT
since1
n
Pni=1 xix
T =1
nxPni=1 x
Ti = xx
T :
From (1.8b) and (1.10b) we see that
E�xix
Ti
�= �+ ��T
E�xxT
�=
1
n�+ ��T
hence
E (S) = �+ ��T ��1
n�+ ��T
�=
n� 1n
�
Therefore an unbiased estimate of � is
Su =n
n� 1S (1.12)
=1
n� 1X0TX 0
7
1.7 Linear transformations
Let x = (x1; :::; xp)T be a random p�vector. It is often natural and useful to consider linear
combinations of the components of x such as for example y1 = x1 + x2 or y2 = x1 + 2x3 � x4: Ingeneral we consider a transformation from the p component vector x to a q component vector y(q < p) given by
y = Ax+ b (1.13)
where A (q � p) and b (q � 1) are constant matrices.
Suppose that E (x) = � and V (x) = � the corresponding expressions for y are
E (y) = A�+ b (1.14a)
V (y) = A�AT (1.14b)
These follow from the linearity of the expectation operator
E (y) = E (Ax+ b)
= AE (x)+ E (b)= A�+ b
= �y say
and
V (y) = E�yyT
�� �y�yT
= Eh(Ax+ b) (Ax+ b)T
i� (A�+ b) (A�+ b)T
= AE�xxT
�AT +AE (x) bT + bE
�xT�AT +bbT
�A��TAT �A�bT � b�AT � bbT
= A�E�xxT
�� ��TAT
�= A�AT as required
1.8 The Mahalanobis transformation
Given a p�variate random variable x with E (x) = � and V (x) = �. A transformation to astandardized set of uncorrelated variates is given by the Mahalanobis transformation.
Suppose � is positive de�nite i.e. there is no exact linear dependence in x. Then the inversecovariance matrix ��1. has a "square root" ��
12 given by
��12 = V ��
12V T (1.15)
where � = V �V T is the spectral decomposition (see handout), i.e. V is an orthogonal ma-trix
�V TV = V V T = Ip
�whose columns are the eigenvectors of � and � = diag (�1; :::; �p) are
thecorresponding eigenvalues. The Mahalanobis transformation takes the form
z = ��12 (x� �) (1.16)
8
Using results (1.14a) and (1.14b) we can show that
E (z) = 0
V (z) = Ip
Proof
E (z) = Eh��
12 (x� �)
i= ��
12 [E (x)� �]
= 0
V (z) = ��12���
12
= Ip
1.8.1 Sample Mahalanobis transformation
Given a data matrixXT = (x1; :::; xn) ; the sample Mahalanobis transformation zi = S�12 (xi� x) for
i = 1; :::; n where S = Sx is the sample covariance matrix 1n�1X
THX creates a transformed datamatrix ZT = (z1; :::; zn). Now the the data matrices are related by
ZT = S�12X
TH or
Z = HXS�12 (1.17)
where H is the centring matrix. We may easily show (Ex.) that ZT is centred and that Sz = Ip:
1.8.2 Sample scaling transformation
A transformation of the data that scales each variable to have mean zero and variance one butpreserves the correlation structure is given by yi = D�1 (xi� x) for i = 1; :::; n where D =diag (s1; :::; sp) : Now
Y T =D�1XTH or
Y = HXD�1 (1.18)
Ex. Show that Sy = Rx:
1.8.3 A useful matrix identity
Let u; v be n�vectors and form the n� n matrix A = uvT : Then
jI + uvT j = 1+vT u (1.19)
Proof
First observe that A and I +A share a common set of eigenvectors since Av = �v )(I +A) v = (1 + �)v: Moreover the eigenvalues of I +A are 1 + �i where �i are the eigenvaluesof A:
Now uvT is a rank one matrix, therefore has a single nonzero eigenvalue (see handout). Since�uvT
�u = u
�vTu
�= �u where � = vTu, the eigenvalues of I + uvT are 1 + �; 1; :::1; 1: The
determinant of I + uvT is the product of the eigenvalues, hence the result.
9
2. Principal Components Analysis
2.1 Outline of technique
Let xT = (x1; x2:::; xp) be a random vector with mean � and covariance matrix �: PCA is atechnique for dimensionality reduction from p dimensions to k < p dimensions. It tries to �nd, inorder, the most informative k linear combinations of a set of variables y1; y2; :::; yk: Here informationwill be interpreted as a percentage of the total variation (as previously de�ned) in �: The k samplePC�s that "explain" x% of the total variation in a sample covariance matrix S may be similarlyde�ned.
2.2 Formulation
Let
y1 = aT1 x
y2 = aT2 x
...
yp = aTp x
where yj = a1jx1 + a2jx2 + :::+ apjxp are a sequence of standardized linear combinations (SLC�s)of the the x0s such that aTj aj = 1 and a
Tj ak = 0 for j 6= k: i.e. a1; a2; :::; ap form an orthonormal
set of p�vectors. Equivalently we may de�ne A; the p� p matrix formed from the columns fajg ;as an orthogonal matrix so that ATA = AAT = Ip:
We choose a1 to maximizeV ar (y1) = a
T1�a1
subject to aT1 a1 = 1: Then we choose a2 to maximize
V ar (y2) = aT2�a2
subject to aT2 a2 = 1 and aT2 a1 = 0, which ensures that y2 will be uncorrelated with y1: Subsequent
PC�s are chosen as the SLC�s that have maximum variance subject to being uncorrelated withprevious PC�s.
NB. Sometimes the PC�s are taken to be "mean-corrected" linear transformations of the x0s i.e.
yj = aTj (x� �)
emphasizing that the PCS�s can be considered as direction vectors in p�space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the samewhichever de�nition is used.
10
2.3 Computation
To �nd the �rst PC we use the Lagrange multiplier technique for �nding the maximum of a functionf (x) subject to an equality constraint g (x) = 0. We de�ne the Lagrangean function
L (a1) = aT1�a1 � �
�aT1 a1 � 1
�where � is a Lagrange multiplier.
Di¤erentiating, we obtain
@L
@a1= 2�a1 � 2�a1 = 0
�a1 = �a1
Therefore a1 should be chosen to be an eigenvector of � with eigenvalue �: Suppose the eigen-values of � are distinct and ranked in decreasing order �1 > �2 > ::: > �p > 0.
V ar (y1) = aT1�a1
= �aT1 a1
= �
Therefore a1 should be chosen as the eigenvector corresponding to the largest eigenvalue of �.
2nd PCThe Lagrangean is
L (a1) = aT2�a2 � �
�aT2 a2 � 1
�� �
�aT2 a1
�where �; � are Lagrange multipliers.
@L
@a2= 2 (�� �Ip)a2 � �a1 = 0
2aT1�a2 � � = 0
since aT2 a1 = 0: However
aT1�a2 = aT2�a1
= �aT2 a1 = 0
Therefore � = 0 and�a2 = �a2
so a2 is the eigenvector of � corresponding to the second largest eigenvalue �2.
11
2.4 Example
The covariance matrix corresponding to scaled (standardized) variables x1; x2 is
� =
�1 �� 1
�(in fact a correlation matrix). Note � has total variation =2.
The eigenvalues of � are the roots of j�� �Ij = 0����1� � �� 1� �
���� = 0(1� �)2 � �2 = 0
Hence � = 1 + �; 1 � �: If � > 0 then �1 = 1 + �; �2 = 1 � �: To �nd a1 we substitute �1 into�a1 = �a1. Note: this gives just one equation in terms of the components of aT1 = (a1; a2)
��a1 + �a2 = 0
so a1 = a2. Applying the normalization
aT1 a1 = a21 + a
22 = 1
we obtain
a1 =
"1p21p2
#Similarly
a2 =
"1p2
�1p2
#so that
y1 = 1p2(x1 + x2)
y2 = 1p2(x1 � x2)
are the PC�s explaining respectively100 (1 + �)
2% and
100 (1� �)2
% of the total variation. Notice
that the PC�s are independent of � while the proportion of the total variation explained by eachPC does depend on �:
12
2.5 PCA and spectral decomposition
Since � (also S) is a real symmetric matrix, we know that it has the spectral decomposition(eigenanalysis)
� = A�AT
=
pXi=1
�iaiaTi
where faig are the eigenvectors of � which we have inserted as columns of the (p� p) matrix Aand �1 � �2 � ::: � �p are the corresponding eigenvalues.
If some eigenvalues are not distinct, so �k = �k+1 = ::: = �l = �, the eigenvectors are notunique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension
l � k + 1 (cf. the major/minor axes of an ellipse x2
a2+y2
b2= 1 as b �! a:). Such a situation arises
with the equicorrelation matrix (see Class Exercise 1).
The transformation of a random p�vector x (corrected for its mean �) to its set of principalcomponents (PC�s) contained in the p�vector y is
y = AT (x� �)
y1 is the linear combination (SLC) of x having maximum variance, y2 is the SLC having max-imum variance subject to being uncorrelated with y1 etc. We have seen that V ar (y1) = �1;V ar (y2) = �2; :::
2.6 Explanation of variance
The interpretation of PC�s (y)as components of variance "explaining" the total variation, i.e. thesum of the variances of the original variables (x) is clari�ed by the following result
Result
The sum of the variances of the original variables and their PC�s are the same.
Proof
A note on trace (�)The sum of diagonal elements of a (p� p) square matrix � is known as the trace of �
tr (�) =
pXi=1
�ii
We show from this de�nition that tr (AB) = tr (BA) whenever AB and BA are de�ned [i.e. Ais (m� n) and B is (n�m)]
tr (AB) =Xi
Xj
aijbji
=Xj
Xi
bjiaij
= tr (BA)
13
The sum of the variances for the PC�s isXi
V ar (yi) =Xi
�i = tr (�)
Now � = A�AT is the spectral decomposition and A is orthonormal so ATA = Ip hence
tr (�) = tr�A�AT
�= tr
��ATA
�= tr (�)
Since � is the covariance matrix of x the sum of its diagonal elements is the sum of the variances�ii of the original variables. Hence the result is proved. �
Consequence (interpretation of PC�s)
It is therefore possible to interpret
�i�1 + �2 + :::+ �p
as the proportion of the total variation in the original data explained by the ith principal componentand
�1 + ::+ �k�1 + �2 + :::+ �p
as the proportion of the total variation explained by the �rst k PC�s.
From a PCA on a (10� 10) sample covariance matrix S; we could for example conclude thatthe �rst 3 PC�s (out of a total of p = 10 PC�s) account for 80% of the total variation in the data.This would mean that the variation in the data is largely con�ned to a 3-dimensional subspacedescribed by the PC�s y1; y2; y3.
2.7 Scale invariance
This unfortunately is a property that PCA does not possess!
In practice we often have to choose units of measurement for our individual variables fxig andthe amount of the total variation accounted for by a particular variable xi is dependent on thischoice (tonnes, kg. or grams).
In a practical study, the data vector x often comprises of physically incomparable quantities(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibilityis to perform PCA on a correlation matrix (e¤ectively choosing each variable to have unit samplevariance), but this is still an implicit choice of scaling. The main point is that the results of a PCAdepends on the scaling adopted.
14
2.8 Principal component scores
The sample PC transform on a data matrix X takes the form for the rth individual (rth row of thesample)
y0r = AT (xr � x)
where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the�rst component y1 corresponds to the scalar product of the �rst column of A with x0r etc.
The components of yr are known as the (mean-corrected) principal component scores for therth individual. The quantities
yr = ATxr
are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of eachdata point with respect to new axes de�ned by the PC�s, i.e. w.r.t. a rotated frame of reference.The scores can provide qualitative information about individuals.
2.9 Correlation of PC�s with original variables
The correlations � (xi; yk) of the kth PC with variable xi are an aid to interpreting the PC�s.
Since y = AT (x� �) we have
Cov (x; y) = E�(x� �)yT
�= E
h(x� �) (x� �)T A
i= �A
and from the spectral decomposition
�A =�A�AT
�A
= A�
Post-multiplying A by a diagonal matrix � has the e¤ect of scaling its columns, so that
Cov (xi; yk) = �kaik
is the covariance between the ith variable and the kth PC.
The correlation
� (xi; yk) =Cov (xi; yk)
V ar (xi)V ar (yk)
=�kaikp�iip�k
= aik
��k�ii
� 12
can be interpreted as the proportion of the variation in xi explained by the kth PC.
15
Exercise
Find the PC�s of the covariance matrix
� =
24 1 �2 0�2 5 00 0 2
35and show that they account for amounts
�1 = 5:83
�1 = 2:00
�3 = 0:17
of the total variation in �:
Compute the correlations � (xi; yk) and try to interpret the PC�s qualitatively.
16
3. Multivariate Normal Distribution
The MVN distribution is a generalization of the univariate normal distribution which has thedensity function (p.d.f.)
f (x) =1p2��
exp
(�(x� �)
2
2�2
)�1 < x <1
where � = mean of distribution, �2 = variance. In p�dimensions the density becomes
f (x) =1
(2�)p=2 j�j1=2exp
��12(x� �)T ��1 (x� �)
�(3.1)
Within the mean vector � there are p (independent) parameters and within the symmetric co-variance matrix � there are 12p (p+ 1) independent parameters [
12p (p+ 3) independent parameters
in total]. We use the notationx s Np (�;�) (3.2)
to denote a RV x having the MVN distribution with
E (x) = �
Cov (x) = �
Note that MVN distributions are entirely characterized by the �rst and second moments of thedistribution.
Basic properties
If x (p� 1)is MVN with mean � and covariance matrix �
� Any linear combination of x is MVNLet y = Ax+ c with A (q � p) and c (q � 1) then
y s N q
��y;�y
�where �y = A�+ c and �y = A�AT :
� Any subset of variables in x has a MVN distribution.
� If a set of variables is uncorrelated, then they are independently distributed. In particular
i) if �ij = 0 then xi; xj are independent
ii) if x is MVN woth covariance matrix �, then Ax and Bx are independent if and only if
Cov (Ax;Bx) = A�BT (3.3)
= 0
� Conditional distributions are MVN.
Result 1
For the MVN distribution, variable are uncorrelated , variable are independent.
17
Proof
Let x (p� 1) be partitioned as
x =
�x1x2
�qp� q
with mean vector
� =
��1�2
�qp� q
and covariance matrix
q p� q
� =
��11 �12�21 �22
�qp� q
i) Independent ) uncorrelated (always holds).
Suppose x1;x2 are independent.
Then�12 = Cov (x1;x2) = Eh(x1 � �1) (x2 � �2)T
ifactorizes into the product of E [(x1 � �1)]
and Eh(x2 � �2)T
iwhich are both zero since E (x1) = �1 and E (x2) = �2: Hence �12 = 0:
ii) Uncorrelated ) independent (for MVN)
This result depends on factorizing the p.d.f. (3.1) when �12 = 0:
In this case (x� �)T ��1 (x� �) has the partitioned form
�xT1 � �T1 xT2 � �T2
� ��11 00 �22
��1 �x1 � �1x2 � �2
�
=�xT1 � �T1 xT2 � �T2
� ���111 0
0 ��122
� �x1 � �1x2 � �2
�= (x1 � �1)T ��111 (x1 � �1) + (x2 � �2)
T ��122 (x2 � �2)
so that expf(x� �)T ��1 (x� �)g factorizes into the product of
expn(x1 � �1)T ��111 (x1 � �1)
oand exp
n(x2 � �2)T ��122 (x2 � �2)
o:
Therefore the p.d.f. can be written as
f (x) = g (x1)h (x2)
proving that x1 and x2 are independent. �
Result 2
Let x =�x1x2
�qp� q be MVN with mean � =
��1�2
�and covariance matrix � =
��11 �12�21 �22
�:
The conditional distribution of x2 given x1 is MVN with
E (x2jx1) = �2 +�21��111 (x1 � �1) (3.4a)
Cov (x2jx1) = �22 ��21��111 �12 (3.4b)
18
Proof
Let x02 = x2 ��21��111 x1:We �rst show that x02 and x1 are independent.Consider the linear transformation�
x1x02
�=
�I 0
��21��111 I
� �x1x2
�(3.5a)
= Ax say. (3.5b)
This linear relationship shows that x1;x02 are jointly MVN (by �rst property of MVN statedabove.
We may show that x1 and x02 are uncorrelated in two waysFirstly
Cov�x1;x
02
�= Cov
�x1;x2 ��21��111 x1
�= Cov(x1;x2)� Cov (x1;x1)��111 �12= �12 ��11��111 �12= 0
or, if we write A =
�BC
�in (3.5) and apply (3.3)
Cov�x1;x
02
�= Cov (Bx;Cx)
= B�CT
Cov�x1;x
02
�=
�I 0
� ��11 �12�21 �22
� ���111 �12I
�=
��11 �12
� ���111 �12I
�= 0
Since MVN and uncorrelated we have shown that x02 and x1 are independent. Therefore
E�x02jx1
�= E
�x02�
= E�x2 ��21��111 x1
�= �2 ��21��111 �1
Now since x02 = x2 ��21��111 x1
E (x2jx1) = E�x02jx1
�+�21�
�111 x1
= �2 ��21��111 �1 +�21��111 x1
= �2 +�21��111 (x1 � �1)
as required.
Because x1 and x02 are independent
Cov�x02jx1
�= Cov
�x02�
19
Conditional on x1 a given constant, x02 = x2 ��21��111 x1 i.e. x02 and x2 di¤er by a constant.Hence
Cov (x2jx1) = Cov�x02jx1
�Therefore
Cov (x2jx1) = Cov�x02�
= C�CT
where C =���21��111 I
�so
���21��111 I
� ��11 �12�21 �22
� ���111 �12I
�=
�0 �22 ��21��111 �12
� ���111 �12I
�= �22 ��21��111 �12
Example
Let x have a MVN distribution with covariance matrix
� =
241 � �2
1 01
35Show that the conditional distribution of (x1; x2) given x3 is also MVN with mean
� =
��1 + �
2 (x3 � �3)�2
�and covariance matrix �
1� �4 �� 1
�
20
3.1 Maximum-likelihood estimation
Let XT = (x1; :::;xn) contain an independent random sample of size n from Np (�;�) :The maxi-mum likelihood estimates (MLEs�) of �;� are
b� = x (3.6a)b� = S (3.6b)
The likelihood function is a function of the parameters �;� given the data X
L (�;�jX) =nYr=1
f (xrj�;�) (3.7)
The RHS is evaluated by substituting the individual data vectors fx1; :::;xng in turn into thep.d.f. of Np (�;�) and taking the product.
nYr=1
f (xrj�;�) = (2�)�np2 j�j�n=2
exp
(�12
nXr=1
(xr � �)T ��1 (xr � �))
Maximizing L is equivalent to minimizing
l = �2 logL
=
nXr=1
log f (xrj�;�)
= K + n log j�j+nXr=1
(xr � �)T ��1 (xr � �)
where K is a constant independent of �;�:
Noting that xr� � = (xr � x) + (x� �) the �nal term in the above may be written
nXr=1
(xr � x)T ��1 (xr � x)
+nXr=1
(xr � x)T ��1 (x� �)
+nXr=1
(x� �)T ��1 (xr � x)
+n (x� �)T ��1 (x� �)
Thus
l (�;�) = tr���1A
�+ ndT��1d (3.8a)
= n�tr���1S
�+dT ��1d
(3.8b)
21
where we de�ne for ease of notation
A = nS (3.9a)
d = x� � (3.9b)
and S is the sample covariance matrix (with divisor n). We have made use of nS = CTC whereC is the (n� p) centred data matrix
CT = (x1 � x;x2 � x; :::;xn � x)
We see that
nXr=1
(xr � x)T ��1 (xr � x) = tr�C��1CT
�= tr
���1CTC
�= tr
���1A
�= ntr
���1S
�Notice that l = l (�;�) and the dependence on � is entirely through d in (3.8). Now assume
that � is positive de�nite (p.d.), then so is ��1 (why?). Thus 8d 6= 0 we have dT��1 d > 0showing that l is minimized with respect to � for �xed � when d = 0.
Hence b� = xTo minimize the log-likelihood l (b�;�) w.r.t. �
l (x;�) = n log j�j+ tr���1A
�= n
�log j�j+ tr
���1S
�up to an arbitrary additive constant.
Let� (�) = n
�log j�j+ tr
���1S
�(3.10)
We show that
� (�)� � (S) = n�log j�j � log jSj+ tr
���1S
�� p
= n�tr���1S
�� log j��1Sj � p
(3.11)
� 0
Lemma 1
��1S is positive de�nite. (proved elsewhere)
Lemma 2
For any set of positive numbersA � logG+ 1
where A and G are the arithmetic, geometric means respectively.
22
Proof
For all x we have ex � 1 + x (simple exercise). For each yi � 0 of a set i 2 f1; :::; ng therefore
yi � 1 + log yiXyi � n+
Xlog yi
A � 1 + log�Y
yi
� 1n
= 1 + logG
as required.
In (3.11) assuming that the eigenvalues of ��1S are positive, recall that for any square matrixA; we have tr (A) =
P�i the sum of the eigenvalues, and j Aj =
Y�i the product of the
eigenvalues.
Let �i (i = 1; :::; p) be the eigenvalues of ��1S and substitute in (3.11)
log j��1Sj = log�Y
�i
�= p logG
tr���1S
�=
X�i
= pA
� (�)� � (S) = np fA� logG� 1g� 0
This show that the MLE�s are as stated in (3:6) :
23
3.2 Sampling distribution of x̄ and S
The Wishart distribution (De�nition)
If M (p� p) can be written M = XTX where X (m� p) is a data matrix from Np (0;�) thenM is said to have a Wishart distribution with scale matrix � and degrees of freedom m: We write
M sWp (�;m) (3.12)
When � = Ip the distribution is said to be in standard form.
Note:
The Wishart distribution is the multivariate generalization of the chi-square �2 distribution
Additive property of matrices with a Wishart distribution
LetM1,M2 be matrices having the Wishart distribution
M1 s Wp (�;m1)
M2 s Wp (�;m2)
independently, thenM1 +M2 sWp (�;m1 +m2)
This property follows from the de�nition of the Wishart distribution because data matrices areadditive in the sense that if
X =
�X1
X2
�is a combined data matrix consisting of m1 +m2 rows then
XTX = XT1X1+X
T2 X2
is matrix (known as the "Gram matrix") formed from the combined data matrix X:
Case of p = 1
When p = 1 we know from the de�nition of �2r as the distribution of the sum of squares of rindependent N (0; 1) variates that
M =mXi=1
x2i s �2�2m
so thatW1
��2;m
�� �2�2m
24
Sampling distributions
Let x1;x2; :::;xn be a random sample of size n from Np (�;�). Then
1. The sample mean �x has the normal distribution
�x s Np��;1
n�
�
2. The sample covariance matrix S�MLE: S =
1
nCTC
�has the Wishart distribution
nS sWp (�;n� 1)
3. The distributrions of �x and S are independent.
3.3 Estimators for special circumstances
3.3.1 � proportional to a given vector
Sometimes � is known to be proportional to a given vector, so � =k�0.For example if x represents a sample of repeated measurements then � =k1 where 1 =(1; 1; :::; 1)T
is the p�vector of 10s:
We �nd the MLE of k for this situation. Suppose � is known and � =k�0 the log likelihood is
l = �2 logL= n
nlog j�j+ tr
���1S
�+ (�x�k�0)T ��1 (�x�k�0)
oWe set
@l
@k= 0 to minimize l w.r.t. k
�xT��1 �x� 2k�T0��1 �x+ k2�T0��1�0 = 0
from which
k̂ =�T0�
�1�x
�T0��1�0
(3.13)
We may show that k̂ is an unbiased estimator of k and determine the variance of k̂
In (3.13) k̂ takes the form1
�cT �x with cT = �T0�
�1 and a = �T0��1�0 so
Ehk̂i=
1
�cTE [�x]
=k
�cT�0:
25
HenceEhk̂i= k (3.14)
showing that k̂ is an unbiased estimator.
Note that V ar [�x] =1
n� and therefore that V ar
�cT �x
�=1
ncT�c we have
V ar�k̂�
=1
n�2cT�c
=1
n�T0��1�0
(3.15)
3.3.2 Linear restriction on �
We determine an estimator for � to satisfy a linear restriction
A� = b
where A is (m� p) and b (m� 1)
Introduce a vector � of m Lagrange multipliers and seek to minimize
l + 2�T (A�� b) = nn(�x��)T ��1 (�x��) + 2�T (A�� b)
oDi¤erentiate w.r.t. �
�2��1 (�x��) + 2AT� = 0
�x�� = �AT� (3.16)
We use the constraint A� = b to evaluate the Lagrange multipliers �:Premultiply by A
A�x� b = A�AT�
� =�A�AT
��1(A�x� b)
Substitute into (3.16)
�̂ =�x� �AT�A�AT
��1(A�x� b) (3.17)
26
3.3.3 Covariance matrix � proportional to a given matrix
We consider estimating k when � = k�0 when �0 is given.
The likelihood (3.8) takes the form
l = n
�log jk�0j+ tr
�1
k��10 S
��plus terms not involving k:
l =
�p log k +
1
ktr���10 S
��dl
dk=
p
k� 1
k2tr���10 S
�= 0
Hence
k̂ =tr���10 S
�p
(3.18)
27
4. Hypothesis testing (Hotelling�s t2-statistic)
Consider the test of hypothesisH0 : � = �0HA � = �1 6= �0
(1)
4.1 The Union-Intersection Principle
W accept the hypothesis H0 as valid if and only if
H0 (a) : aT� = aT�0
is accepted for all a: [In some sense the union of all such hypotheses]
For �xed a we set y = aTx so that in the population
E (y) = aT�0
V ar (y) = aT�a
under H0; and in our sample
�y = aT �x
s:e: (�y) =aTSapn� 1
The univariate t-statistic for testing H0 (a) against the alternative � (y) 6= aT�0 is
t (a) =�y � aT�0s:e: (�y)
=
pn� 1aT (�x� �0)p
aTSa
The acceptance threshold for H0 (a) takes the form t2 (a) � R for some R . The multivariateacceptance region is the intersection
\�t2 (a) � R
�(4.1)
which is true if and only if max�t2 (a)
�� R: Therefore we adopt
max�t2 (a)
�as the test statistic for H0: Equivalently
Maximize (n� 1)aT (�x� �0) (�x� �0)T a
subject toaT Sa = 1
(4.2)
Write d = �x� �0 we introduce a Lagrangean multiplier and seek to determine � and a tosatisfy
d
da
haT (�x� �0) (�x� �0)
T a� �aTSai= 0
28
ddT a� �Sa = 0 (4.3a)�S�1ddT � �I
�a = 0 (4.3b)
jS�1ddT � �Ij = 0 (4.3c)
(4.3b) can be writtenMa = �a showing that a is an eigenvector of S�1ddT .(4.3c) is the determinantal equation satis�ed by the eigenvalues of S�1ddT .Premultiplying (4.3a) by aT gives
aTddT a� �aTSa = 0
� =aTddTa
aTSa= t2 (a)
Therefore in order to maximize t2 (a) we choose � to be the largest eigenvalue of S�1ddT : Thisis a rank 1 matrix with the single non-zero eigenvalue
tr S�1ddT = dTS�1d
and the maximum of (4.2) is known as Hotelling�s T 2 statistic
T 2 = (n� 1) (�x� �0)T S�1 (�x� �0) (4.4)
which is (n� 1) � the sample Mahalanobis distance between �x and �0.
4.2 Distribution of T2
Under H0 it can be shown thatT 2
n� 1 sp
n� pFp;n�p (4.5)
where Fp;n�p is the F distribution on p and n� p degrees of freedom. Note that depending on thecovariance matrix used, T 2 has slightly di¤erent forms
T 2 =
((n� 1) (�x� �0)
T S�1 (�x� �0)
n (�x� �0)T S�1U (�x� �0)
where SU is the unbiased estimator of � (with divisor n� 1).
Example 1
In an investigation of adult intelligence, scores were obtained on two tests "verbal" and "per-formance" for 101 subjects aged 60 to 64. Doppelt and Wallace (1955) reported the following meanscore and covariance matrix: �
�x1�x2
�=
�55:2434:97
�
SU =
�210:54 126:99126:99 119:68
�29
At the � = :01 (1%) level, test the hypothesis that��1�2
�=
�6050
�We �rst compute
S�1U =
�:01319 �:01400�:01400 :02321
�and
d = �x� �0
=��4:76 �15:03
�TThe T 2 statistic is then
T 2 = 101�4:76 15:03
� � :01319 �:01400�:01400 :02321
� �4:7615:03
�= 101
�4:762 � :01319� 2� 4:76� 15:03� :01400
+15:032 � :02321
�= 357:4
This gives
F =99
2� 357:4100
= 176:9
The nearest tabulated 1% value corresponds to F2;60 and is 4.98.
Therefore we conclude the null hypothesis should be rejected. The sample probably arose froma population with a much lower mean vector, rather closer to the sample mean.
Example 2
The change in levels of free fatty acid (FFA) were measured on 15 hypnotised subjects who hadbeen asked to experience fear, depression and anger e¤ects while under hypnosis. The mean FFAchanges were
�x1 = 2:699 �x2 = 2:178 �x3 = 2:558
Given that the covariance matrix of the stress di¤erences yi1 = xi1 � xi2 and yi2 = xi1 � xi3 is
SU =
�1:7343 1:16661:1666 2:7733
�S�1U =
�0:8041 �0:3382�0:3382 2:7733
�test at the 0.05 level of signi�cance, whether each e¤ect produced the same change in FFA.
[T 2 = 2:68 and F = 1:24 with degrees of freedom 2,13.Do not reject the hypothesis "no emotion e¤ect" at the � = :05 level]
30
4.3 Invariance of T2
T 2 is unafected by changes in the scale or origin of the (response) variables. Consider
y = Cx+ d
where C is (p� p) and non-singular.
The null hypothesis H0 : �x = �0 is equivalent to H0 : �y = C�0 + d.
We have under linear transformation
�y = C�x+ d
Sy = CSCT
so that
1
n� 1T2y =
��y � �y
�TS�1y
��y � �y
�= (�x� �0)
T CT�CSCT
��1C (�x� �0)
= (�x� �0)T CT
�CT��1
S�1C�1C (�x� �0)
= (�x� �0)T S�1 (�x� �0)
which demonstrates invariance.
4.4 Con�dence interval for a mean
A con�dence region for � can be obtained given the distribution of T 2
(n� 1) (�x� �)T S�1 (�x� �) sp (n� 1)n� p Fp;n�p (4.6)
by substituting the data values �x and S�1:
In Example 1 above we have
�x = (55:24; 34:97)T
100S�1 =
�1:32 �1:40�1:40 2:32
�and F2;99 (:01) is approximately 4.83 (by interpolation). Hence
1:32 (�1 � 55:24)2 � 2:80 (�1 � 5:24) (�2 � 34:97)+2:32 (�2 � 34:97)2
� 2:100
99� 4:83 = 9:76
This is an ellipse in p = 2 dimensional space (can be plotted). In higher dimensions an ellipsoidalcon�dence region is obtained.
31
4.5 Likelihood ratio test
Given a data matrix X of observations on a random vector x whose distribution depends on avector of parameters � , the likelihood ratio for testing the null hypothesis
H0 : � 2 0 against the alternativeH1 : � 2 1 is de�ned as
� =sup�20 L
sup�21 L(4.7)
where L = L (�;X) is the likelihood function. In a likelihood ratio test (LRT) we reject H0 forlow values of �: In a likelihood ratio test (LRT) we reject H0 for low values of �; i.e. if � < c wherec is chosen so that the probability of Type I error is a:
If we de�ne l�0 = �2 logL0 where L0 is the value of the numerator and similarly l�1 = �2 logL1,the rejection criterion takes the form
�2 log � = �2 log�L�0L�1
�= l�0 � l�1 > k (4.8)
Result
When H0 is true and for n �large� the log likelihood ratio (4.8) has the �2-distribution on rdegrees of freedom, �2r , where r equals the number of free parameters under H1 minus the numberof free parameters under H0:
4.6 LRT for a mean when � is known
H0 : � = �0 a given value when � is known
Given a random sample from N (�;�) resulting in �x and S the likelihood given in (3.8b) is (towithin an additive constant)
l (�;�) = nnlog j�j + tr
���1S
�+ (�x� �)T ��1 (�x� �)
o(4.9)
Under H0 the value of � is known and
l�0 = l (�0;�)
= nnlog j�j + tr
���1S
�+ (�x� �0)
T ��1 (�x� �0)o
Under H1 with no restriction on �; the m.l.e. of � is �̂ = �x: Thus
l�1 = n�log j�j + tr
���1S
�Therefore
�2 log � = l�0 � l�1= n (�x� �0)
T ��1 (�x� �0) (4.10)
32
which is n times the Mahalanobis distance of �x from �0. Note the similarity with Hotelling�s T2
statistic. Given the distribution of �x under H0 is
�x s Np
��0;
1n��
and (4.10) may be written using the transformation y =�1n��� 1
2 (�x� �0) to a standard set ofindependent N (0; 1) variates as
�2 log � = yTy =pXi=1
y2i (4.11)
we have the exact distribution�2 log � s �2p (4.12)
showing that in this case the asymptotic distribution of �2 log � is exact for the small sample case.
Example
Measurements of the length of skull were made on a sample of �rst and second sons from 25families.
�x =
�185:72183:84
�S =
�91:48 66:88
96:78
�Assuming that in fact
� =
�100 00 100
�test at the � = :05 level the hypothesis
H0 : � =�182 182
�TSolution
�2 log � = 25��3:72 1:84
� �:01 00 :01
� �3:721:84
�= 0:25�
�3:722 + 1:842
�= 4:31
Since �22 (:05) = 5:99 do not reject H0
33
4.7 LRT for mean when � is unknown
Consider the test of hypothesis
H0 : � = �0 when � is unknown.
H1 : � 6= �0
In this case � must be estimated under H0 and also under H1:
Under H0
l (�0;�) = nnlog j�j + tr
���1S
�+ (�x� �0)
T ��1 (�x� �0)o
(4.13a)
= n�log j�j + tr
���1S
�+ dT0�
�1d0
(4.13b)
= n�log j�j + tr
���1S
�+ tr
�dT0�
�1d0�
(4.13c)
= n�log j�j + tr
���1S
�+ tr
���1d0d
T0
�(4.13d)
= n�log j�j + tr
���1
�S + d0d
T0
��(4.13e)
writing d0 for �x� �0:
Under H1
l (�̂;�) = nnlog j�j + tr
���1S
�+ (�x� �̂)T ��1 (�x� �̂)
o(4.14a)
= n�log j�j + tr
���1S
�(4.14b)
l��̂; �̂
�= n
�log jSj + tr
�S�1S
�(4.14c)
= n flog jSj + tr (Ip)g (4.14d)
l�1 = n log jSj + np (4.14e)
after substitution of the m.l.e.�s �̂ = �x and �̂ = S obtained previously.
Comparing (4:13e) with (4:14b) we see that the m.l..e. of � under H0 must be
�̂ = S + d0dT0
and that the corresponding value of l = �2 logL is
l�0 = n log jS + d0dT0 j + np
l�0 � l�1 = n log jS + d0dT0 j � n log jSj
= n log jS�1j� n log jS + d0dT0 j
= n log jS�1�S + d0d
T0
�j
= n log jIp+S�1d0 dT0 j= n log
�1 + dT0 S
�1d0�
(4.15)
making use of the useful matrix result proved in (1:8:3) that jIp+uvT j =�1 + vTu
�:
34
Since
�2 log � = n log�1 +
T 2
n� 1
�(4.16)
we see that � and T 2 are monotonically related. Therefore we can conclude that the LRT ofH0 : � = �0 when � is unknown is equivalent to use of Hotelling�s T 2 statistic.
4.8 LRT for � = �0 with � unknown
H0 : � = �0 when � is unknown.
H1 : � 6= �0
Under H0 we substitute �̂ = �x into
l (�̂;�0) = nnlog j�0j + tr
���10 S
�+ (�x� �̂)T ��10 (�x� �̂)
ogiving
l�0 = n�log j�0j + tr
���10 S
�(4.17)
Under H1 we substitute the unrestricted m.l.e.�s �̂ = �x and �̂ = S giving as in (4:14e)
l�1 = n log jSj + np (4.18)
l�0 � l�1 = n�log j�0j + tr
���10 S
�� log jSj � p
= n
�� log j��10 Sj+ tr
���10 S
�� p
(4.19)
This statistic depends only on the eigenvalues of the positive de�nite matrix ��10 S and has theproperty that l�0 � l�1 = �2 log �! 0 as S approaches �0:
Let A be the arithmetic mean and G the geometric mean of the eigenvalues of ��10 S
tr���10 S
�= pA
j��10 Sj = Gp
then
�2 log � = n fpA� p logG� pg= np fA� logG� 1g (4.20)
The general result for the distribution of (4:20) for large n gives
l�0 � l�1 s �2r (4.21)
where r = 12p (p+ 1) is the number of independent parameters in �:
35
4.10 Test for sphericity
A covariance matrix is said to have the property of "sphericity" if
� = kIp (4.22)
for some k: We see that this is a special case of the more general situation � = k�0 treated inSection (3.3.3). The same procedure can be applied. The general likelihood:expresion for a samplefrom the MVN distribution is:
�2 logL = n�log j�j+ tr
���1
�S + ddT
��Under H0 : � = kIp and �̂ = �x so
�2 logL = n�log jkIpj+ tr
�k�1S
�= n
�p log k + k�1tr S
(4.23)
Set@
@k[�2 logL] = 0 at a minimum
p
k� 1
k2tr S = 0
k̂ =tr S
p(4.24)
which is in fact the arithmetic mean A of the eigenvalues of S:
Substitute back into (4.23) gives
l�0 = np (logA+ 1)
Under H1 : �̂ = �x and �̂ = S
l�1 = n log jSj + np= np (logG+ 1)
thus
�2 log � = l�0 � l�1
= np log
�A
G
�(4.25)
The number of free parameters contained in � is 1 under H0 and 12p (p+ 1) under H1: Hence
the appropriate distribution for comparing �2 log � is �2r where
r =1
2p (p+ 1)� 1
=1
2(p� 1) (p+ 1) (4.26)
36
4.11 Test for independence
Independence of the variables x1; :::; xp is manifest by a diagonal covariance matrix
� = diag (�11; :::; �pp) (4.27)
We considerH0 : � is diagonal against the general alternativeH1 :.� is unrestricted
Under H0 it is clear in fact that we will �nd �̂ii = sii because the estimators of �ii for each xiare independent. We can also show this formally
= n�log j�j+ tr ��1
�S + ddT
�= n
(pXi=1
log �ii +
pXi=1
sii�ii
)
Set@
@�ii(�2 logL) = 0
1
�ii� sii�2ii
= 0
b�ii = sii
Therefore
= n
(pXi=1
log sii + p
)= n flog jDj+ pg
where D = diag (s11; :::; spp) :
Under H1 as before we �ndl�1 = n log jSj + np
Therefore
l�0 � l�1 = n[log jDj � log jSj]= �n log jD�1Sj= �n log jD� 1
2SD� 12 j
= �n log jRj (4.28)
The number of free parameters contained in � is p under H0 and 12p (p+ 1) under H1: Hence
the appropriate distribution for comparing �2 log � is �2r where
r =1
2p (p+ 1)� p
=1
2p (p� 1) (4.29)
37
4.12 Simultaneous con�dence intervals (Sche¤e, Roy & Bose)
The union-intersection method for deriving Hotelling�s T 2 statistic provides "simultaneous con�-dence intervals" for the parameters when � is unknown. Following Section 4.1 let
T 2 = (n� 1) (�x� �)T S�1 (�x� �) (4.30)
where � is the unknown (true) mean. Let t (a) be the univariate t�statistic corresponding to thelinear compound y = aTx: Then
maxat2 (a) = T 2
and for all p�vectors at2 (a) � T 2 (4.31)
where
t (a) =�y � �ysy=pn
=
pn� 1aT (�x� �)p
aTSa(4.32)
From Section 4.2 the distribution of T 2 is
T 2
n� 1 sp
n� pFp;n�p
so
Pr
�T 2 � (n� 1) p
n� p Fp;n�p (�)
�= 1� �
therefore from (4.31), for all p�vectors a
Pr
�t2 (a) � (n� 1) p
n� p Fp;n�p (�)
�= 1� � (4.33)
Substituting from (4.32), the con�dence statement in (4.33) is:
With probability 1� � for all p�vectors a
jaT �x� aT�j ��(n� 1) pn� p Fp;n�p (�)
�1=2saTSan� 1
= K�
saTSa
n� 1 say, (4.34)
where K� is the constant
K� =
�(n� 1) pn� p Fp;n�p (�)
�1=2(4.35)
A 100 (1� �)% con�dence interval for the linear compound aT� is therefore
aT �x�K�
saTSa
n� 1 (4.36)
38
How can we apply this result? We might be interested in a de�ned set of linear combinations(linear compounds) of �: The ith component of � is for example the linear compound de�ned byaT = (0; :::; 1; :::0) the unit vector with a single 1 in the ith position. For a large number of such setsof CI�s we would expect 100 (1� �)% to contain no mis-statements while 100�% would contain atleast one mis-statement.
We can relate the T 2 con�dence intervals to the T 2 test of H0 : � = �0. If this H0 is rejectedat signi�cance level � then there exists at least one vector a such that the interval (4.36) does notinclude the value aT �0:
NB. If the covariance matrix Su (with denominator n� 1) is supplied, then in (4.36)raTSa
n� 1
may be replaced by
raTSua
n:
4.13 The Bonferroni method
This provides another way to construct simultaneous CI�s for a small number of linear compoundsof � whilst controlling the overall level of con�dence.
Consider a set of events A1; A2; :::; Am
Pr (A1 \ ::: \Am) = 1� Pr�A1 [ ::: [Am
�From the additive law of probabilities
Pr�A1 [ ::: [Am
��
mXi=1
Pr�Ai�
Therefore
Pr (A1 \ ::: \Am) � 1�mXi=1
Pr�Ai�
(4.37)
Let Ck denote a con�dence statement about the value of some linear compound aTk� withPr (Ck true) = 1� �k:
Pr (all Ck true) � 1� (�1 + :::+ �m) (4.38)
Therefore we can control the overall error rate given by �1 + :::+ �m = � say. For example, inorder to construct simultaneous 100 (1� �)% CI�s for all p components �k of � we could choose
�k =�
p(k = 1; :::; p) leading to
�x1 � tn�1��
2p
�rs11n
...
�xp � tn�1��
2p
�rsppn
if sii derives from Su:
39
Example
Intelligence scores data on n = 101 subjects:
�x =
��x1�x2
�=
�55:2434:97
�
SU =
�210:54 126:99126:99 119:68
�1. Construct 99% simultaneous con�dence intervals for �1; �2 and �1 � �2:For �1 take a
T = (1; 0)
aT �x =�1 0
� �55:2434:97
�= 55:24
aTSua = 210:54
Now take � = :01
K� =
�(n� 1) p(n� p) Fp;n�p (�)
� 12
=
�100� 299
F2;99 (:01)
� 12
= 3:12
taking F2;99 (:01) = 4:83 (approx). Therefore the CI for �1 is
55:24� 3:12�r210:54
101= 55:24� 4:50
giving an interval = (50:7; 59:7)
For �2 we already have K� , take aT = (0; 1) then
aT �x = 34:97
aTSua = 119:68
The CI for �2 is
34:97� 3:12�r119:68
101= 34:97� 3:40
giving an interval = (31:6; 38:4)
For �1 � �2 take aT = [1;�1]
aT �x = [1;�1]�55:2434:97
�= 20:27
aTSua = [1;�1]�210:54 126:99126:99 119:68
� �1�1
�= 210:54� 2� 126:9 + 119:68= 76:24
40
CI for �1 � �2 is
20:27� 3:12�r76:24
101= 20:27� 2:71
= (17:6; 23:0)
2. Construct CI�s for �1; �2 by Bonferroni method. Use � = :01:
Individual CI�s are constructed using �k =:01
2= :005 (k = 1; 2) : Then
t100
��k2
�= t100 (:0025)
' ��1 (:0075)
= 2:81
CI for �1 is
55:24� 2:81�r210:54
101= 55:24� 4:06
= (51:2; 59:3)
and for �2 is
34:97� 2:81�r119:68
101= 34:97� 3:06
= (31:9; 38:0)
Comparing CI�s obtained by the two methods we see that the simultaneous CI�s for �1 and �2and �1 � �2 are 8.7% wider than the coirresponding Bonferroni CI�s.
NB. If we had required 99% Bonferroni CI�s for �1; �2 and �1 � �2 then m = 3 in (4.38) and�
m=:01
6= :0017: The corresponding percentage point of t would be
t100 (:0017) ' ��1 (:9983)
= 2:93
leading to a slightly wider CI Than obtained above.
41
4.14 Two sample procedures
Suppose we have two independent random samples fx11; :::;x1n1g fx21; :::;x2n2g of size n1; n2 fromtwo populations.
�1 : x s Np (�1;�)�2 : x s Np (�2;�)
giving rise to sample means �x1; �x2 and sample covariance matrices S1; S2 . Note the assumptionof a common covariance matrix �:
We consider testing
H0 : �1 = �2 against
H1 : �1 6= �2Let d = �x1 � �x2: Under H0
d s N�0;
�1
n1+1
n2
��
�(a) Case of � known
Analogously to the one sample case�n1n2n1 + n2
� 12
��12d s N (0; Ip)
n1n2ndT��1d s �2p where n = n1 + n2
(b) Case of � unknown
We have the Wishart distributed quantitities
n1S1 s Wp (�;n1 � 1)n2S2 s Wp (�;n2 � 1)
Let
Sp =n1S1 + n2S2
n� 2be the pooled estimator of the covariance matrix �: Then from the additive properties of theWishart distribution (n� 2)Sp has the Wishart distribution Wp (�; n� 2) and�n1n2
n
� 12d s N (0;�)
It may be shown that
T 2 =�n1n2n
�dTS�1p d
has the distribution of a Hotelling�s T 2 statistic. In fact
T 2 s(n� 2) pn� p� 1Fp;n�p�1 (4.39)
42
4.15 Multi-sample procedures (MANOVA)
We consider the case of k samples from populations �1; :::;�k: The sample from population �i isof size ni: By analogy with the univariate case we can decompose the SSP matrix into �orthogonal�parts. This decomposition can be represented as a Multivariate Analysis of Variance (MANOVA)table.
The MANOVA model is
xij = �+ � i + eij j = 1; :::; ni and i = 1; :::; k (4.40)
where eij are independent Np (0;�) variables. Here the parameter vector � is the overall (grand)mean and the � i is the ith treatment e¤ect with
kXi=1
ni� i = 0 (4.41)
De�ne the ith sample mean as �xi =1
ni
Xni
j=1xij :
The Between Groups sum of squares and cross-products (SSP) matrix is
B =kXi=1
ni (�xi � �x) (�xi � �x)T (4.42)
The Grand Mean is �x =Xk
i=1ni�xi and the Total SSP matrix is
T =kXi=1
niXj=1
(�xij � �x) (�xij � �x) (4.43)
It can be shown algebraically that T = B +W where W is the Within Groups (or residual)SSP matrix given by
W =kXi=1
niXj=1
(�xij � �xi) (�xij � �xi)T (4.44)
The MANOVA table is
Source Matrix of SS and Degrees ofof variation cross-products (SSP) freedom (d.f.)
Treatment B =Xk
i=1ni (�xi � �x) (�xi � �x)T k � 1
Residual W =Xk
i=1
Xni
j=1(xij � �xi) (xij � �xi)T
Xk
i=1ni � k
Total (corrected T = B +W =Xk
i=1
Xni
j=1(xij � �x) (xij � �x)
Xk
i=1ni � 1
for the mean)
43
We are interested in testing the hypothesis
H0 : �1 = �2 = ::: = �k (4.42)
whether the samples in fact come from the same population against the general alternative
H1 : �1 6= �2 6= ::: 6= �k (4.43)
We can derive a likelihood ratio test statistic known as Wilk�s � :
Under H0 the m.l.e.�s are
�̂ = �x
�̂ = S
leading to the maximized log likelihood (minimum of� 2 logL)
l�0 = np+ n log jSj (4.44)
Under H1 the m.l.e.�s are
�̂i = �xi
�̂ =1
nW
where
W =kXi=1
W i =kXi=1
niSi
This follows from
l�1 = min�;di
(n log j�j+
kXi=1
ni tr���1
�Si + did
Ti
��)
= min�
(n log j�j+ n tr ��1
1
n
kXi=1
ni Si
!)
since d̂i = �xi � �̂i = 0. Hence �̂ =1
nW and
l�1 = np+ n log
���� 1nW���� (4.45)
Therefore since T = nS
l�0 � l�1 = �n log�jW jjT j
�= �n log � (4.46)
where � is known as Wilk�s � statistic. We reject H0 for small values of � or large values of�n log �: Asymptotically, the rejection region is the upper tail of a �2p(k�1). Under H0 the unknown� has p parameters and under H1 the number of parameters for �1; :::;�k is pk: Hence the d.f. ofthe �2 is p (k � 1). Apart from this asymptotic result, other approximate distributions (notablyBartlett�s approximation) are available, but the details are outside the scope of this course.
44
4.15.1 Calculation of Wilk�s �
Result
Let �1; :::; �p be the eigenvalues of W�1B then
� =
pYj=1
(1 + �j)�1 (4.47)
Proof
� =��T�1W �� = ���(W +B)�1W
���=
��W�1 (W +B)���1
=��I +W�1B
���1=
pYj=1
(1 + �j)�1 (4.48)
by the �useful�identity proved earlier in the notes.
4.15.2 Case k = 2
We show that use of Wilk�s � for k = 2 groups is equivalent to using Hotelling�s T 2 statistic.Speci�cally, we show that � is a monotonic function of T 2. Thus to reject H0 for � < �1 isequivalent to rejecting H0 for T 2 > �2 (for some constants �1; �2):
Proof
For k = 2 we can show (Ex.) that
B =n1n2nddT (4.49)
where d = �x1 � �x2. Then ��I +W�1B�� =
���I + n1n2nW�1ddT
���= 1 +
n1n2ndTW�1d
Now W is just (n� 2)Sp where Sp is the pooled estimator of �:
Thus
��1 = 1 +T 2
n� 2 (4.50)
45
5. Discriminant Analysis (Classi�cation)
Given k populations (groups) �1; :::;�k: An individual from �j has p.d.f. fj (x) for a set of pmeasurement x.
The purpose of discriminant analysis is to allocate an individual to one of the groups f�jg onthe basis of x, making as few "mistakes" as possible. For example a patient presents at a doctor�ssurgery with a set of symptom x. The symptoms suggest a number of posible disease groups f�jgto which the patient might belong. What is the most likely diagnosis?
The aim initially is to �nd a partition of RP into disjoint regions R1; :::; Rk together with adecision rule
x 2 Rj ) allocate x to �j
The decision rule will be more accurate if "�j has most of its probability concentrated in Rj"for each j:
5.1 The maximum likelihood (ML) rule
Allocate x to population �j that gives the largest likelihood to x. Choose j by
Lj (x) = max1�i�k
Li (x)
(break ties arbitrarily).
Result 1
If f�ig is the multivariate normal (MVN) population Np (�i;�) for i = 1; :::; k; the ML ruleallocates x to population �i that minimize the Mahalanobis distance between x and �i:
Proof
Li (x) = j2��j�12 exp
��12(x� �i)
T ��1 (x� �i)
�so the likelihood is maximized when the exponent is minimized.
Result 2
When k = 2 the ML rule allocates x to �1 if
dT (x� �) > 0 (5.1)
where d = ��1 (�1 � �2) and � = 12 (�1 + �2) and to �2 otherwise.
Proof
For the two group case, the ML rule is to allocate x to �1 if
(x� �1)T ��1 (x� �1) < (x� �2)
T ��1 (x� �2)
46
which reduces to
2dTx > �1��1�1 � �2��1�2
= (�1 � �2)T ��1 (�1 + �2)= dT (�1 + �2)
Hence the result. The function
h (x) = (�1 � �2)T ��1�x� 1
2(�1 + �2)
�(5.2)
is known as the discriminant function (DF). In this case the DF is linear in x.
5.2 Sample ML rule
In practice �1; �2;� are estimated by, respectively �x1; �x2;SP where SP is the pooled (unbiased)estimator of covariance matrix.
Example
The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types ofiris. Two of the variables: x1 = sepal length and x2 = sepal width gave the following data onspecies I and II:
�x1 =
�5:03:4
��x2 =
�6:02:8
�S1 =
�:12 :10:10 :14
�S2 =
�:26 :08:08 :10
�(The data have been rounded for clarity).
Sp =50S1 + 50S2
98
=
�0:19 0:090:09 0:12
�Hence
d = S �1p (�x1 � �x2)
=
�0:19 0:090:09 0:12
��1 ��1:00:6
�=
��11:414:1
�
� =1
2(�x1 + �x2) =
�5:53:1
�giving the rule:
Allocate x to �1 if
11:4 (x1 � 5:5) + 14:1 (x2 � 3:1) > 0
�11:4x1 + 14:1x2 + 19:0 > 0
47
5.3 Misclassi�cation probabilities
The misclassi�cation probabilities pij de�ned as
pij = Pr [Allocate to �i when in fact from �j ]
form a k� k matrix, of which the diagonal elements fpiig are a measure of the classi�er�s accuracy.
For the case k = 2p12 = Pr [h (x) > 0 j �2]
Since h (x) = dT (x� �) is a linear compound of x it has a (univariate) normal distribution.
Given that x 2 �2 :-
E [h (x)] = dT��2 �
1
2(�1 + �2)
�=
1
2dT (�2 � �1)
=1
2�2
where �2 = (�2 � �1)T ��1 (�2 � �1) is the Mahalanobis distance between �2 and �1:
The variance of is
dT�d = (�2 � �1)T ��1���1 (�2 � �1)= (�2 � �1)T ��1 (�2 � �1)= �2
Pr [h (x) > 0] = Pr
"h (x) + 1
2�2
�>
12�
2
�
#
= Pr
"Z >
12�
2
�
#= 1� �
�12��
= ���12��
(5.3)
By symmetry this is also p21 i.e.p12 = p21
Example (contd.)
We can estimate the misclassi�cation probability from the sample Mahalanobis distance between�x2 and �x1
D2 = (�x2 � �x1)T S�1p (�x2 � �x1)
=��1:0 0:6
� ��11:414:1
�' 19:9
���12D�= �(�2:23)= 0:013
The misclassi�cation rate is 1.3%.
48