A Tutorial on
Craig A. Tracy
Collection of (real-valued) data from a sequence of experiments
X1, X2, . . . , Xn
Might make assumption underlying law is N(µ, σ2) with unknown
mean µ and variance σ2. Want to estimate µ and σ2 from the data.
Sample Mean & Sample Variance:
Xj , S =1
Xj − X)2
Estimators are “unbiased”
E(X) = µ, E(S) = σ2
Theorem: If X1, X2, . . . are independent N(µ, σ2) variables then
X and S are independent. We have that X is N(µ, σ2/n) and
(n− 1)S/σ2 is χ2(n− 1).
Recall χ2(d) denotes the chi-squared distribution with d degrees of
freedom. Its density is
2d/2 Γ(d/2)xd/2−1 e−x/2 , x ≥ 0,
tz−1 e−t dt, ℜ(z) > 0.
From the classic textbook of Anderson:
Multivariate statistical analysis is concerned with data that
consists of sets of measurements on a number of individuals
or objects. The sample data may be heights and weights of
some individuals drawn randomly from a population of
school children in a given city, or the statistical treatment
may be made on a collection of measurements, such as
lengths and widths of petals and lengths and widths of
sepals of iris plants taken from two species, or one may
study the scores on batteries of mental tests administered
to a number of students.
p = # of sets of measurements on a given individual,
n = # of observations = sample size
• In above examples, one can assume that p≪ n since typically
many measurements will be taken.
• Today it is common for p≫ 1, so n/p is no longer necessarily
Vehicle Sound Signature Recognition: Vehicle noise is a
stochastic signal. The power spectrum is discretized to a
vector of length p = 1200 with n ≈ 1200 samples from the
same kind of vehicle.
Astrophysics: Sloan Digital Sky Survey typically has many
observations (say of quasar spectrum) with the spectra of
each quasar binned resulting in a large p.
Financial data: S&P 500 stocks observed over monthly
intervals for twenty years.
GAUSSIAN DATA MATRICES
The data are now n independent column vectors of length p
~x1, ~x2, . . . , ~xn
from which we construct the n× p data matrix
←− ~xT1 −→
←− ~xT2 −→
←− ~xTn −→
The Gaussian assumption is that
~xj ∼ Np(µ,Σ)
Many applications assume the mean has been already substracted
out of the data, i.e. µ = 0.
Multivariate Gaussian Distribution
If x and y are vectors, the matrix x⊗ y is defined by
(x⊗ y)jk = xjyk
If µ = E(x) is the mean of the random vector x, then the
covariance matrix of x is the p× p matrix
Σ = E [(x− µ)⊗ (x− µ)]]
Σ is a symmetric, non-negative definite matrix. If Σ > 0 (positive
definite) and X ∼ Np(µ,Σ), then the density function of X is
fX(x) = (2π)−p/2(detΣ)−1/2 exp
x− µ,Σ−1(x− µ)
, > x ∈ Rp
~xj , E(x) = µ
Sample covariance matrix:
(~xj − x)⊗ (~xj − x)
For µ = 0 the sample covariance matrix can be written simply as
Some Notation: If X is a n× p data matrix formed from the n
independent column vectors xj , cov(xj) = Σ, we can form one
column vector vec(X) of length pn
The covariance of vec(X) is the np× np matrix
In ⊗ Σ =
Σ 0 0 · · · 0
0 Σ 0 · · · 0...
...... · · ·
0 0 0 · · · Σ
In this case we say the data matrix X constructed from n
independent xj ∼ Np(µ,Σ) has distribution
Np(M, In ⊗ Σ)
where M = E(X) = 1⊗ µ, 1 is the column vector of all 1’s.
Definition: If A = XTX where the n× p matrix X is
Np(0, In ⊗ Σ), Σ > 0, then A is said to have Wishart distribution
with n degrees of freedom and covariance matrix Σ. We will say A
• The Wishart distribution is the multivariate generalization of
the chi-squared distribution.
• A ∼Wp(n,Σ) is positive definite with probability one if and
only if n ≥ p.• The sample covariance matrix,
is Wp(n− 1, 1n−1 Σ).
WISHART DENSITY FUNCTION, n ≥ p
Let Sp denote the space of p× p positive definite (symmetric)
matrices. If A = (ajk) ∈ Sp, let
(dA) = volume element of A =∧
The multivariate gamma function is
e−tr(A) (detA)a−(p+1)/2 (dA),ℜ(a) > (p− 1)/2.
Theorem: If A is Wp(n,Σ) with n ≥ p, then the density function
of A is
2np Γp(n/2) (detΣ)n/2e−
12 tr(Σ−1A) (detA)
Sketch of Proof:
• The density function for X is the multivariate Gaussian
(including volume element (dX))
e−12 tr(Σ−1XT X) (dX)
• Recall the QR factorization : Let X denote an n× p matrix
with n ≥ p with full column rank. Then there exists an unique
n× p matrix Q, QTQ = Ip, and an unique n× p upper
triangular matrix R with positive diagonal elements so that
X = QR. Note A = XTX = RTR.
• A Jacobian calculation [1, 13]: If A = XTX , then
(dX) = 2−p (detA)(n−p−1)/2 (dA)(QTdQ)
and Q = (q1, . . . , qp) is the column representation of Q.
• Thus the joint distribution of A and Q is
(2π)−np/2 (detΣ)−n/2 e−12 tr(Σ−1A)×
• Now integrate over all Q. Use fact that
(QT dQ) =2pπnp/2
and Vn,p is the set of real n× p matrices Q satisfying
QTQ = Ip. (When n = p this is the orthogonal group.)
Remarks regarding the Wishart density function
• Case p = 2 obtain by R. A. Fisher in 1915.
• General p by J. Wishart in 1928 by geometrical arguments.
• Proof outlined above came later. (See [1, 13] for complete
• When Q is a p× p orthogonal matrix
is normalized Haar measure for the orthogonal group O(p). We
denote this Haar measure by (dQ).
• Siegel proved (see, e.g. )
Γp(a) = πp(p−1)/4
2(j − 1)
EIGENVALUES OF A WISHART MATRIX
Theorem: If A is Wp(n,Σ) with n ≥ p the joint density function
for the eigenvalues ℓ1,. . . , ℓp of A is
πp2/2 2−np/2 (detΣ)−n/2
|ℓj − ℓk| ×
e−12 tr(Σ−1QLQT ) (dQ), (λ1 > · · · > λp)
where L = diag(ℓ1, . . . , ℓp) and (dQ) is normalized Haar measure.
Note that ∆(ℓ) :=∏
j<k(ℓj − ℓk) is the Vandermonde.
Corollary: If A is Wp(n, Ip), then the integral over the orthogonal
group in the previous theorem is
j ℓj .
Proof: Recall that the Wishart density function (times the volume
2np Γp(n/2) (detΣ)n/2e−
12 tr(Σ−1A) (detA)(n−p−1)/2 (dA)
The idea is to diagonalize A by an orthogonal transformation and
then integrate over the orthogonal group thereby giving the density
function for the eigenvalues of A.
Let ℓ1 > · · · > ℓp be the ordered eigenvalues of A.
A = QLQT , L = diag(ℓ1, . . . , ℓp), Q ∈ O(p)
The jth column of Q is a normalized eigenvector of A. The
transformation is not 1–1 since Q = [±q1, . . . ,±qp] works for each
fixed A. The transformation is made 1–1 by requiring that the 1st
element of each qj is nonnegative. This restricts Q (as A varies) to
a 2−p part of O(p). We compensate for this at the end.
We need an expression for the volume element (dA) in terms of Q
and L. First we compute the differential of A
dA = dQLQT +QdLQT +QLdQT
QT dAQ = QT dQL+ dL+ LdQT Q
= −dQtQdL+ LdQT Q+ dL
(We used QTQ = I implies QT dQ = −dQT Q.)
We now use the following fact (see, e.g., page 58 in ): If
X = BY BT where X and Y are p× p symmetric matrices, B is a
nonsingular p× p matrix, then (dX) = (detB)p+1(dY ). In our case
Q is orthogonal so the volume element (dA) equals the volume
element (QT dAQ). The volume element is the exterior product of
the diagonal elements of QT dAQ times the exterior product of the
elements above the diagonal.
Since L is diagonal, the commutator[
has zero diagonal elements. Thus the exterior product of the
diagonal elements of QT dAQ is∧
j dℓj .
The exterior product of the elements coming from the commutator
(ℓj − ℓk)∧
and so(dA) =
qTk dqj ∆(ℓ)
The theorem now follows once integrate over all of O(p) and divide
the result by 2p.
• One is interested in limit laws as n, p→∞. For Σ = Ip,
Johnstone  proved, using RMT methods, for centering and
µnp =(√n− 1 +
σnp =(√n− 1 +
thatℓ1 − µnp
converges in distribution as n, p→∞, n/p→ γ <∞, to the
GOE largest eigenvalue distribution .
• El Karoui  has extended the result to γ ≤ ∞. The case
p≫ n appears, for example, in microarray data.
• Soshnikov  has lifted Gaussian assumption under the
additional restriction n− p = O(p1/3).
• For Σ 6= Ip, the difficulty in establishing limit theorms comes
from the integral∫
e−12 tr(Σ−1QΛQT ) (dQ)
Using zonal polynomials infinite series expansions have been
derived for this integral, but these expansions are difficult to
analyze. See Muirhead .
• For complex Gaussian data matrices X similar density formulas
are known for the eigenvalues of X∗X . Limit theorems for
Σ 6= Ip are known since the analogous group integral, now over
the unitary group, is known explicitly—the Harish
Chandra–Itzykson–Zuber (HCIZ) integral (see, e.g. ). See
the work of Baik, Ben Arous and Peche [2, 3] and El Karoui .
PRINCIPAL COMPONENT ANALYSIS (PCA),
H. Hotelling, 1933
Population Principal Components: Let x be a p× 1 random
vector with E(x) = µ and cov(x) = Σ > 0. Let λ1 ≥ λ2 ≥ · · · ≥ λp
denote the eigenvalues of Σ and H an orthogonal matrix
diagonalizing Σ: HT ΣH = Λ = diag(λ1, . . . , λp). We write H in
column vector form
H = [h1, . . . , hp]
so that hj is the p× 1 eigenvector of Σ corresponding to eigenvalue
λj . Define the p× 1 vector
u = HTx = (u1, . . . , up)T
then cov(u) = E(
= HTE ((x− µ)⊗ (x− µ))H
= HT ΣH = Λ ⇒ uj uncorrelated.
Definition: uj is called the jth principal component of x. Note
var(uj) = λj .
Statistical interpretations: The claim is that u1 is that linear
combination of components of x that has maximum variance.
Proof: For simplicity of notation, set µ = 0. Let b denote any p× 1
vector, bT b = 1, and form bTx.
var(bTx) = E(
bTx · bTx)
bTx · (bTx)T)
= bT E(xxT )b = bT Σ b.
We want to maximize the right hand side subject to the constraint
bT b = 1. By the method of Lagrange multipliers we maximize
bT Σ b− λ(bT b− 1)
Since Σ is symmetric the vector of partial derivatives is
2Σ b− 2λb
Thus b must be an eigenvector with eigenvalue λ.
The largest variance corresponds to choosing the largest eigenvalue.
The general result is that ur has maximum variance of all
normalized combinations uncorrelated with u1,. . . , ur−1.
Sample principal components: Let S denote the sample
covariance matrix of the data matrix X and let Q = [q1, . . . , qp] a
p× p orthogonal matrix diagonalizing S:
QT S Q = diag(ℓ1, . . . , ℓp)
The ℓj are the sample variances that are estimates for λj . The
vectors qj are sample estimates for the vectors hj .
If x is the random vector and u = HTx is the vector of principal
components, then u = QTx is the vector of sample principal
In applications: How many of the ℓj ’s are significant?
CANONICAL CORRELATION ANALYSIS (CCA)
H. Hotelling, 1936
Suppose a large data set is naturally decomposed into two groups.
For example, p× 1 random vectors ~x1,. . . , ~xn make up one set and
q × 1 random vectors ~y1,. . . , ~ym the other. We are interested in the
correlations between these two data sets. For example, in medicine
we might have n measurements of age, height, and weight (p = 3)
and m measurements of systolic and diastolic blood pressures
(q = 2). We are interested in what combination of the components
of x is most correlated with a combination of the components of y.
Population Canonical Correlations: Let x and y be two
random vectors of size p× 1 and q × 1, respectively. We assume
E(x) = E(y) = 0 and p ≤ q. Form the (p+ q)× 1 vector
and its (p+ q)× (p+ q) covariance matrix
u := αTx ∈ R, v := γT y ∈ R
where α and γ are vectors to be determined. We want to maximize
corr(u, v) =cov(u, v)
The correlation does not change under scale transformations
u→ cu, etc. so we can maximize this correlation subject to the
E(u2) = E(αTx · αTx) = αT Σ11α = 1 (1)
E(v2) = γT Σ22γ = 1 (2)
Under these constraints
corr(u, v) = E(αTx · γT y) = αT Σ12γ.
ψ = αT Σ12γ −1
2ρ (αT Σ11α− 1)− 1
2λ (γtΣ22γ − 1)
where ρ and λ are Lagrange multipliers. Set the vector of partial
derivatives to zero:
∂α= Σ12γ − ρΣ11α = 0 (3)
12α− λΣ22γ = 0 (4)
If we left multiply (3) by αT and (4) by γT , use the normalization
conditions (1) and (2) we conclude λ = ρ. Thus (3) and (4) become
= 0 (5)
with corr(u, v) = ρ
and ρ ≥ 0 is a solution to
This is a polynomial in ρ of degree (p+ q). Let ρ1 denote the
maximum root and α1 and γ1 corresponding solutions to (5).
Definition: u1 and v1 are called the first canonical variables and
their correlation ρ1 = corr(u1, v1) is called the first canonical
More generally, the rth pair of canonical variables is the pair of
linear combinations ur = (α(r))Tx and vr = (γ(r))T y, each of unit
variance and uncorrelated with the first r − 1 pairs of canonical
variables and having maximum correlation. The correlation
corr(ur, vr) is the rth canonical correlation coefficient.
REDUCTION TO AN EIGENVALUE PROBLEM
−ρI Σ−111 Σ12Σ
Thus the determinantal equation becomes
ρ2I − Σ21Σ−111 Σ12Σ
The nonzero roots ρ1, . . . , ρk are called the population canonical
correlation coefficients. k = rank(Σ12).
In applications Σ is not known. One uses the sample covariance
matrix S to obtain sample canonical correlation coefficients.
The first application of Hotelling’s canonical correlations is by
F. Waugh  in 1942. He begins his paper with
Professor Hotelling’s paper, “Relations between Two Sets
of Variates,” should be widely known and his method used
by practical statisticians. Yet, few practical statisticians
seem to know of the paper, and perhaps those few are
inclined to regard it as a mathematical curiosity rather
than an important and useful method for analyzing
concrete problems. This may be due to . . .
The Practical Problem: Relation of wheat characteristics to
flour characteristics. Guiding principle
The grade of the raw material should give a good
indication of the probable grade of the finished product.
The data: 138 samples of Canadian Hard Red Spring wheat and
the flour made from each these samples.a
wheat quality flour quality
x1 = kernel texture y1 = wheat per barrels of flour
x2 = test weight y2 = ash in flour
x3 = damaged kernels y3 = crude protein in flour
x4 = crude protein in wheat y4 = gluten quality index
x5 = foreign material
u1 = αT1 x = index of wheat quality, v1 = γT
1 y = index of flour quality
corr(u1, v1) = 0.909
aIn this example p > q. The data are normalized to mean 0 and variance 1.
DISTRIBUTION OF SAMPLE CANONICAL
Give a sample of size n observations on
Np+q(µ,Σ) and A the (unnormalized) sample covariance matrix.
Then W is Wp+q(n,Σ). We have 
Theorem (Constantine, 1963): Let A have the Wp+q(n,Σ)
distribution where p ≤ q, n ≥ p+ q and Σ and A are partitioned as
above. Then the joint probability density function of r21,. . . , r2p, the
eigenvalues of A−111 A12A
−122 A21 (let ξj := r2j ) is
ξ(q−p−1)/2j (1− ξj)(n−p−q−1)/2
• cp,q,n is a normalization constant.
• ∆(ξ) = Vandermonde determinant =∏p
j<k(ξj − ξk).
• F(ξ) is a two-matrix hypergeometric function which can be
expressed as an infinite series involving zonal polynomials. See
Theorem 11.3.2 and Definition 7.3.2 in Muirhead .
Null Distribution: For Σ12 = 0 (x and y are independent), the
above joint density for the sample canonical correlation coefficients
ξ(q−p−1)/2j (1− ξj)(n−p−q−1)/2
In this case the distribution of the largest sample canonical
correlation coefficient r1 can be used for testing the null hypothesis:
H : Σ12 = 0. We reject H for large values of r1.
Zonal polynomials naturally arise in multivariate analysis when
considering group integrals such as∫
e−tr(XHY HT ) (dH)
where X and Y are m×m symmetric, positive definite matrices
and (dH) is normalized Haar measure. Zonal polynomials can be
defined either through the representation theory of GL(m,R) 
or as eigenfunctions of certain Laplacians [5, 13].
Let Sm denote the space of m×m symmetric, positive definite
matrices. Zonal polynomials, Cλ(X), X ∈ Sm are certain
homogeneous polynomials in the eigenvalues of X that are indexed
by partitions λ.
To give the precise definition we need some preliminary definitions.
Definition: A partition λ of n is a sequence λ = (λ1, λ2, . . . , λℓ)
where the λj ≥ 0 are weakly decreasing and∑
j λj = n. We denote
this by λ ⊢ n.
For example, λ = (5, 3, 3, 1) is a partition of 12. The number of
nonzero parts of λ is called the length of λ, denoted ℓ(λ).
If λ and µ are two partitions of n, we say λ < µ (lexicographic
order) if, for some index i, λj = µj for j < i and λj < µj . For
(1, 1, 1, 1) < (2, 1, 1) < (2, 2) < (3, 1) < (4)
If λ ⊢ n with ℓ(λ) = m, we define the monomial
xλ = xλ11 xλ2
2 · · ·xλmm
and say xµ is of higher weight than xλ if µ > λ.
Metrics and Laplacians
Let X ∈ Sm and dX = (dxjk) the matrix of differentials of X . We
define a metric on Sm by
(ds)2 = tr(
A simple computation shows this metric is invariant under
X −→ LXLT , L ∈ GL(m,R).
Let n = m(m+ 1)/2 = # of independent elements of X . We denote
by vec(X) ∈ Rn the column vector representation of X , e.g.
, x := vec(X) =
The metric (ds)2 is a quadratic differential in dx.
dx11 dx12 dx22
G(x) = (gij) =
(x11x22 − x212)
x222 −2x12x22 x2
−2x12x22 2x11x22 + 2x212 −2x11x12
x212 −2x11x12 x2
Labeling x = vec(X) with a single index the differential is in the
dxi gij dxj
and the Laplacian associated to this metric is
∆X = (detG)−1/2n
G−1 = (gij)
More succinctly, if
∆X = (detG)−1/2(
∇X , (detG)1/2G−1∇X
where (·, ·) is the standard inner product on Rn.
One can show that ∆X is an invariant differential operator :
∆LXLT = ∆X , L ∈ GL(m,R)
We now diagonalize X
X = HYHT , Y = diag(y1, . . . , ym), H ∈ O(m).
The Laplacian ∆X is now expressed in terms of a radial part and
an angular part. The radial part of ∆X is the differential operator
yj − yk
We now let ∆X denote only the radial part.
We can now define zonal polynomials!
Definition: Let X ∈ Sm with eigenvalues x1,. . . , xm and let
λ = (λ1, . . . , λ−m) ⊢ k into not more than m parts. Then Cλ(X) is
the symmetric, homogeneous polynomial of degree k in xj such that
1. The term of highest weight in Cλ(X) is xλ
2. Cλ(X) is an eigenfunction of the Laplacian ∆X .
= (x1 + · · ·+ xm)k =∑
• Must show there is an unique polynomial satisfying these
• Eigenvalue in (2) equals αλ :=∑
j λj(λj − j) + k(m+ 1)/2.
• Program MOPS  computes zonal polynomials.
The following theorem is at the core of why zonal polynomials
appear in multivariate statistical analysis.
Theorem: If X,Y ∈ Sp, then
Cλ(XHYHT ) (dH) =Cλ(X)Cλ(Y )
where (dH) is normalized Haar measure.
Proof: Let fλ(Y ) denote the left-hand side of (6) and Q ∈ O(p).
fλ(QYQT ) = fλ(Y ) (let H → HQ in integral and use invariance of
the measure). Thus fλ is a symmetric function of the eigenvalues of
Y . Since Cλ is homogeneous of degree |λ|, so is fλ. Now apply the
Laplacian to fλ.
∆Y fλ(Y ) =
∆Y Cλ(XHYHT ) (dH)
∆Y Cλ(X1/2HYHTX1/2) (dH)
∆Y Cλ(LY LT ) (dH) (L = X1/2H)
∆LY LT Cλ(LY LT ) (dH) invariance of ∆Y
Cλ(LY LT ) (dH)
= αλ fλ(Y )
By definition of Cλ(Y ) we have fλ(Y ) = dλ Cλ(Y ). Since
fλ(Ip) = Cλ(X), we find dλ and the theorem follows.
Using Zonal Polynomials to Evaluate Group Integrals∫
e−ρ tr(XHY HT ) (dH)
Cλ(XHYHT ) (dH)
 T. W. Anderson, An Introduction to Multivariate Statistical
Analysis, third edition, John Wiley & Sons, Inc., 2003.
 J. Baik, G. Ben Arous, and S. Peche, Phase transition of the largest
eigenvalue for nonnull complex sample covariance matrices, Ann.
Probab. 33 (2005), 1643–1697.
 J. Baik, Painleve formulas of the limiting distributions for nonnull
complex sample covariance matrices, Duke Math. J. 133 (2006),
 M. Dieng and C. A. Tracy, Application of random matrix theory to
multivariate statistics, arXiv: math.PR/0603543.
 I. Dumitriu, A. Edelman and G. Shuman, MOPS: Multivariate
Orthogonal Polynomials (symbolically), arXiv:math-ph/0409066.
 N. El Karoui, On the largest eigenvalue of Wishart matrices with
identity covariance when n, p and p/n tend to infinity, arXiv:
 N. El Karoui, Tracy-Widom limit for the largest eigenvalue of a
large class of complex Wishart matrices, arXiv: math.PR/0503109.
 G. H. Golub and C. F. Van Loan, Matrix Computations, second
edition, Johns Hopkins University Press, 1989.
 H. Hotelling, Analysis of a complex of statistical variables into
principal components, J. of Educational Psychology 24 (1933),
 H. Hotelling, Relations between two sets of variates, Biometrika 28
 I. M. Johnstone, On the distribution of the largest eigenvalue in
principal component analysis, Ann. Stat. 29 (2001), 295–327.
 I. G. Macdonald, Symmetric Functions and Hall Polynomials, 2nd
ed., Oxford University Press, 1995.
 R. J. Muirhead, Aspects of Multivariate Statistical Theory, John
Wiley & Sons Inc., 1982.
 A. Soshnikov, A note on universality of the distribution of the
largest eigenvalue in certain sample covariance matrices, J.
Statistical Physics 108 (2002), 1033–1056.
 C. A. Tracy and H. Widom, On orthogonal and symplectic matrix
ensembles, Commun. Math. Phys. 177 (1996), 727–754.
 F. V. Waugh, Regression between sets of variables, Econometrica
10 (1942), 290–310.
 P. Zinn-Justin and J.-B. Zuber, On some integrals over the U(N)
unitary group and their large N limit, J. Phys. A: Math. Gen. 36