Chapter 4. Multivariate Models - Wright State Universitythaddeus.tarpey/STT363chapter4.pdf · Chapter 4. Multivariate Models 97 August ... numbers and hence matrix algebra techniques

Chapter 4. Multivariate Models 97

August 31, 2011

Chapter 4. Multivariate Models

The primary purpose of this chapter is to introduce some basic ideas from multivari-ate statistical analysis. Quite often, experiments produce data where measurementswere obtained on more than one variable – hence the name: multivariate. In the Swisshead dimension example (Flury, 1997), in order to determine well-fitting masks, sev-eral different head-dimension measurements were obtained on the soldiers. In thenext chapter on regression analysis, we will examine models that are defined in termsof several parameters. In order to properly understand the estimation of these modelparameters, a foundation in multivariate statistics is needed. In particular, we needto understand concepts such as covariances and correlations between variables andestimators. An advantage of the multivariate approach is to allow for designs of ex-periments where the resulting parameter estimators will be uncorrelated, thus makingit easier to interpret results.

1 Multivariate Probability Density Functions

The probabilistic background for multivariate statistics requires multiple integrationideas as seen in some of the formulas below. However, this chapter does not requiremultiple integration computations. We shall be concerned instead with statisticalestimation computations which require simple (but tedious) arithmetic and someelementary matrix algebra. Fortunately, these computations can be done very easilyon the computer. Data, particularly multivariate data, comes in the form of arrays ofnumbers and hence matrix algebra techniques are the natural way of handling suchdata. The appendix to this chapter contains a short review of some matrix algebrain case the reader needs to brush up on these ideas.

Suppose we are interested in two variables. For instance, in the Swiss head dimensiondata, let Y1 = MFB (Minimal frontal breadth or forehead width) and let Y2 = BAM,(Breadth of angulus mandibulae or chin width). Data that consists of measurementson two different variables is called bivariate data (similarly, data collected on threevariables is called trivariate and so on). We can define a joint probability densityfunction f(y1, y2) that satisfies the following properties which mirror the propertiessatisfied by the (univariate) pdf:

1. f(y1, y2) ≥ 0.

2. The total volume under the pdf must be 1:∫ ∞

−∞

∫ ∞

−∞f(y1, y2)dy1dy2 = 1


3. Let A ⊂ <2, then

P ((Y1, Y2) ∈ A) =∫ ∫

Af(y1, y2)dy1dy2.

Definition. The marginal pdf of Y1, denoted f1(y1), is just the pdf of the randomvariable Y1 considered alone. To determine the marginal pdf, we integrate out y2 inthe joint pdf:

f1(y1) =∫ ∞

−∞f(y1, y2)dy2.

The marginal pdf of Y2 is defined similarly.

Our focus here is not so much on computing probabilities using multiple integration.Instead, we will focus on statistical measures of association between variables.

2 Covariance

Let Y1 and Y2 be two jointly distributed random variables with means µ1 and µ2

respectively and variances σ21 and σ2

2. A common measure of association between Y1

and Y2 is the covariance, denoted σ12:

Covariance: σ12 = cov(Y1, Y2) = E[(Y1 − µ1)(Y2 − µ2)].

The population covariance can be computed by

σ12 =∫ ∞

−∞

∫ ∞

−∞(y1 − µ1)(y2 − µ2)f(y1, y2)dy1dy2.

A positive covariance indicates that if Y1 is above its average (Y1 − µ1 > 0), thenY2 tends to be above its average (Y2 − µ2 > 0), so that (Y1 − µ1)(Y2 − µ2) tends tobe positive; also if Y1 is below average, then Y2 tends to be below average as wellwhereby (Y1−µ1)(Y2−µ2) is a negative times a negative resulting in a positive value.Conversely, a negative covariance indicates that if Y1 tends to be small, then Y2 tendsto be large, and vice-versa.

To illustrate, if Y1 is a measure of a person’s height and Y2 is a measure of their weight,then these two variables tend to be associated. In particular, the covariance betweenthem is usually positive since taller people tend to weigh more and shorter peopletend to weigh less. On the other hand, if Y1 is the hours of training a technicianreceives for learning to operate a new machine and Y2 represents the number oferrors the technician makes using the machine, then we would expect to see fewererrors corresponding with more training and hence Y1 and Y2 would have a negativecovariance.

An important area where the covariance is important is when considering differencesof jointly distributed random variables Y1 − Y2. For instance, we will discuss laterexperiments looking at paired differences in situations where we may want to compare


two different experimental conditions. The statistical analysis requires that we knowthe variance of the difference: var(Y1 − Y2). There are two extreme cases:

Y1 = Y2 : var(Y1 − Y2) = var(0) = 0

σ12 = 0 : var(Y1 − Y2) = var(Y1) + var(Y2)

These two extremes are special cases the following formula which holds in all cases:

var(Y1 − Y2) = σ21 + σ2

2 − 2σ12. (1)

Exercise. Derive (1) using the definition of variance.

3 Correlation

We can transform the covariance to obtain a well-known measure of association knownas the correlation, which is denoted by the Greek letter ρ (“rho”).

Correlation: ρ = σ12σ1σ2

, where σ1 and σ2 are the standard deviations of Y1 and Y2

respectively.

Here are a couple properties for ρ:

1. −1 ≤ ρ ≤ 1.

2. If ρ = ±1, then Y1 and Y2 are perfectly related by a linear transformation, thatis, there exists constants a and b so that Y2 = a + bY1.

Property (1) highlights the fact that the correlation is a unitless quantity. Property(2) highlights the fact that the correlation is a measure of the strength of the linearrelation between Y1 and Y2. A perfect linear relation produces a correlation of 1or −1. A correlation of zero indicates no linear relation between the two randomvariables. Figure 1 shows scatterplots of data obtained from bivariate distributionswith different correlations. The distribution for the top-left panel had a correlationof ρ = 0.95. The plot shows a strong positive relation between Y1 and Y2 with thepoints tightly clustered together in a linear pattern. The correlation for the top-rightpanel is also positive with ρ = 0.50 and again we see a positive relation between thetwo variables, but not as strong as in the top-right panel. The bottom-left panelcorresponds to a correlation of ρ = 0 and consequently, we see no relationship evidentbetween Y1 and Y2 in this plot. Finally, the bottom-right panel shows a negativelinear relation with a correlation of ρ = −0.50.

A note of caution is in order: two variables Y1 and Y2 can be strongly related, butthe relation may be nonlinear in which case the correlation may not be a reasonable


Figure 1: Scatterplots of data obtained from bivariate distributions with differentcorrelations.

measure of association. Figure 2 shows a scatterplot of data from a bivariate distri-bution. There is clearly a very strong relation between y1 and y2, but the relation isnonlinear. The correlation is not an appropriate measure of association for this data.In fact, the correlation is nearly zero. To say y1 and y2 are unrelated because theyare uncorrelated can be misleading if the relation is nonlinear. This is an error thatis quite commonly made in everyday usage of the term correlation.

Caution: Another very common error made in practice is to assume that becausetwo variables are highly correlated, one causes the other. Sometimes this will indeedbe the case (e.g. more fertilizer leads to taller plants and hence a positive correlation.)In other cases, the causation conclusion is silly. For example, do a survey of fires ina large city and note Y1, the dollar amount of fire damage, and also Y2, the numberof fire-fighters called in to fight the fire. Will Y1 and Y2 be positively or negativelycorrelated? Does sending more fire fighters to a fire cause more fire damage? Or,could the association be due to something else?

Below is some Matlab code for obtaining plots and statistics for the multivariate Swisshead dimension data:

% Measurements on 200 Swiss soldiers, obtained to design new

% gas masks. 6 measurements were taken on each soldier (facial height,

% width, etc.)


Figure 2: A scatterplot showing a very strong but nonlinear relationship between y1

and y2. The correlation is nearly zero.

% Put the correct path to the data swiss.dat:

load swiss.dat;

mfb = swiss(:,1); %Minimal frontal breadth (forehead width)

bam = swiss(:,2); % Breadth of angulus mandibulae (chin width)

tfh = swiss(:,3); % True facial height

lgan = swiss(:,4); % Length from glabella to apex nasi (tip of nose to top of forehead)

ltn = swiss(:,5); % length from tragion to nasion (top of nose to ear)

ltg = swiss(:,6); % Length from tragion to gnathion (bottom of chin to ear)

plot(mfb, bam, ’*’)

title(’Swiss Head Data’)

xlabel(’Forehead Width’)

ylabel(’Chin Width’)

cov(swiss) % Compute the sample covariance matrix

corr(swiss) % Compute the sample correlation matrix

Note that to access a particular variable (i.e. a column of the data set) call swiss, wewrite “swiss(:,1)” for column 1, and so on.

4 Higher Dimensional Distributions

For higher dimensional data, it is helpful to employ matrix notation. Suppose wehave p jointly distributed random variables Y1, Y2, . . . , Yp with means µ1, µ2, . . . , µp

and variances σ21, σ

22, . . . , σ

2p. For instance, in the Swiss head dimension example,

there were p = 6 head dimension variables recorded for each soldier. We can let the


boldfaced Y denote the column vector of random variables:

Y =

Y1

Y2...

Yp

and let the boldfaced µ denote the corresponding vector of means:

µ =

µ1

µ2...

µp

.

When we have more than two variables, we can compute covariances between eachpair of variables. These covariances are collected together in a p × p matrix calledthe covariance matrix. The diagonal elements of a covariance matrix correspond tothe variances of the random variables. The i-jth element of the covariance matrixis the covariance between Yi and Yj. The covariance matrix is a symmetric matrixbecause the covariance between Yi and Yj is the same as the covariance between Yj

and Yi. To illustrate, suppose we have a tri-variate distribution for Y1, Y2 and Y3.Let σ12 = cov(Y1, Y2), σ13 = cov(Y1, Y3), and σ23 = cov(Y2, Y3), . Then the covariancematrix, denoted by Ψ is

Covariance Matrix: Ψ =

σ21 σ12 σ13

σ12 σ22 σ23

σ13 σ23 σ23

.

A convenient way of defining the covariance matrix in terms of expectations is

E[(Y − µ)(Y − µ)′].

When we take the expected value of a random vector or a random matrix, we computethe expected value of each term individually. For example,

E[Y ] =

E[Y1]E[Y2]

...E[Yp]

.

For a bivariate random vector (Y1, Y2)′ with mean (µ1, µ2)

′, we have

(Y −µ)(Y −µ)′ =(

Y1 − µ1

Y2 − µ2

)( Y1 − µ1 Y2 − µ2 ) =

((Y1 − µ1)

2 (Y1 − µ1)(Y2 − µ2)(Y1 − µ1)(Y2 − µ2) (Y2 − µ2)

2

).

Therefore,

E[(Y − µ)(Y − µ)′] =(

E[(Y1 − µ1)2] E[(Y1 − µ1)(Y2 − µ2)]

E(Y1 − µ1)(Y2 − µ2)] E[(Y2 − µ2)2]

)

which is the covariance matrix Ψ.


Of course, the population covariances (e.g. σ12) and the population correlations aretypically unknown population parameters which must be estimated from the data.Generally, multivariate data sets are organized so that each row corresponds to a newp-dimensional observation and each column corresponds to the measurement on oneof the p variables. In other words, the data usually comes in the form of n rows for thesample size and p columns for the p measured variables. For a p dimensional data set,let yi1 equal the ith observation on the first variable, and yi2 equal the ith observationon the second variable and so on for i = 1, 2, . . . , n. The sample covariance betweenvariables 1 and 2, denoted s12 is

s12 =n∑

i=1

(yi1 − y1)(yi2 − y2)/(n− 1) (2)

where y1 and y2 are the sample means of the first and second variables respectively.We can estimate the covariance matrix Ψ by replacing the population variances andcovariances by their respective estimators – this will be called the sample covariancematrix and is generally denoted by S.

Example. Consider once again the Swiss head dimension data consisting of p = 6head measurements. Denote these measurements by

Y1 = MFB = Minimal frontal breadth (forehead width)

Y2 = BAM = Breadth of angulus mandibulae (chin width)

Y3 = TFH = True facial height

Y4 = LGAN = Length from glabella to apex nasi (tip of nose to top of forehead)

Y5 = LTN = length from tragion to nasion (top of nose to ear)

Y6 = LTG = Length from tragion to gnathion (bottom of chin to ear).

To give an indication of what the data looks like, below is a list of the fist 20 obser-vations:


MFB BAM TFH LGAN LTN LTG113.2 111.7 119.6 53.9 127.4 143.6117.6 117.3 121.2 47.7 124.7 143.9112.3 124.7 131.6 56.7 123.4 149.3116.2 110.5 114.2 57.9 121.6 140.9112.9 111.3 114.3 51.5 119.9 133.5104.2 114.3 116.5 49.9 122.9 136.7110.7 116.9 128.5 56.8 118.1 134.7105.0 119.2 121.1 52.2 117.3 131.4115.9 118.5 120.4 60.2 123.0 146.896.8 108.4 109.5 51.9 120.1 132.2110.7 117.5 115.4 55.2 125.0 140.6108.4 113.7 122.2 56.2 124.5 146.3104.1 116.0 124.3 49.8 121.8 138.1107.9 115.2 129.4 62.2 121.6 137.9106.4 109.0 114.9 56.8 120.1 129.5112.7 118.0 117.4 53.0 128.3 141.6109.9 105.2 122.2 56.6 122.2 137.8116.6 119.5 130.6 53.0 124.0 135.3109.9 113.5 125.7 62.8 122.7 139.5107.1 110.7 121.7 52.1 118.6 141.6

To get a better feel for the data, Figure 3 shows scatterplots of each pair of variables.

The sample mean vector for the entire data set is given by

y =

y1

y2...y6

=

114.7245115.9140123.055057.9885122.2340138.8335

and the sample covariance S is equal to

S =

26.9012 12.6229 5.3834 2.9313 8.1767 12.107312.6229 27.2522 2.8805 2.0575 7.1255 11.44125.3834 2.8805 35.2300 10.3692 6.0275 7.97252.9313 2.0575 10.3692 17.8453 2.9194 4.99368.1767 7.1255 6.0275 2.9194 15.3702 14.521312.1073 11.4412 7.9725 4.9936 14.5213 31.8369

.

Matlab can compute these statistics easily the cov command to get the sample co-variance matrix. Note that the covariance between the six head measurements are allpositive. It is quite common to see all positive covariances on data of this sort. Forexample, if people with larger than average forehead widths will tend to also havelarger than average chin widths and so on.


MFB

100 110 120 50 60 70 125 140

100

115

130

100

115

BAM

TFH11

012

514

0

5060

70

LGAN

LTN

115

125

135

100 115 130

125

140

110 125 140 115 125 135

LTG

Swiss Head Dimension Data

Figure 3: Scatterplot matrix of each pair of variables in the Swiss head data. Notethat most pairs of variables are positively correlated.


The sample correlations, typically denoted by r, are the sample counterpart to thepopulation correlation. For instance,

r12 =s12

s1s2

. (3)

We can collect the sample correlations together into a correlation matrix, denotedby R where the i-jth element of the matrix is rij, the sample correlation betweenthe ith and the jth variables. Note that the correlation between a random variablewith itself is always 1 (the same goes for sample correlations). Therefore, correlationmatrices always have ones down the diagonal. For the Swiss head dimension data,the sample correlation matrix is

R =

1.0000 0.4662 0.1749 0.1338 0.4021 0.41370.4662 1.0000 0.0930 0.0933 0.3482 0.38840.1749 0.0930 1.0000 0.4135 0.2590 0.23810.1338 0.0933 0.4135 1.0000 0.1763 0.20950.4021 0.3482 0.2590 0.1763 1.0000 0.65640.4137 0.3884 0.2381 0.2095 0.6564 1.0000

.

Note that the highest correlation r56 = 0.6564 is between LTN and LTG, the distancesfrom the top of the nose to the ear and the distance from the bottom of the chin to theear. The correlation between chin width (BAM) and facial height (TFH) is relativelyquite small (r23 = 0.0930). Also, the correlation between chin width and the distancefrom the top the nose to the ear is also relatively quite small (r24 = 0.0933). Lookingat Figure 3, one can see a weak association between BAM and TFH and a strongassociation between LTN and LTG.

4.4 The Multivariate Normal Density Function.

Recall that a normal random variable Y with mean µ and variance σ2 has a probabilitydensity function (pdf) of

f(y) =1√

2πσ2exp{− 1

2σ2(y − µ)2},

for −∞ < y < ∞. It is easy to generalize the this univariate pdf to a multivariate nor-mal pdf. Let Y = (Y1, Y2, . . . , Yp)

′ denote a multivariate normal random vector withmean vector µ = (µ1, µ2, . . . , µp) and covariance matrix Ψ. To obtain the multivariatedensity function, we replace (y − µ)2/σ2 in the exponential exponent by

(y − µ)′Ψ−1(y − µ),

and we replace the 1/σ scaler by the determinant of Ψ raised to the −1/2 power:|Ψ|−1/2. The p-dimensional normal pdf can be written

f(y1, y2, . . . , yp) = (1

2π)p/2|Ψ|−1/2 exp{−1

2(y − µ)′Ψ−1(y − µ)}, (4)

for y ∈ <p.

It is informative to note that if we set the expression (y − µ)′Ψ−1(y − µ) in theexponent of the multivariate normal density equal to a constant, the resulting equation


Figure 4: The bivariate normal pdf (4) for the Swiss head dimension variables LTNand LTG.

describes an ellipsoid in p-dimensional space centered at the mean µ. These ellipsoidpatterns are used for forming multivariate confidence regions and multivariate criticalregions for hypothesis testing.

Introductory textbooks typically refrain from using matrix notation when expressingthe multivariate normal pdf given in (4). However (4) is fairly easy to write downin matrix notation compared to what one would get writing it down without matrixnotation. For example, for p = 2 dimensions, we can write out the bivariate normalpdf as

f(y1, y2) =1

2πσ1σ2

√1− ρ2

exp{− 1

2(1− ρ2)[(

y1 − µ1

σ1

)2−2ρ(y1 − µ1

σ2

)(y2 − µ2

σ2

)+(y2 − µ2

σ2

)2] ],

for ∞ < y1 < ∞, and −∞ < y2 < ∞. This looks quite complicated. The expressionfor p = 3 or more dimensions becomes even more of a mess to write out withoutmatrix notation but (4) stays the same regardless of the dimension.

To get an idea of what a multivariate normal pdf looks like, Figure 4 shows a bivariatenormal pdf for the LTN and LTG variables from the Swiss head dimension data. Thebivariate normal pdf looks like a mountain centered over the mean of the distribution.In order to compute probabilities using the pdf, one needs to compute the volumeunder the pdf surface corresponding to the region of interest.


5 Confidence Regions

Confidence intervals were introduced for estimating a single parameter such as themean of a distribution. In the multivariate setting, we can similarly define confidenceregions for vectors of parameters, such as the mean vector µ. To illustrate matters, weshall consider two of the Swiss head dimension variables: LTN and LTG. Since theycorrespond to the 5th and 6th variables, the mean vector of interest is µ = (µ5, µ6)

′.

There are two approaches. One method is to simply compute two univariate confi-dence intervals separately for µ1 and µ2 and form the Cartesian product of the twointervals to obtain a confidence rectangle. However, if we compute say 95% confidenceintervals for µ1 and µ2, then the joint confidence region (the rectangle) has a lowerconfidence level. To understand why, consider an analogy: if there is a 5% chance I’llget a speeding ticket on a given day when I drive to work. Then the probability Iget at least one ticket during the year is certainly higher than 5%. Similarly, if thereis a 5% probability that each random interval does not contain its respective mean,then the probability that at least one of the intervals does not contain their respectivemean is higher than 5%. A simple (but not always efficient) fix to this problem is touse what is known as the Bonferroni adjustment. If you form p confidence intervalsfor p parameters using a confidence level 1−α for the family of parameters, then onecan compute a confidence interval for each parameter separately using a confidencelevel of 1−α/p to guarantee that the confidence level is at least 1−α for all p intervalsconsidered jointly.

A more efficient approach for estimating a mean vector is to incorporate the correla-tions between the estimate parameters. For multivariate normal data, the resultingconfidence regions have ellipsoidal shapes. Instead of determining a random intervalthat covers a mean µ with high probability, we want to determine a region that coversthe vector µ with high probability.

The solution to this problem requires introducing another probability distributionknown as the F -distribution which results when we look at statistics formed by ratiosof variance estimates. The F -distribution is used extensively in analysis of variance(ANOVA) applications where we want to compare several means. Because an Frandom variable is defined in terms of a ratio of variance estimators, and varianceestimators depend on a degrees of freedom, the F -distribution is specified by a nu-merator and a denominator degrees of freedom. The F -distribution is skewed to theright and takes only positive values. Critical values for the F -distribution can befound beginning on page 202 in the Appendix. Let Fp,n−p(α) denote the α criticalvalue of an F -distribution on p numerator degrees of freedom and n− p denominatordegrees of freedom.

Returning to the confidence region problem, one can show (e.g., Johnson and Wichern,1998, page 179) that for a sample of size n from a p-dimensional normal distribution

P (n(Y − µ)′S−1(Y − µ) ≤ (n− 1)p

(n− p)Fp,n−p(α)) = 1− α.

This statement shows that a (1 − α)100% confidence region for the mean of a p-


Figure 5: A 95% confidence ellipse for the Swiss head dimension data using only thevariables LTN and LTG.

dimensional normal distribution is given by the set of µ ∈ <p that satisfy the in-equality:

n(y − µ)′S−1(y − µ) ≤ (n− 1)p

(n− p)Fp,n−p(α).

The inequality defines a p-dimensional ellipsoid centered at y. To determine if ahypothesized value of µ lies in this region, simply plug it into the expression and seeif the inequality is satisfied or not. Figure 5 shows a 95% confidence ellipse for theSwiss head data for variables Y5 = LTN and Y6 = LTG.

Multivariate statistics is a broad field of statistics and we have only introduced someof the most basic ideas. Additional topics in multivariate analysis (such as principalcomponent analysis, discriminant analysis, cluster analysis, cannonical correlations,MANOVA) take the correlations between variables into consideration to solve variousproblems.

Problems

1. Data on felled black cherry trees was collected (Ryan et al., 1976). The measuredvariables were the diameter (in inches measured from 4.5 feet above the ground),the height (measured in feet) and the volume (measured in cubic feet). The fulldata set appear in the following table:


x1 x2 x3

Diameter Height Volume (xi1 − x1) (xi2 − x2) (xi1 − x1)(xi2 − x2)11.0 75 18.211.1 80 22.611.2 75 19.911.3 79 24.211.4 76 21.011.4 76 21.411.7 69 21.312.0 75 19.112.9 74 22.212.9 85 33.813.3 86 27.413.7 71 25.713.8 64 24.914.0 78 34.514.2 80 31.714.5 74 36.316.0 72 38.316.3 77 42.617.3 81 55.417.5 82 55.717.9 80 58.318.0 80 51.518.0 80 51.020.6 87 77.0

a) Before analyzing the data, do you expect the correlations between thesethree variables to be negative, positive or zero? A scatterplot matrix ofthe data is plotted in Figure 6

b) Instead of carrying out the computation of the covariance which is rathertedious, we shall attempt to get a feel for the covariance between x1, thediameter and x2, the height. The sample means for the three variables arey1 = 13.248, y2 = 76.00, and y3 = 30.17. In the table above, put a + in thecolumn (xi1 − x1) if the ith diameter is higher than the average diameterand put a − if the ith diameter is lower than the mean value. Do thesame thing for the heights in the column labelled (xi2 − x2). If both thesedifferences are positive, or they are both negative, put a + in the columnlabelled (xi1 − x1)(xi2 − x2) and a − otherwise. To illustrate, here is howto do this for the first row:

x1 x2 x3

Diameter Height Volume (xi1 − x1) (xi2 − x2) (xi1 − x1)(xi2 − x2)11.0 75 18.2 − − +

The sample covariance is basically the average of the product (xi1−x1)(xi2−


x2). From the list of + and −’s, does it appear the covariance will be pos-itive or negative?

c) The sample covariance matrix for the entire data set is given by

S =

9.85 10.38 49.8910.38 40.60 62.6649.89 62.66 270.20

.

Compute the sample correlation matrix (using (3)) from the covariancematrix.

d) The purpose of this study was to predict the volume of wood of the treeusing the diameter and/or height. If you had to choose one of the variables(height or diameter) for predicting the volume of the tree, which would youchoose from a purely statistical point of view (note that for trees that havenot been cut, it would be much more difficult to measure the height thanthe diameter). What was the basis for your choice?

e) If we convert the diameter measurements from units of inches to feet,then we would need to divide each diameter measurement by 12. Letxi1 = yi1/12 denote the diameter measurements in units of feet. Computethe sample variance of the xi1 measurements. Also, compute the samplecorrelation between the diameter (in feet) and the height of the cherrytrees.

Appendix: Matrix Algebra

In this appendix we give a brief review of some of the basics of matrix algebra. Amatrix is simply an array of numbers. Let n denote the number of rows and p denotethe number of columns in an array. Matrices are denoted by boldface letters. Forexample, let A denote a matrix with n = 3 rows and p = 2 columns. Then we saythat A is a n × p matrix, which in this case, A is a 3 × 2 matrix. A special case ofa matrix is a vector which is simple a matrix with a single column (a column vector)or a single row (a row vector). By convention, whenever we denote a vector, we shallassume it is a column vector. One can regard an n × 1 column vector as a point inn-dimensional Euclidean space

To illustrate matters, let x denote a 3×1 column vector and A denote a 3×2 matrixdefined as follows:

x =

123

, A =

2 34 51 6

.

We can perform operations on vectors and matrices such as summation, subtraction,multiplication.

The transpose of a matrix means to simply change the columns to vectors and isdenoted by a prime: A′ is the transpose of A. Thus,

x′ = ( 1, 2, 3 )


Girth

65 70 75 80 85

810

1214

1618

20

6570

7580

85

Height

8 10 12 14 16 18 20 10 20 30 40 50 60 70

1020

3040

5060

70

Volume

Black Cherry Tree Data

Figure 6: Scatterplots of the black cherry tree data. Here, Girth = diameter

and

A′ =(

2 4 13 5 6

).

To multiply a matrix by a number (i.e. a scalar), one just multiplies each element ofthe matrix by the scalar. For instance, if c = 2 then

cA = 2

2 34 51 6

=

4 68 102 12

.

In order to add two matrices together, they must both be of the same dimensions inwhich case you just add the corresponding components together (or subtract if youare subtracting matrices). We cannot add the vector x to the matrix A because theyare not of the same dimension. However, if

y =

456

then

x + y =

123

+

456

=

579

.


Let

A =

a11 a12

a21 a22

a31 a32

and B =

b11 b12

b21 b22

b31 b32

,

then

A + B =

a11 a12

a21 a22

a31 a32

+

b11 b12

b21 b22

b31 b32

=

a11 + b11 a12 + b12

a21 + b21 a22 + b22

a31 + b31 a32 + b32

.

Note that the ijth entry in the matrix A for the ith row and the jth column is denotedaij. Thus, the first index specifies the row number and the second index specifies thecolumn number.

Matrix Multiplication. One needs to be a little careful when multiplying twomatrices together. First of all, matrix multiplication is not commutative as we shallsee. If A and B are matrices and we want to form the product AB then the numberof columns of A must match the number of rows of B. Suppose A has dimensionn× p and B has dimension p× q, then the product AB will have dimension n× q.To illustrate, let us first compute the product of two vectors a and b say wherea = ( a11, a12, a13 ) and

b =

b11

b21

b31

.

Since a is a 1× 3 row vector and b is a 3× 1 column vector, we can form the productab since the number of columns of a equals the number of rows of b. The productab is defined as

ab = ( a11, a12, a13 )

b11

b21

b31

= a11b11 + a12b21 + a13b31.

Now consider the product of two matrices A and B. Think of each row of A as arow vector and each column of B as a column vector. The the ijth element of theproduct AB is defined to be the product of the ith row of A times the jth columnof B. To illustrate, let A denote a 3× 2 matrix and B denote a 2× 4 matrix:

A =

a11 a12

a21 a22

a31 a32

and B =

(b11 b12 b13 b14

b21 b22 b23 b24

),

then AB is a 3× 4 matrix computed as

AB =

a11b11 + a12b21 a11b12 + a12b22 a11b13 + a12b23 a11b14 + a12b24



.

In this example, we cannot form the product BA since the number of columns of Bdoes not match the number of rows of A.

Consider again the multiplication of two vectors a = ( a11, a12, a13 ) and

b =

b11

b21

b31

.


We saw how to compute the product ab. Note that we can also form the productba since b is a 3 × 1 column vector and a is a 1 × 3 row vector, i.e. the number ofcolumns of b matches the number of rows of a and the product will be a matrix ofdimension 3× 3:

ba =

b11

b21

b31

( a11, a12, a13 )

b11a11 b11a12 b11a13

b21a11 b21a12 b21a13

b31a11 b31a12 b31a13

.

Here are a few definitions:

A square matrix is a matrix with the same number of rows as columns.

A diagonal matrix is a matrix with all zeros except along the main diagonal.

An important special case of a square diagonal matrix is the identity matrix denotedby I. The identity matrix is a square matrix whose diagonal elements are ones andthe off diagonal elements are all zero. For instance, the 3× 3 identity matrix is

I =

1 0 00 1 00 0 1

.

The reason we call this the identity matrix is that it acts as the multiplicative identityelement: for any matrix A, we have

AI = A

andIA = A

provided the matrix multiplications are defined (one can easily verify these relations).

A symmetric matrix is any matrix A such that A = A′, that is, A is equal to itstranspose. For example

A =(

2 11 2

)

is symmetric. Covariance matrices are always symmetric.

Two (column) vectors a and b of the same dimension are orthogonal if a′b = 0.Geometrically speaking, if we think of a vector as an array extending from the originto the point represented by the vector, then orthogonal vectors are perpendicular toeach other. For example, if a = ( 1, −1 )′ and b = ( 1, 1 )′ , then

a′b = ( 1, −1 )(

11

)= 1 · 1− 1 · 1 = 0.

Figure 7 illustrates the geometric property of the orthogonal vectors.

Inverses. For a scalar such as 5, its inverse is simply 1/5 = 5−1 and 5(1/5) = 1. Allreal numbers have an inverse except zero. Let A denote a square p× p matrix. Theinverse of A, if it exists, is denoted A−1 and is the p× p matrix such that

AA−1 = A−1A = I, (the identity matrix).


Figure 7: Two orthogonal vectors in <2.

In order for a matrix A to have an inverse, its columns must be linearly independentwhich means that no column of A can be expressed as a linear combination of theother columns of A. Such matrices are call nonsingular. Thus a singular matrix doesnot have an inverse. Finding the inverse of a matrix is somewhat tedious for higherdimensional matrices. However, for a 2× 2 matrix, there is a simple formula. If

A =(

a11 a12

a21 a22

),

then

A−1 =1

a11a22 − a12a21

(a22 −a12

−a21 a11

).

Inverse matrices are needed in order to define the multivariate normal density functionand understanding multivariate distance. The inverse of diagonal matrices are easyto compute:

a11 0 0 · · · 00 a22 0 · · · 0...

...... · · · ...

0 0 0 · · · app

−1

=

1a11

0 0 · · · 0

0 1a22

0 · · · 0...

...... · · · ...

0 0 0 · · · 1app

.

In order for the inverse of a diagonal matrix to exist, all the diagonal elements mustbe nonzero. For higher dimensional non-diagonal matrices, computer software suchas matlab can be used to compute inverses of matrices.

Another matrix operation needed for the multivariate normal density is the deter-minant of a matrix denoted by |A| (also denoted by det(A)). The computation ofthe determinant is rather tedious for square matrices of dimension higher than 3× 3and again, software such as matlab can be used to compute determinants. In the case


Figure 8: The determinant is the area of the parallelogram formed from the columnvectors that make up the matrix.

of a 2× 2 matrix A where

A =(

a11 a12

a21 a22

),

the formula is quite simple:

|A| = a11a22 − a12a21.

Thus, if

A =(

3 42 5

),

then |A| = 3 · 5 − 4 · 2 = 15 − 8 = 7. One way to think of the determinant of amatrix is to look at the two column vectors of A. If we plot these two vectors, theyform two edges of a parallelogram as seen in Figure 8. The determinant of A is the(signed) area of the parallelogram. For higher dimensional matrices, the columns ofthe matrix are the vertices of a parallelepiped and the determinant is equal to the(signed) volume of the parallelepiped.

The determinant of a singular matrix is zero. For instance, suppose

A =(

1 42 8

).

Then, the second column of A is just 2 times the first column of A and therefore Ais singular. The determinant of A is |A| = 8 − 8 = 0. Since the second column ofA is 2 times the first column of A, the vertices of the parallelogram formed by thesetwo columns coincide and hence the area of the resulting parallelogram is zero.

Note that in the formula for the multivariate normal distribution, we divide by thedeterminant of the covariance matrix. If the determinant is zero, then the distribution


does not have a density. To understand what this means, consider a bivariate normalrandom vector (Y1, Y2)

′. If the covariance matrix has determinant zero, then thatmeans that Y2 is a linear function of Y1 and the two random variables are perfectlycorrelated (i.e. correlation equal to ±1). In a scatterplot such as Figure 5, if Y1

and Y2 were perfectly correlated, then the points would lie exactly in a line and theconfidence ellipse would shrink to a line. The bivariate density assigns probability bycomputing the volume under the density surface (as shown in Figure 4). However, ifthe entire distribution is concentrated on a line in the plane, then the volume underthe density, if it existed, would be zero. In other words, the distribution is degenerate.

Problems

1. Let

A =(

3 44 7

)and B =

(2 −1−1 2

).

Find the following:

a) A + B.

b) AB.

c) BA. (Is AB = BA?)

d) A−1. Verify that your answer is correct by confirming that AA−1 = I.

e) |A|2. Let

X =

1 x1

1 x2

1 x3

1 x4

1 x5

.

Find the following:

a) X ′X.

b) (X ′X)−1.

References

Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.

Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis.Prentice Hall, New Jersey.

Ryan, T. A., Joiner, B. L., and Ryan, B. F. (1976). The Minitab Student Handbook.Duxbury Press, California.

Documents

Chapter 4. Multivariate Models - Wright State Universitythaddeus.tarpey/STT363chapter4.pdf · Chapter 4. Multivariate Models 97 August ... numbers and hence matrix algebra techniques