Description of Multivariate Data

Description of Multivariate Data

Multivariate Analysis

The analysis of many variables

Multivariate Analysis: The analysis of many variables

More precisely and also more traditionally this term stands fo the study of a random sample pf n objects (units or cases) such that on each object we measure p variables or characteristics.

So that for each object there is a vector:

1 2, , , nx x x

Each with p components:

1, , ,i i ij ipx x x x

The variables will be correlated as they are measured on the same object.

This may lead to incorrect and inadequate analysis:

A common practice is to treat each variable separately by applying methods of univariate analysis.

The challenge of multivariate analysis is to untangle the overlapping information provided by a set of correlated variables and to reveal the underlying structure.

This is done by a variety of methods,

some of which are

• generalizations of univariate methods

and some which are

• multivariate with without univariate counterparts

The purpose of this course is

• to describe and perhaps justify these methods, and also

• provide some guidance about how to select an appropriate method for a given multivariate data set.

Example

x1 = age (in years) at entry to university,

Randomly select n = 5 students as objects and for each student measure:

x2 = mark out of 100 in an exam at the end of the first year,

x3 = sex (0 = female, 1= male)

The result may look something like this:Objects x1 x2 x3

1 18.5 91 0

2 18.0 73 1

3 18.9 64 1

4 18.5 71 0

5 18.4 85 1

It is of interest to note that the variables in the example are not of the same type:

– x1 is a continuous variable,

– x2 is a discrete variable and

– x3 is a binary variable

11 1 1

1

1

j p

i ij ip

n nj np

x x x

x x xX

x x x

1

objectsi

n

o

o

o

1 j px x x

variables

The Data Matrix

11 1 1

1

1

j p

i ij ip

n nj np

x x x

x x xX

x x x

1

i

n

x

x

x

1, , ,i i ij ipx x x x

where

We can write

= the ith row of X.

11 1 1

1

1

j p

i ij ip

n nj np

x x x

x x xX

x x x

1

the column of

j

thijj

nj

x

xx j X

x

1 , , , ,j px x x

where

We can also write

1x

1x

is the p-vector denoting the p observations on the first object, while

In this notation

is the n-vector denoting the observations on the first variable

1 2 3, , , , nx x x x

The rows

form a random sample while the columns

1 2 3, , , , px x x x

do not (this is emphasized in the notation by the use of parentheses)

The objective of multivariate analysis will be a attempt to find some feature of the variables (i.e. the columns of the data matrix)

At other times, the objective of multivariate analysis will be a attempt to find some feature of the individuals (i.e. the rows of the data matrix)

The feature that we often look for is grouping of the individuals or of the variables.

We will give a classification of multivariate methods later

Summarization of the data

Even when n and p are moderately large, the amount of information (np elements of the data matrix) can be overwhelming and it is necessary to find ways of summarizing data.

Later on we will discuss way of graphical representation of the data

1

1 for 1,2, ,

n

i rir

x x i pn

22

1

1 for 1, 2, ,

1

n

i ri i iir

s x x s i pn

Definitions:

1

1 for , 1,2, ,

1

n

ij ri i rj jr

s x x x x i j pn

1. The sample mean for the ith variable

2. The sample variance for the ith variable

3. The sample covariance between the ith variable

and the jth variable

1

i

p

x

xx

x

Defn: The sample mean vector

Putting the definitions together we are led to the following definitions:

11 1 1

1

1

i p

i ii ipijp p

p pi pp

s s s

s s sS s

s s s

Defn: The sample covariance matrix

Expressing the sample mean vector and the sample covariance matrix in

terms of the data matrix

The sample mean vector

1

1

1 11

n

i rr

p

x

xx x Xn n

x

Note

1

1 1

1

where

is the n-vector whose components are all equal to 1.

The sample covariance matrix

11 1 1 1

1 1

1 1

j j p p

i ij j ip p

n nj j np p

x x x x x x

x x x x x xX

x x x x x x

1X x

We can write

1 2

1 2

1 2

1 2

1

11

1

p

p

p

p

x x x

x x xx x x x

x x x

111X X

n

then

because

1X X x

111nI X

n

The final step is to realize that that

It is easy to check that

21 1 1

11 11 11n n nI I In n n

111nI

n

1

1

1

n

ij ri i rj jr

s x x x xn

1

1 1

1 1

n

ri rj ijr

x x X Xn n

So that

1 111 11n nI X I X

n n

1n S X X

21

11nX I Xn

111nX I X

n

In the text book

1 1

11

1 1

n nJ

R

And then

11 p pn S X I J X

n

R

Another Expression for S

Note:

1n S X X

and

1 1

ii

nn

x x x

X x xx

x xx

Thus

1

11 , , n

n

x

n S x x

x

1

n

i ii

x x

1

n

i ii

x x x x

Hence

1

1

1

n

i ii

S x x x xn

Data are frequently scaled as well as centered. The scaling is done by introducing:

Defn: the sample correlation coefficient for (between) the ith and the jth variables

ij ijij

i j ii jj

s sr

s s s s

the sample correlation matrix

12 1

12 2

1 2

1

1

1

p

p

ijp p

p p

r r

r rR r

r r

Obviously

1ii iiii

i i ii ii

s sr

s s s s

and using the Schwartz’s inequality

1ijr

If R = I then we say the variables are uncorrelated

Note: if we denote

Then it can be checked that

1 1R D SD

1

2

1 2

0 0

0 0, , ,

0 0

p

p

s

sD diag s s s

s

Measures of Multivariate Scatter

The sample variance-covariance matrix S is an obvious generalization of the univariate concept of variance, which measures scatter about the mean.

Sometimes it is convenient to have a single number to measure the overall multivariate scatter.

There are two common measures of this type:

Defn: The generalized sample variance

detS S

Defn: The total sample variance

2

1 1

p p

ii ii i

tr S s s

In both cases, large values indicate a high degree of scatter about the centroid: x

1 2 pS

low values indicate concentration about the centroid: x

Using the eigenvalues 1, 2, …,p of the matrix S,it can be shown that

1 2 ptr S

0S

If p = 0 then

This says that there is a linear dependence amongst the variables.

Normally, S is positive definite and all the eigenvalues are positive.

Linear combinations

Taking linear combinations of variables is one of the most important tools of multivariate analysis.

This is for basically two reasons:

1. A few appropriately chosen combinations may provide more of the information than a lot of the original variables. (this is called dimension reduction.)

2. Linear combinations can simplify the structure of the variance-covariance matrix, which can help in the interpretation of the data.

For a given vector of constraints:

We consider a linear combination

1

p

a

a

a

1 1 2 2i i i i p ipY a x a x a x a a

For i = 1, 2, … , n. Then

1 1

1 1n n

i ii i

Y Y a x a xn n

And the variance of the Y’s is

22

1

1

1

n

Y ii

s Y Yn

1

1

1

n

i ii

a x x x x an

a Sa

Documents

Description of Multivariate Data