The Application of SAS in Multivariate Analysisusers.du.se/~lrn/C_upps09/EssayC_Application_of_SAS_in... · 2010-01-11 · knowledge who need to apply the technique of multivariate

1

The Application of SAS in

Multivariate Analysis

Author: Meng Xiangli & Peng Sisi

Supervisor: Karl-Erik Westergren

C-level in Statistics, November 2009

School of Economics and Social Sciences

Högskolan Dalarna, Sweden

2

Table of Contents

Abstract-----------------------------------------------------------------------------3

1. Introduction-------------------------------------------------------------------3

2. Background---------------------------------------------------------------------4

2.1 Importance of Multivariate Statistics------------------------------------4

2.2 Problem raised-------------------------------------------------------------4

2.3 Comparison of software--------------------------------------------------5

3. Format suggested---------------------------------------------------------------6

4. Application of the Format ----------------------------------------------------6

4.1 Hierarchical Clustering Format-----------------------------------------6

4.2 Principal Component Analysis Format------------------------------12

4.3 Canonical Correlation Analysis Format-----------------------------16

5. Discussion---------------------------------------------------------------------20

6. References---------------------------------------------------------------------22

7. Appendix-----------------------------------------------------------------------23

3

Abstract

This paper suggests a format for how an instructive text on multivariate data

analysis methods should be designed. The target group is people with basic statistic

knowledge who need to apply the technique of multivariate analysis to a specific

problem. As a background we discuss the problems of learning multivariate analysis,

we then present and motivate a format that we believe could lead people to learn the

techniques in a easy way and in short time. The format is applied to three methods –

Cluster Analysis, Principal Components and Canonical Correlation Analysis - to show

how the format works. The paper ends with a statistical-pedagogical discussion about

the format.

1. Introduction

The data which need to be analyzed are becoming more complicated. Most of

commercial and industrial data have many variates and oceans of observations. To

deal with such multivariate data problem, not only theory but also program technique

is needed because there are so much calculation which is impossible to work by hand.

However, people with only basic statistical knowledge often find it hard to grasp

the multivariate technique because the theories and computer programs seem difficult

and complex. There are many books and papers on Multivariate statistics but it takes a

lot of time to learn and use the technique. Mainly there are two reasons for this. One

is that a lot of materials do not contain necessary theories and computer procedures at

the same time. Some materials do not give program procedures, while others give

procedures but often with only quite brief theories. Then readers do not know how to

put theories into practical use or which theories are used in a program. Another reason

is that materials on this subject are often too long. The techniques are introduced from

many aspects, containing a lot of mathematical formulas, calculation and professional

concepts. This is likely to be too difficult and time consuming for a person with

applied approach. So people especially those who need multivariate analysis in busy

work could not afford the time and energy to grasp this technique through these

materials.

So there is a need to develop formats for introducing and explaining multivariate

techniques to people with only basic statistical knowledge. Some books do give us

good examples on how to lead people to learn statistics methods, we take reference of

4

format and methods of these materials and suggest improvements for teaching people

to grasp the technique of multivariate analysis.

The decision of which software to use is an inevitable part of the process. There is

much software that can be used for multivariate analysis but we choose SAS because

of its advantages.

This paper suggest a format to teach the beginners the technique of multivariate

analysis by using SAS software, it aims at making readers learn the techniques in an

easy way and in short time. The paper takes three multivariate analysis methods to

illustrate how to apply the format. Readers can use similar ways to learn other

multivariate techniques. We also hope that textbook writers will consider our

suggestions.

2. Background

2.1 Importance of Multivariate statistics

With the development of society and economy, affairs are becoming more correlated

to each other. As a result, the data that need to be analyzed are becoming more

complex, containing more variables and observations. Some traditional statistical

methods can be used in some occasions, but they have their restrictions when it comes

to multivariate problems. The theories of multivariate analysis have been developed

for years and they were not widely used until powerful computers as well as

corresponding software are available. In many business and social affairs multivariate

analysis is widely used and is becoming an important and inevitable area to master

especially for an applied statistician.

2.2 Problem raised

However, although multivariate analysis is becoming more important in practical use,

many people find it hard to grasp this technique. There are many books and papers

introducing the multivariate analysis technique, but people with basic statistical

knowledge often find it abstruse to learn these books and papers and they could not

afford the time that is needed. One reason is that some materials do not contain

necessary theories and computer procedures at the same time. Classical textbooks like

„Multivariate Statistical Analysis(He Xiaoqun 2003)‟ and „Applied Multivariate

Statistical Analysis(Gao Huixuan 2001)‟ contain rich theories but without program

procedures, while some papers and SAS textbooks give procedures but often with

5

only quite brief theories, like the most common used course book in China „SAS

statistical analysis and application(Huang Yan& Wu Ping, 2006)‟. Then beginners do

not understand the theories or how to put theories into practical use.

We believe that materials on this subject are often too long. As the techniques are

introduced from many aspects, containing many mathematical formulas, calculation

and professional concepts, it takes long time to read and find out how to apply it to a

specific problem.

Some books give us good examples on how to teach people statistic methods, such

as „Statistical Analysis and The Application of SPSS (Xue Wei 2007)‟, and „Statistics

(Jia Junping 2007)‟. These books contains the following advantages, 1 they give

necessary theories in a way that readers could easily understand, 2 they also illustrate

how to use software to put the theories into use, they illustrate in detail so that the

people who are not familiar with the software could also know how to operate. 3 They

show exactly how to understand the results obtained with the software. With the

advantages readers can easily grasp the technique they introduce. We take the

advantages of the good formats of the two books, using a similar form. We also

improve the format by combining the use of software in the process to make the

whole process more clear and easy to follow.

2.3 Comparison of software

Software are important in multivariate analysis for the large amount of calculation.

We choose SAS because it has its own advantages in functions and applications. Also

SAS is preferred by companies and industries so it will be easier for people to find

jobs if he or she could use SAS well. There are some other software that also can be

used for statistical problems, some comparisons are made (Feng Xuenan, Cui Yujie

2008).

Table1: comparisons

SPSS Eviews minitab SAS

Application

Fields

Social science Economic

fields

Quality

control

Most statistic

fields

Specialty Cross-sectional&

Time-series data

Time-series

data

Quality data Large number of

data&multivariate

data

6

3. Format suggested

This paper suggests a format for the beginners who have some basic statistic

knowledge and need to apply multivariate analysis in their work. The aim is that the

reader will grasp the basic techniques of a specific method using SAS program in a

easy way and in short time. Readers could go on to further studies using texts written

in similar format. We believe that the suggested format is worth considering when

writing manuals and textbooks. The paper uses three methods of multivariable

analysis: hierarchical clustering, principal component analysis and canonical

correlation analysis to illustrate the format.

The structure of the suggested format is as follows:

Ⅰ. The idea. First a brief view is given to show the readers the usage of the method

and how it works, and the readers would know what methods they are learning and

what they can do through the learning process.

Ⅱ. Necessary theory and steps for the procedure. Then the necessary theories

including the steps for the procedure are introduced briefly and they are combined

with SAS code in the process. The complex math calculation and large number of

professional concepts are avoided; the modeling theories are not given a lot of

concern because the modeling parts are rather theoretical and they take much time to

understand. We believe readers could quickly understand how to put the theories into

practical use.

Ⅲ. Example. Third a practical example is given to show how the method actually

works. In the program the paper gives the meaning and effect of each command in

SAS to illustrate how to construct them.

Ⅳ. SAS code and program outcome. They are given referring to the corresponding

steps of the process introduced in the theory part and the explanation is based on the

theory part.

Ⅴ. Interpretation and decision. Finally the interpretation of the outcome and how to

make decisions upon the outcome are exemplified. We believe in such a way readers

would easily understand the outcome and know how to make decisions themselves.

4. Application of the Format

4.1 The format applied to hierarchical clustering

The idea.

7

Hierarchical clustering method (He Xiaoqun 2003) is a method to make similar cases

into same categories, and then make a partition of sample space. Then people can find

out the feature of different categories, and make corresponding decisions.

“Observations are sorted into subsets (called clusters) so that observations in the same

cluster are similar in some sense, and observations are significant different between

clusters” (Wikipedia, Oct 2009). The method is widely used. In commercial areas, it

is e.g. used to identify different types of customers, and then make different sales

plans; in biology, it can be used to classify different types of genes, to make

classifications of animals and plants.

Necessary theory and steps for the procedure.

There are two ways of Hierarchical Clustering: agglomerative and divisive. As the

name says, agglomerative hierarchical clustering agglomerates the sample cases into

one category, divisive make the sample space into finer categories, and both ways

make hierarchies in the process. We can see the idea of agglomerative way clearly

through its steps, then divisive way is just made in the reverse way.

1. Pretreatment of data: Choose the appropriate variables for the cluster analysis.

The first step can be seen as the pretreatment of data. There are many variables for a

case but we should choose the appropriate variables for the analysis. It cannot be done

by software, and people have to choose them by some rules : variables should reflect

the characteristics of different clusters and not have a high degree of correlation.

2. Treatment of data.

(1)Calculate the distance between each pair of the n cases.

(2)Construct n clusters, each one contains only one case.

(3)Join the two clusters with shortest distance into a new cluster; calculate the

distance between the new cluster and the other clusters.

(4)Repeat step 3 until all the cases join into one big cluster.

In SAS system the command „proc cluster‟ can perform this for us, but we need to

tell the software which kind of distance to choose.

3. Make decisions upon the results.

(1)Get the joining process with dendrogram (graphic like a tree).

(2)Decide the number of clusters.

This step is done by asking program to give the necessary outcome and help us to

decide the number of clusters.

It can be seen that there are two important steps for cluster analysis: defining the

8

distance and determining the number of clusters. Different definitions of distance

determine how the similarity of two elements is calculated. It will also influence the

shape of the clusters, as some elements may be close to one another according to one

definition of distance and farther away according to another one. There are several

ways to define the distance, each one has its own advantages and disadvantages. Also

different methods are defined through the definition of its distance. The corresponding

commands are given in the example.

(i) The definition of distance to use

Complete and single linkage: the furthest and shortest distance between elements in

different groups respectively. Both methods are easy for calculation but both are

biased. Complete linkage tends to the extreme values and single linkage has the

notorious chaining tendency: while two elements from different groups are closed

while the rest elements are distant to ones from the other group, the two groups can

also be joined together.

Average linkage: average distance between pairs of observations, it is less

influenced by extreme values, clusters with small variances tends to be joined

together, so it is slightly biased toward forming clusters with the same variance.

Centroid linkage: distance between the mean of each group. It is the most effective

way to reduce the effect of extreme value, but the mean of a joining group tends to the

mean of larger original group.

Ward‟s method: distance is defined as the error sum of square; two groups are

joined together if the new group makes the smallest increase in the error sum of

square. Ward‟s method joins clusters to maximize the likelihood at each level of the

hierarchy. It is widely used because of its good clustering effect, especially when the

differences between groups are not significant enough. But it has bias toward

producing equal-size groups, also sensitive at extreme value.

There are also some other ways to calculate distance, but the ways above are

commonly used. In practical use, we should choose methods according the structure

of the sample space and the characters of each method. For example, when there are

many extreme values, complete linkage and ward‟s are better not to be used. Also we

can try several methods and compare the results between different methods, then

choose a appropriate method.

(ii) What number of clusters to use?

The second important step is to determine the number of clusters. In step 3 we use

9

the command „proc tree‟ to ask for the dendrogram, and it shows the joining process

directly and intuitively, and there are also several statistics to help us determine the

number of clusters:

1. Fixed distance, determine the fixed distance in the dendrogram, if the distances

between different groups are all larger than the fixed distance, then the agglomerate

process is stopped. This method can be gotten directly from the graph.

2. Cubic clustering criterion (ccc). It tests the hypothesis whether the data have

been sampled from the uniform distribution. We should find the peak value that

makes ccc to be more than 3, then the corresponding number of clusters can be

chosen.

3. Pseudo F test (PSF). It measures the separation among the clusters at the current

level in the hierarchy. The number of clusters can be determined by the one with peak

PSF value.

4. Pseudo 2t test. It is similar with pseudo F test. The number is also determined

by the peak value. Ccc and PSF are gotten with the command ccc pseudo and Pseudo

2t is given out automatically, so in program we need to give commands for the

dendrogram ccc and PSF. The number is checked to see if the results are fit for

practical use and easy to explain, or if some necessary adjustments are needed.

Example.

In geography research, cities can be classified according to their climate. The data of

meteorological conditions in 13 capitals of the world (Wu Xizhi 2005) are used to

classify the cities according to their climate. The corresponding climate variables are:

altitude (meters), average of temperature in a year, average temperature of the coldest

month, average temperature of the warmest month, average of the precipitation

(mm/m), the average precipitation in the wettest month, the average precipitation in

the driest month, average precipitation in the most clear month (%), average

precipitation in the most cloud month (%).

SAS code and the program outcome.

The program is as follows:

data climate;

input v01 $ v02-v10;

cards;

Beijing 37.0 11.7 -4.7 26.0 632.0 25417.0 3.1 47150.0 20271.0

Tyoko 6.0 13.8 3.0 26.4 1625.0 2155.0 220.0 41.0 30103.0

10

Berlin 51.0 8.8 -.5 17.8 592.0 26816.0 11749.0 16923.0 30621.0

Singopor 17.0 27.2 25.6 27.8 24.0 258.0 169.0 581.0 24838.0

San Diago 520.0 14.7 8.6 20.6 363.0 29373.0 7.0 37304.0 20607.0

London 5.0 10.6 4.3 17.7 593.0 21002.0 19725.0 22160.0 27039.0

San Francisco16.0 13.7 10.0 16.5 517.0 100.0 4.0 11871.0 22647.0

Rome 51.0 16.2 7.5 25.6 760.0 115.0 37506.0 37487.0 21520.0

Washington 22.0 13.8 3.1 25.4 1050.0 120.0 70.0 19268.0 25934.0

Montreal 57.0 6.5 -9.2 21.3 1048.0 102.0 25659.0 19937.0 27334.0

Paris 15.0 11.5 3.5 19.5 619.0 23590.0 12844.0 18111.0 26634.0

Honolulu 4.0 22.2 22.2 25.8 610.0 110.0 37457.0 17411.0 22678.0

Sydney 41.0 11.7 11.7 22.3 1181.0 140.0 70.0 14824.0 59.0

run;

proc cluster data=climate method=ward standard outtree=tree1 ccc pseudo;

var v2 v3 v4 v5 v6 v7 v8 v9 v10;

proc print data=tree1;

proc tree;

run;

The program starts with the „data‟ step. It uses „data‟ command to define the name

of a data set. Then „input‟ asks SAS to input values of variables and gives them the

name of variables. „cards‟ means starting the input of the data step. For the second

step „proc cluster‟ is the command for clustering. „method=ward‟ means the method is

Ward‟ method. The other method can be made as „method=ave, cen, com, sin‟ stands

for: average linkage, centroid linkage, complete linkage, single linkage. „standard‟ is

to standardize the variables for the magnitude of variables varies a lot. When the

magnitudes of variables are similar, it can be omitted. „tree1‟is used to give out the

dendrogram. „ccc‟ and „pseudo‟ ask the outcome contain cubic clustering criterion

and Pseudo F test statistics.

The outcome contains mainly the test for cubic clustering criterion and Pseudo F

statistics as well as the dendrogram. The outcome gives us important information on

how to make clustering.

Table2: The cubic clustering criterion and Pseudo F test statistics

NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2

12 OB6 OB11 2 0.0029 0.997 . . 31.8 .

11 OB3 CL12 3 0.0064 0.991 . . 21.5 2.2

10 OB2 OB9 2 0.0204 0.970 . . 10.9 .

9 OB8 OB12 2 0.0277 0.943 . . 8.2 .

8 CL11 OB7 4 0.0455 0.897 . . 6.2 9.9

7 CL10 OB10 3 0.0561 0.841 . . 5.3 2.7

11

NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2

6 OB1 OB5 2 0.0808 0.760 . . 4.4 .

5 OB4 CL9 3 0.1047 0.656 . . 3.8 3.8

4 CL7 OB13 4 0.1071 0.548 . . 3.6 2.8

3 CL6 CL8 6 0.1336 0.415 . . 3.5 3.9

2 CL3 CL4 10 0.1747 0.240 0.356 -1.4 3.5 3.1

1 CL2 CL5 13 0.2401 0.000 0.00 0.00 . 3.5

It can be seen through the table that the cubic clustering criterion (ccc) and Pseudo

F test statistics (psf) do not give us appropriate peak value: the psf statistics reaches

its peak when the number of clusters is 12, and ccc statistics only show 2 values. The

pseudo 2t test get its local maximum at 3 and 8, but 8 clusters are too many for

practical use, then we need to decide the number of clusters at the neighborhood of 3.

In the next step we have to use the dendrogram to make a decision:

dendrogram of the Cluster analysis

Interpretation and decision.

From the dendrogram, we can see that three or four types are appropriate. It

depends on the need of practical use. The recent temperature in Beijing is increasing,

and it is the same with San Diego, so the two cities are divided into a class, and they

are defined as the class sub-tropics climate. London, Paris, Berlin is the typical

European climate, coincide with theirs geographical position. San Francisco, close to

the coast and latitude does not differ greatly, and thus be divided into the same

12

category, and they are defined as the warm temperate maritime climate. The first two

types can be joined together if we need fewer clusters because the two types have

shorter distance than the other types which means they have more similarities. Tokyo,

Washington, Montreal, Sydney are located in the middle of mainland and are divided

into the same kind, named temperate continental climate. Singapore, Rome, Honolulu

city are typical Mediterranean climate. The results of the classification are largely in

line with our common sense. That is to say, the way of classification is efficient.

4.2 The format applied to principal component analysis

The idea.

“Principal component analysis(PCA) is appropriate when you have obtained measures

on a large number of observed variables and wish to develop a smaller number of

artificial variables (called principal components ) that will account for most of the

variance in the observed variables. The principal components may then be used as

predictor or criterion variables in subsequent analyses.”(support.sas 2009)

Principal component analysis (He Xiaoqun, 2003) is a variable reduction procedure

and it is very useful for anyone who needs to reduce a number of observed variables

into a smaller number of principal components. Because all the variables are

measuring the same structure and it is possible that they are correlated with some

other variables.


The linear relationship between factors and original indicators are modeled

theoretically as followed:

1 11 1 12 2 1

2 21 1 22 2 2

1 1 2 2

...

...

......

...

p p

p p

p p p pp p

y u x u x u x

y u x u x u x

y u x u x u x

: 'y u x

Where ix is original variate, iy is the principal component (i=1,2…p). The main

thought of PCA is to make the first several iy (not all of

iy ) explain most part of

variance of x, that is, the first several y contains most part of information contained in

x. Because not all the iy is used, it results in lowering the dimension. Then the aim

of PCA is to get the coefficient of x. The steps of PCA is as follows:

13

1. The pretreatment of data.

Generally PCA does not require some special assumption of variables. But there is

a requirement: the correlation coefficients between different variates should be bigger

than 0.3 (He Xiaoqun, 2003), because if the variables do not have strong relationship

then it will not be effective to reduce the dimension of variables.

2. Treatment of data.

This step mainly aims at finding the coefficient of x by mathematical methods. For

the matrix constructed by X, the eigenvalue i are calculated, which represents the

amount of variance that is accounted for by iy . i is ranked from maximum to

minimum, corresponding to iy . Then the eigenvector iu of eigenvalue i construct

the vectors the coefficient matrix ( 1u , 2u ,…pu )= 'u . The whole process can be run in

SAS with the command „proc factor‟, and we can get the functions of u.

3. Making decisions upon results

With the eigenvalues and vectors we get in step 2, 1

/p

i i iiv

denote the

proportion of variance of x that is explained by iy , and i

v denotes the cumulative

proportion of variance that is explained by y. Then the first few principal components

are chosen so that the iv is more than 0.80 (He Xiaoqun 2003 chap 5), Then with

the result of SAS we could decide the number of factors we choose.

Example.

A specific example on the economic indicators of all the districts in Beijing (Beijing

Statistical Yearbook 2006) will now be presented to illustrate the process of principal

component analysis. A 6-item measure of economic indicators is established, average

wage of workers (yuan per worker), GDP for the area (measured in ten thousand

yuan), Capita disposable income (measured in yuan), Local financial revenue

(measured in ten thousand yuan), Investment in fixed assets (measured in ten

thousand yuan), Retail sales of social consumer goods (measured in ten thousand

yuan). PCA is used to reduce the dimension of variates for the convenience of further

economic research.

In principal component analysis, the number of components extracted is equal to

the number of variables being analyzed. Because six variables are analyzed in the

14

present study, six components will be extracted. The first component can be expected

to account for a fairly large amount of the total variance. Each succeeding component

will account for progressively smaller amounts of variance. Although 6 components

are extracted in this way, only the first few components will be important enough to

be retained for interpretation.


data economic;

input aww gdp cdi lfr ifa rss;

cards;

32173.00 3294121 14880.10 313801.0 1702707 1660120

33856.00 3487500 14539.00 343806.0 1335592 1640614

……

15691.00 683842.0 12023.10 83336.00 726398.2 267492.0

16904.00 374521.6 12852.60 28128.00 255814.0 298766.0

;

run;

proc factor data=economic simple corr;

run;

„proc factor‟ is the command for PCA, „simple corr‟ means the outcome should

contain the correlation matrix for the original variables

When correlations among several variables are computed, they are typically

summarized in the form of a correlation matrix. This is an appropriate opportunity to

review just how a correlation matrix is interpreted.

Table3: Correlations between variables

a b c d e f

a 1.00000 0.71054 0.75292 0.80475 0.62057 0.77472

b 0.71054 1.00000 0.61457 0.93428 0.94433 0.96403

c 0.75292 0.61457 1.00000 0.69087 0.56189 0.64015

d 0.80475 0.93428 0.69087 1.00000 0.93908 0.91717

e 0.62057 0.94433 0.56189 0.93908 1.00000 0.88326

f 0.77472 0.96403 0.64015 0.91717 0.88326 1.00000

It can be seen that all correlations are more than 0.3, then PCA can be used.

Table4: Eigenvalues and Proportions

Eigenvalue Difference Proportion Cumulative

1.0 4.94666570 4.29921045 0.8244 0.8244

2.0 0.64745526 0.39770550 0.1079 0.9324

3.0 0.24974976 0.13679681 0.0416 0.9740

4.0 0.11295295 0.08697200 0.0188 0.9928

15

Eigenvalue Difference Proportion Cumulative

5.0 0.02598094 0.00878555 0.0043 0.9971

6.0 0.01719540

0.0029 1.0000

From the first row it can be seen that the proportion of eigenvalues for component

1 is 0.8244 which means the first component extracts most part of variance and the

rest components account for rather small proportion. Then the first component can be

used to represent most part of the information of the six original variates.

The factor loading matrix is as follows, we choose only the factor that we need, and

in this case it is only one:

Table5: Factor loading matrix

Factor1

a 0.85064

b 0.95778

c 0.76852

d 0.97603

e 0.91900

f 0.95808

The matrix(because only one component is selected, the matrix turns into a vector)

shows the loading of principle component on each original variates. If we sum the

loadings together, we will get the eigen value of the component.

For each original variate, the proportion of variance that is extracted by the

principle component is given below:

Table6: Final Communality Estimates: Total = 4.946666

a B c d e f

0.72358523 0.91735171 0.59062243 0.95263854 0.84455722 0.91791057

It can be seen that all the originate variates except c have high proportion, and that

means most of them get their information extracted by the component.


From the outcome above we can see that the first principal component can stand for

most of the information that is contained in the six original variables. Also because

the component has high loadings on the 6 variates, it can be considered as a

comprehensive economic factor and it can be seen as the measure of the economic

development of the area. So in further studies we can use the first principle

component as a represent of the six original variables.

16

4.3 The format applied to canonical correlation analysis

The idea.

Canonical correlation analysis is a widely used technique in multivariate statistical

analysis, which has been used in economics, meteorology, and many modern

information processing fields such as communication theory and statistical signal

processing. Canonical correlation analysis is a multivariate statistical model that

facilitates the study of interrelationships among sets of multiple dependent variables

and multiple independent variables. With the analysis we extract canonical variables

from each sets , then we can explain the relationship between two sets of variables

through the first few pairs of canonical variables, so the complexity of analyzing the

relationship between sets of variables decreases.(Joseph F. Hair, Jr., Rolph E. 1998).


Canonical correlation analysis (CCA) can be seen as extension of principal

component analysis (PPA). The main ideology is constructed upon PCA. For each

group, principle components are extracted by PCA to maximize the correlation

coefficient between the two groups of components. The components in the same

group are uncorrelated. Then we can use the correlation of components to analyze the

correlation of two group of variables, also it decreases the complexity to quantify the

strength of the relationship. The principal components that extracted appear in pairs,

and one pair of corresponding components is called a pair of canonical variables. The

mathematical theory of CCA is stated as below: 1 2... px x x

1 2... qy y y are two groups

of variables, the canonical variables noted as V and W are given by

1 1

1 1

...

...

p p

q q

v a x a x

w b y b y

. The purpose of CCA is to find coefficient

1... pa a and 1... qb b

to maximize the correlation efficient of V and W. In SAS the whole procedure can be

done by the proc “cancorr”, the results we need and corresponding test are also

obtained.

The steps of CCA are given bellow:

1. Pretreatment of data

(1) Specify the target of CCA. That is defining the two group of variables.

(2) Develop the process of plan. That is getting the data of samples. Small sample

sometimes leads to biased and unstable results, so the basically needed amount of

17

observations is 10 for each variable (Yu Xiulin, Ren Xuesong, 2007, chap 10).

2. Treatment of data

(1) Assess the variables underlying the assumption of CCA. First, the correlation

coefficient between any two variables is based on a linear relationship. If the

relationship is not linear, then some variables have to be transformed. Second, the

canonical correlation is the linear relationship between the variates.

(2) Estimate the canonical model and assessing overall model fit. Mainly there are

three measurements: (a) the level of significance of the canonical function, often at

level 0.05; (b) redundancy measure of shared variance. It can be gotten by multiplying

shared variance of the variate and the squared canonical correlation. It explains the

amount of shared variance that can be explained by each canonical function.

The necessary test outcome in SAS for the steps 3-4 is given in the title: Test of H0:

The canonical correlations in the Eigenvalues of Inv(E)*H and Standardized

Canonical Coefficients for the VAR/WITH Variables. Corresponding results can be

obtained.

3. Make decisions upon results.

(1) Interpret the canonical variate. This step mainly focuses on determining the

relative importance of each of the original variables with the canonical variates. Also

there are 2 methods: (a) canonical loadings: measure the linear relationship between

original variates and canonical variate in the same sets. (b) canonical cross-loadings:

directly measure the linear correlation between an original variate and the canonical

variates in the other set. SAS gives both of them in the name :Canonical structure. It

helps us to understand the structure of the canonical variates.

(2) Validation and diagnosis. This step ensures that the results are not specific

only to the sample and can be generalized to the population. Generally there are two

methods: (a) create two subsamples of data and perform the analysis on each

subsample separately, comparing the similarity between the two results. (b) remove

one variable, examine the sensitivity of results.

Example.

We use the example of industrial scale and efficiency (Beijing Statistical Yearbook

2006) to illustrate how to use SAS in canonical correlation analysis.

The purpose is to find the relationship between industrial scale and efficiency. The

variates on industrial scale are: total industrial output value, total profits, total assets,

sales revenue. The variates on efficiency are: Value added rates , total assets

18

contribution rate , personnel labor productivity , sales rates of products. First the two

sets of variables are defined. The sample collects 36 different kinds of industries and

4 variates for a group, and that is 9 observations for 1 variable, so it is approximately

enough for CCA though a little smaller than 10-1. It is assumed that the correlations

between canonical variates are linear correlation. The complete data are given in

appendix.


data differentindustry;

input industry $ x1-x4 y1-y4;

cards;

coal mining 1152.99 28.71 3587.78 1144.27 47.60 4.92 19175.00

99.34

………

Tap water production and supply 213.33 5.68 1104.59 201.44 49.20 2.21

50803.00 97.73

;

run;

proc cancorr data=cancorr;

var x1-x4;

with y1-y4;

run;

„proc cancorr‟ is the procedure of canonical analysis, „var‟ and „with‟ give the two

groups of variables. The test of linear relationship in canonical correlation is:

Table7: Test of H0: The canonical correlations

Eigenva

lue

Differe

nce

Proporti

on

Cumulat

ive

LRT Appr.

F

Num

DF

Den

DF

Pr>F

1 1.5709 0.8145 0.5618 0.5618 0.1492856 4.82 16 89.234 <.0001

2 0.7564 0.321 0.2705 0.8323 0.383797 3.92 9 73.163 0.0004

3 0.4353 0.4018 0.1557 0.988 0.674084 3.38 4 62 0.0145

4 0.0336 0.0120 1.0000 0.9675388 1.07 1 32 0.3079

It can be seen that the test statistics for the correlation of the first three pairs of

canonical variables are significant, so it proves that the assumption of linear

relationship is right.

For the 2(2) step the SAS program gives the redundancy measure of shared

variance. The table above shows that eigen values of the first two pairs of canonical

variates can explain 83 percent of the total variance i.e. most part of correlated

relationship. Then the result has a satisfying redundancy measure of shared variance.

19

In the following we need to know the coefficients of canonical variables, SAS gives

the canonical coefficient in two ways: raw and standardized. If the variates are not

standardized in the process we get raw coefficients, otherwise standardized ones. In

this example, the standardized way is chosen because the magnitude varies a lot. The

output coefficients are given below, and also the canonical weights:

Table8: Standardized Canonical Coefficients for the VAR Variables

V1 V2 V3 V4

x1 0.4505 -2.6036 -1.6004 6.0199

x2 0.9485 0.5518 -0.5372 -0.0128

x3 0.9706 -1.1759 1.4022 2.6152

x4 -2.2066 4.0817 0.7718 -7.7569

Table9: Standardized Canonical Coefficients for the WITH Variables

W1 W2 W3 W4

y1 1.0357 -0.0346 0.702 0.0107

y2 1.0806 -1.1882 -2.3763 1.257

y3 -1.3502 1.9062 1.4576 -1.7991

y4 0.0782 0.1893 -0.1431 1.1702

From the results above we only need the coefficients of the first two pairs of

canonical variates, and they are functions of original variables:

1 0.45 1 0.948 2 0.971 3 2.207 4

2 2.604 1 0.752 2 1.176 3 4.082 4

1 1.036 1 1.081 2 1.350 3 0.078 4

2 0.035 1 1.188 2 1.906 3 0.189 4

v x x x x

v x x x x

w y y y y

w y y y y

The 3(1) step is to interpret the canonical variates. The canonical weights are

already given, and they are acceptable according to their values. The canonical

loadings are give in the table with name „VAR and Their Canonical Variables‟ and

„WITH and Their Canonical Variables‟:

The canonical cross-loading is given in the following table:

Table10: Correlations Between the VAR Variables and Their Canonical Variables

V1 V2 V3 V4

x1 -0.4392 0.7385 -0.0291 0.5107

x2 0.4931 0.8105 -0.1649 0.2698

x3 -0.1049 0.7075 0.5306 0.4549

x4 -0.3771 0.8103 0.1566 0.4203

Table11: Correlations Between the WITH Variables and the Canonical Variables of

the VAR Variables

20

V1 V2 V3 V4

y1 0.6954 0.277 0.0741 -0.0201

y2 0.3498 0.3962 -0.3455 -0.0368

y3 0.233 0.5542 -0.2176 -0.0369

y4 -0.0238 0.4238 0.0359 0.1369

From the table we can get the correlation between original variates in one group

and the canonical variates in the other one. So y1 is high related to v1 with coefficient

0.6954, and the other correlations degree can be derived through the table.

Then from the two coefficients the importance of original variates that relative to

canonical variate can be gotten, then the canonical function is fully interpreted.


From the function above, it can be seen through canonical weights that v1 is mainly

related to the total profits, and w1 is mainly related to personnel labor productivity. It

is the same way when it comes to v2 and w2.Then it can be concluded that the

correlation between scale and efficiency is mainly the correlation between total profits

and personnel labor productivity. Then if further research is needed to analyze the

relationship between scale and efficiency, it is simple and clear to devote most energy

into the relationship between total profits and personnel labor productivity.

5. Discussion

In this paper we use three examples to illustrate a format that could make people learn

multivariate analysis quickly and easily, we do not introduce the theories too much.

Models are introduced but not too much, because too much theories of the models

could increase the difficulties in reading the paper, readers could have a general idea

about the models although maybe not exactly after reading the paper.

However, not all the beginners are suggested to use this format; the paper mainly

focuses on the beginners who have some basic statistic knowledge but do not need

theory analysis or research. They are supposed to know the basic theories about

probability theory and mathematics, generally people who have bachelor‟s degree in

science would get that kind of knowledge well. For some professionals and people

who intend to theoretical research may not use this format to learn the technique.

In this format, the model is not introduced in detail, as a result readers may not be

able to make changes to the technique according to different practical issue. For future

work we would try to make more part of the model and make it easy to understand.

21

Also we could put the format into use and make questionnaires to find the effect and

shortage of this format. Although the paper has some delimitations, it can work on the

target people and it may prove itself helpful.

22

6. References

[1] Beijing Municipal Bureau of Statistics (2003) Beijing Statistical Yearbook Beijing.

[2] Fang Kaitai (2004) Practical Multivariate Statistical Analysis Beijing China

Normal University Press.

[3] Feng Xuenan, Cui Yujie (2008) Comparative Study and Analysis of Statistical

Software Packages Journal of North China Industry University.

[4] Gao HuiXuan (2001) Applied Multivariate Statistical Analysis, Beijing Peking

University Press.

[5] Gao HuiXuan (2000) SAS System • SAS / ETS Software Manual Beijing Statistics

Press.

[6] He Xiaoqun, 2003 Multivariate Statistical Analysis Beijing People's University

Press.

[7] Huang Yan, Wu Ping (2006) SAS statistical analysis and application Machinery

Industry Press.

[8] Joseph F. Hair, Jr., Rolph E. Anderson, Ronald L. Tatham and William C. Black.

(1998). Multivariate Data Analysis, 5th edition Prentice Hall, Inc.

[9] Jia Junping (2004) Statistics Beijing People's University Press.

[10] Support.sas (2009):

(http://www.support.sas.com/publishing/pubcat/chaps/55129.pdf )

[11] Wu Xizhi (2005 )From Data to the Conclusion Learning Guidance and

Exercise, Beijing China Statistics Press.

[12] Xue Wei (2007), Statistical Analysis and The Application of SPSS Beijing

People's University Press.

[13] Yu Xiulin, Ren Xuesong (2007) Multivariate Statistical Analysis, Beijing China

Statistics Press.

[14] Wikipedia.org , Oct (2009), Cluster analysis, Types of clustering.

23

7. Appendix

ⅰ Economic indicators of all the districts in Beijing

Dongcheng District 32173.00 3294121 14880.10 313801.0 1702707 1660120 398566.0

Xicheng District 33856.00 3487500 14539.00 343806.0 1335592 1640614 564176.0

Chongwen District 20848.00 864012.7 13494.10 69658.00 622455.0 639490.0 138745.0

Xuanwu District 23095.00 2013163 13166.00 194970.0 968372.0 697703.0 263537.0

Chaoyang District 27661.00 6841968 13898.00 565625.0 2858851 2986623 851602.0

Fengtai District 20901.00 2310141 12281.90 122042.0 777305.0 1762322 446608.0

Shijingshan District 23939.00 1384952 11904.70 74402.00 323941.0 991338.0 164470.0

Haidian District 29984.90 8928950 14409.10 475348.0 2898447 3542556 954386.0

Mentougou 18495.00 463519.7 11051.30 36561.00 426365.0 245444.0 64376.00

Fangshan 17547.00 1964661 11712.30 89542.00 1101166 618040.7 144272.0

Tongzhou District 14702.00 1044963 11968.60 84397.00 890727.0 463931.0 122300.0

Shunyi District 14899.00 1830882 13203.00 86458.00 850903.2 503774.0 154533.0

Changping District 17595.00 1301430 10373.00 88541.00 812579.0 362498.0 167877.0

Daxing District 15634.00 1037035 10811.70 93497.00 1008939 413926.0 202267.0

Pinggu 13081.00 509968.2 11964.20 45794.00 912776.9 184335.0 59195.00

Huairou District 17284.10 622751.2 12785.00 54036.00 382200.4 200132.0 70325.00

Miyun County 15691.00 683842.0 12023.10 83336.00 726398.2 267492.0 74002.00

Yanqing County 16904.00 374521.6 12852.60 28128.00 255814.0 298766.0 40746.00

ⅱ Industrial scale and efficiency

Oil and gas extraction 2717.77 970.23 4060.70 2578.92 73.25 32.20 353475.0 99.66

Ferrous Metals Mining 61.96 2.31 219.09 66.39 44.06 4.77 23657.00 97.69

Nonferrous Metals Mining 158.71 9.87 331.23 153.03 38.05 6.89 26840.00 9 5 . 6 4

Nonmetal Mining 116.49 3.12 487.71 115.00 39.64 4.24 19202.00 97.06

Timber harvesting 88.70 2.42 320.28 76.16 52.21 3.61 7856.00 97.58

Food processing 1703.30 49.23 1602.56 1675.15 22.48 7.72 61532.00 97.41

Food Manufacturing 812.25 37.34 992.55 762.92 28.08 9.56 66252.00 97.35

Beverage Manufacturing 1292.59 97.29 2136.72 1261.02 37.45 14.09 81130.00 98.38

Tobacco Processing 1572.21 177.15 2259.93 1648.57 66.15 44.57 540853.0 99.34

Textiles 2579.62 59.46 3534.35 2462.08 25.48 6.15 27245.00 96.36

Apparel Manufacturing 656.24 39.87 687.56 620.87 27.27 9.91 39403.00 97.61

Leather Fur Processing 408.51 8.16 351.15 353.77 25.68 6.70 33697.00 97.86

Woodworking 201.44 3.81 326.32 185.41 25.82 5.22 48571.00 97.42

Furniture Manufacturing 72.14 4.57 117.00 73.22 31.56 7.91 42089.00 98.71

Paper products industry 885.08 31.82 1880.16 840.86 27.35 5.99 50133.00 98.68

Printing 311.06 32.95 537.56 295.25 38.24 11.03 61320.00 95.39

Sporting Goods Manufacturing 169.05 8.62 170.16 164.29 25.67 9.64 31201.00 98.25

Oil processing 3984.19 -22.60 3317.95 4048.10 18.76 10.26 183101.0 99.61

System of chemical raw materials 3773.49 69.20 6596.67 3702.15 25.17 5.31 51256.00 98.09

Pharmaceutical Industry 1300.66 118.92 2299.38 1303.76 36.00 10.88 75996.00 95.83

Chemical fiber manufacturing768.52 15.91 1302.49 730.37 21.82 4.85 58418.00 96.75

24

rubber products industry 531.49 12.81 793.59 475.52 28.39 7.18 49230.00 97.25

Plastics industry 691.57 36.06 969.88 684.45 26.60 7.66 67878.00 96.10

Non-metallic mineral products 1558.40 48.19 3465.47 1480.46 30.92 5.55 33246.00 96.95

Ferrous metal smelting 4501.74 177.18 8781.46 4471.53 27.69 6.12 65043.00 99.52

Non-ferrous metal smelting1462.41 53.30 2425.72 1427.34 27.09 6.62 53415.00 98.08

Fabricated metal products 813.77 32.21 1091.61 766.62 25.40 7.11 50839.00 97.74

Ordinary machinery manufacturing 1780.45 52.48 3357.88 1677.11 28.33 5.44 35867.00 97.33

Special Equipment 1236.25 28.10 2107.34 1162.21 26.03 4.96 31677.00 96.76

Transportation equipment 4982.51 242.12 7416.46 4838.52 25.25 7.98 61745.00 98.92

Electrical machinery 3088.08 125.48 3682.56 2891.65 25.48 7.69 76073.00 97.46

Telecommunications equipment 6722.51 369.16 6613.48 6706.67 21.92 9.17 129402.0 99.59

Instrumentation 506.64 25.09 659.81 521.55 27.67 11.5 44722.00 95.69

Power Steam production 3906.31 480.74 16391.02 6308.19 55.18 7.83 162631.0 99.53

Gas production and supply 124.24 -1.34 489.48 185.16 25.06 1.52 29799.00 102.11

Documents

The Application of SAS in Multivariate Analysisusers.du.se/~lrn/C_upps09/EssayC_Application_of_SAS_in... · 2010-01-11 · knowledge who need to apply the technique of multivariate