Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
The Application of SAS in
Multivariate Analysis
Author: Meng Xiangli & Peng Sisi
Supervisor: Karl-Erik Westergren
C-level in Statistics, November 2009
School of Economics and Social Sciences
Högskolan Dalarna, Sweden
2
Table of Contents
Abstract-----------------------------------------------------------------------------3
1. Introduction-------------------------------------------------------------------3
2. Background---------------------------------------------------------------------4
2.1 Importance of Multivariate Statistics------------------------------------4
2.2 Problem raised-------------------------------------------------------------4
2.3 Comparison of software--------------------------------------------------5
3. Format suggested---------------------------------------------------------------6
4. Application of the Format ----------------------------------------------------6
4.1 Hierarchical Clustering Format-----------------------------------------6
4.2 Principal Component Analysis Format------------------------------12
4.3 Canonical Correlation Analysis Format-----------------------------16
5. Discussion---------------------------------------------------------------------20
6. References---------------------------------------------------------------------22
7. Appendix-----------------------------------------------------------------------23
3
Abstract
This paper suggests a format for how an instructive text on multivariate data
analysis methods should be designed. The target group is people with basic statistic
knowledge who need to apply the technique of multivariate analysis to a specific
problem. As a background we discuss the problems of learning multivariate analysis,
we then present and motivate a format that we believe could lead people to learn the
techniques in a easy way and in short time. The format is applied to three methods –
Cluster Analysis, Principal Components and Canonical Correlation Analysis - to show
how the format works. The paper ends with a statistical-pedagogical discussion about
the format.
1. Introduction
The data which need to be analyzed are becoming more complicated. Most of
commercial and industrial data have many variates and oceans of observations. To
deal with such multivariate data problem, not only theory but also program technique
is needed because there are so much calculation which is impossible to work by hand.
However, people with only basic statistical knowledge often find it hard to grasp
the multivariate technique because the theories and computer programs seem difficult
and complex. There are many books and papers on Multivariate statistics but it takes a
lot of time to learn and use the technique. Mainly there are two reasons for this. One
is that a lot of materials do not contain necessary theories and computer procedures at
the same time. Some materials do not give program procedures, while others give
procedures but often with only quite brief theories. Then readers do not know how to
put theories into practical use or which theories are used in a program. Another reason
is that materials on this subject are often too long. The techniques are introduced from
many aspects, containing a lot of mathematical formulas, calculation and professional
concepts. This is likely to be too difficult and time consuming for a person with
applied approach. So people especially those who need multivariate analysis in busy
work could not afford the time and energy to grasp this technique through these
materials.
So there is a need to develop formats for introducing and explaining multivariate
techniques to people with only basic statistical knowledge. Some books do give us
good examples on how to lead people to learn statistics methods, we take reference of
4
format and methods of these materials and suggest improvements for teaching people
to grasp the technique of multivariate analysis.
The decision of which software to use is an inevitable part of the process. There is
much software that can be used for multivariate analysis but we choose SAS because
of its advantages.
This paper suggest a format to teach the beginners the technique of multivariate
analysis by using SAS software, it aims at making readers learn the techniques in an
easy way and in short time. The paper takes three multivariate analysis methods to
illustrate how to apply the format. Readers can use similar ways to learn other
multivariate techniques. We also hope that textbook writers will consider our
suggestions.
2. Background
2.1 Importance of Multivariate statistics
With the development of society and economy, affairs are becoming more correlated
to each other. As a result, the data that need to be analyzed are becoming more
complex, containing more variables and observations. Some traditional statistical
methods can be used in some occasions, but they have their restrictions when it comes
to multivariate problems. The theories of multivariate analysis have been developed
for years and they were not widely used until powerful computers as well as
corresponding software are available. In many business and social affairs multivariate
analysis is widely used and is becoming an important and inevitable area to master
especially for an applied statistician.
2.2 Problem raised
However, although multivariate analysis is becoming more important in practical use,
many people find it hard to grasp this technique. There are many books and papers
introducing the multivariate analysis technique, but people with basic statistical
knowledge often find it abstruse to learn these books and papers and they could not
afford the time that is needed. One reason is that some materials do not contain
necessary theories and computer procedures at the same time. Classical textbooks like
„Multivariate Statistical Analysis(He Xiaoqun 2003)‟ and „Applied Multivariate
Statistical Analysis(Gao Huixuan 2001)‟ contain rich theories but without program
procedures, while some papers and SAS textbooks give procedures but often with
5
only quite brief theories, like the most common used course book in China „SAS
statistical analysis and application(Huang Yan& Wu Ping, 2006)‟. Then beginners do
not understand the theories or how to put theories into practical use.
We believe that materials on this subject are often too long. As the techniques are
introduced from many aspects, containing many mathematical formulas, calculation
and professional concepts, it takes long time to read and find out how to apply it to a
specific problem.
Some books give us good examples on how to teach people statistic methods, such
as „Statistical Analysis and The Application of SPSS (Xue Wei 2007)‟, and „Statistics
(Jia Junping 2007)‟. These books contains the following advantages, 1 they give
necessary theories in a way that readers could easily understand, 2 they also illustrate
how to use software to put the theories into use, they illustrate in detail so that the
people who are not familiar with the software could also know how to operate. 3 They
show exactly how to understand the results obtained with the software. With the
advantages readers can easily grasp the technique they introduce. We take the
advantages of the good formats of the two books, using a similar form. We also
improve the format by combining the use of software in the process to make the
whole process more clear and easy to follow.
2.3 Comparison of software
Software are important in multivariate analysis for the large amount of calculation.
We choose SAS because it has its own advantages in functions and applications. Also
SAS is preferred by companies and industries so it will be easier for people to find
jobs if he or she could use SAS well. There are some other software that also can be
used for statistical problems, some comparisons are made (Feng Xuenan, Cui Yujie
2008).
Table1: comparisons
SPSS Eviews minitab SAS
Application
Fields
Social science Economic
fields
Quality
control
Most statistic
fields
Specialty Cross-sectional&
Time-series data
Time-series
data
Quality data Large number of
data&multivariate
data
6
3. Format suggested
This paper suggests a format for the beginners who have some basic statistic
knowledge and need to apply multivariate analysis in their work. The aim is that the
reader will grasp the basic techniques of a specific method using SAS program in a
easy way and in short time. Readers could go on to further studies using texts written
in similar format. We believe that the suggested format is worth considering when
writing manuals and textbooks. The paper uses three methods of multivariable
analysis: hierarchical clustering, principal component analysis and canonical
correlation analysis to illustrate the format.
The structure of the suggested format is as follows:
Ⅰ. The idea. First a brief view is given to show the readers the usage of the method
and how it works, and the readers would know what methods they are learning and
what they can do through the learning process.
Ⅱ. Necessary theory and steps for the procedure. Then the necessary theories
including the steps for the procedure are introduced briefly and they are combined
with SAS code in the process. The complex math calculation and large number of
professional concepts are avoided; the modeling theories are not given a lot of
concern because the modeling parts are rather theoretical and they take much time to
understand. We believe readers could quickly understand how to put the theories into
practical use.
Ⅲ. Example. Third a practical example is given to show how the method actually
works. In the program the paper gives the meaning and effect of each command in
SAS to illustrate how to construct them.
Ⅳ. SAS code and program outcome. They are given referring to the corresponding
steps of the process introduced in the theory part and the explanation is based on the
theory part.
Ⅴ. Interpretation and decision. Finally the interpretation of the outcome and how to
make decisions upon the outcome are exemplified. We believe in such a way readers
would easily understand the outcome and know how to make decisions themselves.
4. Application of the Format
4.1 The format applied to hierarchical clustering
The idea.
7
Hierarchical clustering method (He Xiaoqun 2003) is a method to make similar cases
into same categories, and then make a partition of sample space. Then people can find
out the feature of different categories, and make corresponding decisions.
“Observations are sorted into subsets (called clusters) so that observations in the same
cluster are similar in some sense, and observations are significant different between
clusters” (Wikipedia, Oct 2009). The method is widely used. In commercial areas, it
is e.g. used to identify different types of customers, and then make different sales
plans; in biology, it can be used to classify different types of genes, to make
classifications of animals and plants.
Necessary theory and steps for the procedure.
There are two ways of Hierarchical Clustering: agglomerative and divisive. As the
name says, agglomerative hierarchical clustering agglomerates the sample cases into
one category, divisive make the sample space into finer categories, and both ways
make hierarchies in the process. We can see the idea of agglomerative way clearly
through its steps, then divisive way is just made in the reverse way.
1. Pretreatment of data: Choose the appropriate variables for the cluster analysis.
The first step can be seen as the pretreatment of data. There are many variables for a
case but we should choose the appropriate variables for the analysis. It cannot be done
by software, and people have to choose them by some rules : variables should reflect
the characteristics of different clusters and not have a high degree of correlation.
2. Treatment of data.
(1)Calculate the distance between each pair of the n cases.
(2)Construct n clusters, each one contains only one case.
(3)Join the two clusters with shortest distance into a new cluster; calculate the
distance between the new cluster and the other clusters.
(4)Repeat step 3 until all the cases join into one big cluster.
In SAS system the command „proc cluster‟ can perform this for us, but we need to
tell the software which kind of distance to choose.
3. Make decisions upon the results.
(1)Get the joining process with dendrogram (graphic like a tree).
(2)Decide the number of clusters.
This step is done by asking program to give the necessary outcome and help us to
decide the number of clusters.
It can be seen that there are two important steps for cluster analysis: defining the
8
distance and determining the number of clusters. Different definitions of distance
determine how the similarity of two elements is calculated. It will also influence the
shape of the clusters, as some elements may be close to one another according to one
definition of distance and farther away according to another one. There are several
ways to define the distance, each one has its own advantages and disadvantages. Also
different methods are defined through the definition of its distance. The corresponding
commands are given in the example.
(i) The definition of distance to use
Complete and single linkage: the furthest and shortest distance between elements in
different groups respectively. Both methods are easy for calculation but both are
biased. Complete linkage tends to the extreme values and single linkage has the
notorious chaining tendency: while two elements from different groups are closed
while the rest elements are distant to ones from the other group, the two groups can
also be joined together.
Average linkage: average distance between pairs of observations, it is less
influenced by extreme values, clusters with small variances tends to be joined
together, so it is slightly biased toward forming clusters with the same variance.
Centroid linkage: distance between the mean of each group. It is the most effective
way to reduce the effect of extreme value, but the mean of a joining group tends to the
mean of larger original group.
Ward‟s method: distance is defined as the error sum of square; two groups are
joined together if the new group makes the smallest increase in the error sum of
square. Ward‟s method joins clusters to maximize the likelihood at each level of the
hierarchy. It is widely used because of its good clustering effect, especially when the
differences between groups are not significant enough. But it has bias toward
producing equal-size groups, also sensitive at extreme value.
There are also some other ways to calculate distance, but the ways above are
commonly used. In practical use, we should choose methods according the structure
of the sample space and the characters of each method. For example, when there are
many extreme values, complete linkage and ward‟s are better not to be used. Also we
can try several methods and compare the results between different methods, then
choose a appropriate method.
(ii) What number of clusters to use?
The second important step is to determine the number of clusters. In step 3 we use
9
the command „proc tree‟ to ask for the dendrogram, and it shows the joining process
directly and intuitively, and there are also several statistics to help us determine the
number of clusters:
1. Fixed distance, determine the fixed distance in the dendrogram, if the distances
between different groups are all larger than the fixed distance, then the agglomerate
process is stopped. This method can be gotten directly from the graph.
2. Cubic clustering criterion (ccc). It tests the hypothesis whether the data have
been sampled from the uniform distribution. We should find the peak value that
makes ccc to be more than 3, then the corresponding number of clusters can be
chosen.
3. Pseudo F test (PSF). It measures the separation among the clusters at the current
level in the hierarchy. The number of clusters can be determined by the one with peak
PSF value.
4. Pseudo 2t test. It is similar with pseudo F test. The number is also determined
by the peak value. Ccc and PSF are gotten with the command ccc pseudo and Pseudo
2t is given out automatically, so in program we need to give commands for the
dendrogram ccc and PSF. The number is checked to see if the results are fit for
practical use and easy to explain, or if some necessary adjustments are needed.
Example.
In geography research, cities can be classified according to their climate. The data of
meteorological conditions in 13 capitals of the world (Wu Xizhi 2005) are used to
classify the cities according to their climate. The corresponding climate variables are:
altitude (meters), average of temperature in a year, average temperature of the coldest
month, average temperature of the warmest month, average of the precipitation
(mm/m), the average precipitation in the wettest month, the average precipitation in
the driest month, average precipitation in the most clear month (%), average
precipitation in the most cloud month (%).
SAS code and the program outcome.
The program is as follows:
data climate;
input v01 $ v02-v10;
cards;
Beijing 37.0 11.7 -4.7 26.0 632.0 25417.0 3.1 47150.0 20271.0
Tyoko 6.0 13.8 3.0 26.4 1625.0 2155.0 220.0 41.0 30103.0
10
Berlin 51.0 8.8 -.5 17.8 592.0 26816.0 11749.0 16923.0 30621.0
Singopor 17.0 27.2 25.6 27.8 24.0 258.0 169.0 581.0 24838.0
San Diago 520.0 14.7 8.6 20.6 363.0 29373.0 7.0 37304.0 20607.0
London 5.0 10.6 4.3 17.7 593.0 21002.0 19725.0 22160.0 27039.0
San Francisco16.0 13.7 10.0 16.5 517.0 100.0 4.0 11871.0 22647.0
Rome 51.0 16.2 7.5 25.6 760.0 115.0 37506.0 37487.0 21520.0
Washington 22.0 13.8 3.1 25.4 1050.0 120.0 70.0 19268.0 25934.0
Montreal 57.0 6.5 -9.2 21.3 1048.0 102.0 25659.0 19937.0 27334.0
Paris 15.0 11.5 3.5 19.5 619.0 23590.0 12844.0 18111.0 26634.0
Honolulu 4.0 22.2 22.2 25.8 610.0 110.0 37457.0 17411.0 22678.0
Sydney 41.0 11.7 11.7 22.3 1181.0 140.0 70.0 14824.0 59.0
run;
proc cluster data=climate method=ward standard outtree=tree1 ccc pseudo;
var v2 v3 v4 v5 v6 v7 v8 v9 v10;
proc print data=tree1;
proc tree;
run;
The program starts with the „data‟ step. It uses „data‟ command to define the name
of a data set. Then „input‟ asks SAS to input values of variables and gives them the
name of variables. „cards‟ means starting the input of the data step. For the second
step „proc cluster‟ is the command for clustering. „method=ward‟ means the method is
Ward‟ method. The other method can be made as „method=ave, cen, com, sin‟ stands
for: average linkage, centroid linkage, complete linkage, single linkage. „standard‟ is
to standardize the variables for the magnitude of variables varies a lot. When the
magnitudes of variables are similar, it can be omitted. „tree1‟is used to give out the
dendrogram. „ccc‟ and „pseudo‟ ask the outcome contain cubic clustering criterion
and Pseudo F test statistics.
The outcome contains mainly the test for cubic clustering criterion and Pseudo F
statistics as well as the dendrogram. The outcome gives us important information on
how to make clustering.
Table2: The cubic clustering criterion and Pseudo F test statistics
NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2
12 OB6 OB11 2 0.0029 0.997 . . 31.8 .
11 OB3 CL12 3 0.0064 0.991 . . 21.5 2.2
10 OB2 OB9 2 0.0204 0.970 . . 10.9 .
9 OB8 OB12 2 0.0277 0.943 . . 8.2 .
8 CL11 OB7 4 0.0455 0.897 . . 6.2 9.9
7 CL10 OB10 3 0.0561 0.841 . . 5.3 2.7
11
NCL --Clusters Joined--- FREQ SPRSQ RSQ ERSQ CCC PSF PST2
6 OB1 OB5 2 0.0808 0.760 . . 4.4 .
5 OB4 CL9 3 0.1047 0.656 . . 3.8 3.8
4 CL7 OB13 4 0.1071 0.548 . . 3.6 2.8
3 CL6 CL8 6 0.1336 0.415 . . 3.5 3.9
2 CL3 CL4 10 0.1747 0.240 0.356 -1.4 3.5 3.1
1 CL2 CL5 13 0.2401 0.000 0.00 0.00 . 3.5
It can be seen through the table that the cubic clustering criterion (ccc) and Pseudo
F test statistics (psf) do not give us appropriate peak value: the psf statistics reaches
its peak when the number of clusters is 12, and ccc statistics only show 2 values. The
pseudo 2t test get its local maximum at 3 and 8, but 8 clusters are too many for
practical use, then we need to decide the number of clusters at the neighborhood of 3.
In the next step we have to use the dendrogram to make a decision:
dendrogram of the Cluster analysis
Interpretation and decision.
From the dendrogram, we can see that three or four types are appropriate. It
depends on the need of practical use. The recent temperature in Beijing is increasing,
and it is the same with San Diego, so the two cities are divided into a class, and they
are defined as the class sub-tropics climate. London, Paris, Berlin is the typical
European climate, coincide with theirs geographical position. San Francisco, close to
the coast and latitude does not differ greatly, and thus be divided into the same
12
category, and they are defined as the warm temperate maritime climate. The first two
types can be joined together if we need fewer clusters because the two types have
shorter distance than the other types which means they have more similarities. Tokyo,
Washington, Montreal, Sydney are located in the middle of mainland and are divided
into the same kind, named temperate continental climate. Singapore, Rome, Honolulu
city are typical Mediterranean climate. The results of the classification are largely in
line with our common sense. That is to say, the way of classification is efficient.
4.2 The format applied to principal component analysis
The idea.
“Principal component analysis(PCA) is appropriate when you have obtained measures
on a large number of observed variables and wish to develop a smaller number of
artificial variables (called principal components ) that will account for most of the
variance in the observed variables. The principal components may then be used as
predictor or criterion variables in subsequent analyses.”(support.sas 2009)
Principal component analysis (He Xiaoqun, 2003) is a variable reduction procedure
and it is very useful for anyone who needs to reduce a number of observed variables
into a smaller number of principal components. Because all the variables are
measuring the same structure and it is possible that they are correlated with some
other variables.
Necessary theory and steps for the procedure.
The linear relationship between factors and original indicators are modeled
theoretically as followed:
1 11 1 12 2 1
2 21 1 22 2 2
1 1 2 2
...
...
......
...
p p
p p
p p p pp p
y u x u x u x
y u x u x u x
y u x u x u x
: 'y u x
Where ix is original variate, iy is the principal component (i=1,2…p). The main
thought of PCA is to make the first several iy (not all of
iy ) explain most part of
variance of x, that is, the first several y contains most part of information contained in
x. Because not all the iy is used, it results in lowering the dimension. Then the aim
of PCA is to get the coefficient of x. The steps of PCA is as follows:
13
1. The pretreatment of data.
Generally PCA does not require some special assumption of variables. But there is
a requirement: the correlation coefficients between different variates should be bigger
than 0.3 (He Xiaoqun, 2003), because if the variables do not have strong relationship
then it will not be effective to reduce the dimension of variables.
2. Treatment of data.
This step mainly aims at finding the coefficient of x by mathematical methods. For
the matrix constructed by X, the eigenvalue i are calculated, which represents the
amount of variance that is accounted for by iy . i is ranked from maximum to
minimum, corresponding to iy . Then the eigenvector iu of eigenvalue i construct
the vectors the coefficient matrix ( 1u , 2u ,…pu )= 'u . The whole process can be run in
SAS with the command „proc factor‟, and we can get the functions of u.
3. Making decisions upon results
With the eigenvalues and vectors we get in step 2, 1
/p
i i iiv
denote the
proportion of variance of x that is explained by iy , and i
v denotes the cumulative
proportion of variance that is explained by y. Then the first few principal components
are chosen so that the iv is more than 0.80 (He Xiaoqun 2003 chap 5), Then with
the result of SAS we could decide the number of factors we choose.
Example.
A specific example on the economic indicators of all the districts in Beijing (Beijing
Statistical Yearbook 2006) will now be presented to illustrate the process of principal
component analysis. A 6-item measure of economic indicators is established, average
wage of workers (yuan per worker), GDP for the area (measured in ten thousand
yuan), Capita disposable income (measured in yuan), Local financial revenue
(measured in ten thousand yuan), Investment in fixed assets (measured in ten
thousand yuan), Retail sales of social consumer goods (measured in ten thousand
yuan). PCA is used to reduce the dimension of variates for the convenience of further
economic research.
In principal component analysis, the number of components extracted is equal to
the number of variables being analyzed. Because six variables are analyzed in the
14
present study, six components will be extracted. The first component can be expected
to account for a fairly large amount of the total variance. Each succeeding component
will account for progressively smaller amounts of variance. Although 6 components
are extracted in this way, only the first few components will be important enough to
be retained for interpretation.
SAS code and the program outcome.
data economic;
input aww gdp cdi lfr ifa rss;
cards;
32173.00 3294121 14880.10 313801.0 1702707 1660120
33856.00 3487500 14539.00 343806.0 1335592 1640614
……
15691.00 683842.0 12023.10 83336.00 726398.2 267492.0
16904.00 374521.6 12852.60 28128.00 255814.0 298766.0
;
run;
proc factor data=economic simple corr;
run;
„proc factor‟ is the command for PCA, „simple corr‟ means the outcome should
contain the correlation matrix for the original variables
When correlations among several variables are computed, they are typically
summarized in the form of a correlation matrix. This is an appropriate opportunity to
review just how a correlation matrix is interpreted.
Table3: Correlations between variables
a b c d e f
a 1.00000 0.71054 0.75292 0.80475 0.62057 0.77472
b 0.71054 1.00000 0.61457 0.93428 0.94433 0.96403
c 0.75292 0.61457 1.00000 0.69087 0.56189 0.64015
d 0.80475 0.93428 0.69087 1.00000 0.93908 0.91717
e 0.62057 0.94433 0.56189 0.93908 1.00000 0.88326
f 0.77472 0.96403 0.64015 0.91717 0.88326 1.00000
It can be seen that all correlations are more than 0.3, then PCA can be used.
Table4: Eigenvalues and Proportions
Eigenvalue Difference Proportion Cumulative
1.0 4.94666570 4.29921045 0.8244 0.8244
2.0 0.64745526 0.39770550 0.1079 0.9324
3.0 0.24974976 0.13679681 0.0416 0.9740
4.0 0.11295295 0.08697200 0.0188 0.9928
15
Eigenvalue Difference Proportion Cumulative
5.0 0.02598094 0.00878555 0.0043 0.9971
6.0 0.01719540
0.0029 1.0000
From the first row it can be seen that the proportion of eigenvalues for component
1 is 0.8244 which means the first component extracts most part of variance and the
rest components account for rather small proportion. Then the first component can be
used to represent most part of the information of the six original variates.
The factor loading matrix is as follows, we choose only the factor that we need, and
in this case it is only one:
Table5: Factor loading matrix
Factor1
a 0.85064
b 0.95778
c 0.76852
d 0.97603
e 0.91900
f 0.95808
The matrix(because only one component is selected, the matrix turns into a vector)
shows the loading of principle component on each original variates. If we sum the
loadings together, we will get the eigen value of the component.
For each original variate, the proportion of variance that is extracted by the
principle component is given below:
Table6: Final Communality Estimates: Total = 4.946666
a B c d e f
0.72358523 0.91735171 0.59062243 0.95263854 0.84455722 0.91791057
It can be seen that all the originate variates except c have high proportion, and that
means most of them get their information extracted by the component.
Interpretation and decision.
From the outcome above we can see that the first principal component can stand for
most of the information that is contained in the six original variables. Also because
the component has high loadings on the 6 variates, it can be considered as a
comprehensive economic factor and it can be seen as the measure of the economic
development of the area. So in further studies we can use the first principle
component as a represent of the six original variables.
16
4.3 The format applied to canonical correlation analysis
The idea.
Canonical correlation analysis is a widely used technique in multivariate statistical
analysis, which has been used in economics, meteorology, and many modern
information processing fields such as communication theory and statistical signal
processing. Canonical correlation analysis is a multivariate statistical model that
facilitates the study of interrelationships among sets of multiple dependent variables
and multiple independent variables. With the analysis we extract canonical variables
from each sets , then we can explain the relationship between two sets of variables
through the first few pairs of canonical variables, so the complexity of analyzing the
relationship between sets of variables decreases.(Joseph F. Hair, Jr., Rolph E. 1998).
Necessary theory and steps for the procedure.
Canonical correlation analysis (CCA) can be seen as extension of principal
component analysis (PPA). The main ideology is constructed upon PCA. For each
group, principle components are extracted by PCA to maximize the correlation
coefficient between the two groups of components. The components in the same
group are uncorrelated. Then we can use the correlation of components to analyze the
correlation of two group of variables, also it decreases the complexity to quantify the
strength of the relationship. The principal components that extracted appear in pairs,
and one pair of corresponding components is called a pair of canonical variables. The
mathematical theory of CCA is stated as below: 1 2... px x x
1 2... qy y y are two groups
of variables, the canonical variables noted as V and W are given by
1 1
1 1
...
...
p p
q q
v a x a x
w b y b y
. The purpose of CCA is to find coefficient
1... pa a and 1... qb b
to maximize the correlation efficient of V and W. In SAS the whole procedure can be
done by the proc “cancorr”, the results we need and corresponding test are also
obtained.
The steps of CCA are given bellow:
1. Pretreatment of data
(1) Specify the target of CCA. That is defining the two group of variables.
(2) Develop the process of plan. That is getting the data of samples. Small sample
sometimes leads to biased and unstable results, so the basically needed amount of
17
observations is 10 for each variable (Yu Xiulin, Ren Xuesong, 2007, chap 10).
2. Treatment of data
(1) Assess the variables underlying the assumption of CCA. First, the correlation
coefficient between any two variables is based on a linear relationship. If the
relationship is not linear, then some variables have to be transformed. Second, the
canonical correlation is the linear relationship between the variates.
(2) Estimate the canonical model and assessing overall model fit. Mainly there are
three measurements: (a) the level of significance of the canonical function, often at
level 0.05; (b) redundancy measure of shared variance. It can be gotten by multiplying
shared variance of the variate and the squared canonical correlation. It explains the
amount of shared variance that can be explained by each canonical function.
The necessary test outcome in SAS for the steps 3-4 is given in the title: Test of H0:
The canonical correlations in the Eigenvalues of Inv(E)*H and Standardized
Canonical Coefficients for the VAR/WITH Variables. Corresponding results can be
obtained.
3. Make decisions upon results.
(1) Interpret the canonical variate. This step mainly focuses on determining the
relative importance of each of the original variables with the canonical variates. Also
there are 2 methods: (a) canonical loadings: measure the linear relationship between
original variates and canonical variate in the same sets. (b) canonical cross-loadings:
directly measure the linear correlation between an original variate and the canonical
variates in the other set. SAS gives both of them in the name :Canonical structure. It
helps us to understand the structure of the canonical variates.
(2) Validation and diagnosis. This step ensures that the results are not specific
only to the sample and can be generalized to the population. Generally there are two
methods: (a) create two subsamples of data and perform the analysis on each
subsample separately, comparing the similarity between the two results. (b) remove
one variable, examine the sensitivity of results.
Example.
We use the example of industrial scale and efficiency (Beijing Statistical Yearbook
2006) to illustrate how to use SAS in canonical correlation analysis.
The purpose is to find the relationship between industrial scale and efficiency. The
variates on industrial scale are: total industrial output value, total profits, total assets,
sales revenue. The variates on efficiency are: Value added rates , total assets
18
contribution rate , personnel labor productivity , sales rates of products. First the two
sets of variables are defined. The sample collects 36 different kinds of industries and
4 variates for a group, and that is 9 observations for 1 variable, so it is approximately
enough for CCA though a little smaller than 10-1. It is assumed that the correlations
between canonical variates are linear correlation. The complete data are given in
appendix.
SAS code and the program outcome.
data differentindustry;
input industry $ x1-x4 y1-y4;
cards;
coal mining 1152.99 28.71 3587.78 1144.27 47.60 4.92 19175.00
99.34
………
Tap water production and supply 213.33 5.68 1104.59 201.44 49.20 2.21
50803.00 97.73
;
run;
proc cancorr data=cancorr;
var x1-x4;
with y1-y4;
run;
„proc cancorr‟ is the procedure of canonical analysis, „var‟ and „with‟ give the two
groups of variables. The test of linear relationship in canonical correlation is:
Table7: Test of H0: The canonical correlations
Eigenva
lue
Differe
nce
Proporti
on
Cumulat
ive
LRT Appr.
F
Num
DF
Den
DF
Pr>F
1 1.5709 0.8145 0.5618 0.5618 0.1492856 4.82 16 89.234 <.0001
2 0.7564 0.321 0.2705 0.8323 0.383797 3.92 9 73.163 0.0004
3 0.4353 0.4018 0.1557 0.988 0.674084 3.38 4 62 0.0145
4 0.0336 0.0120 1.0000 0.9675388 1.07 1 32 0.3079
It can be seen that the test statistics for the correlation of the first three pairs of
canonical variables are significant, so it proves that the assumption of linear
relationship is right.
For the 2(2) step the SAS program gives the redundancy measure of shared
variance. The table above shows that eigen values of the first two pairs of canonical
variates can explain 83 percent of the total variance i.e. most part of correlated
relationship. Then the result has a satisfying redundancy measure of shared variance.
19
In the following we need to know the coefficients of canonical variables, SAS gives
the canonical coefficient in two ways: raw and standardized. If the variates are not
standardized in the process we get raw coefficients, otherwise standardized ones. In
this example, the standardized way is chosen because the magnitude varies a lot. The
output coefficients are given below, and also the canonical weights:
Table8: Standardized Canonical Coefficients for the VAR Variables
V1 V2 V3 V4
x1 0.4505 -2.6036 -1.6004 6.0199
x2 0.9485 0.5518 -0.5372 -0.0128
x3 0.9706 -1.1759 1.4022 2.6152
x4 -2.2066 4.0817 0.7718 -7.7569
Table9: Standardized Canonical Coefficients for the WITH Variables
W1 W2 W3 W4
y1 1.0357 -0.0346 0.702 0.0107
y2 1.0806 -1.1882 -2.3763 1.257
y3 -1.3502 1.9062 1.4576 -1.7991
y4 0.0782 0.1893 -0.1431 1.1702
From the results above we only need the coefficients of the first two pairs of
canonical variates, and they are functions of original variables:
1 0.45 1 0.948 2 0.971 3 2.207 4
2 2.604 1 0.752 2 1.176 3 4.082 4
1 1.036 1 1.081 2 1.350 3 0.078 4
2 0.035 1 1.188 2 1.906 3 0.189 4
v x x x x
v x x x x
w y y y y
w y y y y
The 3(1) step is to interpret the canonical variates. The canonical weights are
already given, and they are acceptable according to their values. The canonical
loadings are give in the table with name „VAR and Their Canonical Variables‟ and
„WITH and Their Canonical Variables‟:
The canonical cross-loading is given in the following table:
Table10: Correlations Between the VAR Variables and Their Canonical Variables
V1 V2 V3 V4
x1 -0.4392 0.7385 -0.0291 0.5107
x2 0.4931 0.8105 -0.1649 0.2698
x3 -0.1049 0.7075 0.5306 0.4549
x4 -0.3771 0.8103 0.1566 0.4203
Table11: Correlations Between the WITH Variables and the Canonical Variables of
the VAR Variables
20
V1 V2 V3 V4
y1 0.6954 0.277 0.0741 -0.0201
y2 0.3498 0.3962 -0.3455 -0.0368
y3 0.233 0.5542 -0.2176 -0.0369
y4 -0.0238 0.4238 0.0359 0.1369
From the table we can get the correlation between original variates in one group
and the canonical variates in the other one. So y1 is high related to v1 with coefficient
0.6954, and the other correlations degree can be derived through the table.
Then from the two coefficients the importance of original variates that relative to
canonical variate can be gotten, then the canonical function is fully interpreted.
Interpretation and decision.
From the function above, it can be seen through canonical weights that v1 is mainly
related to the total profits, and w1 is mainly related to personnel labor productivity. It
is the same way when it comes to v2 and w2.Then it can be concluded that the
correlation between scale and efficiency is mainly the correlation between total profits
and personnel labor productivity. Then if further research is needed to analyze the
relationship between scale and efficiency, it is simple and clear to devote most energy
into the relationship between total profits and personnel labor productivity.
5. Discussion
In this paper we use three examples to illustrate a format that could make people learn
multivariate analysis quickly and easily, we do not introduce the theories too much.
Models are introduced but not too much, because too much theories of the models
could increase the difficulties in reading the paper, readers could have a general idea
about the models although maybe not exactly after reading the paper.
However, not all the beginners are suggested to use this format; the paper mainly
focuses on the beginners who have some basic statistic knowledge but do not need
theory analysis or research. They are supposed to know the basic theories about
probability theory and mathematics, generally people who have bachelor‟s degree in
science would get that kind of knowledge well. For some professionals and people
who intend to theoretical research may not use this format to learn the technique.
In this format, the model is not introduced in detail, as a result readers may not be
able to make changes to the technique according to different practical issue. For future
work we would try to make more part of the model and make it easy to understand.
21
Also we could put the format into use and make questionnaires to find the effect and
shortage of this format. Although the paper has some delimitations, it can work on the
target people and it may prove itself helpful.
22
6. References
[1] Beijing Municipal Bureau of Statistics (2003) Beijing Statistical Yearbook Beijing.
[2] Fang Kaitai (2004) Practical Multivariate Statistical Analysis Beijing China
Normal University Press.
[3] Feng Xuenan, Cui Yujie (2008) Comparative Study and Analysis of Statistical
Software Packages Journal of North China Industry University.
[4] Gao HuiXuan (2001) Applied Multivariate Statistical Analysis, Beijing Peking
University Press.
[5] Gao HuiXuan (2000) SAS System • SAS / ETS Software Manual Beijing Statistics
Press.
[6] He Xiaoqun, 2003 Multivariate Statistical Analysis Beijing People's University
Press.
[7] Huang Yan, Wu Ping (2006) SAS statistical analysis and application Machinery
Industry Press.
[8] Joseph F. Hair, Jr., Rolph E. Anderson, Ronald L. Tatham and William C. Black.
(1998). Multivariate Data Analysis, 5th edition Prentice Hall, Inc.
[9] Jia Junping (2004) Statistics Beijing People's University Press.
[10] Support.sas (2009):
(http://www.support.sas.com/publishing/pubcat/chaps/55129.pdf )
[11] Wu Xizhi (2005 )From Data to the Conclusion Learning Guidance and
Exercise, Beijing China Statistics Press.
[12] Xue Wei (2007), Statistical Analysis and The Application of SPSS Beijing
People's University Press.
[13] Yu Xiulin, Ren Xuesong (2007) Multivariate Statistical Analysis, Beijing China
Statistics Press.
[14] Wikipedia.org , Oct (2009), Cluster analysis, Types of clustering.
23
7. Appendix
ⅰ Economic indicators of all the districts in Beijing
Dongcheng District 32173.00 3294121 14880.10 313801.0 1702707 1660120 398566.0
Xicheng District 33856.00 3487500 14539.00 343806.0 1335592 1640614 564176.0
Chongwen District 20848.00 864012.7 13494.10 69658.00 622455.0 639490.0 138745.0
Xuanwu District 23095.00 2013163 13166.00 194970.0 968372.0 697703.0 263537.0
Chaoyang District 27661.00 6841968 13898.00 565625.0 2858851 2986623 851602.0
Fengtai District 20901.00 2310141 12281.90 122042.0 777305.0 1762322 446608.0
Shijingshan District 23939.00 1384952 11904.70 74402.00 323941.0 991338.0 164470.0
Haidian District 29984.90 8928950 14409.10 475348.0 2898447 3542556 954386.0
Mentougou 18495.00 463519.7 11051.30 36561.00 426365.0 245444.0 64376.00
Fangshan 17547.00 1964661 11712.30 89542.00 1101166 618040.7 144272.0
Tongzhou District 14702.00 1044963 11968.60 84397.00 890727.0 463931.0 122300.0
Shunyi District 14899.00 1830882 13203.00 86458.00 850903.2 503774.0 154533.0
Changping District 17595.00 1301430 10373.00 88541.00 812579.0 362498.0 167877.0
Daxing District 15634.00 1037035 10811.70 93497.00 1008939 413926.0 202267.0
Pinggu 13081.00 509968.2 11964.20 45794.00 912776.9 184335.0 59195.00
Huairou District 17284.10 622751.2 12785.00 54036.00 382200.4 200132.0 70325.00
Miyun County 15691.00 683842.0 12023.10 83336.00 726398.2 267492.0 74002.00
Yanqing County 16904.00 374521.6 12852.60 28128.00 255814.0 298766.0 40746.00
ⅱ Industrial scale and efficiency
Oil and gas extraction 2717.77 970.23 4060.70 2578.92 73.25 32.20 353475.0 99.66
Ferrous Metals Mining 61.96 2.31 219.09 66.39 44.06 4.77 23657.00 97.69
Nonferrous Metals Mining 158.71 9.87 331.23 153.03 38.05 6.89 26840.00 9 5 . 6 4
Nonmetal Mining 116.49 3.12 487.71 115.00 39.64 4.24 19202.00 97.06
Timber harvesting 88.70 2.42 320.28 76.16 52.21 3.61 7856.00 97.58
Food processing 1703.30 49.23 1602.56 1675.15 22.48 7.72 61532.00 97.41
Food Manufacturing 812.25 37.34 992.55 762.92 28.08 9.56 66252.00 97.35
Beverage Manufacturing 1292.59 97.29 2136.72 1261.02 37.45 14.09 81130.00 98.38
Tobacco Processing 1572.21 177.15 2259.93 1648.57 66.15 44.57 540853.0 99.34
Textiles 2579.62 59.46 3534.35 2462.08 25.48 6.15 27245.00 96.36
Apparel Manufacturing 656.24 39.87 687.56 620.87 27.27 9.91 39403.00 97.61
Leather Fur Processing 408.51 8.16 351.15 353.77 25.68 6.70 33697.00 97.86
Woodworking 201.44 3.81 326.32 185.41 25.82 5.22 48571.00 97.42
Furniture Manufacturing 72.14 4.57 117.00 73.22 31.56 7.91 42089.00 98.71
Paper products industry 885.08 31.82 1880.16 840.86 27.35 5.99 50133.00 98.68
Printing 311.06 32.95 537.56 295.25 38.24 11.03 61320.00 95.39
Sporting Goods Manufacturing 169.05 8.62 170.16 164.29 25.67 9.64 31201.00 98.25
Oil processing 3984.19 -22.60 3317.95 4048.10 18.76 10.26 183101.0 99.61
System of chemical raw materials 3773.49 69.20 6596.67 3702.15 25.17 5.31 51256.00 98.09
Pharmaceutical Industry 1300.66 118.92 2299.38 1303.76 36.00 10.88 75996.00 95.83
Chemical fiber manufacturing768.52 15.91 1302.49 730.37 21.82 4.85 58418.00 96.75
24
rubber products industry 531.49 12.81 793.59 475.52 28.39 7.18 49230.00 97.25
Plastics industry 691.57 36.06 969.88 684.45 26.60 7.66 67878.00 96.10
Non-metallic mineral products 1558.40 48.19 3465.47 1480.46 30.92 5.55 33246.00 96.95
Ferrous metal smelting 4501.74 177.18 8781.46 4471.53 27.69 6.12 65043.00 99.52
Non-ferrous metal smelting1462.41 53.30 2425.72 1427.34 27.09 6.62 53415.00 98.08
Fabricated metal products 813.77 32.21 1091.61 766.62 25.40 7.11 50839.00 97.74
Ordinary machinery manufacturing 1780.45 52.48 3357.88 1677.11 28.33 5.44 35867.00 97.33
Special Equipment 1236.25 28.10 2107.34 1162.21 26.03 4.96 31677.00 96.76
Transportation equipment 4982.51 242.12 7416.46 4838.52 25.25 7.98 61745.00 98.92
Electrical machinery 3088.08 125.48 3682.56 2891.65 25.48 7.69 76073.00 97.46
Telecommunications equipment 6722.51 369.16 6613.48 6706.67 21.92 9.17 129402.0 99.59
Instrumentation 506.64 25.09 659.81 521.55 27.67 11.5 44722.00 95.69
Power Steam production 3906.31 480.74 16391.02 6308.19 55.18 7.83 162631.0 99.53
Gas production and supply 124.24 -1.34 489.48 185.16 25.06 1.52 29799.00 102.11