14
Clustering Procedure Cheng Lei [email protected] Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Clustering Procedure Cheng Lei [email protected] Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Embed Size (px)

Citation preview

Page 1: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Clustering Procedure

Cheng [email protected]

Department of Electrical and Computer EngineeringUniversity of Victoria

April 16, 2015

Page 2: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Outline❖Overview❖CLUSTER Procedure❖Clustering Methods

Page 3: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Overview

• Data:• Distances

• Coordinates <Euclidean Distances>

• Clustering methods• 11 methods supported

• FASTCLUS Procedure• CPU time: proportional to the number of observations• Use FASTCLUS for a preliminary cluster analysis• Use CLUSTER to cluster the preliminary clusters hierarchically

• Principles• Each observation begins in a cluster by itself

• Two closet clusters are merged to form a new one to replace the two old ones

• Repeat the merging step until only one cluster is left

Page 4: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Overview• CLUSTER Procedure

• Not practical to very large data sets as CPU time is roughly proportional to the square or cube of the number of the observations

• Displays a history of the clustering process

• Shows statistics for estimating the number of clusters

• RMSSTD

• Pseudo F

• Pseudo T-squre

• Creates dendrogram

• Create output data sets for TREE procedure to output the cluster membership

Page 5: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

CLUSTER Procedure• PROC CLUSTER METHOD=method-name <options>;

• BY variables;

• COPY variables;

• FREQ variables;

• ID variables;

• RMSSTD variables;

• VAR variables;

Page 6: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Options• RMSSTD

• Root mean squared standard deviation of a cluster

• Pseudo F• The ratio of between-cluster variance to

within cluster variance

• Pseudo T-square• A measure of merging two clusters to a new

cluster

Page 7: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

RMSSTD

• : the within-group sum of squares of cluster k

• : the number of elements in cluster k• : the number of variables

Page 8: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Pseudo F

• : the between-group sum of squares• : the within-group sum of squares• : the number of clusters at a certain

step• : the number of observations

Page 9: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Pseudo T-Square

• : within-cluster sum of squares of clusters K and L

• : number of observations in cluster k and L

• : between-cluster sum of squares

Page 10: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

METHODS• Average Linkage (AVE or AVERAGE)• Centroid Method (CEN or CENTROID)• Complete Linkage (COM or COMPLETE)• Density Linkage (DEN or DENSITY)• Maximum likelihood (EML)• Flexible-Beta Method (FLE or FLEXIBLE)• McQuitty’s Similarity Analysis (MCQ or MCQUITTY)• Median Method (MED or MEDIAN)• Single Linkage (SIN or SINGLE)• Two-Stage Density Linkage (TWO or TWOSTAGE)• Ward’s minimum-variance method (WAR or WARD)

Page 11: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Average Linkage• Idea:

• Compute the distance between two clusters and it is defined as the average distance between pairs of observations, one in each cluster

Page 12: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Centroid Method• Idea:

• Compute the Euclidean distance between two clusters

Page 13: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Next week’s work• Do examples with SAS base language• More reading about other procedures

in SAS/STAT

Page 14: Clustering Procedure Cheng Lei rexlei86@uvic.ca Department of Electrical and Computer Engineering University of Victoria April 16, 2015

Thank You!!!