Upload
julian-sullivan
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Clustering Procedure
Cheng [email protected]
Department of Electrical and Computer EngineeringUniversity of Victoria
April 16, 2015
Outline❖Overview❖CLUSTER Procedure❖Clustering Methods
Overview
• Data:• Distances
• Coordinates <Euclidean Distances>
• Clustering methods• 11 methods supported
• FASTCLUS Procedure• CPU time: proportional to the number of observations• Use FASTCLUS for a preliminary cluster analysis• Use CLUSTER to cluster the preliminary clusters hierarchically
• Principles• Each observation begins in a cluster by itself
• Two closet clusters are merged to form a new one to replace the two old ones
• Repeat the merging step until only one cluster is left
Overview• CLUSTER Procedure
• Not practical to very large data sets as CPU time is roughly proportional to the square or cube of the number of the observations
• Displays a history of the clustering process
• Shows statistics for estimating the number of clusters
• RMSSTD
• Pseudo F
• Pseudo T-squre
• Creates dendrogram
• Create output data sets for TREE procedure to output the cluster membership
CLUSTER Procedure• PROC CLUSTER METHOD=method-name <options>;
• BY variables;
• COPY variables;
• FREQ variables;
• ID variables;
• RMSSTD variables;
• VAR variables;
Options• RMSSTD
• Root mean squared standard deviation of a cluster
• Pseudo F• The ratio of between-cluster variance to
within cluster variance
• Pseudo T-square• A measure of merging two clusters to a new
cluster
RMSSTD
• : the within-group sum of squares of cluster k
• : the number of elements in cluster k• : the number of variables
Pseudo F
• : the between-group sum of squares• : the within-group sum of squares• : the number of clusters at a certain
step• : the number of observations
Pseudo T-Square
• : within-cluster sum of squares of clusters K and L
• : number of observations in cluster k and L
• : between-cluster sum of squares
METHODS• Average Linkage (AVE or AVERAGE)• Centroid Method (CEN or CENTROID)• Complete Linkage (COM or COMPLETE)• Density Linkage (DEN or DENSITY)• Maximum likelihood (EML)• Flexible-Beta Method (FLE or FLEXIBLE)• McQuitty’s Similarity Analysis (MCQ or MCQUITTY)• Median Method (MED or MEDIAN)• Single Linkage (SIN or SINGLE)• Two-Stage Density Linkage (TWO or TWOSTAGE)• Ward’s minimum-variance method (WAR or WARD)
Average Linkage• Idea:
• Compute the distance between two clusters and it is defined as the average distance between pairs of observations, one in each cluster
Centroid Method• Idea:
• Compute the Euclidean distance between two clusters
Next week’s work• Do examples with SAS base language• More reading about other procedures
in SAS/STAT
Thank You!!!