Upload
frederica-mccoy
View
214
Download
0
Embed Size (px)
Citation preview
Important clustering methods used in microarray data analysis
Steve Horvath
Human Genetics and Biostatistics
UCLA
Contents
• Multidimensional scaling plots– Related to principal component analysis
• k-means clustering
• hierarchical clustering
Introduction to clustering
MDS plot of clusters
-40 -20 0 20 40
-20
02
04
0
First PLS Component
Se
con
d P
LS
Co
mp
on
en
t
A
AA
A
A
AA
AG
GG
G
GG
G
G
G
G
G
G
G
G
G
G
G
GG
G
G
GGG
G
G
G
G
G
GG
G
G
G
G
G
GG
G
G
G
G
G
G
GG
G
G
G
G
G
G
G
G
G
G
GG
G
M
M
M
M
M
M
M
O
O
O
O
O
O
O
O
O
OO
MDS plot of clusters
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2
-0.2
-0.1
0.0
0.1
0.2
0.3
133A UCLA groups, 133a
cmd1[,1]
cmd
1[,2
]
3
43
3
3
3
3
3
4
54
4
55
5
5
5
5
4 5
4
5
4
5
4
43
4
3
44
4 45
5
5
5
5
5
5
44
5
3
45
5
4
5
5
5
4
5
4
4
5
5
4
5
5
5
5
4
5
5
4
4
3
3
3
3
4
4
3
5
3
3
4
3
43
4
4
5
5
2 references for clustering
• T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series
• L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability
Introduction to clustering
Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.
Proximity matrices are the input to most clustering algorithms
Proximity between pairs of objects: similarity or dissimilarity.
If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities.
Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:
'' kiikii ddd
Different intergroup dissimilarities
Gi Hi HGiiGA
iiHiGiCL
iiHiGiSL
NNdHGd
dHGd
dHGd
' '
'',
'',
)/(),(
)(max),(
)(min),(
Let G and H represent 2 groups.
Agglomerative clustering, hierarchical clustering and dendrograms
Hierarchical clustering plot
Agglomerative clustering
• Agglomerative clustering algorithms begin with every observation representing a singleton cluster.
• At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster.
• Therefore a measure of dissimilarity between 2 clusters must be defined.
Comparing different linkage methods If there is a strong clustering tendency, all 3 methods
produce similar results. Single linkage has a tendency to combine observations
linked by a series of close intermediate observations ("chaining“). Good for elongated clusters
Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string).
Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters
Dendrogram
Recursive binary splitting/agglomeration can be represented by a rooted binary tree.
The root node represents the entire data set. The N terminal nodes of the trees represent individual
observations. Each nonterminal node ("parent") has two daughter
nodes. Thus the binary tree can be plotted so that the height
of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters.
A dendrogram provides a complete description of the hierarchical clustering in graphical format.
Comments on dendrograms
Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms.
Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data.
In general dendrograms are a description of the results of the algorithm and not graphical summary of the data.
Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality
),max( '' kiikii CCC for all i,i’,k
Figure 13
1 2
12
8 10
12
34
56
78
Cluster Dendrogram
hclust (*, "average")dist(xsimple)
Heig
ht
3
1 2
12
8 10
02
46
810
Cluster Dendrogram
hclust (*, "complete")dist(xsimple)
Heig
ht
3 1 2
12 8 10
12
34
5
Cluster Dendrogram
hclust (*, "single")dist(xsimple)
Heig
ht
average complete single