16
Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Embed Size (px)

Citation preview

Page 1: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Important clustering methods used in microarray data analysis

Steve Horvath

Human Genetics and Biostatistics

UCLA

Page 2: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Contents

• Multidimensional scaling plots– Related to principal component analysis

• k-means clustering

• hierarchical clustering

Page 3: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Introduction to clustering

Page 4: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

MDS plot of clusters

-40 -20 0 20 40

-20

02

04

0

First PLS Component

Se

con

d P

LS

Co

mp

on

en

t

A

AA

A

A

AA

AG

GG

G

GG

G

G

G

G

G

G

G

G

G

G

G

GG

G

G

GGG

G

G

G

G

G

GG

G

G

G

G

G

GG

G

G

G

G

G

G

GG

G

G

G

G

G

G

G

G

G

G

GG

G

M

M

M

M

M

M

M

O

O

O

O

O

O

O

O

O

OO

Page 5: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

MDS plot of clusters

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

-0.2

-0.1

0.0

0.1

0.2

0.3

133A UCLA groups, 133a

cmd1[,1]

cmd

1[,2

]

3

43

3

3

3

3

3

4

54

4

55

5

5

5

5

4 5

4

5

4

5

4

43

4

3

44

4 45

5

5

5

5

5

5

44

5

3

45

5

4

5

5

5

4

5

4

4

5

5

4

5

5

5

5

4

5

5

4

4

3

3

3

3

4

4

3

5

3

3

4

3

43

4

4

5

5

Page 6: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

2 references for clustering

• T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series

• L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability

 

Page 7: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Introduction to clustering

 Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects.  Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.  

Page 8: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Proximity matrices are the input to most clustering algorithms

 Proximity between pairs of objects: similarity or dissimilarity.

If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities.

Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:

 '' kiikii ddd

Page 9: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Different intergroup dissimilarities

Gi Hi HGiiGA

iiHiGiCL

iiHiGiSL

NNdHGd

dHGd

dHGd

' '

'',

'',

)/(),(

)(max),(

)(min),(

Let G and H represent 2 groups.

Page 10: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Agglomerative clustering, hierarchical clustering and dendrograms

Page 11: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Hierarchical clustering plot

Page 12: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Agglomerative clustering

• Agglomerative clustering algorithms begin with every observation representing a singleton cluster.

• At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster.

• Therefore a measure of dissimilarity between 2 clusters must be defined.

 

Page 13: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Comparing different linkage methods If there is a strong clustering tendency, all 3 methods

produce similar results. Single linkage has a tendency to combine observations

linked by a series of close intermediate observations ("chaining“). Good for elongated clusters

Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string).

Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters

Page 14: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Dendrogram

Recursive binary splitting/agglomeration can be represented by a rooted binary tree.

The root node represents the entire data set. The N terminal nodes of the trees represent individual

observations. Each nonterminal node ("parent") has two daughter

nodes. Thus the binary tree can be plotted so that the height

of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters.

A dendrogram provides a complete description of the hierarchical clustering in graphical format.

Page 15: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Comments on dendrograms

Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms.

Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data.

In general dendrograms are a description of the results of the algorithm and not graphical summary of the data.

Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality

),max( '' kiikii CCC for all i,i’,k

Page 16: Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Figure 13

1 2

12

8 10

12

34

56

78

Cluster Dendrogram

hclust (*, "average")dist(xsimple)

Heig

ht

3

1 2

12

8 10

02

46

810

Cluster Dendrogram

hclust (*, "complete")dist(xsimple)

Heig

ht

3 1 2

12 8 10

12

34

5

Cluster Dendrogram

hclust (*, "single")dist(xsimple)

Heig

ht

average complete single