Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Important clustering methods used in microarray data analysis

Steve Horvath

Human Genetics and Biostatistics

UCLA

Contents

• Multidimensional scaling plots– Related to principal component analysis

• k-means clustering

• hierarchical clustering

Introduction to clustering

MDS plot of clusters

-40 -20 0 20 40

-20

02

04

0

First PLS Component

Se

con

d P

LS

Co

mp

on

en

t

A

AA

A

A

AA

AG

GG

G

GG

G

G

G

G

G

G

G

G

G

G

G

GG

G

G

GGG

G

G

G

G

G

GG

G

G

G

G

G

GG

G

G

G

G

G

G

GG

G

G

G

G

G

G

G

G

G

G

GG

G

M

M

M

M

M

M

M

O

O

O

O

O

O

O

O

O

OO

MDS plot of clusters

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

-0.2

-0.1

0.0

0.1

0.2

0.3

133A UCLA groups, 133a

cmd1[,1]

cmd

1[,2

]

3

43

3

3

3

3

3

4

54

4

55

5

5

5

5

4 5

4

5

4

5

4

43

4

3

44

4 45

5

5

5

5

5

5

44

5

3

45

5

4

5

5

5

4

5

4

4

5

5

4

5

5

5

5

4

5

5

4

4

3

3

3

3

4

4

3

5

3

3

4

3

43

4

4

5

5

2 references for clustering

• T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series

• L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability

Introduction to clustering

Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.

Proximity matrices are the input to most clustering algorithms

Proximity between pairs of objects: similarity or dissimilarity.

If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities.

Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:

'' kiikii ddd

Different intergroup dissimilarities

Gi Hi HGiiGA

iiHiGiCL

iiHiGiSL

NNdHGd

dHGd

dHGd

' '

'',

'',

)/(),(

)(max),(

)(min),(

Let G and H represent 2 groups.

Agglomerative clustering, hierarchical clustering and dendrograms

Hierarchical clustering plot

Agglomerative clustering

• Agglomerative clustering algorithms begin with every observation representing a singleton cluster.

• At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster.

• Therefore a measure of dissimilarity between 2 clusters must be defined.

Comparing different linkage methods If there is a strong clustering tendency, all 3 methods

produce similar results. Single linkage has a tendency to combine observations

linked by a series of close intermediate observations ("chaining“). Good for elongated clusters

Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string).

Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters

Dendrogram

Recursive binary splitting/agglomeration can be represented by a rooted binary tree.

The root node represents the entire data set. The N terminal nodes of the trees represent individual

observations. Each nonterminal node ("parent") has two daughter

nodes. Thus the binary tree can be plotted so that the height

of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters.

A dendrogram provides a complete description of the hierarchical clustering in graphical format.

Comments on dendrograms

Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms.

Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data.

In general dendrograms are a description of the results of the algorithm and not graphical summary of the data.

Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality

),max( '' kiikii CCC for all i,i’,k

Figure 13

1 2

12

8 10

12

34

56

78

Cluster Dendrogram

hclust (*, "average")dist(xsimple)

Heig

ht

3

1 2

12

8 10

02

46

810

Cluster Dendrogram

hclust (*, "complete")dist(xsimple)

Heig

ht

3 1 2

12 8 10

12

34

5

Cluster Dendrogram

hclust (*, "single")dist(xsimple)

Heig

ht

average complete single

Documents

Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA