MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette...

Preview:

Citation preview

MMG991 Session 7

• Non-hierarchical cluster analysis– Review fundamental concepts– As implemented in S-Plus

• Microarray data• Other applications

– Open discussion on implementation• Selection of projects

Cluster analysis revisited• Hierarchical methods

– Goals– Agglomorative– Divisive– Unsupervised– output

• Partitioning methods– Goals– k-means– pam, clara and fanny– Supervised

• Selecting the number of groups– cutree()

– output

Cutting the tree

• cutree()– Returns a vector of group number for the objects clustered– Input tree (output of hclust()– Height of cut (h) or number of groups (k)

• Visualizing the cuts– Currently no default plotting routine– So, what can we do

• Table of groupings• “decorate” the tree

Setting up the example

set1a <-read.table("set1a.txt", header=T, sep="\t")set1a[,1]<-paste(as.character(set1a[,1]),"1a",sep=".")row.names(set1a)<-set1a[,1]set1a<-set1a[sort(dimnames(set1a)[[1]]),-1]

set1a.norm<-(set1a-apply(set1a,1,mean))/apply(set1a,1,stdev)for(i in 1:ncol(set1a.norm))

dimnames(set1a.norm)[[2]][i]<-paste("exp-", as.character(i), sep="")

graphsheet()par(mfcol=c(2,1))

Before0

24

68

10

0 50 100 150 200 250 300

010

2030

4050

1 2 3 45

6 78

9 1011 12

1314 15 16 17 18 19 20 21

2223 24 25

26 2728

29 30 31 3233 34 35 36 37 38

3940 41 42 4344 45 46 47 48 49

50510

1520

25

0 50 100 150 200 250 300

010

2030

4050

Drawing the figuresset1a.clust<-hclust(dist(set1a.norm, met="euc"),

met="aver")set1a.clust<-clorder(set1a.clust, apply(set1a.norm,

1, mean))set1a.plclust<-plclust(set1a.clust, labels=FALSE)set1a.cutree<-cutree(set1a.clust, k=14)temp<-cbind(set1a.plclust$x, set1a.plclust$y,

col=as.vector(set1a.cutree))for(i in 1:14)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2],col=i, pch=16)

set1a.norm<-set1a.norm[set1a.clust$order,]image(list(x=1:dim(set1a.norm)[1],

y=1:dim(set1a.norm)[2], z=as.matrix(set1a.norm)))

image.legend(as.matrix(set1a.norm), x=nrow(set1a.norm)*1.066,y=ncol(set1a.norm)*1.05, size=c(.1, 2.55),hor=F,cex=0.66,mgp=c(0,0.25,0))

Second dimensionset1a.tclust<-hclust(dist(t(set1a.norm), met="euc"),

met="aver")set1a.tclust<-clorder(set1a.tclust, apply(t(set1a.norm), 1,

mean))set1a.tplclust<-plclust(set1a.tclust, labels=FALSE)set1a.tcutree<-cutree(set1a.tclust, k=6)temp<-cbind(set1a.tplclust$x, set1a.tplclust$y,

col=as.vector(set1a.tcutree))for(i in 1:6)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2], col=i,pch=16)

set1a.norm<-set1a.norm[,set1a.tclust$order]image(list(x=1:dim(set1a.norm)[1], y=1:dim(set1a.norm)[2],

z=as.matrix(set1a.norm)))image.legend(as.matrix(set1a.norm),x=nrow(set1a.norm)*1.066,

y=ncol(set1a.norm)*1.05, size=c(.1, 2.55), hor=F,cex=0.66,mgp=c(0,0.25,0))

02

46

810

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Gene clusters identified

Experiment clusters identified5

1015

2025

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

k-means

• Objective– partition observations into groups that minimizes within group

sum of squared distances (withinss). – Centroids– Requires a defined number of groups– Determining optimum number of groups– No graphical output

– The classic example• Ruspini’s data set

kmeans(ruspini,4)Centers:

x y [1,] 98.17647 114.8824[2,] 20.15000 64.9500[3,] 43.91304 146.0435[4,] 68.93333 19.4000

Clustering vector:[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3[40] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Within cluster sum of squares:[1] 4558.235 3689.500 3176.783 1456.533

Cluster sizes:[1] 17 20 23 15

Available arguments:[1] "cluster" "centers" "withinss" "size"

Ruspini dataset

ruspini$x

rusp

ini$

y

0 20 40 60 80 100 120

050

100

150

Ruspini dataset, k=4

ruspini[, 1]

rusp

ini[,

2]

0 20 40 60 80 100 120

050

100

150

Ruspini dataset, k=5

ruspini[, 1]

rusp

ini[,

2]

0 20 40 60 80 100 120

050

100

150

So, how many clusters are there?• Hartigan’s recommendation• if• (sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k -1)

> 10• addition of group is justifed

• Setting up a test…

kscore.ruspini<-as.list(2:21)for(i in 2:20){

k<-kmeans(ruspini, i)kscore.ruspini[[i]]<-k$withinss

}for(i in 2:19){

print((sum(kscore.ruspini[[i]])/sum(kscore.ruspini[[i+1]])-1)*(nrow(ruspini)-i-1))

}

[3] 53.74084

[4] 210.9672

[5] 8.920013

[6] 21.9182

[7] 13.646

[8] 10.26787

[9] 12.0679

[10] 6.488705

[11] 13.35045

[12] 9.935521

[13] 4.754963

[14] 7.834783

[15] 6.378573

[16] 3.886348

[17] 2.072307

[18] 4.197096

[19] 5.176718

[20] 4.354051

k-means with array dataset1a.norm<-(set1a -

apply(set1a,1,mean))/apply(set1a,1,stdev)set1a.norm<-set1a.norm[sort(dimnames(set1a.norm)[[1]]),]set1a.kmeans<-kmeans(set1a.norm, 14)gene.order<-cbind(dimnames(set1a)[[1]], set1a.kmeans$cluster)gene.order<-gene.order[order(gene.order[,2]),1]set1a.kmeans<-kmeans(t(set1a.norm), 6)exp.order<-cbind(dimnames(set1a)[[2]], set1a.kmeans$cluster)exp.order<-exp.order[order(exp.order[,2]),1]

#to visualize the output of the two analysis

temp<-set1a.norm[gene.order, exp.order]

image(list(x=1:dim(temp)[1], y=1:dim(temp)[2], z=as.matrix(temp)))

Kmeans, genes=14, exp=6

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Optimum number of clusters

Genes[3] 52.11781[4] 61.86213[5] 76.23372[6] 57.22758[7] 134.5601[8] 115.662[9] 111.2952[10] 89.475[11] 182.0612[12] 187.1214[13] 153.94[14] 316.5357[15] 8.525307[16] 4.961622[17] 6.483269[18] 6.932642[19] 3.641691[20] 3.420847

Experiments[3] 82.84932[4] 111.1703[5] 52.30111[6] 46.1943[7] 46.47232[8] 49.47469[9] 23.58832[10] 39.69845[11] 28.4224[12] 28.27107[13] 43.54086[14] 22.23373[15] 35.81718[16] 23.46791[17] 29.98056[18] 24.97635[19] 31.67716[20] 30.31866

Optimized k-means clustering

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Six steps of cluster analysis

• Obtaining the data matrix– Test data set

• Standardizing the data matrix– Normalization

• Computing the resemblance matrix– Similarity– Dissimilarity– Distance– Other measures

• Clustering the data• Rearranging the data matrix• Goodness of fit

Partitioning around medoids (pam())

• Similar to k-means– Utilizes medoids rather than centroids– More robust

• Minimizes sum of dissimilarities rather sum of squared Euclideandistances

– Provides grapical output to evaluate clustering• Silhouete plots

– Denotes number of clusters, cluster width and quality– Ranked in decreasing order– Overall average silhouette width

» Heuristics

– pam(x, k, diss=F, metric="euclidean", stand=F, save.x=T, save.diss=T)

pam() with array dataset1a.pam<-pam(set1a.norm,14)set1a.tpam<-pam(t(set1a.norm),4)gene.order<-cbind(dimnames(set1a.norm)[[1]],

set1a.pam$clustering)gene.order<-gene.order[order(gene.order[,2]),1]exp.order<-cbind(dimnames(set1a.norm)[[2]],

set1a.tpam$clustering)exp.order<-exp.order[order(exp.order[,2]),1]temp<-set1a.norm[gene.order, exp.order]image(list(x=1:dim(temp)[1], y=1:dim(temp)[2],

z=as.matrix(temp)))image.legend(as.matrix(temp), x=nrow(temp)*1.075,

y=ncol(temp)*1.05, size=c(.125, 6.1), hor=F,cex=0.66, tck=-0.01, mgp=c(0,0.5,0))

Silhouette plot, grouped by gene, k=14

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.83

Silhouette plot, grouped by expt, k=4

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.29

DNA array data by pam()

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Clustering large applications

• Clara– Optimized version of pam()– Limitations of k-means and pam()

• Memory requirements are quadratic– Algorithm works with subsets

• Divides data into k clusters• Remaining objects assigned to clusters• Susbsequent iterations forced to contain currently best medoids

– clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k, save.x=T, save.diss=T)

Silhouette plots

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.84

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.31

DNA array data by clara()

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.77

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.22

Silhouette plots

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

DNA array data by fanny()

Summing up

• Cluster analysis provides a means of organizing the data based on common features

• Different algorithms may arrive at different solutions

• Homework for next week– Comparing the output of hierarchical and partition methods

• Use Eisen’s test data– Which genes consistently group together?– Which experiments consistently group together?

– Projects

Recommended