MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette...

MMG991 Session 7

• Non-hierarchical cluster analysis– Review fundamental concepts– As implemented in S-Plus

• Microarray data• Other applications

– Open discussion on implementation• Selection of projects

Cluster analysis revisited• Hierarchical methods

– Goals– Agglomorative– Divisive– Unsupervised– output

• Partitioning methods– Goals– k-means– pam, clara and fanny– Supervised

• Selecting the number of groups– cutree()

– output

Cutting the tree

• cutree()– Returns a vector of group number for the objects clustered– Input tree (output of hclust()– Height of cut (h) or number of groups (k)

• Visualizing the cuts– Currently no default plotting routine– So, what can we do

• Table of groupings• “decorate” the tree

Setting up the example

set1a <-read.table("set1a.txt", header=T, sep="\t")set1a[,1]<-paste(as.character(set1a[,1]),"1a",sep=".")row.names(set1a)<-set1a[,1]set1a<-set1a[sort(dimnames(set1a)[[1]]),-1]

set1a.norm<-(set1a-apply(set1a,1,mean))/apply(set1a,1,stdev)for(i in 1:ncol(set1a.norm))

dimnames(set1a.norm)[[2]][i]<-paste("exp-", as.character(i), sep="")

graphsheet()par(mfcol=c(2,1))

Before0

0 50 100 150 200 250 300

1 2 3 45

9 1011 12

1314 15 16 17 18 19 20 21

2223 24 25

26 2728

29 30 31 3233 34 35 36 37 38

3940 41 42 4344 45 46 47 48 49

0 50 100 150 200 250 300

Drawing the figuresset1a.clust<-hclust(dist(set1a.norm, met="euc"),

met="aver")set1a.clust<-clorder(set1a.clust, apply(set1a.norm,

1, mean))set1a.plclust<-plclust(set1a.clust, labels=FALSE)set1a.cutree<-cutree(set1a.clust, k=14)temp<-cbind(set1a.plclust$x, set1a.plclust$y,

col=as.vector(set1a.cutree))for(i in 1:14)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2],col=i, pch=16)

set1a.norm<-set1a.norm[set1a.clust$order,]image(list(x=1:dim(set1a.norm)[1],

y=1:dim(set1a.norm)[2], z=as.matrix(set1a.norm)))

image.legend(as.matrix(set1a.norm), x=nrow(set1a.norm)*1.066,y=ncol(set1a.norm)*1.05, size=c(.1, 2.55),hor=F,cex=0.66,mgp=c(0,0.25,0))

Second dimensionset1a.tclust<-hclust(dist(t(set1a.norm), met="euc"),

met="aver")set1a.tclust<-clorder(set1a.tclust, apply(t(set1a.norm), 1,

mean))set1a.tplclust<-plclust(set1a.tclust, labels=FALSE)set1a.tcutree<-cutree(set1a.tclust, k=6)temp<-cbind(set1a.tplclust$x, set1a.tplclust$y,

col=as.vector(set1a.tcutree))for(i in 1:6)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2], col=i,pch=16)

set1a.norm<-set1a.norm[,set1a.tclust$order]image(list(x=1:dim(set1a.norm)[1], y=1:dim(set1a.norm)[2],

z=as.matrix(set1a.norm)))image.legend(as.matrix(set1a.norm),x=nrow(set1a.norm)*1.066,

y=ncol(set1a.norm)*1.05, size=c(.1, 2.55), hor=F,cex=0.66,mgp=c(0,0.25,0))

0 50 100 150 200 250 300

Gene clusters identified

Experiment clusters identified5

0 50 100 150 200 250 300

k-means

• Objective– partition observations into groups that minimizes within group

sum of squared distances (withinss). – Centroids– Requires a defined number of groups– Determining optimum number of groups– No graphical output

– The classic example• Ruspini’s data set

kmeans(ruspini,4)Centers:

x y [1,] 98.17647 114.8824[2,] 20.15000 64.9500[3,] 43.91304 146.0435[4,] 68.93333 19.4000

Clustering vector:[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3[40] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Within cluster sum of squares:[1] 4558.235 3689.500 3176.783 1456.533

Cluster sizes:[1] 17 20 23 15

Available arguments:[1] "cluster" "centers" "withinss" "size"

Ruspini dataset

ruspini$x

0 20 40 60 80 100 120

Ruspini dataset, k=4

ruspini[, 1]

0 20 40 60 80 100 120

Ruspini dataset, k=5

ruspini[, 1]

0 20 40 60 80 100 120

So, how many clusters are there?• Hartigan’s recommendation• if• (sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k -1)

> 10• addition of group is justifed

• Setting up a test…

kscore.ruspini<-as.list(2:21)for(i in 2:20){

k<-kmeans(ruspini, i)kscore.ruspini[[i]]<-k$withinss

}for(i in 2:19){

print((sum(kscore.ruspini[[i]])/sum(kscore.ruspini[[i+1]])-1)*(nrow(ruspini)-i-1))

[3] 53.74084

[4] 210.9672

[5] 8.920013

[6] 21.9182

[7] 13.646

[8] 10.26787

[9] 12.0679

[10] 6.488705

[11] 13.35045

[12] 9.935521

[13] 4.754963

[14] 7.834783

[15] 6.378573

[16] 3.886348

[17] 2.072307

[18] 4.197096

[19] 5.176718

[20] 4.354051

k-means with array dataset1a.norm<-(set1a -

apply(set1a,1,mean))/apply(set1a,1,stdev)set1a.norm<-set1a.norm[sort(dimnames(set1a.norm)[[1]]),]set1a.kmeans<-kmeans(set1a.norm, 14)gene.order<-cbind(dimnames(set1a)[[1]], set1a.kmeans$cluster)gene.order<-gene.order[order(gene.order[,2]),1]set1a.kmeans<-kmeans(t(set1a.norm), 6)exp.order<-cbind(dimnames(set1a)[[2]], set1a.kmeans$cluster)exp.order<-exp.order[order(exp.order[,2]),1]

#to visualize the output of the two analysis

temp<-set1a.norm[gene.order, exp.order]

image(list(x=1:dim(temp)[1], y=1:dim(temp)[2], z=as.matrix(temp)))

Kmeans, genes=14, exp=6

0 50 100 150 200 250 300

Optimum number of clusters

Genes[3] 52.11781[4] 61.86213[5] 76.23372[6] 57.22758[7] 134.5601[8] 115.662[9] 111.2952[10] 89.475[11] 182.0612[12] 187.1214[13] 153.94[14] 316.5357[15] 8.525307[16] 4.961622[17] 6.483269[18] 6.932642[19] 3.641691[20] 3.420847

Experiments[3] 82.84932[4] 111.1703[5] 52.30111[6] 46.1943[7] 46.47232[8] 49.47469[9] 23.58832[10] 39.69845[11] 28.4224[12] 28.27107[13] 43.54086[14] 22.23373[15] 35.81718[16] 23.46791[17] 29.98056[18] 24.97635[19] 31.67716[20] 30.31866

Optimized k-means clustering

0 50 100 150 200 250 300

Six steps of cluster analysis

• Obtaining the data matrix– Test data set

• Standardizing the data matrix– Normalization

• Computing the resemblance matrix– Similarity– Dissimilarity– Distance– Other measures

• Clustering the data• Rearranging the data matrix• Goodness of fit

Partitioning around medoids (pam())

• Similar to k-means– Utilizes medoids rather than centroids– More robust

• Minimizes sum of dissimilarities rather sum of squared Euclideandistances

– Provides grapical output to evaluate clustering• Silhouete plots

– Denotes number of clusters, cluster width and quality– Ranked in decreasing order– Overall average silhouette width

» Heuristics

– pam(x, k, diss=F, metric="euclidean", stand=F, save.x=T, save.diss=T)

pam() with array dataset1a.pam<-pam(set1a.norm,14)set1a.tpam<-pam(t(set1a.norm),4)gene.order<-cbind(dimnames(set1a.norm)[[1]],

set1a.pam$clustering)gene.order<-gene.order[order(gene.order[,2]),1]exp.order<-cbind(dimnames(set1a.norm)[[2]],

set1a.tpam$clustering)exp.order<-exp.order[order(exp.order[,2]),1]temp<-set1a.norm[gene.order, exp.order]image(list(x=1:dim(temp)[1], y=1:dim(temp)[2],

z=as.matrix(temp)))image.legend(as.matrix(temp), x=nrow(temp)*1.075,

y=ncol(temp)*1.05, size=c(.125, 6.1), hor=F,cex=0.66, tck=-0.01, mgp=c(0,0.5,0))

Silhouette plot, grouped by gene, k=14

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.83

Silhouette plot, grouped by expt, k=4

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

DNA array data by pam()

0 50 100 150 200 250 300

Clustering large applications

• Clara– Optimized version of pam()– Limitations of k-means and pam()

• Memory requirements are quadratic– Algorithm works with subsets

• Divides data into k clusters• Remaining objects assigned to clusters• Susbsequent iterations forced to contain currently best medoids

– clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k, save.x=T, save.diss=T)

Silhouette plots

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

DNA array data by clara()

0 50 100 150 200 250 300

Silhouette plots

0 50 100 150 200 250 300

DNA array data by fanny()

Summing up

• Cluster analysis provides a means of organizing the data based on common features

• Different algorithms may arrive at different solutions

• Homework for next week– Comparing the output of hierarchical and partition methods

• Use Eisen’s test data– Which genes consistently group together?– Which experiments consistently group together?

– Projects

MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette...

Documents

Dana - silhouette

Silhouette 24no3

Measure V (current-meter) at 0.2 and 0.8 of depth Average V and multiply by (width * depth)

Silhouette Form

Silhouette Piano

Silhouette Settings

Using Dahl’s Tense-Aspect questionnaire: Individual ...cysouw.de/home/presentations_files/cysouwBEZHTATENSE_poster.pdf · Simple Past (PST). Speaker C: ... erage silhouette width

Separation Processes: Filtration - ChE 4M3 0.5cm [width=0.2]/Users

Silhouette Cruises

Silhouette Photography

Silhouette Cameo Vinyl Crafting - Crafters Corner Supplies...Silhouette Comes with: Silhouette CAMEO electronic cutter tool Silhouette Studio software for Windows XP/Vista/7 and Mac

Silhouette Texture

[width=0.2]LogoMines [width=0.3]LogoINRIA [width=0.15

Silhouette America - Silhouette America - Panduan PenggunaSetelah Anda mendaftarkan mesin Silhouette Anda, Anda berhak menerima berlangganan 1-bulan gratis untuk Toko Desain Silhouette

Separation Processes - ChE 4M3 0.5cm [width=0.2]/Users

Silhouette Sconces

[width=0.2]logogustaveroussyrvb [width=0.2]logocesp2 ... › data › pages › ...Previous work (Bayar et al., 2016) • Consider a trial as part of a series of two-arm RCTs rather

INFRASYSTEM Silhouette

Template BR_Rec_2005.dot!MSW-E.docx · Web viewFurthermore, the large bandwidth of 37 and 17 MHz respectively in comparison to the bandwidth 1.2/0.2 6 MHz (0.2 s pulse width), indicates

Silhouette presentation