Download ppt - Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 1

Microarray Data Analyisis: Clustering

and Validation Measures

Raffaele GiancarloDipartimento di Matematica

Università di PalermoItaly


What we want (tipically)What we want (tipically)

Genes Expression Matrix•Group functionally related genes together

•Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always

•Clustering

Expression LevelsGenes


What we want (tipically)What we want (tipically)

Clustering Solution


Limitations in the Analysis Process


Limitations: Microarray Technology

• MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006

– …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself

– A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak


Limitations: Visualization Tools

• One of those two Clusters is random noise …Which One ???


Limitations: Statistics

• Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004

– Many papers for omic research describe development or application of statistical methods— Many of those are questionable


Overview Of Remaining Part

• Clustering as a three step process• Internal validation Techniques• External Validation Techniques• Experiments• One stop shops software systems• Some Issues I Really Had to Talk

About


Cluster Analysis as a Three Step Process


What is clustering?

• Group similar objects together

E1 E2 E3 E4

Gene 1 -2 +2 +2 -1

Gene 2 +8 +3 0 +4

Gene 3 -4 +5 +4 -2

Gene 4 -1 +4 +3 -1Clu

steri

ng

gen

es Clustering

experiments


What is Clustering?

• Goal: partition the observations {xi} so that

– C(i)=C(j) if xi and xj are “similar”

– C(i)C(j) if xi and xj are “dissimilar”

• natural questions: – What is a cluster– How do I choose a good similarity function– How do I choose a good algorithm

• APPLICATION and DATA DEPENDENT

– How many clusters are REALLY present in the data


What’s a Cluster?

• No rigorous definition• Subjective• Scale/Resolution dependent (e.g.

hierarchy)


Step One

• Choose a good similarity function-

– Euclidean Distance- • capture magnetudo and pattern of

expression, i.e., direction

– Correlation functions• Captures pattern of expression, i.e. direction

– Etc…


Step Two

• Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize

– Compactness: Intra- Cluster Variation Small• They like well separated or spherical clusters but fail on more complex

cluster shapes• Kmeans, Average Link Hierarchical Clustering

– Connectedness- neighboring items should share the same cluster• Robust with respect to cluster shapes, but fail when separation in the

data is poor. • Single Link Hierarchical Clustering, CAST, CLICK

– Spatial Separation- Poor performer by itself, usually coupled with other criteria

• Simulated Annealing, Tabu Search


Step Three• An index that tells us how many clusters are really present in the data:

Consistency/Uniformity

more likely to be 2 than 3

more likely to be 2 than 36?(depends, what if each circle represents 1000 objects?)


Step Three

• An index that tells us: Separability

increasing confidence to be 2


Step Three




Step Three




Step Three

• An index that is– independent of cluster “volume”?– independent of cluster size?– independent of cluster shape?– sensitive to outliers?– etc…

• Theoretically Sound-Gap Statistics• Data Driven and Validated-Many


Internal Validation Measures

• How many clusters are really present in the data• Assess Cluster Quality•Internal: No external knowledge about the dataset is given


The Basic Scheme

• Given an Index F – a function of clustering solution

• black box producing clustering solutions with k=2,…,m clusters

• Compute F( ) to decide which k is best

2C

kC

kC


Internal Validation Measures

• Within-Cluster Sum of Squares [Folklore]

• Gap Statistics [Tibshirani, Walther, Hastie 2001]

• FOM [Yeung, Haynor, Ruzzo 2001]

• Consensus Clustering [Monti et al., 2003]

• Etc…


Within-Cluster Sum of Squares

r rCi Cj

jir xxD2

xi

xj


Within-Cluster Sum of Squares

r

r r

Ciir

Ci Cjjir

xxn

xxD

2

2

2

k

rr

rk D

nW

1 2

1

Measure of compactness of clusters


Using Wk to determine # clusters

Idea of L-Curve Method: use the k corresponding to the “elbow”

(the most significant increase in goodness-of-fit)


Example• Yeast Cell Cycle Dataset, 698 genes and

72 conditions

• Five functional classes-The gold solution

• Algorithm, K-means with Av. Link input and Euclidean Distance

• We want to know how many clusters are predicted by Wk , with K-means as an “oracle”


Example


Problems with Use of Wk

• No reference clustering solution to compare against, i.e., no model

• The values of Wk are not normalized and therefore cannot be compared

• In a nutshell: we get values of Wk but we do not quite know how far we are from randomness

• Gap Statistics takes care of those problems


The Gap Statistics

• Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk

• Extended to work in higher dimensions – No Theory

• Validated experimentally


Sample Uniformly and at Random

1. Align with feature axes (data-geometry independent)

ObservationsBounding Box (aligned

with feature axes)Monte Carlo Simulations


Computation of the Gap Statistic

for l = 1 to B

Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)

for k = 1 to K

Cluster the observations into k groups and compute log Wk

for l = 1 to B

Cluster the M.C. sample into k groups and compute log Wkb

Compute

Compute sd(k), the s.d. of {log Wkb}l=1,…,B

Set the total s.e.

Find the smallest k such that

)(/11 ksdBsk

B

bkkb WW

BkGap

1

loglog1

)(

1)1()( kskGapkGap


Example

• The same experimental setting as for

Within-Sum of Squares

• We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle”


Example


Figure of Merit• A purely experimental approach,

designed and validated specifically for microarray data


FOMExperiments

1 m

gen

es

1

n

e

Cluster C1

Cluster Ci

Cluster Ck

g

R(g,e)

m

e

keFOMkFOM1

),()(

n

k-nk)/FOM(e, k)FOM(e, adjusted

k

i CgC

i

ieegR

nkeFOM

1

2))(),((1

),(


FOM


Example

• Same experimental setting as in the Within Sum of Squares

• We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle”

• Hint: look for the elbow in the FOM plot, exactly as for the Wk curve.


Example


External Validation Measures

• Given two partitions of the same dataset, how close they are ?

• Assess Quality of a partition against a given gold standard

•External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes


Some External Validation Measures

• The two partitions must have the same number of classes– Jaccard Index– Minkowski score– Rand Index [Rand 71]

• The two partitions can have a different number of classes– The Adjusted Rand Index [Hubert and

Arabie 85]– The F measure [van Rijsbergen 79]


Some External Validation Measures

• Problem with the mentioned indexes:– What is their expected value ?

• In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had with Gap Statistics.


The Adjusted Rand Index

• It takes in input two partitions, not necessarely having the same number of classes.

– Value 1, its maximum, means perfect agreement

– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero

• Note1: the index may take negative values

• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index

– The index must be maximased

– We will see some of its uses later


The Adjusted Rand Index

• It takes in input two partitions, not necessarely having the same number of classes.

– Value 1, its maximum, means perfect agreement

– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero

• Note1: the index may take negative values

• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index

– The index must be maximased

– We will see some of its uses later


Adjusted Rand index

• Compare clusters to classes• Consider # pairs of objects

Same cluster

Different cluster

Same class a c

Different class

b d


Example (Adjusted Rand)

c#1(4) c#2(5) c#3(7) c#4(4)class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

RE

RERdcda

daRRand

Closed form in the paper by Handl et al.(supplementary material)


Some Experiments or on the Need of Benchmark

Data Set


How Do I Pick:

• Distance and Similarity Functions, given algorithm and data set

• algorithm, given data set

• Internal Validation Measures, given data set


Different Distances-Same Algorithm and implementation

(k-means)


Same Distance-Two Different Implementations of the Same

Algorithm: not all k-means are equal


Performance of Different Algorithms- precision

Method Clusters Adjusted Rand

Max K-means Random 5 0,44

Min K-means Random 5 0,49

Cast 5 0,529

K-means Avlink 5 0,508

Avlink 5 0,559

Click 8 0,51


Performance of Different Indexes-Precision


Performance of Different Indexes-Precision


Performance of Different Indexes-Time

Measure Time in ms

Wk 157672

FOM 3695437

Gap MC 28082500

Gap P 26468125


Performance Evaluation

• Which conclusions can one draw from the shown experiments ?

– Some indication of which distance, algorithm and measure to pick

• A much more extensive analysis is need, with well designed benchmark datasets


Performance Evaluation

• Benchmark data sets

– Hard to design, in particular for Microarrays

– Worth the trouble (see Tompa et al, Nature Biotechnology, 2005)


One Stop Shop Systems for Analysis of Micro

Array Data


MIDAS and MEV

• Filtering and data normalization tools

• Clustering Algorithms (K-means, Cast)

• Validation Measures (FOM)

• Statistical Analysis tools


Click and Expander

• Data Normalization and Filtering

• Clustering Algorithm (In particular Click)

• Biclusterting algorithms • Validation Methods • Statistical and Visualization Tools


Visualization Methods for Statistical Analysis of

Microarray Data• A system that combines statistical

methods and data visualization

• Sinoptyc views and limited navigation on the data are supported


Some Issues I Should Have Talked About

• Issue 25: Over-expression and Under-expression of genes

– Problem: one gene subject to “normal” conditions; same gene subject to “different” conditions.

– Question: Are the measured expression levels different ?

– Sensitivity Analysis in Microarray Data: Quite a bit of work– see for instance

http://www-stat.stanford.edu/~tibs/SAM/


Advertisement

• Second Lipari International Summer School in Bioinformatics and Computational Biology

• Where and When- Lipari Island, Italy-June 14-21, 2008

• Theme- Biological Networks: Evolution, Interaction and Computation

• More Info at http://lipari.cs.unict.it/LipariSchool/Bio/index.php


Conclusions

• Data analysis for microarrays (and not only) is a complicated interactive process with no clear-cut recipe

• Reliable tools or knowledge of their limitations is a must

GOOD LUCK!!!