04/19/23 Raffaele Giancarlo 1
Microarray Data Analyisis: Clustering
and Validation Measures
Raffaele GiancarloDipartimento di Matematica
Università di PalermoItaly
04/19/23 Raffaele Giancarlo 2
What we want (tipically)What we want (tipically)
Genes Expression Matrix•Group functionally related genes together
•Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always
•Clustering
Expression LevelsGenes
04/19/23 Raffaele Giancarlo 3
What we want (tipically)What we want (tipically)
Clustering Solution
04/19/23 Raffaele Giancarlo 4
Limitations in the Analysis Process
04/19/23 Raffaele Giancarlo 5
Limitations: Microarray Technology
• MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006
– …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself
– A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak
04/19/23 Raffaele Giancarlo 6
Limitations: Visualization Tools
• One of those two Clusters is random noise …Which One ???
04/19/23 Raffaele Giancarlo 7
Limitations: Statistics
• Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004
– Many papers for omic research describe development or application of statistical methods— Many of those are questionable
04/19/23 Raffaele Giancarlo 8
Overview Of Remaining Part
• Clustering as a three step process• Internal validation Techniques• External Validation Techniques• Experiments• One stop shops software systems• Some Issues I Really Had to Talk
About
04/19/23 Raffaele Giancarlo 9
Cluster Analysis as a Three Step Process
04/19/23 Raffaele Giancarlo 10
What is clustering?
• Group similar objects together
E1 E2 E3 E4
Gene 1 -2 +2 +2 -1
Gene 2 +8 +3 0 +4
Gene 3 -4 +5 +4 -2
Gene 4 -1 +4 +3 -1Clu
steri
ng
gen
es Clustering
experiments
04/19/23 Raffaele Giancarlo 11
What is Clustering?
• Goal: partition the observations {xi} so that
– C(i)=C(j) if xi and xj are “similar”
– C(i)C(j) if xi and xj are “dissimilar”
• natural questions: – What is a cluster– How do I choose a good similarity function– How do I choose a good algorithm
• APPLICATION and DATA DEPENDENT
– How many clusters are REALLY present in the data
04/19/23 Raffaele Giancarlo 12
What’s a Cluster?
• No rigorous definition• Subjective• Scale/Resolution dependent (e.g.
hierarchy)
04/19/23 Raffaele Giancarlo 13
Step One
• Choose a good similarity function-
– Euclidean Distance- • capture magnetudo and pattern of
expression, i.e., direction
– Correlation functions• Captures pattern of expression, i.e. direction
– Etc…
04/19/23 Raffaele Giancarlo 14
Step Two
• Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize
– Compactness: Intra- Cluster Variation Small• They like well separated or spherical clusters but fail on more complex
cluster shapes• Kmeans, Average Link Hierarchical Clustering
– Connectedness- neighboring items should share the same cluster• Robust with respect to cluster shapes, but fail when separation in the
data is poor. • Single Link Hierarchical Clustering, CAST, CLICK
– Spatial Separation- Poor performer by itself, usually coupled with other criteria
• Simulated Annealing, Tabu Search
04/19/23 Raffaele Giancarlo 15
Step Three• An index that tells us how many clusters are really present in the data:
Consistency/Uniformity
more likely to be 2 than 3
more likely to be 2 than 36?(depends, what if each circle represents 1000 objects?)
04/19/23 Raffaele Giancarlo 16
Step Three
• An index that tells us: Separability
increasing confidence to be 2
04/19/23 Raffaele Giancarlo 17
Step Three
• An index that tells us: Separability
increasing confidence to be 2
04/19/23 Raffaele Giancarlo 18
Step Three
• An index that tells us: Separability
increasing confidence to be 2
04/19/23 Raffaele Giancarlo 19
Step Three
• An index that is– independent of cluster “volume”?– independent of cluster size?– independent of cluster shape?– sensitive to outliers?– etc…
• Theoretically Sound-Gap Statistics• Data Driven and Validated-Many
04/19/23 Raffaele Giancarlo 20
Internal Validation Measures
• How many clusters are really present in the data• Assess Cluster Quality•Internal: No external knowledge about the dataset is given
04/19/23 Raffaele Giancarlo 21
The Basic Scheme
• Given an Index F – a function of clustering solution
• black box producing clustering solutions with k=2,…,m clusters
• Compute F( ) to decide which k is best
2C
kC
kC
04/19/23 Raffaele Giancarlo 22
Internal Validation Measures
• Within-Cluster Sum of Squares [Folklore]
• Gap Statistics [Tibshirani, Walther, Hastie 2001]
• FOM [Yeung, Haynor, Ruzzo 2001]
• Consensus Clustering [Monti et al., 2003]
• Etc…
04/19/23 Raffaele Giancarlo 23
Within-Cluster Sum of Squares
r rCi Cj
jir xxD2
xi
xj
04/19/23 Raffaele Giancarlo 24
Within-Cluster Sum of Squares
r
r r
Ciir
Ci Cjjir
xxn
xxD
2
2
2
k
rr
rk D
nW
1 2
1
Measure of compactness of clusters
04/19/23 Raffaele Giancarlo 25
Using Wk to determine # clusters
Idea of L-Curve Method: use the k corresponding to the “elbow”
(the most significant increase in goodness-of-fit)
04/19/23 Raffaele Giancarlo 26
Example• Yeast Cell Cycle Dataset, 698 genes and
72 conditions
• Five functional classes-The gold solution
• Algorithm, K-means with Av. Link input and Euclidean Distance
• We want to know how many clusters are predicted by Wk , with K-means as an “oracle”
04/19/23 Raffaele Giancarlo 27
Example
04/19/23 Raffaele Giancarlo 28
Problems with Use of Wk
• No reference clustering solution to compare against, i.e., no model
• The values of Wk are not normalized and therefore cannot be compared
• In a nutshell: we get values of Wk but we do not quite know how far we are from randomness
• Gap Statistics takes care of those problems
04/19/23 Raffaele Giancarlo 29
The Gap Statistics
• Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk
• Extended to work in higher dimensions – No Theory
• Validated experimentally
04/19/23 Raffaele Giancarlo 30
Sample Uniformly and at Random
1. Align with feature axes (data-geometry independent)
ObservationsBounding Box (aligned
with feature axes)Monte Carlo Simulations
04/19/23 Raffaele Giancarlo 31
Computation of the Gap Statistic
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
Compute
Compute sd(k), the s.d. of {log Wkb}l=1,…,B
Set the total s.e.
Find the smallest k such that
)(/11 ksdBsk
B
bkkb WW
BkGap
1
loglog1
)(
1)1()( kskGapkGap
04/19/23 Raffaele Giancarlo 32
Example
• The same experimental setting as for
Within-Sum of Squares
• We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle”
04/19/23 Raffaele Giancarlo 33
Example
04/19/23 Raffaele Giancarlo 34
Figure of Merit• A purely experimental approach,
designed and validated specifically for microarray data
04/19/23 Raffaele Giancarlo 35
FOMExperiments
1 m
gen
es
1
n
e
Cluster C1
Cluster Ci
Cluster Ck
g
R(g,e)
m
e
keFOMkFOM1
),()(
n
k-nk)/FOM(e, k)FOM(e, adjusted
k
i CgC
i
ieegR
nkeFOM
1
2))(),((1
),(
04/19/23 Raffaele Giancarlo 36
FOM
04/19/23 Raffaele Giancarlo 37
Example
• Same experimental setting as in the Within Sum of Squares
• We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle”
• Hint: look for the elbow in the FOM plot, exactly as for the Wk curve.
04/19/23 Raffaele Giancarlo 38
Example
04/19/23 Raffaele Giancarlo 39
External Validation Measures
• Given two partitions of the same dataset, how close they are ?
• Assess Quality of a partition against a given gold standard
•External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes
04/19/23 Raffaele Giancarlo 40
Some External Validation Measures
• The two partitions must have the same number of classes– Jaccard Index– Minkowski score– Rand Index [Rand 71]
• The two partitions can have a different number of classes– The Adjusted Rand Index [Hubert and
Arabie 85]– The F measure [van Rijsbergen 79]
04/19/23 Raffaele Giancarlo 41
Some External Validation Measures
• Problem with the mentioned indexes:– What is their expected value ?
• In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had with Gap Statistics.
04/19/23 Raffaele Giancarlo 42
The Adjusted Rand Index
• It takes in input two partitions, not necessarely having the same number of classes.
– Value 1, its maximum, means perfect agreement
– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero
• Note1: the index may take negative values
• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index
– The index must be maximased
– We will see some of its uses later
04/19/23 Raffaele Giancarlo 43
The Adjusted Rand Index
• It takes in input two partitions, not necessarely having the same number of classes.
– Value 1, its maximum, means perfect agreement
– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero
• Note1: the index may take negative values
• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index
– The index must be maximased
– We will see some of its uses later
04/19/23 Raffaele Giancarlo 44
Adjusted Rand index
• Compare clusters to classes• Consider # pairs of objects
Same cluster
Different cluster
Same class a c
Different class
b d
04/19/23 Raffaele Giancarlo 45
Example (Adjusted Rand)
c#1(4) c#2(5) c#3(7) c#4(4)class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1
1192
20
2831592
10
2
5
2
3
2
2
1231432
4
2
7
2
5
2
4
312
7
2
4
2
3
2
2
cbad
ac
ab
a
469.0)(1
)( Rand Adjusted
789.0,
RE
RERdcda
daRRand
Closed form in the paper by Handl et al.(supplementary material)
04/19/23 Raffaele Giancarlo 46
Some Experiments or on the Need of Benchmark
Data Set
04/19/23 Raffaele Giancarlo 47
How Do I Pick:
• Distance and Similarity Functions, given algorithm and data set
• algorithm, given data set
• Internal Validation Measures, given data set
04/19/23 Raffaele Giancarlo 48
Different Distances-Same Algorithm and implementation
(k-means)
04/19/23 Raffaele Giancarlo 49
Same Distance-Two Different Implementations of the Same
Algorithm: not all k-means are equal
04/19/23 Raffaele Giancarlo 50
Performance of Different Algorithms- precision
Method Clusters Adjusted Rand
Max K-means Random 5 0,44
Min K-means Random 5 0,49
Cast 5 0,529
K-means Avlink 5 0,508
Avlink 5 0,559
Click 8 0,51
04/19/23 Raffaele Giancarlo 51
Performance of Different Indexes-Precision
04/19/23 Raffaele Giancarlo 52
Performance of Different Indexes-Precision
04/19/23 Raffaele Giancarlo 53
Performance of Different Indexes-Time
Measure Time in ms
Wk 157672
FOM 3695437
Gap MC 28082500
Gap P 26468125
04/19/23 Raffaele Giancarlo 54
Performance Evaluation
• Which conclusions can one draw from the shown experiments ?
– Some indication of which distance, algorithm and measure to pick
• A much more extensive analysis is need, with well designed benchmark datasets
04/19/23 Raffaele Giancarlo 55
Performance Evaluation
• Benchmark data sets
– Hard to design, in particular for Microarrays
– Worth the trouble (see Tompa et al, Nature Biotechnology, 2005)
04/19/23 Raffaele Giancarlo 56
One Stop Shop Systems for Analysis of Micro
Array Data
04/19/23 Raffaele Giancarlo 57
MIDAS and MEV
• Filtering and data normalization tools
• Clustering Algorithms (K-means, Cast)
• Validation Measures (FOM)
• Statistical Analysis tools
04/19/23 Raffaele Giancarlo 58
Click and Expander
• Data Normalization and Filtering
• Clustering Algorithm (In particular Click)
• Biclusterting algorithms • Validation Methods • Statistical and Visualization Tools
04/19/23 Raffaele Giancarlo 59
Visualization Methods for Statistical Analysis of
Microarray Data• A system that combines statistical
methods and data visualization
• Sinoptyc views and limited navigation on the data are supported
04/19/23 Raffaele Giancarlo 60
Some Issues I Should Have Talked About
• Issue 25: Over-expression and Under-expression of genes
– Problem: one gene subject to “normal” conditions; same gene subject to “different” conditions.
– Question: Are the measured expression levels different ?
– Sensitivity Analysis in Microarray Data: Quite a bit of work– see for instance
http://www-stat.stanford.edu/~tibs/SAM/
04/19/23 Raffaele Giancarlo 61
Advertisement
• Second Lipari International Summer School in Bioinformatics and Computational Biology
• Where and When- Lipari Island, Italy-June 14-21, 2008
• Theme- Biological Networks: Evolution, Interaction and Computation
• More Info at http://lipari.cs.unict.it/LipariSchool/Bio/index.php
04/19/23 Raffaele Giancarlo 62
Conclusions
• Data analysis for microarrays (and not only) is a complicated interactive process with no clear-cut recipe
• Reliable tools or knowledge of their limitations is a must
GOOD LUCK!!!