Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Evaluation of gene-expression clustering via mutual information

distance measureIdo Priness, Oded Maimon and Irad

Ben-GalBMC Bioinformatics, 2007

Outline

• Introduction• Methods• Experiment• Discussion• Conclusion

2

Introduction

• Many clustering algorithms depend on similarity or distance measures to quantify the degree of association between expression profiles.

• It’s a key factor for a successful identification of relationships between genes and networks.

3

Introduction

• In general, many clustering algorithms use the Euclidean distance or the Pearson correlation coefficient as default similarity measure.

• However, those measurement are sensitive to noise effects and outliers.

4

Introduction

• How to improve?• To evaluate the mutual information(MI)

between gene expression patterns.

5

Methods

• Similarity measures• The implement of mutual information• Assessment of clustering quality

6

Similarity measures

• Consider two vector: and • Euclidean Distance:

• Pearson Correlation coefficient:

7

Similarity measures

• The used Mutual Information requires the expression patterns to be represented by discrete random variable.

• Given two random variable X, Y with respective ranges and probability distribution functions , the Mutual Information is:

8

Mutual Information

• The MI is always non-negative.• A zero MI indicates that the patterns do not

follow any kind of dependence.• The MI treats each expression level equally,

regardless of actual level, and thus is less biased by outliers.

9

The implement of mutual information

• We use a two-dimensional histogram to approximate the joint probability density function of two expression patterns.

10

We use the same number of bins for all expression patterns. The number of bins should be moderate enough to allow good estimates of the probability function.

Expression pattern of y

Expression pattern of x

The implement of mutual information

• The joint probabilities are then estimated by the corresponding relative frequencies of expression values in each bin in the two-dimensional histogram.

• The number of bins is often obtained heuristically.

11

Assessment of clustering quality

• When the true solution is unknown, we often use the homogeneity and the separation functions to determine the quality of a clustering solution.

12


• Consider that: a set of N elements ,divided into k clusters.

• Denote by and the expression pattern of element and the expression pattern of its cluster.

13

Homogeneity

• The homogeneity is:

where represents a given similarity measure.

14

Separation

• The separation is:

where , are the number of elements in cluster , and the expression pattern of cluster are .

15


• High homogeneity implies that elements in the same cluster are very similar to each other.

• Low separation implies that elements from different clusters are very dissimilar to each other.

16

Experiment

• Experiment 1: Robustness of compared distance measures.

• Experiment 2: Comparison of known clustering algorithms by the MI measure.

17

Robustness of compared distance measures

• Evaluate the performance of the three distance measures based on clustering solutions with a known number of clustering errors.

• How to generate clustering errors?

18

Transferring samples from true cluster to the erroneous one.


• The smaller is the number of errors in a solution, the better should be its homogeneity and separation scores and vice versa.

• It is expected that scores of groups of different quality will significantly differ from each other.

• It means “robustness” for those distance measures.

19


• The datasets:

20


• Experimental results:

21

The MI outperform the Pearson correlation and the Euclidean distance.



22

The MI outperform the Pearson correlation and the Euclidean distance.



23

The homogeneity and separation of MI based are better than the ones of Pearson correlation or Euclidean distance.


• For any number of clustering errors higher than one, the obtained MI-based scores are statistically more significant than the Pearson-based or Euclidean-based scores.

• Therefore, the use of MI-based scores results in a smaller type-II error (false negative) in comparison to the other distance measures when used to evaluate the quality of a clustering solution.

24

Comparison of known clustering algorithms by the MI measure

• In this experiment, we compare the effectiveness of several known clustering algorithm.

• Four compared algorithms:– K-means– Self-Organizing Maps(SOM)– Click– sIB(a MI based clustering algorithm.)

25


• The dataset: The Yeast cell-cycle dataset– 72 experimental conditions.– Transcript level vary periodically within the cell

cycle.

• Spellman et al. assumed that those expression patterns can be correlated to five different profiles(G1, S G2, M and M/G1 stages).

26



27The sIB has higher homogeneity, lower separation.



28

For sIB, it could get better Homogeneity and Separation scores than other algorithms.



29However, sIB became worst when Pearson correlation based score used!


• Once the solutions are evaluated by a different distance measure, the ranking obtained is almost the opposite of the MI-based ranking.

30

Discussion

• In the first experiment, we show the statistical superiority of the average MI-based measure independently of the selected clustering algorithm.

• In the second experiment, we show that the use of different distance measures can yield very different results when evaluating the solutions of known clustering algorithms.

31

Discussion

• The use of equal-probability bins to estimate the MI score provides considerable protection against outlier, since the contributions of all expression values within a bin to this estimation are identical, regardless of their actual values.

32

Conclusion

• The MI measure is a generalized measure of statistical dependence in the data and is reasonably immune against missing data and outliers.

• The selection of a proper distance measure is more important in clustering algorithms.

33

Documents

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007