33
Evaluation of gene- expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Embed Size (px)

Citation preview

Page 1: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Evaluation of gene-expression clustering via mutual information

distance measureIdo Priness, Oded Maimon and Irad

Ben-GalBMC Bioinformatics, 2007

Page 2: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Outline

• Introduction• Methods• Experiment• Discussion• Conclusion

2

Page 3: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Introduction

• Many clustering algorithms depend on similarity or distance measures to quantify the degree of association between expression profiles.

• It’s a key factor for a successful identification of relationships between genes and networks.

3

Page 4: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Introduction

• In general, many clustering algorithms use the Euclidean distance or the Pearson correlation coefficient as default similarity measure.

• However, those measurement are sensitive to noise effects and outliers.

4

Page 5: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Introduction

• How to improve?• To evaluate the mutual information(MI)

between gene expression patterns.

5

Page 6: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Methods

• Similarity measures• The implement of mutual information• Assessment of clustering quality

6

Page 7: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Similarity measures

• Consider two vector: and • Euclidean Distance:

• Pearson Correlation coefficient:

7

Page 8: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Similarity measures

• The used Mutual Information requires the expression patterns to be represented by discrete random variable.

• Given two random variable X, Y with respective ranges and probability distribution functions , the Mutual Information is:

8

Page 9: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Mutual Information

• The MI is always non-negative.• A zero MI indicates that the patterns do not

follow any kind of dependence.• The MI treats each expression level equally,

regardless of actual level, and thus is less biased by outliers.

9

Page 10: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

The implement of mutual information

• We use a two-dimensional histogram to approximate the joint probability density function of two expression patterns.

10

We use the same number of bins for all expression patterns. The number of bins should be moderate enough to allow good estimates of the probability function.

Expression pattern of y

Expression pattern of x

Page 11: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

The implement of mutual information

• The joint probabilities are then estimated by the corresponding relative frequencies of expression values in each bin in the two-dimensional histogram.

• The number of bins is often obtained heuristically.

11

Page 12: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Assessment of clustering quality

• When the true solution is unknown, we often use the homogeneity and the separation functions to determine the quality of a clustering solution.

12

Page 13: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Assessment of clustering quality

• Consider that: a set of N elements ,divided into k clusters.

• Denote by and the expression pattern of element and the expression pattern of its cluster.

13

Page 14: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Homogeneity

• The homogeneity is:

where represents a given similarity measure.

14

Page 15: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Separation

• The separation is:

where , are the number of elements in cluster , and the expression pattern of cluster are .

15

Page 16: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Assessment of clustering quality

• High homogeneity implies that elements in the same cluster are very similar to each other.

• Low separation implies that elements from different clusters are very dissimilar to each other.

16

Page 17: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Experiment

• Experiment 1: Robustness of compared distance measures.

• Experiment 2: Comparison of known clustering algorithms by the MI measure.

17

Page 18: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• Evaluate the performance of the three distance measures based on clustering solutions with a known number of clustering errors.

• How to generate clustering errors?

18

Transferring samples from true cluster to the erroneous one.

Page 19: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• The smaller is the number of errors in a solution, the better should be its homogeneity and separation scores and vice versa.

• It is expected that scores of groups of different quality will significantly differ from each other.

• It means “robustness” for those distance measures.

19

Page 20: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• The datasets:

20

Page 21: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• Experimental results:

21

The MI outperform the Pearson correlation and the Euclidean distance.

Page 22: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• Experimental results:

22

The MI outperform the Pearson correlation and the Euclidean distance.

Page 23: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• Experimental results:

23

The homogeneity and separation of MI based are better than the ones of Pearson correlation or Euclidean distance.

Page 24: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Robustness of compared distance measures

• For any number of clustering errors higher than one, the obtained MI-based scores are statistically more significant than the Pearson-based or Euclidean-based scores.

• Therefore, the use of MI-based scores results in a smaller type-II error (false negative) in comparison to the other distance measures when used to evaluate the quality of a clustering solution.

24

Page 25: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• In this experiment, we compare the effectiveness of several known clustering algorithm.

• Four compared algorithms:– K-means– Self-Organizing Maps(SOM)– Click– sIB(a MI based clustering algorithm.)

25

Page 26: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• The dataset: The Yeast cell-cycle dataset– 72 experimental conditions.– Transcript level vary periodically within the cell

cycle.

• Spellman et al. assumed that those expression patterns can be correlated to five different profiles(G1, S G2, M and M/G1 stages).

26

Page 27: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• Experimental results:

27The sIB has higher homogeneity, lower separation.

Page 28: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• Experimental results:

28

For sIB, it could get better Homogeneity and Separation scores than other algorithms.

Page 29: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• Experimental results:

29However, sIB became worst when Pearson correlation based score used!

Page 30: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Comparison of known clustering algorithms by the MI measure

• Once the solutions are evaluated by a different distance measure, the ranking obtained is almost the opposite of the MI-based ranking.

30

Page 31: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Discussion

• In the first experiment, we show the statistical superiority of the average MI-based measure independently of the selected clustering algorithm.

• In the second experiment, we show that the use of different distance measures can yield very different results when evaluating the solutions of known clustering algorithms.

31

Page 32: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Discussion

• The use of equal-probability bins to estimate the MI score provides considerable protection against outlier, since the contributions of all expression values within a bin to this estimation are identical, regardless of their actual values.

32

Page 33: Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007

Conclusion

• The MI measure is a generalized measure of statistical dependence in the data and is reasonably immune against missing data and outliers.

• The selection of a proper distance measure is more important in clustering algorithms.

33