56
Introduction Biclustering Possibilistic Biclustering algorithm Results/Conclusions DNA Microarray Data Sets Biclustering using a Possibilistic Approach Francesco Masulli DISI Dept Computer and Information Science, University of Genova ITALY November 2006 Francesco Masulli DNA Microarray Data Sets Biclustering

DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

DNA Microarray Data Sets Biclustering usinga Possibilistic Approach

Francesco Masulli

DISI Dept Computer and Information Science, University of Genova ITALY

November 2006

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 2: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Outline

1 Introduction

2 Biclustering

3 Possibilistic Biclustering algorithm

4 Results/Conclusions

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 3: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSData representation

Nowadays, in the Post-Genomic era, we have manyBioinformatics data sets available (most of them releasedin public domain on the Internet)

The information embedded in most of them has no yetcompletely exploited, due to the lack of accurate machinelearning tools and/or of their diffusion in the Bioinformaticscommunity.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 4: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Most of Bioinformatics data set come from from DNAmicroarray experiments and are normally given as arectangular m × n matrix X , where each columnrepresents a feature (e.g., gene) and each row representsa data sample or condition (e.g., patient)

X = (xij)m×n, (1)

where the value xij is the expression of i-th gene in j-thcondition.

The analysis of microarray data sets can give a valuableinformation on the biological relevance of genes andcorrelations between them [Madei, 2004].

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 5: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor Machine Learning tasks

Clustering (Unsupervised): Given a set of sample partitionthem into groups of similar data samples according tosome similarity criteria (CLASS DISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): For each ofthe classes, select a subset of features responsible forcreating the condition corresponding to the class (GENESELECTION, BIOMARKER SELECTION).Outlier Detection : Some of the data samples are notgood representative of any of the classes. Therefore, it isbetter to disregard them while performing data analysis.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 6: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor Machine Learning tasks

Clustering (Unsupervised): Given a set of sample partitionthem into groups of similar data samples according tosome similarity criteria (CLASS DISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): For each ofthe classes, select a subset of features responsible forcreating the condition corresponding to the class (GENESELECTION, BIOMARKER SELECTION).Outlier Detection : Some of the data samples are notgood representative of any of the classes. Therefore, it isbetter to disregard them while performing data analysis.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 7: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor Machine Learning tasks

Clustering (Unsupervised): Given a set of sample partitionthem into groups of similar data samples according tosome similarity criteria (CLASS DISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): For each ofthe classes, select a subset of features responsible forcreating the condition corresponding to the class (GENESELECTION, BIOMARKER SELECTION).Outlier Detection : Some of the data samples are notgood representative of any of the classes. Therefore, it isbetter to disregard them while performing data analysis.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 8: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor Machine Learning tasks

Clustering (Unsupervised): Given a set of sample partitionthem into groups of similar data samples according tosome similarity criteria (CLASS DISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): For each ofthe classes, select a subset of features responsible forcreating the condition corresponding to the class (GENESELECTION, BIOMARKER SELECTION).Outlier Detection : Some of the data samples are notgood representative of any of the classes. Therefore, it isbetter to disregard them while performing data analysis.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 9: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor challenges in Machine Learning

Typical noisiness of data arising in many MachineLearning applications complicates solution of MachineLearning Tasks (robustness to noise).

High-dimensionality of data makes complete search inmost of data mining problems computationally infeasible(curse of dimensionality).

Some data values may be inaccurate or missing .

The available data may be not sufficient to obtainstatistically significant conclusions.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 10: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor challenges in Machine Learning

Typical noisiness of data arising in many MachineLearning applications complicates solution of MachineLearning Tasks (robustness to noise).

High-dimensionality of data makes complete search inmost of data mining problems computationally infeasible(curse of dimensionality).

Some data values may be inaccurate or missing .

The available data may be not sufficient to obtainstatistically significant conclusions.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 11: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor challenges in Machine Learning

Typical noisiness of data arising in many MachineLearning applications complicates solution of MachineLearning Tasks (robustness to noise).

High-dimensionality of data makes complete search inmost of data mining problems computationally infeasible(curse of dimensionality).

Some data values may be inaccurate or missing .

The available data may be not sufficient to obtainstatistically significant conclusions.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 12: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BIOINFORMATICS DATA SETSMajor challenges in Machine Learning

Typical noisiness of data arising in many MachineLearning applications complicates solution of MachineLearning Tasks (robustness to noise).

High-dimensionality of data makes complete search inmost of data mining problems computationally infeasible(curse of dimensionality).

Some data values may be inaccurate or missing .

The available data may be not sufficient to obtainstatistically significant conclusions.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 13: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Biclustering

Problem we shall focus today: To identify genes withsimilar behavior with respect to different conditions ,that is an instance of the problem of biclustering (alsoknown as co-clustering, two-way clustering, ...) [Cheng &Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner etal, 2005]

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 14: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERING

Biclustering is a methodology allowing for feature set anddata points clustering simultaneously.

It finds clusters of samples possessing similarcharacteristics together with features creating thesesimilarities.

The required consistency of sample and featureclassification gives biclustering an advantage over othermethodologies treating samples and features of a datasetseparately of each other.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 15: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERING

Biclustering is a methodology allowing for feature set anddata points clustering simultaneously.

It finds clusters of samples possessing similarcharacteristics together with features creating thesesimilarities.

The required consistency of sample and featureclassification gives biclustering an advantage over othermethodologies treating samples and features of a datasetseparately of each other.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 16: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGApplications

Biological and Medical:Microarray data analysisAnalysis of drug activity [Liu & Wang, 2003]Analysis of nutritional data [Lazzeroni et al., 2000]

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 17: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGApplications

Text Mining [Dhillon, 2001, 2003]

Marketing [Gaul & Schader, 1996]

Dimensionality Reduction in Databases [Agrawal et al.,1998]Others:

electoral data [Hartigan, 1972]currency exchange [Lazzeroni et al. , 2000]

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 18: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGState of the art

Cheng & Church algorithm [2000]

The algorithm constructs one bicluster at a time using astatistical criterion - a low mean squared residue (thevariance of the set of all elements in the bicluster, plus themean row variance and the mean column variance).

Once a bicluster is created, its entries are replaced byrandom numbers, and the procedure is repeated iteratively.

Drawback: The masking procedure results in aphenomenon of random interference, affecting thesubsequent discovery of large-sized biclusters [Yang et al.,2003].

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 19: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGState of the art

Cheng & Church algorithm [2000]

The algorithm constructs one bicluster at a time using astatistical criterion - a low mean squared residue (thevariance of the set of all elements in the bicluster, plus themean row variance and the mean column variance).

Once a bicluster is created, its entries are replaced byrandom numbers, and the procedure is repeated iteratively.

Drawback: The masking procedure results in aphenomenon of random interference, affecting thesubsequent discovery of large-sized biclusters [Yang et al.,2003].

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 20: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGState of the art

Direct Clustering [Hartigan, 1972]

Flexible Overlapped Clusters (FLOC) [Yang et al., 2003](probabilistic algorithm)

Bipartite graphs [Tanay et al 2002]

Genetic algorithms [Mitra et al, 2006]

Simulated Annealing [Bryan et al, 2005]

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 21: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

BICLUSTERINGSurveys

S. Madeira, A.L. Oliveira, Biclustering Algorithms forBiological Data Analysis: A Survey, 2004.

A. Tanay, R. Sharan, R. Shamir, Biclustering Algorithms: ASurvey, 2004.

D. Jiang, C. Tang, A. Zhang, Cluster Analysis for GeneExpression Data: A Survey, 2004.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 22: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)

Joint work:

Maurizio Filippone, Francesco Masulli, Stefano RovettaDISI Dept Computer and Information Science, University ofGenova ITALY

Haider Banka, Sushmita MitraIndian Statistical Institute, Kolkata INDIA

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 23: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)

We propose a new approach to the biclustering problemusing the possibilistic clustering paradigm [Krishnapuram &Keller, 1993].

PBC algorithm finds one bicluster at a time, assigning toeach data matrix element a membership to the bicluster

The membership model is of the fuzzy possibilistic type.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 24: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)Definitions

Let xij be the expression level of the i-th gene in the j-thcondition.

A bicluster is defined as a subset of the m × n data matrixX , i.e., a bicluster is a pair (g, c),where g ⊂ {1, . . . , m} is a subset of genes andc ⊂ {1, . . . , n} is a subset of conditions [Cheng & Church,2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005].

We are interested in largest biclusters from DNAmicroarray data that do not exceed an assignedhomogeneity constraint [Cheng & Church, 2000] as theycan supply relevant biological information.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 25: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)Definitions

The size (or volume) n of a bicluster is usually defined asthe number of cells in the gene expression matrix Xbelonging to it, that is the product of the cardinalitiesng = |g| and nc = |c|:

n = ng · nc (2)

Normalized square residual

d2ij =

(

xij + xIJ − xiJ − xIj)2

n(3)

where the elements xIJ , xiJ and xIj are respectively thebicluster mean, the row mean and the column mean of Xfor the selected genes and conditions:

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 26: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)Definitions

bicluster mean:

xIJ =1n

i∈g

j∈c

xij (4)

bicluster row mean:

xiJ =1nc

j∈c

xij (5)

bicluster column mean:

xIj =1ng

i∈g

xij (6)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 27: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)Definitions

Mean Square Residual [Cheng & Church, 2000]:

G =∑

i∈g

j∈c

d2ij (7)

G measures the bicluster homogeneity, i.e., the differencebetween the actual value of an element xij and its expectedvalue as predicted from the corresponding row mean,column mean, and bicluster mean.

OUR AIM: maximizing the bicluster cardinality n and at thesame time minimizing the residual G (NP-complete task[Peete, 2003]) using the Possibilistic ClusteringParadigm .

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 28: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Approaches to clustering Bioinformatics data sets

Data clustering is a routine step in biological data analysis,and a basic tool in Bioinformatics [Golub, et al., 1999; P.Tamayo, et al., 1999; Azuaje, 2003]Main approaches:

Hierarchical Clustering [Eisen et al., 1998; Orengo et al.,2003]Partitional (or Central) Clustering: including C-Means[Duda & Hart, 1973], Self Organizing Map [Kohonen, 2001],Fuzzy C-Means [Bezdek, 1981], Deterministic Annealing[Rose et al, 1990], Alternating Cluster Estimation [Runkler,1999], etc.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 29: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Probabilistic constraintFrom Probabilistic to Possibilistic Clustering

Let X = {x1, . . . , xr} be a set of unlabeled data points,Y = {y1, . . . , ys} a set of cluster centers (or prototypes)and U = [upq] the fuzzy membership matrix.

Often, central clustering algorithms impose a probabilisticconstraint on memberships, according to which the sum ofthe membership values of a point in all the clusters mustbe equal to one:

r∑

q=1

upq = 1 (8)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 30: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

C-Means Algorithm

The C-Means (CM) algorithm is an efficient approximateway to obtain the maximum likelihood estimate of thecenters of clusters [Duda&Hart,1973].The CM, while maximizes the likelihood of the training set,minimizes at the same time a risk functional Jw [Bezdek,1981]:

Jw (U, Y ) =

s∑

p=1

r∑

q=1

upqEpq, (9)

i.e., expectation of a loss function or distortion (local costfunction)loss often defined as Epq = ‖xq − yp‖

2

upq = P(wp|xq), membership value of pattern xq to clusterwp, approximated to upq ∈ {0, 1}.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 31: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

C-Means Algorithm

assign the number of clusters c and the threshold ǫ

initialize the centers of clusters (at random, or usingavailable knowledge)do until any center changes less than ǫ

assign the points to the clusters with smaller Euclideandistancerecalculate the centers of clustersend do.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 32: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

C-Means AlgorithmCM Stability

Main problem of CM: trapping in local minima of Jw (i.e. onthe local maxima of the likelihood) ⇒ low reliability of itsresults.=> Multiple runs / Cluster validation criteria

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 33: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

C-Means AlgorithmCM Stability

In order to overcome this problem:

Local search techniques based on a regularizationframework: (adding constraints on the solution, i.e.minimization of a modified risk functional)E.g., Isodata and Fuzzy clustering paradigms: FuzzyC-Means (FCM) [Bezdek, 1981], and DeterministicAnnealing (DA) [Rose et al, 1990], Possibilistic Clustering[Krishnapuram & Keller, 1993, 1996], and GradedPossibilistic Clustering [Masulli & Rovetta, 2003].Global search techniques, e.g., minimization of Jw usingSimulated Annealing (Bogus et al., 1999), or EvolutionaryComputing [Fogel, 1993; Bezdek et al., 1994; Tseng&Yang,1997; Egan, 1998; Kuncheva et al., 1998; Hall etal., 1999; Masulli et al., 1999].

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 34: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Fuzzy C-Means Algorithm[Bezdek, 1981]

Risk functional

Jm(U, Y ) =

s∑

p=1

r∑

q=1

umpqEpq, (10)

where m ∈ (0,+∞) is a control parameter of fuzziness,and upq ∈ [0, 1].

The clustering problem can be defined as the constrainedminimization of Jm with respect to Y , under the

normalization (probabilistic constraint):r∑

q=1

upq = 1

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 35: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Fuzzy C-Means Algorithm

Picard iteration of the following equations:

upq =1

∑sr=1(

Ep(xq)Er (xq) )

2m−1

(11)

yp =

∑rq=1 xqum

pq∑r

q=1 umpq

. (12)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 36: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Fuzzy C-Means AlgorithmRemarks

limm→1+

upq = {0, 1}; limm→1+

yp =1rp

xq∈Xp

xq;

rp = |Xp| =

r∑

q=1

upq

limm→+∞

upq =1s; lim

m→+∞

yp =1r

xq∈X

xq

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 37: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

Fuzzy C-Means Algorithm

"universal" value: m = 2

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 38: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

From Probabilistic to Possibilistic Clustering

Probabilistic constraintr∑

q=1

upq = 1:

PROS - competitive constraint allowing the unsupervisedlearning algorithms to find the barycenter of clustersCONS - membership to clusters (a) not interpretable as adegree of typicality - (b) can give sensibility to outliers

(a) (b)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 39: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

From Probabilistic to Possibilistic Clustering

In the Possibilistic C-Means (PCM) Algorithm[Krishnapuram & Keller, 1993] the constraints on theelements of U are relaxed to:

upq ∈ [0, 1] ∀p, q; (13)

0 <

r∑

q=1

upq < r ∀p; (14)

p

upq > 0 ∀q. (15)

i.e., clusters cannot be empty and each pattern must beassigned to at least one clustermode seeking algorithm [Krishnapuram & Keller, 1993]

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 40: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

From Probabilistic to Possibilistic Clustering

PCM objective function [Krishnapuram & Keller, 1996]:

Jm(U, Y ) =

s∑

p=1

r∑

q=1

upqEpq +

s∑

p=1

1βp

r∑

q=1

(upq log upq − upq),

(16)where: Epq = ‖xq − yp‖

2 (squared Euclidean distance) - βp

(scale) depending on the average size of the p-th cluster.

Thanks to the penality term, points with a high degree oftypicality have high upq values, and points not veryrepresentative have low upq values in all the clusters.

Note that if βp → ∞ ∀p =⇒trivial solution upq = 0 ∀p, q, as no probabilistic constraintis assumed.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 41: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

From Probabilistic to Possibilistic Clustering

The pair (U, Y ) minimizes Jm, under the possibilisticconstraints 13-15 only if:

upq = e−Epq/βp ∀p, q, (17)

and

yp =

∑rq=1 xqupq∑r

q=1 upq∀p. (18)

Picard iterationMembership refinement algorithm, membership to clustersas cluster typicality degree (initialization of centroids using,e.g., Fuzzy C-Means).High outliers rejection capability as PCM makes theirmembership very low.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 42: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

From Probabilistic to Possibilistic Clustering

PCM approach =⇒ equivalent to a set of s independentestimation problems [Nasraoui, 1995]:

(upq, y) = arg∧

upq ,y

r∑

q=1

upqEpq +1βp

r∑

q=1

(upq log upq − upq)

∀p,

(19)that can be solved independently one at a time through aPicard iteration.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 43: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

For each bicluster we assign two vectors of membership,one for the rows and one other for the columns, denotingthem respectively a and b.In a crisp sets framework row i and column j can eitherbelong to the bicluster (ai = 1 and bj = 1) or not (ai = 0 orbj = 0).An element xij of X belongs to the bicluster if both ai = 1and bj = 1, i.e., its membership uij to the bicluster is:

uij = and(ai , bj) (20)

The cardinality of the bicluster is then defined as:

n =∑

i

j

uij (21)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 44: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

Fuzzy set theory framework:

We allow membership uij , ai and bj to belong in the interval[0, 1].The membership uij of an element xij of X to the biclustercan be obtained by the aggregation of row and columnmemberships, using, e.g., a fuzzy t-norm like:

uij = aibj (product) (22)

or

uij =ai + bj

2(average) (23)

The fuzzy cardinality of the bicluster is defined as the sumof the memberships uij for all i and j as in eq. 21.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 45: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

Homogeneity measures (eqs. 4 to 7) generalization:Fuzzy normalized square residual

d2ij =

(

xij + xIJ − xiJ − xIj)2

n(24)

where fuzzy bicluster mean, fuzzy bicluster row mean,fuzzy bicluster column mean are defined as :

xIJ =

i∑

j uijxij∑

i∑

j uij, xiJ =

j uijxij∑

j uij, xIj =

i uijxij∑

i uij(25)

and fuzzy mean square residual:

G =∑

i

j

uijd2ij (26)

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 46: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

Possibilistic Biclustering Problem : maximizing thebicluster cardinality n and minimizing the fuzzy residual Gunder the fuzzy possibilistic paradigm.To this aim we make the following assumptions:

we treat one bicluster at a time;the fuzzy memberships ai and bj are interpreted astypicality degrees of gene i and condition j with respect tothe bicluster;we compute the membership uij using the averageaggregator (eq. 23).

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 47: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

All those requirements are fulfilled by minimizing thefollowing functional JB with respect to a and b:

JB =∑

ij

(

ai + bj

2

)

d2ij +λ

i

(ai ln ai−ai)+µ∑

j

(bj ln bj−bj)

(27)The first term is the fuzzy mean square residual G, whilethe other two are penalization terms.

The parameters λ and µ control the size of the bicluster.Their values can be estimated by simple statistics over thetraining set, and then hand-tuned to incorporate possiblea-priori knowledge and to obtain the desired results.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 48: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

Setting the derivatives of JB with respect to thememberships ai and bj to zero we obtain:

ai = exp

(

j d2ij

)

(28)

bj = exp

(

i d2ij

)

(29)

Those necessary conditions for the minimization of JB

together with the definition of the fuzzy normalized squareresidual d2

ij (eq. 24) can be used to find a numericalsolution for the optimization problem (Picard iteration).

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 49: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

Table: Possibilistic Biclustering (PBC) algorithm.

1 Initialize memberships a and b and threshold ε

2 Compute d2ij ∀i , j (eq. 24)

3 Update ai ∀i (eq. 28)4 Update bj ∀j (eq. 29)5 if ‖a′ − a‖ < ε and ‖b′ − b‖ < ε then stop6 else jump to step 2

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 50: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

POSSIBILISTIC BICLUSTERING ALGORITHM (PBC)PBC Formulation

The memberships initialization can be made:randomlyusing some a priori information about relevant genes andconditions.using the results already obtained from another biclusteringalgorithm (in this case PBC will work as a refinementalgorithm)

ε controls the convergence of the algorithm.

After convergence of the algorithm the memberships a andb can be defuzzified by applying an α-cut, i.e., bycomparing with a threshold.

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 51: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

RESULTSYeast data set [Tavazoie et al.; 1999][Ball et al, 2000] [Aach et al 2000]

2879 genes and 17 conditionsα-cut= .5 for a and b defuzzification. ε = 10−2.(results averaged on 20 runs)

Size of biclusters vs λ and µ

lambda0.26

0.280.30

0.320.34

0.36mu90

95100

105

n

0

5000

10000

15000

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 52: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

RESULTSYeast data set

PBC is slightly sensitive to initialization of membershipswhile strongly sensitive to parameters λ and µ. PBC canfind biclusters of a desired size just tuning the parametersλ and µ (results averaged on 20 runs).

λ µ ng nc n G0.25 115 448 10 4480 56.070.19 200 457 16 7312 67.800.30 100 654 8 5232 82.200.32 100 840 9 7560 111.630.31 120 989 13 12857 146.890.34 120 1177 13 15301 181.570.37 110 1309 13 17017 207.200.42 100 1500 13 19500 245.500.45 95 1622 12 19464 260.250.46 95 1681 13 21853 285.000.47 95 1737 13 22581 297.400.48 95 1797 13 23361 310.72

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 53: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

RESULTSYeast data set

1 2 3 4 5 6 7 8

100

150

200

250

300

350

Conditions

Exp

ress

ion

Val

ues

2 4 6 8 10 12

010

020

030

040

050

0Conditions

Exp

ress

ion

Val

ues

Plot of a small and a large bicluster

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 54: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

RESULTSYeast data set

Method avg. G avg. n avg. ng avg. nc Largest nDBF [Zhang et al 2004] 115 1627 188 11 4000FLOC [Yang et al 2003] 188 1826 195 12.8 2000Cheng-Church [2000] 204 1577 167 12 4485

Single-objective GA [Mitra & Banca 2006] 52.9 571 191 5.13 1408Multi-objective GA [Mitra & Banca 2006] 235 10302 1095 9.29 14828

Possibilistic Biclustering 297 22571 1736 13 22607Comparative study on Yeast data

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 55: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

CONCLUSIONS

The Possibilistic Biclustering (PBC) algorithm extends thepossibilistic clustering paradigm for the solution of thebiclustering problem.

The membership uij of an element xij X to the bicluster isobtained by aggregation of memberships (typicality) of hisrow (gene) and column (condition) with respect to bicluster.

The quality of the large biclusters obtained is better thanother biclustering methods.Further studies:

biological validation of the obtained resultsautomatically selection of parameters λ and µ

other aggregators for obtaining xij

Francesco Masulli DNA Microarray Data Sets Biclustering

Page 56: DNA Microarray Data Sets Biclustering using a ...scc/seminars/biclust-india06.pdf · most of data mining problems computationally infeasible (curse of dimensionality). Some data values

IntroductionBiclustering

Possibilistic Biclustering algorithmResults/Conclusions

CONCLUSIONS

The Possibilistic Biclustering (PBC) algorithm extends thepossibilistic clustering paradigm for the solution of thebiclustering problem.

The membership uij of an element xij X to the bicluster isobtained by aggregation of memberships (typicality) of hisrow (gene) and column (condition) with respect to bicluster.

The quality of the large biclusters obtained is better thanother biclustering methods.Further studies:

biological validation of the obtained resultsautomatically selection of parameters λ and µ

other aggregators for obtaining xij

Francesco Masulli DNA Microarray Data Sets Biclustering