Download ppt - DATA MINING APPROACH IGES 2008

DATA MINING APPROACH

IGES 2008

What is Machine Learning?– Algorithms that allow computers to learn.– Unsupervised learning

• E.g. cluster analysis– Supervised learning

• E.g. decision trees, logistic regression

Data Mining– Machine learning in large databases.

Pattern Recognition– An application of machine learning to

•Speech recognition•Face recognition•Etc.

Principe

Decision treeLazy Classification

temperature

hu

mid

ity

Example of a « play golf dataset »

Concept : classification

Confusion Matrix : a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

Actual class

case control

Predicted Class

CaseTrue

positiveFalse

positive

controlFalse

NegativeTrue

negative

Classification : performances measurement

Classification Measures:

Accuracy = (TP + TN) / (TP + TN + FP + FN)What proportion of cases and controls are correctly classified?

Sensitivity = TP / (TP + FN)What proportion of cases are correctly classified?

Specificity = TN / (TN + FP)What proportion of controls are correctly classified?

Balanced Accuracy = (Sensitivity + Specificity) / 2


TP = True PositiveTN = Trus Negative

Receiver Operating Characteristic (ROC) :


Cross-validation :

Leave One Out Cross Validation (LOOCV) – Better for small datasets– Unbiased estimate of prediction error

n-fold CV (e.g. 5-fold or 10-fold CV)– Better for larger datasets– Biased estimate of prediction error– Lower variance– May need to repeat several times and average results

Classification : validation

Classification : filters

• Minor allele frequency (e.g. > 0.1)• LD (e.g. r2< 0.9)• Chi-square• Information gain• Interaction information• Gain ratio• Principle components• ReliefF

Statistic Filters

• Prior statistical results

• Prior experimental results

• Biochemical pathway

• Gene Ontology

• Protein-protein interactions

1

Knowledge Filters

Classification : filters

Classification : wrapper

Many search methods

– Exhaustive– Hill climbing

•Neural networks (NN)•Simulated annealing (SA)

– Beam•Evolutionary algorithms (EA)•Genetic algorithms (GA)•Genetic programming (GP)•Estimation of Distribution Algorithms (EDA)

Etape 1 : Génération aléatoire d’une génération de réseaux de neurones.

Etape 2 : Les données sont découpées en 10 parties. Les 9/10 sont utilisés pour l’apprentissage

Etape 3 : Sélection des n meilleurs NN

GE

NN

NNNN

NNNN

1

2 3

4

5

6

Erreur de classification

etErreur de prédiction

Paramétrage de la GE (taux de mutation,

nombre de génération maximale,…)

NN

NN

NNNN

NN

NN

NN

NN

NNNNNNNN

Recombinaison Mutation Duplication

NNNN

NN

21

3

Classification : wrapper

Example of GENN : Grammatical Evolution to optimize Neural Network

Etape 4 (GE part1): Évaluation des performances des RN sur les données d’apprentissage. Classement des méthodes en fonction de leur taux d’erreur et sélection des meilleurs RN.

Etapes 5 (GE part2): Parmi les réseaux sélectionnés à l’étape 4, des phénomènes de recombinaison, de duplication et de mutation sont simulés, et créer ainsi une nouvelle génération de réseaux.

Etape 6 : Evaluation du RN final sur le 1/10 des données et mesure de la capacité à classer les atteints et non-atteints.

http://www.ailab.si/orange/

Available data mining package

http://www.cs.waikato.ac.nz/ml/

Ensemble Learning & Random Forest

14

Machine Learning Methods

• Decision trees• Linear regression• Neural networks• k-nearest neighbour• Naïve Bayesian classifiers• Support Vector Machines• Ensemble Learning Methods

– Bagging, Boosting, Random Forests& RuleFit• and many more ...

Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions

Application phase

T

T1 T2 … TS

(x, ?) h* = F(h1, h2, …, hS)

(x, y*)

Learning phase

h1 h2 … hS

Ensemble Learning Methods

• Accuracy : a more reliable mapping can be obtained by combining the output of multiple "experts ”

• Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection.

• There is no single model that works for all pattern recognition problems! (no free lunch theorem)

"To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991.

16


• How to generate base classifiers?Generation strategy

Decision tree learning:ID3, C4.5 & CART Instance-based learning: k-nearest neighbor Bayesian classification: Naïve Bayes Neural networks Regression analysis Clustering et.al.

• How to integrate them?Integration strategy:

BAGGing = Bootstrap AGGregation (Breiman, 1996) Boosting (Schapire and Singer, 1998) Random Forests (Breiman, 2001)



Bagging generates sub-sample from a standard training set by sampling examples from D uniformly and with replacement (bootstrap sample).

The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

Bootstrap aggregating (bagging) : meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy.


Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier

After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight

Boosting is a machine learning meta-algorithm for performing supervised learning.

Yes No Yes No

Tree 1 Tree 2 Tree 3 Tree i

Final Classification is based on votes from all N trees

Random Forest

Final Decision

1. Choose the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.

2. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.

3. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.

4. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Random Forest

Each tree is constructed using the following algorithm:

Number of cases = N, Number of variables in the classifier = M.

• Its accuracy is as good as Adaboost and sometimes better

• It is relatively robust to outliers and noise• It is faster than bagging or boosting• It gives useful internal estimates of error, strength,

correlation and variable importance• It is simple and can be easily parallelized

Random Forest

Heidema 2006 BMC Genetics

0

100

2003 2004 2005 2006 2007

PubmedPublications

Random Forest

Use of Entropy based Methods

B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics

N

iii ppH

1

log

Probabilité = incertitude sur un étatEntropie = incertitude sur le système

Information Theory

Shannon Entropy

Surprisal or Self-Entropy of event i :

Information Theory

iii pps log)(

0)1(

)0(

ii

ii

ps

ps

Exemple pour un système X ayant deux états possible tel que :p1 = 10-5 s1 = 5

p2 1 s2 = 0H(X) 0

si pas d’incertitude sur un système : entropie = 0

Least biased (maximum entropy) probability is uniform: N

pi1

Information Theory

Entropy is not disorder

Sequence 1: 1111100000 p1=0.5, p0=0.5 H(sequence1) = log(2)

Sequence 2: 1100010110 p1=0.5, p0=0.5H(sequence2) = log(2)

!

Maximum Entropy Principle (Jaynes, Physical Review 106, 620 (1957))

Least biased probability distribution is the one that maximizes the information entropy subject to prior information

Information Theory

B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics

Usually called Information Gain

),()()(

)()(

),(log),();( 2

BAHBHAH

bPaP

baPbapBAI

Aa Bb

Mutual Information (correlation)

Information = Removal of uncertainty

Uncertainty from SNP A

Uncertainty from SNP B

Uncertainty removed because of correlation between A and B

• Consider two attributes, A and B (two SNPs), and a class label C (disease status).

Information Theory

Information Gain

Information Gain

• If IG(ABC) > 0– Evidence for an attribute interaction

• If IG(ABC) < 0– The information between A and B is redundant

• If IG(ABC) = 0– Evidence of conditional independence or a mixture of

synergy and redundancy

Information Theory

A = SNP1B = SNP2C = disease status

Attribute Selection based on Entropy

• Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects).

• Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration.

Information Theory

Information Theory

Ahola et al. A statistical score for assessing the quality of multiple sequence alignments. BioinformaticsD'haeseleer. What are DNA sequence motifs? Nature Biotechnology

Entropy Kullback-Leibler

N

i i

ii q

ppQPD

1

log),(

pi = probabilité observéqi = probabilité à priori (sous H0 est = 0.25 pour chaque nucléotide)

Exemple d’application pour mesurer la qualité du séquençage

Gorodkin, et. al. Comput. Appl. Biosci., Vol. 13, no. 6 pp 583-586, 1997.Tuerk et. al. PNAS 89, pp 6988-6992, 1992

> CCAGAGGCCCAACUGGUAAACGGGC> CCG-AAGCUCAACGGGAUAAUGAGC> CCG-AAGCCGAACGGGAAAACCGGC> CC-CAAGCGC-AGGGGAGAA-GCGC> CCG-ACGCCA-ACGGGAGAA-UGGC> CCGUUUUCAG-UCGGGAAAAACUGA> CCGUUACUCC-UCGGGAUAAAGGAG> CCGUAAGAGG-ACGGGAUAAACCUC> CCG-UAGGAG-GCGGGAUAU-CUCC

Relative Uncertainty on base U for one specific position

MDR :Multifactor Dimensionality Reduction

MDR Step 1 & 2• Step 1: partition the data into some number of equal

parts for cross-validation• Step 2: a set of N genetic and/or discrete environmental

factors is selected from the list of all factors

MDR

• The N factors and their multifactor classes or cells are represented in N-dimensional space

• The ratio of the number of cases to the number of controls is evaluated within each multifactor cell

MDR

MDR Step 3

• Each multifactor cell in N-dimensional space is labeled as high-risk if the ratio meets or exceeds some threshold T (e.g. T = 1.0) and low-risk if otherwise

• Those cells labeled high-risk are in one group and those low-risk are in another group, which reduces the N-dimensional model to one dimension

MDR

MDR Step 4

• Step 5: all possible combinations of N factors are evaluated for their ability to classify affected and unaffected individuals in the training data, and the best N-factor model is selected.

• Step 6: the independent test data from the cross-validation is used to estimate the prediction error of the best model selected.

MDR

MDR Step 5 & 6

MDR

overview

• Steps 1 through 6 are repeated for each possible cross validation interval

• The final step: determine which multifactor levels (e.g. genotypes) are high risk and which are low risk using the entire dataset.

MDR

MDR final

Interpretation – Interaction Graphs

• Comprised of a node for each attribute with pairwise connections between them.

• Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute.

• Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes.

MDR : New features

IG > 0 Evidence for an attribute interaction

IG< 0 The information between SNP1 and SNP2 is redundant

IG = 0 Evidence of conditional independence or a mixture of synergy and redundancy

Interpretation – Interaction Graphs

MDR : New features

Class

SNP1 SNP2

Attribute effectAttribute effect

Attribute correlation

Attribute InteractionI(SNP1;SNP2;Class)

• Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree.

Interpretation – Dendrograms

MDR : New features

MDR : New features

MDR : New features