DATA MINING APPROACH
IGES 2008
What is Machine Learning?– Algorithms that allow computers to learn.– Unsupervised learning
• E.g. cluster analysis– Supervised learning
• E.g. decision trees, logistic regression
Data Mining– Machine learning in large databases.
Pattern Recognition– An application of machine learning to
•Speech recognition•Face recognition•Etc.
Principe
Decision treeLazy Classification
temperature
hu
mid
ity
Example of a « play golf dataset »
Concept : classification
Confusion Matrix : a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).
Actual class
case control
Predicted Class
CaseTrue
positiveFalse
positive
controlFalse
NegativeTrue
negative
Classification : performances measurement
Classification Measures:
Accuracy = (TP + TN) / (TP + TN + FP + FN)What proportion of cases and controls are correctly classified?
Sensitivity = TP / (TP + FN)What proportion of cases are correctly classified?
Specificity = TN / (TN + FP)What proportion of controls are correctly classified?
Balanced Accuracy = (Sensitivity + Specificity) / 2
Classification : performances measurement
TP = True PositiveTN = Trus Negative
Receiver Operating Characteristic (ROC) :
Classification : performances measurement
Cross-validation :
Leave One Out Cross Validation (LOOCV) – Better for small datasets– Unbiased estimate of prediction error
n-fold CV (e.g. 5-fold or 10-fold CV)– Better for larger datasets– Biased estimate of prediction error– Lower variance– May need to repeat several times and average results
Classification : validation
Classification : filters
• Minor allele frequency (e.g. > 0.1)• LD (e.g. r2< 0.9)• Chi-square• Information gain• Interaction information• Gain ratio• Principle components• ReliefF
Statistic Filters
• Prior statistical results
• Prior experimental results
• Biochemical pathway
• Gene Ontology
• Protein-protein interactions
1
Knowledge Filters
Classification : filters
Classification : wrapper
Many search methods
– Exhaustive– Hill climbing
•Neural networks (NN)•Simulated annealing (SA)
– Beam•Evolutionary algorithms (EA)•Genetic algorithms (GA)•Genetic programming (GP)•Estimation of Distribution Algorithms (EDA)
Etape 1 : Génération aléatoire d’une génération de réseaux de neurones.
Etape 2 : Les données sont découpées en 10 parties. Les 9/10 sont utilisés pour l’apprentissage
Etape 3 : Sélection des n meilleurs NN
GE
NN
NNNN
NNNN
1
2 3
4
5
6
Erreur de classification
etErreur de prédiction
Paramétrage de la GE (taux de mutation,
nombre de génération maximale,…)
NN
NN
NNNN
NN
NN
NN
NN
NNNNNNNN
Recombinaison Mutation Duplication
NNNN
NN
21
3
Classification : wrapper
Example of GENN : Grammatical Evolution to optimize Neural Network
Etape 4 (GE part1): Évaluation des performances des RN sur les données d’apprentissage. Classement des méthodes en fonction de leur taux d’erreur et sélection des meilleurs RN.
Etapes 5 (GE part2): Parmi les réseaux sélectionnés à l’étape 4, des phénomènes de recombinaison, de duplication et de mutation sont simulés, et créer ainsi une nouvelle génération de réseaux.
Etape 6 : Evaluation du RN final sur le 1/10 des données et mesure de la capacité à classer les atteints et non-atteints.
http://www.ailab.si/orange/
Available data mining package
http://www.cs.waikato.ac.nz/ml/
Ensemble Learning & Random Forest
14
Machine Learning Methods
• Decision trees• Linear regression• Neural networks• k-nearest neighbour• Naïve Bayesian classifiers• Support Vector Machines• Ensemble Learning Methods
– Bagging, Boosting, Random Forests& RuleFit• and many more ...
Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions
Application phase
T
T1 T2 … TS
(x, ?) h* = F(h1, h2, …, hS)
(x, y*)
Learning phase
h1 h2 … hS
Ensemble Learning Methods
• Accuracy : a more reliable mapping can be obtained by combining the output of multiple "experts ”
• Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection.
• There is no single model that works for all pattern recognition problems! (no free lunch theorem)
"To solve really hard problems, we’ll have to use several different representations……. It is time to stop arguing over which type of pattern-classification technique is best……. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of comparing things. " --Minsky, 1991.
16
Ensemble Learning Methods
• How to generate base classifiers?Generation strategy
Decision tree learning:ID3, C4.5 & CART Instance-based learning: k-nearest neighbor Bayesian classification: Naïve Bayes Neural networks Regression analysis Clustering et.al.
• How to integrate them?Integration strategy:
BAGGing = Bootstrap AGGregation (Breiman, 1996) Boosting (Schapire and Singer, 1998) Random Forests (Breiman, 2001)
Ensemble Learning Methods
Ensemble Learning Methods
Bagging generates sub-sample from a standard training set by sampling examples from D uniformly and with replacement (bootstrap sample).
The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Bootstrap aggregating (bagging) : meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy.
Ensemble Learning Methods
Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier
After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight
Boosting is a machine learning meta-algorithm for performing supervised learning.
Yes No Yes No
Tree 1 Tree 2 Tree 3 Tree i
Final Classification is based on votes from all N trees
Random Forest
Final Decision
1. Choose the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
2. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
3. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
4. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
Random Forest
Each tree is constructed using the following algorithm:
Number of cases = N, Number of variables in the classifier = M.
• Its accuracy is as good as Adaboost and sometimes better
• It is relatively robust to outliers and noise• It is faster than bagging or boosting• It gives useful internal estimates of error, strength,
correlation and variable importance• It is simple and can be easily parallelized
Random Forest
Heidema 2006 BMC Genetics
0
100
2003 2004 2005 2006 2007
PubmedPublications
Random Forest
Use of Entropy based Methods
B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics
N
iii ppH
1
log
Probabilité = incertitude sur un étatEntropie = incertitude sur le système
Information Theory
Shannon Entropy
Surprisal or Self-Entropy of event i :
Information Theory
iii pps log)(
0)1(
)0(
ii
ii
ps
ps
Exemple pour un système X ayant deux états possible tel que :p1 = 10-5 s1 = 5
p2 1 s2 = 0H(X) 0
si pas d’incertitude sur un système : entropie = 0
Least biased (maximum entropy) probability is uniform: N
pi1
Information Theory
Entropy is not disorder
Sequence 1: 1111100000 p1=0.5, p0=0.5 H(sequence1) = log(2)
Sequence 2: 1100010110 p1=0.5, p0=0.5H(sequence2) = log(2)
!
Maximum Entropy Principle (Jaynes, Physical Review 106, 620 (1957))
Least biased probability distribution is the one that maximizes the information entropy subject to prior information
Information Theory
B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics
Usually called Information Gain
),()()(
)()(
),(log),();( 2
BAHBHAH
bPaP
baPbapBAI
Aa Bb
Mutual Information (correlation)
Information = Removal of uncertainty
Uncertainty from SNP A
Uncertainty from SNP B
Uncertainty removed because of correlation between A and B
• Consider two attributes, A and B (two SNPs), and a class label C (disease status).
Information Theory
Information Gain
Information Gain
• If IG(ABC) > 0– Evidence for an attribute interaction
• If IG(ABC) < 0– The information between A and B is redundant
• If IG(ABC) = 0– Evidence of conditional independence or a mixture of
synergy and redundancy
Information Theory
A = SNP1B = SNP2C = disease status
Attribute Selection based on Entropy
• Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects).
• Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration.
Information Theory
Information Theory
Ahola et al. A statistical score for assessing the quality of multiple sequence alignments. BioinformaticsD'haeseleer. What are DNA sequence motifs? Nature Biotechnology
Entropy Kullback-Leibler
N
i i
ii q
ppQPD
1
log),(
pi = probabilité observéqi = probabilité à priori (sous H0 est = 0.25 pour chaque nucléotide)
Exemple d’application pour mesurer la qualité du séquençage
Gorodkin, et. al. Comput. Appl. Biosci., Vol. 13, no. 6 pp 583-586, 1997.Tuerk et. al. PNAS 89, pp 6988-6992, 1992
> CCAGAGGCCCAACUGGUAAACGGGC> CCG-AAGCUCAACGGGAUAAUGAGC> CCG-AAGCCGAACGGGAAAACCGGC> CC-CAAGCGC-AGGGGAGAA-GCGC> CCG-ACGCCA-ACGGGAGAA-UGGC> CCGUUUUCAG-UCGGGAAAAACUGA> CCGUUACUCC-UCGGGAUAAAGGAG> CCGUAAGAGG-ACGGGAUAAACCUC> CCG-UAGGAG-GCGGGAUAU-CUCC
Relative Uncertainty on base U for one specific position
MDR :Multifactor Dimensionality Reduction
MDR Step 1 & 2• Step 1: partition the data into some number of equal
parts for cross-validation• Step 2: a set of N genetic and/or discrete environmental
factors is selected from the list of all factors
MDR
• The N factors and their multifactor classes or cells are represented in N-dimensional space
• The ratio of the number of cases to the number of controls is evaluated within each multifactor cell
MDR
MDR Step 3
• Each multifactor cell in N-dimensional space is labeled as high-risk if the ratio meets or exceeds some threshold T (e.g. T = 1.0) and low-risk if otherwise
• Those cells labeled high-risk are in one group and those low-risk are in another group, which reduces the N-dimensional model to one dimension
MDR
MDR Step 4
• Step 5: all possible combinations of N factors are evaluated for their ability to classify affected and unaffected individuals in the training data, and the best N-factor model is selected.
• Step 6: the independent test data from the cross-validation is used to estimate the prediction error of the best model selected.
MDR
MDR Step 5 & 6
MDR
overview
• Steps 1 through 6 are repeated for each possible cross validation interval
• The final step: determine which multifactor levels (e.g. genotypes) are high risk and which are low risk using the entire dataset.
MDR
MDR final
Interpretation – Interaction Graphs
• Comprised of a node for each attribute with pairwise connections between them.
• Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute.
• Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes.
MDR : New features
IG > 0 Evidence for an attribute interaction
IG< 0 The information between SNP1 and SNP2 is redundant
IG = 0 Evidence of conditional independence or a mixture of synergy and redundancy
Interpretation – Interaction Graphs
MDR : New features
Class
SNP1 SNP2
Attribute effectAttribute effect
Attribute correlation
Attribute InteractionI(SNP1;SNP2;Class)
• Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree.
Interpretation – Dendrograms
MDR : New features
MDR : New features
MDR : New features