Experimental Evaluation Experimental Evaluation of Learning Algorithmsof Learning Algorithms
Part 1Part 1
22
MotivationMotivation Evaluating the performance of learning systems is Evaluating the performance of learning systems is
important because:important because: Learning systems are usually designed to predict Learning systems are usually designed to predict
the class of “future” unlabeled data points.the class of “future” unlabeled data points. In some cases, evaluating hypotheses is an In some cases, evaluating hypotheses is an
integral part of the learning process (example, integral part of the learning process (example, when pruning a decision tree)when pruning a decision tree)
33
Recommended Steps for Proper Recommended Steps for Proper Evaluation Evaluation
1.1. Identify the “interesting” properties of the classifier. Identify the “interesting” properties of the classifier. 2.2. Choose an Choose an evaluation metricevaluation metric accordingly. accordingly.3.3. Choose the Choose the learning algorithmslearning algorithms to involve in the study to involve in the study
along with the along with the domain(s)domain(s) on which the various systems on which the various systems will be pitted against.will be pitted against.
4.4. Choose a Choose a confidence estimationconfidence estimation method. method.5.5. Check that all the assumptions made by the evaluation Check that all the assumptions made by the evaluation
metric and confidence estimator are verified.metric and confidence estimator are verified.6.6. Run the evaluation method with the chosen metric and Run the evaluation method with the chosen metric and
confidence estimator, and analyze the results.confidence estimator, and analyze the results.7.7. Interpret the results with respect to the domain(s).Interpret the results with respect to the domain(s).
44
Choose Learning Algorithm(s) to Evaluate
Select Performance
Measure of Interest
Select Datasets
for Comparison
Select Error-Estimation/Sampling
Method
Select Statistical
Test
The Classifier Evaluation Procedure
Perform Evaluation
Means knowledge of 1 is necessary for 2
Means feedback from 1 should be used to adjust 2
1
1
2
2
55
Typical Typical (but not necessarily optimal) (but not necessarily optimal)
ChoicesChoices Identify the “interesting” properties of the Identify the “interesting” properties of the
classifier. classifier. Choose an evaluation metric accordinglyChoose an evaluation metric accordingly Choose a confidence estimation method .Choose a confidence estimation method . Check that all the assumptions made by the Check that all the assumptions made by the
evaluation metric and confidence estimator are evaluation metric and confidence estimator are verified.verified.
Run the evaluation method with the chosen metric Run the evaluation method with the chosen metric and confidence estimator, and analyze the results.and confidence estimator, and analyze the results.
Interpret the results with respect to the domain.Interpret the results with respect to the domain.
66
Typical Typical (but not necessarily optimal) (but not necessarily optimal)
Choices IChoices I: : 1.1. Identify the “interesting” properties of the classifier. Identify the “interesting” properties of the classifier. 2.2. Choose an evaluation metric accordinglyChoose an evaluation metric accordingly3.3. Choose a confidence estimation method .Choose a confidence estimation method .4.4. Check that all the assumptions made by the evaluation Check that all the assumptions made by the evaluation
metric and confidence estimator are verified.metric and confidence estimator are verified.5.5. Run the evaluation method with the chosen metric and Run the evaluation method with the chosen metric and
confidence estimator, and analyze the results.confidence estimator, and analyze the results.6.6. Interpret the results with respect to the domain.Interpret the results with respect to the domain.
Almost only the highlighted part is performed.Almost only the highlighted part is performed.
These stepsare typically considered,but only verylightly
77
Typical Typical (but not necessarily optimal) (but not necessarily optimal)
Choices II:Choices II: Typical choices for Performance Evaluation:Typical choices for Performance Evaluation:
AccuracyAccuracy Precision/RecallPrecision/Recall
Typical choices for Sampling Methods:Typical choices for Sampling Methods: Train/Test Sets (Why is this necessary?)Train/Test Sets (Why is this necessary?) K-Fold Cross-validationK-Fold Cross-validation
Typical choices for significance estimationTypical choices for significance estimation t-test (often a very bad choice, in fact!)t-test (often a very bad choice, in fact!)
88
Confusion Matrix / Common Performance Confusion Matrix / Common Performance evaluation Metricsevaluation Metrics
True class True class Hypothesized | Hypothesized | class Vclass V
PosPos NegNeg
YesYes TPTP FPFP
NoNo FNFN TNTN
P=TP+FNP=TP+FN N=FP+TNN=FP+TN
AccuracyAccuracy = = (TP+TN)/(P+N)(TP+TN)/(P+N)
Precision Precision = TP/(TP+FP)= TP/(TP+FP) Recall/TP rateRecall/TP rate = TP/P = TP/P FP RateFP Rate = FP/N = FP/N
A Confusion Matrix
99
Sampling and Significance Sampling and Significance Estimation: Questions ConsideredEstimation: Questions Considered Given the observed accuracy of a hypothesis over a Given the observed accuracy of a hypothesis over a
limited sample of data, how well does this estimate its limited sample of data, how well does this estimate its accuracy over additional examples?accuracy over additional examples?
Given that one hypothesis outperforms another over Given that one hypothesis outperforms another over some sample data, how probable is it that this some sample data, how probable is it that this hypothesis is more accurate, in general?hypothesis is more accurate, in general?
When data is limited what is the best way to use this When data is limited what is the best way to use this data to both learn a hypothesis and estimate its data to both learn a hypothesis and estimate its accuracy?accuracy?
1010
k-Fold Cross-Validationk-Fold Cross-Validation
1. Partition the available data 1. Partition the available data DD00 intointo kk disjoint subsets disjoint subsets TT11, T, T22, …, , …,
TTkk of equal size, where this size is at least 30. of equal size, where this size is at least 30.
2. For 2. For ii from from 11 to to kk, do, do
use use TTii for the test set, and the remaining data for training set for the test set, and the remaining data for training set SSii
SSii <- {D <- {D00 - T - Tii}} hhA A <- L<- LAA(S(Sii)) hhB B <- L<- LBB(S(Sii)) i <- errori <- errorTiTi(h(hAA)-error)-errorTiTi(h(hBB))
3. Return the value 3. Return the value avg(avg(),), where . where . avg(avg() = 1/k ) = 1/k i=1i=1
k k ii
1111
Confidence of the k-fold EstimateConfidence of the k-fold Estimate The most commonly used approach to confidence The most commonly used approach to confidence
estimation in Machine learning is:estimation in Machine learning is: To run the algorithm using 10-fold cross-validation and to To run the algorithm using 10-fold cross-validation and to
record the accuracy at each fold.record the accuracy at each fold. To compute a confidence interval around the average of the To compute a confidence interval around the average of the
difference between these reported accuracies and a given gold difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula:standard, using the t-test, i.e., the following formula:
δδ +/- t +/- tN,9N,9 * s * sδδ wherewhere δδ is the average difference between the reported accuracy is the average difference between the reported accuracy
and the given gold standard, and the given gold standard, ttN,9 N,9 is a constant chosen according to the degree of confidence is a constant chosen according to the degree of confidence
desired,desired, ssδδ = sqrt(1/90 = sqrt(1/90 ΣΣi=1i=1
10 10 ((δδii – – δδ))22) where ) where δδii represents the difference represents the difference between the reported accuracy and the given gold standard at between the reported accuracy and the given gold standard at fold i.fold i.
1212
What’s wrong with What’s wrong with AccuracyAccuracy??
True class True class PosPos NegNeg
YesYes 200200 100100
NoNo 300300 400400
P=500P=500 N=500N=500
True class True class PosPos NegNeg
YesYes 400400 300300
NoNo 100100 200200
P=500P=500 N=500N=500
Both classifiers obtain 60% accuracyBoth classifiers obtain 60% accuracy They exhibit very different behaviours:They exhibit very different behaviours:
On the left: On the left: weak weak positive recognition rate/positive recognition rate/strongstrong negative negative recognition raterecognition rate
On the right: On the right: strongstrong positive recognition rate/ positive recognition rate/weakweak negative negative recognition raterecognition rate
1313
What’s wrong with What’s wrong with Precision/RecallPrecision/Recall??
True class True class PosPos NegNeg
YesYes 200200 100100
NoNo 300300 400400
P=500P=500 N=500N=500
True class True class PosPos NegNeg
YesYes 200200 100100
NoNo 300300 00
P=500P=500 N=100N=100
Both classifiers obtain the same precision and recall values of Both classifiers obtain the same precision and recall values of 66.7% and 40%66.7% and 40%
They exhibit very different behaviours:They exhibit very different behaviours: Same positive recognition rateSame positive recognition rate Extremely different negative recognition rate: Extremely different negative recognition rate: strongstrong on the on the
left / left / nilnil on the right on the right Note: Accuracy has no problem catching this!Note: Accuracy has no problem catching this!
1414
What’s wrong with the What’s wrong with the t-testt-test??
Classifiers 1 and 2 yield the Classifiers 1 and 2 yield the samesame average mean and confidence interval. average mean and confidence interval. Yet, Classifier 1 is relatively Yet, Classifier 1 is relatively stablestable, while Classifier 2 is , while Classifier 2 is not.not. Problem:Problem: the t-test assumes a the t-test assumes a normalnormal distribution. The difference in accuracy distribution. The difference in accuracy
between classifier 2 and the gold-standard is not normally distributedbetween classifier 2 and the gold-standard is not normally distributed
Fold Fold
11
Fold Fold
22
Fold Fold
33
Fold Fold
44
Fold Fold
55
Fold Fold
66
Fold Fold
77
Fold Fold
88
Fold Fold
99
Fold Fold
1010
C 1C 1 +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5%
C 2C 2 +10%+10% -5%-5% -5%-5% 0%0% 0%0% 0%0% 0%0% 0%0% 0%0% 0%0%
1515
So what can be done?So what can be done? Think about evaluation carefully prior to starting your Think about evaluation carefully prior to starting your
experiments.experiments. Use performance measures other than accuracy and Use performance measures other than accuracy and
precision recall. E.g., ROC Analysis, combinations of precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your measures. Also, think about the best measure for your problem.problem.
Use re-sampling methods other than cross-validation, Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization?when necessary: bootstrapping? Randomization?
Use statistical tests other than the t-test: non-parametric Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, many domains (the t-test is not appropriate for this case, which is the most common one).which is the most common one).
We will try to discuss some of these issues next time.We will try to discuss some of these issues next time.