15
Experimental Experimental Evaluation of Evaluation of Learning Learning Algorithms Algorithms Part 1 Part 1

Experimental Evaluation of Learning Algorithms Part 1

Embed Size (px)

Citation preview

Page 1: Experimental Evaluation of Learning Algorithms Part 1

Experimental Evaluation Experimental Evaluation of Learning Algorithmsof Learning Algorithms

Part 1Part 1

Page 2: Experimental Evaluation of Learning Algorithms Part 1

22

MotivationMotivation Evaluating the performance of learning systems is Evaluating the performance of learning systems is

important because:important because: Learning systems are usually designed to predict Learning systems are usually designed to predict

the class of “future” unlabeled data points.the class of “future” unlabeled data points. In some cases, evaluating hypotheses is an In some cases, evaluating hypotheses is an

integral part of the learning process (example, integral part of the learning process (example, when pruning a decision tree)when pruning a decision tree)

Page 3: Experimental Evaluation of Learning Algorithms Part 1

33

Recommended Steps for Proper Recommended Steps for Proper Evaluation Evaluation

1.1. Identify the “interesting” properties of the classifier. Identify the “interesting” properties of the classifier. 2.2. Choose an Choose an evaluation metricevaluation metric accordingly. accordingly.3.3. Choose the Choose the learning algorithmslearning algorithms to involve in the study to involve in the study

along with the along with the domain(s)domain(s) on which the various systems on which the various systems will be pitted against.will be pitted against.

4.4. Choose a Choose a confidence estimationconfidence estimation method. method.5.5. Check that all the assumptions made by the evaluation Check that all the assumptions made by the evaluation

metric and confidence estimator are verified.metric and confidence estimator are verified.6.6. Run the evaluation method with the chosen metric and Run the evaluation method with the chosen metric and

confidence estimator, and analyze the results.confidence estimator, and analyze the results.7.7. Interpret the results with respect to the domain(s).Interpret the results with respect to the domain(s).

Page 4: Experimental Evaluation of Learning Algorithms Part 1

44

Choose Learning Algorithm(s) to Evaluate

Select Performance

Measure of Interest

Select Datasets

for Comparison

Select Error-Estimation/Sampling

Method

Select Statistical

Test

The Classifier Evaluation Procedure

Perform Evaluation

Means knowledge of 1 is necessary for 2

Means feedback from 1 should be used to adjust 2

1

1

2

2

Page 5: Experimental Evaluation of Learning Algorithms Part 1

55

Typical Typical (but not necessarily optimal) (but not necessarily optimal)

ChoicesChoices Identify the “interesting” properties of the Identify the “interesting” properties of the

classifier. classifier. Choose an evaluation metric accordinglyChoose an evaluation metric accordingly Choose a confidence estimation method .Choose a confidence estimation method . Check that all the assumptions made by the Check that all the assumptions made by the

evaluation metric and confidence estimator are evaluation metric and confidence estimator are verified.verified.

Run the evaluation method with the chosen metric Run the evaluation method with the chosen metric and confidence estimator, and analyze the results.and confidence estimator, and analyze the results.

Interpret the results with respect to the domain.Interpret the results with respect to the domain.

Page 6: Experimental Evaluation of Learning Algorithms Part 1

66

Typical Typical (but not necessarily optimal) (but not necessarily optimal)

Choices IChoices I: : 1.1. Identify the “interesting” properties of the classifier. Identify the “interesting” properties of the classifier. 2.2. Choose an evaluation metric accordinglyChoose an evaluation metric accordingly3.3. Choose a confidence estimation method .Choose a confidence estimation method .4.4. Check that all the assumptions made by the evaluation Check that all the assumptions made by the evaluation

metric and confidence estimator are verified.metric and confidence estimator are verified.5.5. Run the evaluation method with the chosen metric and Run the evaluation method with the chosen metric and

confidence estimator, and analyze the results.confidence estimator, and analyze the results.6.6. Interpret the results with respect to the domain.Interpret the results with respect to the domain.

Almost only the highlighted part is performed.Almost only the highlighted part is performed.

These stepsare typically considered,but only verylightly

Page 7: Experimental Evaluation of Learning Algorithms Part 1

77

Typical Typical (but not necessarily optimal) (but not necessarily optimal)

Choices II:Choices II: Typical choices for Performance Evaluation:Typical choices for Performance Evaluation:

AccuracyAccuracy Precision/RecallPrecision/Recall

Typical choices for Sampling Methods:Typical choices for Sampling Methods: Train/Test Sets (Why is this necessary?)Train/Test Sets (Why is this necessary?) K-Fold Cross-validationK-Fold Cross-validation

Typical choices for significance estimationTypical choices for significance estimation t-test (often a very bad choice, in fact!)t-test (often a very bad choice, in fact!)

Page 8: Experimental Evaluation of Learning Algorithms Part 1

88

Confusion Matrix / Common Performance Confusion Matrix / Common Performance evaluation Metricsevaluation Metrics

True class True class Hypothesized | Hypothesized | class Vclass V

PosPos NegNeg

YesYes TPTP FPFP

NoNo FNFN TNTN

P=TP+FNP=TP+FN N=FP+TNN=FP+TN

AccuracyAccuracy = = (TP+TN)/(P+N)(TP+TN)/(P+N)

Precision Precision = TP/(TP+FP)= TP/(TP+FP) Recall/TP rateRecall/TP rate = TP/P = TP/P FP RateFP Rate = FP/N = FP/N

A Confusion Matrix

Page 9: Experimental Evaluation of Learning Algorithms Part 1

99

Sampling and Significance Sampling and Significance Estimation: Questions ConsideredEstimation: Questions Considered Given the observed accuracy of a hypothesis over a Given the observed accuracy of a hypothesis over a

limited sample of data, how well does this estimate its limited sample of data, how well does this estimate its accuracy over additional examples?accuracy over additional examples?

Given that one hypothesis outperforms another over Given that one hypothesis outperforms another over some sample data, how probable is it that this some sample data, how probable is it that this hypothesis is more accurate, in general?hypothesis is more accurate, in general?

When data is limited what is the best way to use this When data is limited what is the best way to use this data to both learn a hypothesis and estimate its data to both learn a hypothesis and estimate its accuracy?accuracy?

Page 10: Experimental Evaluation of Learning Algorithms Part 1

1010

k-Fold Cross-Validationk-Fold Cross-Validation

1. Partition the available data 1. Partition the available data DD00 intointo kk disjoint subsets disjoint subsets TT11, T, T22, …, , …,

TTkk of equal size, where this size is at least 30. of equal size, where this size is at least 30.

2. For 2. For ii from from 11 to to kk, do, do

use use TTii for the test set, and the remaining data for training set for the test set, and the remaining data for training set SSii

SSii <- {D <- {D00 - T - Tii}} hhA A <- L<- LAA(S(Sii)) hhB B <- L<- LBB(S(Sii)) i <- errori <- errorTiTi(h(hAA)-error)-errorTiTi(h(hBB))

3. Return the value 3. Return the value avg(avg(),), where . where . avg(avg() = 1/k ) = 1/k i=1i=1

k k ii

Page 11: Experimental Evaluation of Learning Algorithms Part 1

1111

Confidence of the k-fold EstimateConfidence of the k-fold Estimate The most commonly used approach to confidence The most commonly used approach to confidence

estimation in Machine learning is:estimation in Machine learning is: To run the algorithm using 10-fold cross-validation and to To run the algorithm using 10-fold cross-validation and to

record the accuracy at each fold.record the accuracy at each fold. To compute a confidence interval around the average of the To compute a confidence interval around the average of the

difference between these reported accuracies and a given gold difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula:standard, using the t-test, i.e., the following formula:

δδ +/- t +/- tN,9N,9 * s * sδδ wherewhere δδ is the average difference between the reported accuracy is the average difference between the reported accuracy

and the given gold standard, and the given gold standard, ttN,9 N,9 is a constant chosen according to the degree of confidence is a constant chosen according to the degree of confidence

desired,desired, ssδδ = sqrt(1/90 = sqrt(1/90 ΣΣi=1i=1

10 10 ((δδii – – δδ))22) where ) where δδii represents the difference represents the difference between the reported accuracy and the given gold standard at between the reported accuracy and the given gold standard at fold i.fold i.

Page 12: Experimental Evaluation of Learning Algorithms Part 1

1212

What’s wrong with What’s wrong with AccuracyAccuracy??

True class True class PosPos NegNeg

YesYes 200200 100100

NoNo 300300 400400

P=500P=500 N=500N=500

True class True class PosPos NegNeg

YesYes 400400 300300

NoNo 100100 200200

P=500P=500 N=500N=500

Both classifiers obtain 60% accuracyBoth classifiers obtain 60% accuracy They exhibit very different behaviours:They exhibit very different behaviours:

On the left: On the left: weak weak positive recognition rate/positive recognition rate/strongstrong negative negative recognition raterecognition rate

On the right: On the right: strongstrong positive recognition rate/ positive recognition rate/weakweak negative negative recognition raterecognition rate

Page 13: Experimental Evaluation of Learning Algorithms Part 1

1313

What’s wrong with What’s wrong with Precision/RecallPrecision/Recall??

True class True class PosPos NegNeg

YesYes 200200 100100

NoNo 300300 400400

P=500P=500 N=500N=500

True class True class PosPos NegNeg

YesYes 200200 100100

NoNo 300300 00

P=500P=500 N=100N=100

Both classifiers obtain the same precision and recall values of Both classifiers obtain the same precision and recall values of 66.7% and 40%66.7% and 40%

They exhibit very different behaviours:They exhibit very different behaviours: Same positive recognition rateSame positive recognition rate Extremely different negative recognition rate: Extremely different negative recognition rate: strongstrong on the on the

left / left / nilnil on the right on the right Note: Accuracy has no problem catching this!Note: Accuracy has no problem catching this!

Page 14: Experimental Evaluation of Learning Algorithms Part 1

1414

What’s wrong with the What’s wrong with the t-testt-test??

Classifiers 1 and 2 yield the Classifiers 1 and 2 yield the samesame average mean and confidence interval. average mean and confidence interval. Yet, Classifier 1 is relatively Yet, Classifier 1 is relatively stablestable, while Classifier 2 is , while Classifier 2 is not.not. Problem:Problem: the t-test assumes a the t-test assumes a normalnormal distribution. The difference in accuracy distribution. The difference in accuracy

between classifier 2 and the gold-standard is not normally distributedbetween classifier 2 and the gold-standard is not normally distributed

Fold Fold

11

Fold Fold

22

Fold Fold

33

Fold Fold

44

Fold Fold

55

Fold Fold

66

Fold Fold

77

Fold Fold

88

Fold Fold

99

Fold Fold

1010

C 1C 1 +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5% +5%+5% -5%-5%

C 2C 2 +10%+10% -5%-5% -5%-5% 0%0% 0%0% 0%0% 0%0% 0%0% 0%0% 0%0%

Page 15: Experimental Evaluation of Learning Algorithms Part 1

1515

So what can be done?So what can be done? Think about evaluation carefully prior to starting your Think about evaluation carefully prior to starting your

experiments.experiments. Use performance measures other than accuracy and Use performance measures other than accuracy and

precision recall. E.g., ROC Analysis, combinations of precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your measures. Also, think about the best measure for your problem.problem.

Use re-sampling methods other than cross-validation, Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization?when necessary: bootstrapping? Randomization?

Use statistical tests other than the t-test: non-parametric Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, many domains (the t-test is not appropriate for this case, which is the most common one).which is the most common one).

We will try to discuss some of these issues next time.We will try to discuss some of these issues next time.