Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

1

5/18/2008 1Evaluation

Classifier Evaluation

• How to estimate classifier performance.

• Learning curves

• Feature curves

• Rejects and ROC curves


Learning Curve

Size training set

True classification error ε

Bayes error ε*

Sub-optimal classifier

Bayes consistent classifier

cleval

testc


The Apparent Classification Error

Size training set

Classification error True error ε

Apparent error εA of training set

The apparent (or resubstitution error) of the training set is positively biased (optimistic).

An independent test set is needed!

bias


Error Estimation by Test Set

Training

ClassifierTesting

Design Set

Classification Error Estimate

Other training set other classifierOther test set other error estimate

0.031 0.054 0.0950.010 0.017 0.0030.003 0.005 0.009

10

100

1000

0.01 0.03 0.1N

gendat

testc


Training Set Size Test Set Size

• Training set should be large for good classifiers.• Test set should be large for a reliable, unbiased error estimate.• In practice often just a single design set is given

Same set for training and testing may give a good

classifier, but will definitely yield an optimistically biased

error estimate.

A small, independent test setyields an unbiased, but

unreliable (large variance) error estimate for a well trained

classifier.

A large, independent test setyields an unbiased and reliable (small variance) error estimate for a badly trained classifier.

50-50 is an often used standard solution for the sizes for training and testing. Is it

really good?There are better alternatives.

gendat


Cross-validationTraining

ClassifieriTesting Classification Error

EstimateRotaten times

Size test set 1/n of design set.

Size training set is (n - 1)/n of design set.

Train and test n test times. Average errors. (Good choice: n = 10)

All objects are tested ones most reliable test result that is possible.

Final classifier: trained by all objects best possible classifier.

Error estimate is slightly pessimistically biased.

crossval

2


Leave-one-out Procedure

Training

Test single object

Classification Error Estimate

Rotaten times

Cross-validation in which n is total number of objects. One object tested at a time.

n classifiers to be computed.In general unfeasible for large n.

Doable for k-NN classifier (needs no training).

crossval

testk

Classifieri


Expected Learning Curves by Estimated Errors

Size training set

Classification errorTrue error ε

Apparent error εA of training set

Cross-validation error estimate (rotation method)

Leave-one-out (n-1) error estimatecleval


Averaged Learning Curve

a = gendath([200 200]); e = cleval(a,ldc,[2,3,5,7,10,15,20],500); plote(e);

For obtaining ‘theoretically expected’ curves many repetitions

are needed.

cleval


Repeated Learning Curves

a = gendath([200 200]); for j=1:10e = cleval(a,ldc,[2,3,5,7,10,15,20],1); hold on; plote(e);

end

Small sample sizes have a very large variability.

cleval

plote


Learning Curves for Different Classifier Complexity

Size training set

Classification error

complexity

Bayes error

More complex classifiers are better in case of large training setsand worse in case of small training sets

cleval


Peaking Phenomenon, OvertrainingCurse of Dimensionality, Rao’s Paradox

feature set size (dimensionality)classifier complexity

Classification error training set size

clevalf

3


Example Overtraining, Polynomial Classifier

1 2 3

654

svc([],’p’’1)


Example Overtraining (2)

Complexity

svc([],’p’’1)

testc

gendat



Run smoothing parameter in Parzen classifier from small to large5/18/2008 16Evaluation


Complexity

parzenc([],h)


Overtraining Increasing Bias

feature set size (dimensionality)classifier complexity

Classification error True error

Apparent error

Bias


Example Curse of Dimensionality

Fisher classifier for various feature rankings

4


Confusion Matrix (1)

reallabels

obtainedlabels

Confusion matrix:

confmat



confmat

testc

classified toA B C

class A 8 2 0class B 6 23 1class C 4 1 15

obje

cts f

rom

-0.20 error in class A-0.23 error in class B-0.25 error in class C

0.228 averaged error



obje

cts f

rom

classified toA B C

class A 8 2 0 0.20class B 6 24 0 0.20class C 0 0 20 0.00

0.133

classified toA B C


0.30

classified toA B C


0.30

classification details are only observable in the confusion matrix!! 5/18/2008 22Evaluation

Conclusions on Error Estimations

• Larger training sets yield better classifiers.

• Independent test sets are needed for obtaining unbiased error estimates.

• Larger test sets yield more accurate error estimates.

• Leave-one-out cross-validation seems to be an optimal compromise, but might be computationally infeasible.

• 10-fold cross-validation is a good practice.

• More complex classifiers need larger training sets to avoid overtraining.

• This holds in particular for larger feature sizes, due to the curse of dimensionality.

• For too small training sets, more simple classifiers or smaller feature sets are needed.

• Confusion matrices allow a detailed look at the per class classification


Reject and ROC Analysis

• Reject Types• Reject Curves• Performance Measures• Varying Costs and Priors• ROC Analysis

Error

Reject

ε0

εr

r

x

∑

length


Outlier Reject

Reject objects for which probability density is low:

x

Note: in these area the posterior probabilities might be high!

rejectc

5


Ambiguity Reject

Reject objects for which classification is unsure:about equal posterior probabilities:

rejectc


Reject Curve

The classification error ε0 can be reduced to εr by rejecting a fraction r of the objects.

2d

Error

Reject

∑0

∑r

r

d

reject

plote


Rejecting Classifier

Original classifier

Rejecting classifier

Rejected fraction r


Reject Fraction

For reasonably well trained classifiers r > 2 (ε(r=0) – ε(r)).To decrease the error by 1%, more than 2% has to be rejected.

Class B

Class Ax

error decrease

Reject

Rejected fraction r


How much to reject?

For given total cost cthis is a linear function in the (r,ε) space.Shift it until a possible operating point is reached.

Error ε

Reject r

d

d*

Cost c

reject plote

Compare the cost of a rejected object, cr , with the cost of a classification error, cε :


Error / Performance Measures

xηA

Class A

εA

Class B

Objects of class Α, with prior p(A), is classified by S(x) = 0in a part ηA, assigned to A, and a part εA, assigned to B.

p(A) = ηA + εA

6



xClass A

ηBεB

Class B

Objects of class Β, with prior p(B), is classified by S(x) = 0in a part ηB, assigned to B, and a part εB, assigned to A.

p(B) = ηB + εB5/18/2008 32Evaluation


εA : error contribution class AεB : error contribution class B

ε : classification errorη : performance

ε + η = 1

εB= εB1+ εB2ε = εA+ εB

ηA = ηA1 + ηA2ηB = ηB1 + ηB2

η = ηA+ ηBp(A) + p(B) = 1p(A) = ηA+ εAp(B) = ηB+ εB

εA/p(A): error in class AεB/p(B): error in class B

ηΑ/p(A) : sensitivity (recall) class A(what goes correct in class A)

ηΑ/ (ηΑ+εB)/ : precision class A(what belongs to A in what is classified as A)

xClass B

ηA1

Class A

εA=ηB2εB1=ηA2

εB2

ηB1



classified as

T NT TP FNN FP TN

Truelabels

Sensitivity: = Error:

Specificity: =

False Discovery Rate (FDR):

Given a trained classifier and a test set:

x

T N



• Error: probability of erroneous classifications

• Performance: 1 – error

• Sensitivity of a target class (e.g. diseased patients): performance for objects from that target class.

• Specificity: performance for all objects outside the target class.

• Precision of a target class : fraction of correct objects among all objects assigned to that class.

• Recall: fraction of correctly classified objects. This is identical tothe performance. It is also identical to the sensitivity when related to a particular class.

35

ROC Curve

x εB

εA

For every classifier S(x) – d = 0 two errors, εA and εB are made.The ROC curve shows them as function of d.

d

roc plote


ROC Curve

ROC is independent of class prior probabilitiesChange of prior is analogous to the shift of the decision boundary

Shift in decision boundary

Priors = [0.2 0.8]Priors = [0.5 0.5]

7


ROC AnalysisROC: Receiver-Operator Characteristic (from communication theory)

Performanceof target class

Error in non-targets1

1

00

1

0

Error class A

Errorclass B

Medical diagnosticsDatabase retrieval

2-class pattern recognition

roc plote


When are ROC Curves Useful? (1)

Performance

Cost

1

00

Trade-off between cost and performance.

What is the performance for given cost?

What are the cost of a preset performance?



1

0εA

εB

0

Study of the effect of changing priors

operating points: value of d in S(x) – d = 0



0 εA

εB

0

Compare behavior of different classifiers

εA

εBS1(x)

S2(x)

S3(x)

Any 'virtual' classifier between S1(x) and S2(x)in the (εA ,εB) space can be realized by usingat random α times S1(x) and (1-α) times S2(x).

Combine Classifiers

2A

1AA )1( εα−+αε=ε 2

B1BB )1( εα−+αε=ε


Summary Reject and ROC

• Reject for solving ambiguity: reject objects close to the decision boundary lower costs.

• Reject option for protection against outliers.

• ROC analysis for performance – cost trade-off.

• ROC analysis in case of unknown or varying priors.

• ROC analysis for comparing / combining classifiers.

Documents

Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2