Classification Evaluation

8/8/2019 Classification Evaluation

http://slidepdf.com/reader/full/classification-evaluation 1/22

Classification – Evaluation

Witten and Frank

Han and Kamber



Adapted from Han and Kamber

Classifier Accuracy

Measures

Accuracy of a classifier M, acc(M): percentage of test set tuples that arecorrectly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M)

Given m classes, CM i,j , an entry in a confusion matrix, indicates # of tuplesin class i that are labeled by the classifier as class j

Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos /* true positive recognition rate */

specificity = t-neg/neg /* true negative recognition rate */precision = t-pos/(t-pos + f-pos)

accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

This model can also be used for cost-benefit analysis

95.521000026347366total

86.2730002588412buy_computer = no

99.347000466954buy_computer = yes

recognition(%)totalbuy_computer = nobuy_computer = yesclasses

True negativeFalse positiveC2

False negativeTrue positiveC1

C2C1




Predictor Error Measures

Measure predictor accuracy: measure how far off the predicted value

is from the actual known value

Loss function: measures the error bet. yi and the predicted value yi’

Absolute error: | yi – yi’|

Squared error: (yi – yi’)2

Test error (generalization error): the average loss over the test set

Mean absolute error: Mean squared error:

Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared

error

d

y yd

i

ii!=

"1

|'|

d

y yd

i

ii!=

"1

2)'(

!

!

=

=

"

"

d

ii

d

i

ii

y y

y y

1

1

||

|'|

!

!

=

=

"

"

d

i

i

d

i

ii

y y

y y

1

2

1

2

)(

)'(




Evaluating the Accuracy of a Classifier

or Predictor (I)

Holdout method Given data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction

Test set (e.g., 1/3) for accuracy estimation

Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies

obtained

Cross-validation (k -fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets,

each approximately equal size

At i -th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data

Stratified cross-validation: folds are stratified so that class dist. ineach fold is approx. the same as that in the initial data




Evaluating the Accuracy of a Classifier

or Predictor (II)

Bootstrap

Works well with small data sets

Samples the given training tuples uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected

again and re-added to the training set

Several boostrap methods, and a common one is .632 boostrap

Suppose we are given a data set of d tuples. The data set is sampled d times, with

replacement, resulting in a training set of d samples. The data tuples that did not

make it into the training set end up forming the test set. About 63.2% of the original

data will end up in the bootstrap, and the remaining 36.8% will form the test set

(since (1 – 1/d)d ≈ e-1 = 0.368)

Repeat the sampling procedue k times, overall accuracy of the model:

))(368.0)(632.0()( _

1

_ set traini

k

i

set test iM accM accM acc !+!="

=



Adapted from Witten and Frank

Comparing data mining

schemes

Frequent question: which of two learning

schemes performs better?

Note: this is domain dependent!

Obvious way: compare 10-fold CV estimates

Problem: variance in estimate

Variance can be reduced using repeated CV

However, we still don’t know whether theresults are reliable




Model Selection:

ROC Curves

ROC (Receiver Operating

Characteristics) curves: for visual

comparison of classification models

Originated from signal detection

theory

Shows the trade-off between the

true positive rate and the false

positive rate

The area under the ROC curve is a

measure of the accuracy of the

model

Rank the test tuples in decreasing

order: the one that is most likely tobelong to the positive class

appears at the top of the list

The closer to the diagonal line (i.e.,

the closer the area is to 0.5), the

less accurate is the model

Vertical axis represents the truepositive rate

Horizontal axis rep. the false

positive rate The plot also shows a diagonal line

A model with perfect accuracy willhave an area of 1.0




Paired t-test

Student’s t-test tells whether themeans of two samples aresignificantly different

Take individual samples using

cross-validation Use a paired t-test because the

individual samples are paired The same CV is applied twice

William GossetBorn: 1876 in Canterbury; Died: 1937 in Beaconsfield, England

Obtained a post as a chemist in the Guinness brewery in Dublin in

1899. Invented the t-test to handle small samples for quality

control in brewing. Wrote under the name "Student".




Significance tests

Significance tests tell us how confident wecan be that there really is a difference

Null hypothesis: there is no “real” difference

Alternative hypothesis: there is a difference

A significance test measures how muchevidence there is in favor of rejecting thenull hypothesis

Let’s say we are using 10-fold CV Question: do the two means of the 10 CV

estimates differ significantly?




Distribution of the means x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for a k -

fold CV

m x and my are the means

With enough samples, the mean of a set of

independent samples is normally distributed

Estimated variances of the means areσx

2/k and σy2/k

If µ x and µ y are the true means then

are approximately normally distributed withmean 0, variance 1

k

m

x

x x

/2

!

µ "

k

m

y

y y

/2

!

µ "




Student’s distribution

With small samples (k < 100) the mean

follows Student’s distribution with k–1

degrees of freedom

Confidence limits:

0.8820%

1.3810%

1.835%

2.82

3.25

4.30

z

1%

0.5%

0.1%

Pr[ X ≥ z ]

0.8420%

1.2810%

1.655%

2.33

2.58

3.09

z

1%

0.5%

0.1%

Pr[ X ≥ z ]

9 degrees of freedom normal distribution




Distribution of the differences

Let md = m x – my

The difference of the means (md ) also has aStudent’s distribution with k–1 degrees of

freedom Let σd

2 be the variance of the difference

The standardized version of md is called thet -statistic:

We use t to perform the t-testk

mt

d

d

/

2

!

=




Performing the test

• Fix a significance level α• If a difference is significant at the α% level,

there is a (100-α)% chance that there really isa difference

•

Divide the significance level by twobecause the test is two-tailed• I.e. the true difference can be +ve or – ve

• Look up the value for z that correspondsto α/2

• If t ≤ –z or t ≥ z then the difference issignificant

• I.e. the null hypothesis can be rejected




Unpaired observations If the CV estimates are from different

randomizations, they are no longer paired

(or maybe we used k -fold CV for one

scheme, and j -fold CV for the other one)

Then we have to use an un paired t-test

with min(k , j ) – 1 degrees of freedom

The t -statistic becomes:

jk

mmt y x

y x

22! !

+

"

=

k

mt

d

d

/2

!

=




Interpreting the result

All our cross-validation estimates are basedon the same dataset

Samples are not independent

Should really use a different datasetsample for each of the k estimates used inthe test to judge performance acrossdifferent training sets

Or, use heuristic test, e.g. corrected resampled t-test



Combining Classifiers

Han and Kamber

Russell and Norvig






Bagging: Bootstrap

Aggregation

Analogy: Diagnosis based on multiple doctors’ majority vote

Training Given a set D of d tuples, at each iteration i , a training set Di of d tuples

is sampled with replacement from D (i.e., boostrap)

A classifier model Mi is learned for each training set Di

Classification: classify an unknown sample X Each classifier Mi returns its class prediction

The bagged classifier M* counts the votes and assigns the class with themost votes to X

Prediction: can be applied to the prediction of continuous values bytaking the average value of each prediction for a given test tuple

Accuracy Often significantly better than a single classifier derived from D

For noisy data: not considerably worse, more robust

Proved improved accuracy in prediction




Boosting

Analogy: Consult several doctors, based on a combination of weighted

diagnoses—weight assigned based on the previous diagnosis accuracy

How boosting works?

Weights are assigned to each training tuple

A series of k classifiers is iteratively learned

After a classifier Mi is learned, the weights are updated to allow the subsequent

classifier, Mi+1, to pay more attention to the training tuples that were misclassified by

Mi

The final M* combines the votes of each individual classifier, where the weight of

each classifier's vote is a function of its accuracy

The boosting algorithm can be extended for the prediction of continuous values

Comparing with bagging: boosting tends to achieve greater accuracy, but it

also risks overfitting the model to misclassified data

Can be shown to maximize margin of classifier



3 part boosting

Train classifier A on all data

Train classifier B on data that A makes anerror on

Train classifier C on data that A and B don’tagree on

Break ties using C

Problem: strong classifier => fewer training

points for B and C Unreliable



Decision Fusion

Train heterogeneous classifiers

Use a voting mechanism for deciding

final classifier Can learn relative weights of votes of

classifiers

Can fix weights acc. to classification

accuracy



Summary

Several methods for evaluating classifier accuracy Hold out methods

Boot strap

Comparing classifiers Confidence intervals

ROC curves

Combining classifiers

Bagging Boosting

Fusion

Documents

Classification Evaluation