CS513-Data Mining - Lecture 6: Performance …csit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-8-13...2016/08/13 · Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining May-June

CS513-Data MiningLecture 6: Performance Evaluation of Data Mining Algorithms

Waheed Noor

Computer Science and Information Technology,University of Balochistan,

Quetta, Pakistan

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining May-June 2016 1 / 46

Outline

1 Performance EvaluationTraining and Testing

2 Predicting PerformanceCross-ValidationThe Bootstrap

3 ReadingsQuiz

4 Performance Measures for DM Methods

5 Cost Sensitivity


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Performance Evaluation

Performance evaluation of the learned/trained data miningalgorithm is important in practice.The performance on only training set is not enough for claimingthe results rather we need an independent test set.The test set can be generated from the portion of the training set,which is not used for training, also called hold-out set.We need ways of predicting performance bounds in practicebased on experiments with whatever data can be obtained.We need to be careful for the choice of technique for predictingperformance of our models that depends on the underlying data.Surely, we may need pre-process on data to ensure quality, andas well as the size of the data is also important.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Training and TestingError RateFor classification problems, error rate can be a good measure forassessing the performance of a classifier. Error rate is the fraction overall instances when a classifier is unable to predict the correct class.

ImportantImportant is the performance on future data (i.e., unseen andindependent).Training data is only used to learn the classifier assuming that thisand future data is generated by the same process.Therefore, error rate on training data (aka resubstitution error) isnot useful and we need a way to estimate/predict the true errorrate.That is, we need a dataset that has not been used for training thatis called the test set.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Predicting Performance

Assuming a test set with an estimated error rate of 25%.For the moment, we will use success rate rather than error ratethat is 75% in this case.How much we are confident then? i.e., how close it is to the truesuccess rate.Our confidence on the estimate of success rate depends on thesize of the test set, the larger the size the more we are confident.In statistical perspective, we calculate confidence interval and theconfidence (probability) that represents that the true success ratelies in the confidence interval in relation with our estimate ofsuccess rate.It would be good read for you but does not come under the scopeof this course.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Holdout Method I

DefinitionIn holdout method, certain portion of the training data is reserved fortesting purpose which is not used during training. It is very common inpractice to keep one-third of the whole dataset for testing andremaining for training.

It may be possible that either training or testing sample is notrepresentative, unfortunately we can not verify this in general.

However, we can put a check that full dataset have representationof each class equally (even representation).


Holdout Method II

Another problem, if by chance, the training set have no instancefor a particular class then it may be hopeless to learn anything forthe missing class.

In this case, we should randomly sample training and test set,which means we are applying stratification.

Another way to mitigate the bias is to repeat the holdout methodseveral times with random sampling.

The overall error rate is then calculated by averaging error rates ofall iterations.


Cross-Validation I

DefinitionDataset is divided into different fixed number of partitions (called folds)of approximately equal size and testing is carried out on each foldwhile training on the remaining examples. The final error rate (successrate) is again the average over all error rates for each test partition.

If we use three-folds then we are actually applying three-foldcross-validation.

Standard is to use 10-fold cross-validation by applyingstratification first, i.e., stratified 10-fold cross-validation.

This method is useful since we exploit the full available informationfor training and testing.


Cross-Validation II

For 10-fold to be right, we have practical and as well as theoreticalevidence of its importance.

However, there are no conclusive arguments in the literature.

Literature also encourage to repeat 10-fold cross-validationnumber of times for better estimates with some random samplingmethod such as stratification.

In this case, if we repeat ten times the 10-fold cross-validation,how many training-testing iteration will be there?

It may computationally not feasible for very large datasets.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


The Bootstrap I

Unlike cross-validation methods, the bootstrap method is based on thestatistical method called sampling with replacement. There aredifferent variant of this method.

Definition (0.632 Bootstrap)Dataset of n instances is sampled n times with replacement togenerate another dataset of n instances. Since there are instancesthat are repeated, the instances that are not repeated will be used astest set.

How?

Each time every instance have the probability of to be pickedequal to 1/n and (1− 1/n) to be not picked.


The Bootstrap II

Since we are repeating sampling n times, therefore

(1− 1n)n ≈ e−1 = 0.368

Which is the chance of a particular instance not to be picked-up.

Therefore, 36.8% of the instances will not be picked while 63.2%will be picked. It is also the reason for its name.

The concept of repetition will fill in the 36.8% of instances to makethe new training set of the same size as n.

The error rate is calculated:

Error = 0.632× Errortest instances + 0.368× Errortraining instances.


The Bootstrap III

where Errortraining instances is resubsititution error.

Finally, the whole bootstrap method is repeated several times andfinal error is obtained by averaging individual errors.

This method is very useful for very small datasets.

A disadvantage: assume a learning algorithm memorized thetraining data giving 100% succes, i.e., Errortraining instances = 0 andassume the true error rate 50% then the bootstrap error will be

0.632× 0.5 + 0.368× 0 = 0.316

That is simply misleading.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Important Readings

– Read Leave-one-out validation method and its properties.


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Success rate or 0-1 Loss Function I

So far we have used the success rate as our performancemeasure of a DM method.

In which, the prediction is either correct or incorrect.

It is also sometimes called 0-1 loss function, since the loss is zeroif prediction is correct and one if it is incorrect.

But in different situations and applications, success rate alonemay not be enough to capture the actual quality of predictingpower of a DM method.


Success rate or 0-1 Loss Function II

For example, problems where cost of errors or incorrectpredictions is different.

Example1. loan prediction: the cost of lending to a defaulter is far greater thanthe lost-business cost of refusing a loan to a non-defaulter.2. Diagnosis: the cost of misidentifying problems with a machine thatturns out to be free of faults is less than the cost of overlookingproblems with one that is about to fail.


Success rate or 0-1 Loss Function III

A simple success rate/classification accuracy may also misleadingwhen the datasets are imbalanced.

Definition (Imbalance Dataset)We say a data set is imbalance when the examples of one class in thedataset out number the examples of other classes. Most of the DMmethods assumes that the underlying data is well balanced.

For instance, assume a data set with binary outcome variable,where 90% examples belongs to one class, which means if aclassifier predicts this class all time, we will get 90% classificationaccuracy.


Confusion Matrix I

In a binary class problem, the prediction can take one of the fourpossible forms shown bellow and denoted by true positive TP, truenegative TN, false positive FP and false negative FN.

Figure : Different possible outcome of a binary class prediction


Confusion Matrix II

Where the TP and TN are correct classifications/predictions, i.e.,actual positive class is predicted as positive by the model andactual negative class is predicted as negative by the model.

FP represents misclassification of actual negative class aspositive.

FN means the model has misclassified the actual positive class asnegative.


Confusion Matrix III

In the table above, our objective is to get maximum values alongthe diagonal, while minimum values (close to zero) for off-diagonalelements.

True positive rate: TP divided by total number of positives, i.e., TPTP+FN

True negative rate: TN divided by total number of negatives, i.e., TNTN+FP

Success rate: number of correct classifications divided by total numberof classifications, i.e., TP+TN

TP+TN+FP+FN .


Confusion Matrix IV

Table : Class activity: calculate TP, TN and success rate of the followingmodel

PredictedActual Positive NegativePositive 52 15Negative 18 45


Discussion: Confusion Matrix I

How good is this measure of success?

How many agreements can be due to a chance? Is it possible atall?

Figure : Confusion matrix for a) actual b) expected predictors


Discussion: Confusion Matrix II

The actual predictor correctly predicts 140 test examples out of200, while expected predictor predicts 82.But we can see in the table above that the row and column totalshave not changed.How to take such by chance prediction into the account?

Definition (Kappa Statistics)Measures the level of agreement between predictor and observedclassification/categorization of a dataset, while correcting theagreement that occurred by chance. It’s maximum value is 100% whena perfect agreement occurs, not by chance.

For our example, we can calculate Kapp statistics in percentage oftest examples where both predictors agreed upon.


Discussion: Confusion Matrix III

The actual predictor is correct on 140 test examples, while theexpected predictor correctly predicts 82 out of 200 examples.

Actual predictor has 140− 82 = 58 extra success out of200− 82 = 118 possible test examples, which counts to 49.2%.


Precision and Recall I

In information retrieval such as querying web search engines, amodel responses with the relevant documents.

In that case we can use two measures called recall and precisionestimated as followed.

recall =number of retrieved relevant documents

total number of relevant documents

precision =number of retrieved relevant documents

total number of documents retrievedRecall is also called sensitivity.

Both measures are based on capturing relevance.


Precision and Recall II

Maximum precision ensures no FP and maximum recall ensuresno FN.

We can also see precision as measure of exactness, while recallcan be seen as measure of completeness.

So trade-off between precision and recall depends on the relativecosts of FPs and FNs.

Class ActivityHow can you define precision and recall in terms of binaryclassification.


Performance Measures for Numeric Prediction

Figure : Measures for Numeric Prediction


Outline



3 ReadingsQuiz


5 Cost Sensitivity


Cost-sensitive Classification I

In a classification problem, if costs are known, they can beincorporated into the decision process.

The confusion matrix, where errors FN and FP will have differentcosts.

Similarly, the correct classifications TP and TN will have differentbenefits.

The costs can be summarized in the similar fashion as confusionmatrix, called cost matrix.


Cost-sensitive Classification II

Figure : Default cost matrix for a) two class b) three class problem.

The diagonal elements represents correct classification whileoff-diagonal represents the errors.

In such case we will replace the success rate with average cost(or benefit) per decision.


Cost-sensitive Classification III

The cost may include, cost of the used algorithm, data collection,feature selection, and so on.

Given the cost matrix, the cost of learned model can be calculatedfor test set by simply summing the relevant elements of the costmatrix for the model’s predictions.

Note: cost will be taken into account during evaluating predictionsrather than predictions time.

We have to minimize this cost.


Cost-sensitive Classification with Probabilities I

For models that output probabilities such as naive Bayes or even fordecision trees where we can extract classification scores/frequencies.

We need to minimize the expected cost of predictions rather thanmisclassification error using the predicted probabilities of each classfor test instances.

Consider the cost matrix of three-class problem for classes a, band c bellow.


Cost-sensitive Classification with Probabilities II

Further assume that the model assigns probabilities pa, pb and pcto a test instance.

Then the expected cost is give by:∑

k

CkjP(Yk | x).


Cost-sensitive Classification with Probabilities III

The model will assign the class j ∈ {a,b, c} for which theexpected cost is minimum.

How to do!Let us consider first the class a, i.e., j = a, then we can get theexpected cost for this class by multiplying the first column of costmatrix [0 1 1] with probability vector [pa pb pc]. We do this for other twoclasses as well and assign the class with minimum expected cost.

Class ActivityPerform a cross-product on the above two vectors and show what youget.


Cost-sensitive Classification with Probabilities IV

This method has its roots from decision theory and we can useloss alternative to the cost as well.


Cost-sensitive Learning I

In the above discussion, we incorporated costs (cost matrix)during evaluating the predicting, i.e., testing time.

DefinitionIt may be better to incorporate costs during training time (learning)rather than testing. Therefore, making learning of a learning methodcost sensitive.

A general technique to build cost-sensitive classifier is to weightthe training instances of each class according to their costs.

That is, for example, if in a two class problem with classes yes andno and if the error in class no has 10 times more cost than theerror in class yes.


Cost-sensitive Learning II

We can increase (artificially, may be by duplication) the number ofinstances of class no by factor of 10 than the number of instancesof class yes.

So we have biased the classifier towards avoiding errors oninstances that belongs to class no.

And we will get less FP than FN, since FP are penalized 10-timesmore heavily than FN.


References I

Christopher M. Bishop.Pattern Recognition and Machine Learning.Springer, New York, 2006.

Ian H. Witten and Eibe Frank.Data Mining: Practical Machine Learning Tools and Techniques,Second Edition.Morgan Kaufmann, San Francisco, CA, 2005.


Documents

CS513-Data Mining - Lecture 6: Performance …csit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-8-13...2016/08/13 · Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining May-June