47
DATA ANALYTICS Evaluation Metrics for Supervised Learning Models of Machine Learning Md. Main Uddin Rony Software Developer, Infolytx,Inc.

Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Embed Size (px)

Citation preview

Page 1: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

DATA ANALYTICSEvaluation Metrics for Supervised

Learning Models of Machine Learning

Md. Main Uddin RonySoftware Developer, Infolytx,Inc.

Page 2: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Machine Learning Evaluation Metrics

Page 3: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

ML Evaluation Metrics Are…..●tied to Machine Learning Tasks

●methods which determine an algorithm’s performance and behavior

●helpful to decide the best model to meet the target performance

●helpful to parameterize the model in such a way that can offer best performing algorithm

Page 4: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Evaluation Metrics Types...●Various types of ML Algorithms (classification, regression, ranking,

clustering)

●Different types of evaluation metrics for different types of algorithm

●Some metrics can be useful for more than one type of algorithm (Precision - Recall)

●Will cover Evaluation Metrics for Supervised learning models only ( Classification, Regression, Ranking)

Page 5: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Classification Metrics

Page 6: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Classification Model Does...Predict class labels given input data

In Binary classification, there are two possible output classes ( 0 or 1, True or False, Positive or Negative, Yes or No etc.)

Spam detection of email is a good example of Binary classification.

Page 7: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Some Popular Classification Metrics...Accuracy

Confusion Matrix

Log-Loss

AUC

Page 8: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Accuracy●Ratio between the number of correct predictions and total number

of predictions

●Example: Suppose we have 100 examples in the positive class and 200 examples in the negative class. Our model declares 80 out of 100 positives as positive correctly and 195 out of 200 negatives as negative correctly.

●So, accuracy is = (80 + 195)/(100 + 200) = 91.7%

Page 9: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Confusion Matrix● Shows a more detailed breakdown of correct and incorrect classifications for

each class.

● Think about our previous example and then the confusion matrix looks like:

● What is the accuracy that positive class has ? And Negative class?

● Clearly, positive class has lower accuracy than the negative class

● And that information is lost if we calculate overall accuracy only.

Predicted as positive Predicted as negative

Labeled as positive 80 20

Labeled as negative 5 195

Page 10: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Per-Class Accuracy●Average per class accuracy of previous example:

(80% + 97.5%)/2 = 88.75 %, different from accuracy

Why important?

- Can show different scenario when there are different numbers of examples per class

- Class with more examples than other will dominate the statistic of accuracy, hence produced a distorted picture

Page 11: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Log-LossVery much useful when the raw output of classifier is a numeric

probability instead of a class label 0 or 1

Mathematically , log-loss for a binary classifier:

Minimum is 0 when prediction and true label match up

Calculate for a data point predicted by classifier to belong to class 1 with probability .51 and with probability 1

Minimizing this value, maximizing the accuracy of the classifier

Page 12: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (Area Under Curve)● The curve is receiver operating

characteristic curve or in short ROC curve

● Provides nuanced details about the behavior of the classifier

● Bad ROC curve covers very little area

● Good ROC curve has a lot of space under it

● But, how?

Page 13: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 14: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 15: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 16: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 17: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 18: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)

Page 19: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AUC (contd..)●So, what’s the advantage of using of ROC curve over a simpler

metric?

ROC curve visualizes all possible classification thresholds, whereas other metrics only represents your error rate for a single threshold

Page 20: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Ranking Metrics

Page 21: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Ranking ...Is related to binary classification

Internet Search can be a good example which acts as a ranker.

During a query, it returns ranked list of web pages relevant to that query

So, here ranking can be a binary classification of “relevant query” or “irrelevant query”

It also ordering the results so that the most relevant result should be on top

So, what can be done in underlying implementation considering both??

Can we predict what will ranking metrics evaluate and how?

Page 22: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Some Ranking Metrics..Precision - Recall

Precision - Recall Curve and F1 Score

NDCG

Page 23: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Precision - RecallConsidering the scenario of web search result, Precision answers this question:

“Out of the items that the ranker/classifier predicted to be relevant, how many are truly relevant?”

Whereas, Recall answers this:

“Out of all the items that are truly relevant, how many are found by the ranker/classifier?”

Page 24: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Precision - Recall (Contd..)

Page 25: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Calculation Example Of Precision- Recall

Total Negative = 9760 + 140 = 9900 Total Positive = 40 + 60 = 100 Total Negative prediction = 9760 + 40 = 9800 Total Positive prediction = 140 + 60 = 200

Precision = TP / (TP+FP)

= 60 / (60 + 140) = 30%

Recall = TP / (TP+FN)

= 60 / (60+40) = 60%

Predicted as Negative

Predicted as Positive

Actual Negative

9760 (TN) 140 (FP)

Actual Positive

40 (FN) 60 (TP)

Page 26: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Precision - Recall Curve When the numbers of answers returned

by the ranker will change, the precision and recall score will also be changed

By plotting precision versus recall over a range of k values which denotes numbers of results returned, we get the precision - recall curve

Page 27: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Computing Precision-Recall Point

Page 28: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Interpolating a Recall/Precision Curve

Page 29: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Trade-off between Recall and Precision

Page 30: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

F-MeasureOne measure of performance that takes into account both recall and

precision

Harmonic mean of recall and precision:

Compared to arithmetic mean, both need to be high for harmonic mean to be high

Page 31: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

NDCG●Precision and recall treat all retrieved items equally.

●But, a relevant item in position 1 and a relevant item in position 5 bear same significance?

●Think about a web search result

●NDCG tries to take this scenario into account.

Page 32: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

What?●NDCG stands for Normalized Discounted Cumulative Gain

●First just focus on DCG (Discounted Cumulative Gain)

Page 33: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Discounted Cumulative Gain●Popular measure for evaluating web search and related tasks.

●Discounts items that are further down the search result list

●Two assumptions:

- Highly relevant documents are more useful than marginally relevant document

- the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

Page 34: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Discounted Cumulative Gain●Uses graded relevance as a measure of the usefulness, or gain,

from examining a document

●Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

●Typical discount is 1/log (rank)

- With base 2, the discount at rank 4 is ½, and at rank 8 it is 1/3

Page 35: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Discounted Cumulative Gain●DCG is the total gain accumulated at a particular rank p:

●Alternative formulation:

- used by some web search companies

- emphasis on retrieving highly relevant documents * Equation used from Addison Wesley’s presentation

Page 36: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

DCG Example● 10 ranked documents judged on 0-3 relevance scale:

3, 2, 3, 0, 0, 1, 2, 2, 3, 0

● discounted gain:

3, 2/1, 3/1.59, 0, 0, 1/ 2.59, 2/2.81, 2/3 , 3/3.17, 0

= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0

● DCG:

3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 * Example used from Addison Wesley’s presentation

Page 37: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Normalized DCG●Normalized version of discounted cumulative gain

●Often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking

●Normalized score always lies between 0.0 and 1.0

Page 38: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

NDCG Example● Let’s look back the list of ranked document judged on relevance scale:

3, 2, 3, 0, 0, 1, 2, 2, 3, 0

● Perfect ranking:

3, 3, 3, 2, 2, 2, 1, 0, 0, 0

● Perfect discounted gain:

3, 3/1, 3/1.59, 2/2, 2/2.32, 2/ 2.59, 1/2.81, 0 , 0, 0

= 3, 3, 1.89, 1, 0.86, 0.77, 0.36, 0, 0, 0

Page 39: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

NDCG Example● Ideal DCG values:

3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88

NDCG values( divide actual by ideal):

3/3, 5/6, 6.89/7.89, 6.89/8.89, 6.89/9.75, 7.28/10.52,

7.99/10.88, 8.66/10.88, 9.61/10.88, 9.61/10.88

= 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88

3, 2, 3, 0, 0, 1, 2, 2, 3, 0

Page 40: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Regression Metrics

Page 41: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

What Regression Tasks do?Model learns to predict numeric scores.

For example, we try to predict the price of a stock on future days given past price history and other useful information

Page 42: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Some Regression Metrics..RMSE (Root Mean Square Error)

Quantiles of Errors

Page 43: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

RMSEThe most commonly used metric for regression tasks

Also known as RMSD ( root-mean-square deviation)

This is defined as the square root of the average squared distance between the actual score and the predicted score:

Page 44: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Quantiles of ErrorsRMSE is an average, so it is sensitive to large outliers.

If the regressor performs really badly on a single data point, the average error could be big, not robust

Quantiles (or percentiles) are much more robust

Because it is not affected by large outliers

It’s important to look at the median absolute percentage:

It gives us a relative measure of the typical error.

Page 45: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

AcknowledgementEvaluating Machine Learning Models by Alice Zheng

Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Tutorial of Data School on ROC Curves and AUC by Kevin Markham

Page 46: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Questions???

Page 47: Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine Learning

Thank You