82
1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

Embed Size (px)

Citation preview

Page 1: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

1

Data MiningChapter 5

Credibility: Evaluating What’s Been Learned

Kirk Scott

Page 2: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

2

The Lontara Script, and Example of an Abugida

Page 3: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

3

Comparison of various abugidas descended from Brahmi script. May Śiva protect those who take delight in

the language of the gods. (Kalidasa)

Page 4: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

4

Bopomofo system of phonetic notation for the transcription of spoken Chinese

Page 5: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

• Oracle bone script: (from left)• 馬 / 马 mǎ "horse", • 虎 hǔ "tiger", • 豕 shĭ "swine", • 犬 quǎn "dog", • 鼠 shǔ "rat and mouse", • 象 xiàng "elephant", • 豸 zhì "beasts of prey", • 龜 / 龟 guī "turtle", • 爿 qiáng "low table" (now 床 chuáng), • 為 / 为 wèi "to lead" (now "do or for"), • and 疾 jí "illness"

5

Page 6: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

6

5.0 On to the Topic at Hand…

Page 7: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

7

• One thing you’d like to do is evaluate the performance of a data mining algorithm

• More specifically, you’d like to evaluate results obtained by applying a data mining algorithm to a certain data set

• For example, can you estimate what percent of new instances it will classify correctly?

Page 8: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

8

• Problems include:• The performance level on the training set

is not a good indicator of performance on other data sets

• Data may be scarce, so that enough data for separate training and test data sets may be hard to come by

• A related question is how to compare the performance of 2 different algorithms

Page 9: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

9

• Parameters for evaluation and comparison:

• Are you predicting classification?• Or are you predicting the probability that

an instance falls into a classification?• Or are you doing numeric prediction?• What does performance evaluation mean

for things other than prediction, like association rules?

Page 10: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

10

• Can you define the cost of a misclassification or other “failure” of machine learning?

• There are false positives and false negatives

• Different kinds of misclassifications will have different costs in practice

Page 11: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

11

• There are broad categories of answers to these questions

• For evaluation of one algorithm, a large amount of data makes estimating performance easy

• For smaller amounts of data, a technique called cross-validation is commonly used

• Comparing the performance of different data mining algorithms relies on statistics

Page 12: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

12

5.1 Training and Testing

Page 13: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

13

• For tasks like prediction, a natural performance measure is the error rate

• This is the number of incorrect predictions out of all made

• How to estimate this?

Page 14: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

14

• A rule set, a tree, etc. may be imperfect• It may not classify all of the training set

instances correctly• But it has been derived from the training

set• In effect, it is optimized for the training set• Its error rate on an independent test set is

likely to be much higher

Page 15: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

15

• The error rate on the training set is known as the resubstitution error

• You put the same data into the classifier that was used to create it

• The resubstitution error rate may be of interest

• But it is not a good estimate of the “true” error rate

Page 16: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

16

• In a sense, you can say that the classifier, by definition, is overfitted on the training set

• If the training set is a very large sample, its error rate could be closer to the true rate

• Even so, as time passes and conditions change, new instances might not fall into quite the same distribution as the training instances and will have a higher rate of misclassification

Page 17: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

17

• Not surprisingly, the way to estimate the error rate is with a test set

• You’d like both the training set and the test set to be representative samples of all possible instances

• It is important that the training set and the test set be independent

• Any test data should have played no role in the training of the classifier

Page 18: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

18

• There is actually a third set to consider• The validation set• It may serve either of these two purposes:• Selecting one of several possible data

mining algorithms• Optimizing the one selected

Page 19: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

19

• All three sets should be independent• Test only with the test set• Train only with the training set• Validate only with the validation set• In general, error rate estimation is done on

the test set• After all decisions are made, it is

permissible to combine all test sets and retrain on this superset for better results

Page 20: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

20

5.2 Predicting Performance

Page 21: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

21

• This section can be dealt with quickly• Its main conclusions are based on

statistical concepts, which are presented in a box

• You may be familiar with the derivations from statistics class

• They are beyond the scope of this course

Page 22: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

22

• These are the basic ideas:• Think in terms of a success rate rather

than an error rate• Based on a sample of n instances, you

have an observed success rate in the test data set

Page 23: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

23

• Statistically, you can derive a confidence interval around the observed rate that depends on the sample size

• Doing this provides more complete knowledge about the real success rate, whatever it might be, based on the observed success rate

Page 24: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

24

5.3 Cross-Validation

Page 25: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

25

• Cross-validation is the general term for techniques that can be used to select training/testing data sets and estimate performance when the overall data set is limited in size

• In simple terms, this might be a reasonable rule of thumb:

• Hold out 1/3 of the data for testing, for example, and use the rest for training (and validation)

Page 26: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

26

Statistical Problems with the Sample

• Even if you do a proper random sample, poorly matched test and training sets can lead to poor error estimates

• Suppose that all of the instances of one classification fall into the test set and none fall into the training set

Page 27: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

27

• The rules resulting from training will not be able to classify those test set instances

• The error rate will be high (justifiably)• But a different selection of training set and

test set would give rules that covered that classification

• And when the rules are evaluated using the test set, a more realistic error rate would result

Page 28: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

28

Stratification

• Stratification can be used to better match the test set and the training set

• You don’t simply obtain the test set by random sampling

• You randomly sample from each of the classifications in the overall data set in proportion to their occurrence there

• This means all classifications are represented in both the test and training sets

Page 29: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

29

Repeated Holdout

• With additional computation, you can improve on the error estimate obtained from a stratified sample alone

• n times, randomly hold out 1/3 of the instances for training, possibly using stratification

• Average the error rates over the n times• This is called repeated holdout

Page 30: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

30

Cross-Validation

• The idea of multiple samples can be generalized

• Partition the data into n “folds” (partitions)• Do the following n times:• Train on (n – 1) of the partitions• Test on the remaining 1• Average the error rates over the n times

Page 31: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

31

Stratified 10-Fold Cross-Validation

• The final refinement of cross validation is to make the partitions so that all classifications are roughly in proportion

• The standard rule of thumb is to use 10 partitions

• Why?• The answer boils down to this essentially:• Human beings have 10 fingers…

Page 32: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

32

• In general, it has been observed that 10 partitions lead to reasonable results

• 10 partitions is not computationally out of the question

Page 33: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

33

• The final refinement presented in the book is this:

• If you want really good error estimates, do 10-fold cross validation 10 times with different stratified partitions and average the results

• At this point, estimating error rates has become a computationally intensive task

Page 34: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

34

5.4 Other Estimates

Page 35: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

35

• 10 times 10-fold stratified cross-validation is the current standard for estimating error rates

• There are other methods, including these two:

• Leave-One-Out Cross-Validation• The Bootstrap

Page 36: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

36

Leave-One-Out Cross-Validation

• This is basically n-fold cross validation taken to the max

• For a data set with n instances, you hold out one for testing and train on the remaining (n – 1)

• You do this for each of the instances and then average the results

Page 37: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

37

• This has these advantages:• It’s deterministic:• There’s no sampling• In a sense, you maximize the information

you can squeeze out of the data set

Page 38: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

38

• It has these disadvantages:• It’s computationally intensive• By definition, a holdout of 1 can’t be

stratified• By definition, the classification of a single

instance will not conform to the distribution of classifications in the remaining instances

• This can lead to poor error rate estimates

Page 39: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

39

The Bootstrap

• This is another technique that ultimately relies on statistics (presented in a box)

• The question underlying the development and use of the bootstrap technique is this:

• Is there a way of estimating the error rate that is especially well-suited to small data sets?

Page 40: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

40

• The basic statistical idea is this:• Do not just randomly select a test set from

the data set• Instead, select a training set (the larger of

the two sets) using sampling with replacement

Page 41: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

41

• Sampling with replacement in this way will lead to these results:

• Some of the instances will not be selected for the training set

• These instances will be the test set• By definition, you expect duplicates in the

training set

Page 42: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

42

• Having duplicates in the training set will mean that it is a poorer match to the test set

• The error estimate is skewed to the high side

• This is corrected by combining this estimate with the resubstitution error rate, which is naturally skewed to the low side

Page 43: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

43

• This is the statistically based formula for the overall error rate:

• Overall error rate =• .632 * test set error rate• + .368 * training set error rate• To improve the estimate, randomly sample

with replacement multiple times and average the results

Page 44: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

44

• Like the leave-out-one technique, there are cases where this technique does not give good results

• This depends on the distribution of the classifications in the overall data set

• Sampling with replacement in some cases can lead to the component error rates being particularly skewed in such a way that they don’t compensate for each other

Page 45: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

45

5.5 Comparing Data Mining Schemes

Page 46: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

46

• This section can be dealt with quickly because it is largely based on statistical derivations presented in a box

• The fundamental idea is this:• Given a data set (or data sets) and two

different data mining algorithms, you’d like to choose which one is best

Page 47: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

47

• For researchers in the field, this is the broader question:

• Across all possible data sets in a given problem domain (however that may be defined) which algorithm is superior overall?

• We’re really only interested in the simpler, applied question

Page 48: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

48

• At heart, to compare two algorithms, you compare their estimated error or success rates

• In a previous section it was noted that you can get a confidence interval for an estimated success rate

• In a situation like this, you have two probabilistic quantities you want to compare

Page 49: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

49

• The paired t-test is an established statistical technique for comparing two such quantities and determining if they are significantly different

Page 50: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

50

5.6 Predicting Probabilities

Page 51: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

51

• The discussion so far had to do with evaluating schemes that do simple classification

• Either an instance is predicted to be in a certain classification or it isn’t

• Numerically, you could say a successful classification was a 1 and an unsuccessful classification was a 0

• In this situation, evaluation boiled down to counting the number of successes

Page 52: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

52

Quadratic Loss Function

• Some classification schemes are finer tuned

• Instead of just predicting a classification, they produce the following:

• For each instance examined, a probability that that instance falls into each of the classes

Page 53: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

53

• The set of k probabilities can be thought of as a vector of length k containing elements pi

• Each pi represents the probability of one of the classifications for that instance

• The pi sum to 1

Page 54: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

54

• The following discussion will be based on the case of one instance, not initially considering the fact that a data set will contain many instances

• How do you judge the goodness of the vector of pi‘s that a data mining algorithm produces?

Page 55: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

55

• Evaluation is based on what is known as a loss function

• Loss is a measure of the probabilities against the actual classification found

• A good prediction of probabilities should have a low loss value

• In other words, the difference is small between what is observed and the predicted probabilities

Page 56: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

56

• Consider the loss for one instance• Out of k possible classifications, the

instance actually falls into one of them, say the ith

• This fact can be represented by a vector a with a 1 in position ai and 0 in all others

Page 57: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

57

• The data mining algorithm finds a vector of length k containing the probabilities that an instance falls into one of the classifications

• This fact can be represented by a vector p containing values pi

Page 58: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

58

• Now for a single instance, consider this sum over the different classes and their probabilities:

j jj ap 2)(

Page 59: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

59

• Ignore the squaring for a moment• Excluding the ith case, this is a sum of the

probabilities that didn’t come true• For a good prediction, you’d like this to be

small

j jj ap 2)(

Page 60: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

60

• For the ith case you have pi – ai = pi – 1

• For a good scheme you’d like the probability of predicting the correct classification to be as near to 1 as possible

• In other words, you’d like the difference between the probability and 1 to be as small as possible

j jj ap 2)(

Page 61: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

61

• Putting the two parts back together again, you’d like this sum to be small:

• As for the squaring, this is basically just statistical standard operating procedure

• Squaring makes the function amenable to calculus techniques

j jj ap )(

Page 62: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

62

• The final outcome is this• The goal is to minimize the mean square

error (MSE)• Because of the squaring this is called the

quadratic loss function• Recall that this discussion was for one

instance• Globally you want to minimize the MSE

over all instances

j jj ap 2)(

Page 63: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

63

Informational Loss Function

• The informational loss function is another way of evaluating a probabilistic predictor

• Like the discussion of the quadratic loss function, this will be discussed in terms of a single instance

• For a whole data set you would have to consider the losses over all instances

Page 64: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

64

• The book explains that the informational loss function is related to the information function for tree formation

• It is also related to logistic regression• Suppose that a particular instance is

classified into the ith category• For that instance, this is the loss function:• - log2 pi

Page 65: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

65

• I will give a simplified information theoretic explanation at the end

• In the meantime, it is possible to understand what the function does with a few basic definitions and graphs

• Remember the following:• The sum of all of the pi = 1

• Each individual pi < 1

Page 66: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

66

• Now observe that the information loss function involves only pi, the probability value for the category that the instance actually classified into

• The information loss function does not include any information about the probabilities of the categories that the instance didn’t classify into

Page 67: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

67

• So, restrict your attention to the ith category, the one the instance classified into

• A measure of a good predictor would be how close the prediction probability, pi, is to the value 1 (perfect prediction) for that instance

Page 68: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

68

• Here is the standard log function• log2 pi

• Its graph is shown on the following overhead

Page 69: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

69

Page 70: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

70

• Here is the informational loss function• - log2 pi

• Its graph is shown on the following overhead

• We are only interested in that part of the graph where 0 < pi < 1, since that is the range of possible values for pi

Page 71: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

71

Page 72: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

72

• The goal is to minimize the loss function• What does that mean, visually?• It means pushing pi as close to one as

possible• When pi = 1 the information loss function

takes on the value 0• An information loss function value of 0 is

the optimal result

Page 73: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

73

• In vague information theoretic terms, you might say this:

• The greater the probability of your prediction, the more (valid) information you apparently had to base it on

• Or, the less information you would have needed in order to make it optimal

• As pi 1, -log2 pi 0

Page 74: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

74

• Conversely, the less the probability of your prediction the less (valid) information you apparently had to base it on

• Or the more information you would have needed to make it optimal

• As pi 0, -log2 pi ∞

Page 75: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

75

• The book observes that if you assign a probability of 0 to an event and that event happens, the informational loss function “cost” is punishing—namely, infinite

• A prediction that is flat wrong would require an unmeasurable amount of information to correct

Page 76: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

76

• This leads to another very informal insight into the use of the logarithm in the function, and what that means

• If you view the graph from right to left instead of right to left, as the probability of a prediction gets smaller and smaller, the information needed in order to make it better grows “exponentially”

Page 77: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

77

• In practice, because of the potential for an infinite informational loss function value, no event is given a probability of 0

Page 78: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

78

• Consider the graph again• Assume that information was measured in

bits rather than decimal numbers• A more theoretical statement about the

relative costs would go like this:• The informational loss function represents

the relative number of bits of information that would be needed to effect a change in the probability of a prediction

Page 79: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

79

Discussion

• Which is better, the quadratic loss function or the informational loss function?

• It’s a question of preference• The quadratic loss function is based on

the probability of cases that didn’t occur as well as cases that occurred

• The information loss function is based only on the probability of the case that occurred

Page 80: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

80

• The quadratic loss function is bounded (a finite sum of squares of differences)

• The informational loss function is unbounded

• Statisticians would probably find the quadratic loss function (MSE criterion) familiar and useful

Page 81: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

81

• Information theorists would probably find the informational loss function familiar and useful

• The book points out that in theory you can measure the information contained in a structural representation in bits

• Advanced analyses can be based on formulas that combine the information in a representation and the bits involved in measuring the error rate (bits still needed)

Page 82: 1 Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott

82

The End