View
226
Download
1
Category
Preview:
Citation preview
© sebastian thrun, CMU, 2000 1
10-610 The KDD Lab
Intro: Outcome Analysis
Sebastian ThrunCarnegie Mellon University
www.cs.cmu.edu/~10610
© sebastian thrun, CMU, 2000 2
Problem 1
You find out on testing data, your speech recognizer can recognize sentences with 68% word accuracy, whereas previous recognizers achieve 60%. Would you advice a company to adopt your speech recognizer?
© sebastian thrun, CMU, 2000 3
Problem 2
On testing data, your data mining algorithm can predict emergency C-sections with 68% accuracy, whereas a previous $1,000 test achieves 60% accuracy. Do you recommend to replace the previous test by your new method?
© sebastian thrun, CMU, 2000 4
Characterize: What Should We Worry about?
cost/loss
D
D dxxpxfxL )()),(,(
FP/FN errors
regression
quadratic error
unsupervised learning
log likelihood
pattern classification
+ -
classification error
© sebastian thrun, CMU, 2000 5
ROC Curves (ROC=Receiver Operating Characteristic)
© sebastian thrun, CMU, 2000 6
Error Types
Type I error, alpha error, false positive: Probability of accepting hypothesis if not true
Type II error, beta error, false negative: Probability of rejecting hypothesis when it is true
© sebastian thrun, CMU, 2000 7
ROC Curves (ROC=Receiver Operating Characteristic)
© sebastian thrun, CMU, 2000 8
ROC Curves (ROC=Receiver Operating Characteristic)
Sensitivity: probability that a test result will be positive when the disease is present
Specificity: probability that a test result will be negative when the disease is not present
Positive likelihood ratio: ratio between the probability of a positive test result given the presence of the disease and the probability of a positive test result given the absence of the disease
Negative likelihood ratio: ratio between the probability of a negative test result given the presence of the disease and the probability of a negative test result given the absence of the disease
Positive predictive value (PPV): probability that the disease is present when the test is positive
Negative predictive value (NPV): probability that the disease is not present when the test is negative
negative falsepositive true
positive true
positive falsenegative true
negative true
negative true
negative true
positive true
positive true
positive falsepositive true
positive true
negative falsenegative true
negative true
© sebastian thrun, CMU, 2000 9
Evaluating Machine Learning Algorithms
plenty data little data
© sebastian thrun, CMU, 2000 10
Holdout Set
Data
evaluate errortrain
Often also used for parameter optimization
© sebastian thrun, CMU, 2000 11
Example:
Hypothesis misclassifies 12 out of 40 examples in cross validation set S.
Q: What will the “true” error on future examples?
A:
© sebastian thrun, CMU, 2000 12
Finite Cross-Validation Set
True error:
Test error:
D
D ydxyxpxfye ,),(),(
Syx
S xfym
e,
),(1
ˆ
D = all data
m = #test samples S = test data
(true risk)
(empirical risk)
© sebastian thrun, CMU, 2000 13
Confidence Intervals (See Mitchell 97)
If• S contains m examples, drawn independently• m 30
Then• With approximately 95% probability, the true error eD
lies in the interval
m
eee SS
S
)ˆ1(ˆ96.1ˆ
© sebastian thrun, CMU, 2000 14
Example:
Hypothesis misclassifies 12 out of 40 examples in cross validation set S.
Q: What will the “true” error on future examples?
A: With 95% confidence, the true error will be in the interval:
m
eee SS
S
)ˆ1(ˆ96.1ˆ]44.0;16.0[
3.040
12ˆ Se40m 14.0
)ˆ1(ˆ96.1
m
ee SS
© sebastian thrun, CMU, 2000 15
Confidence Intervals (See Mitchell 97)
If• S contains n examples, drawn independently• n 30
Then• With approximately N% probability, the true error eD lies
in the interval
m
eeze SS
NS
)ˆ1(ˆˆ
N% 50% 68% 80% 90% 95% 98% 99%
zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58
© sebastian thrun, CMU, 2000 16
Finite Cross-Validation Set
True error:
Test error:
Number of test errors: Is Binomially distributed:
D
D ydxyxpxfye ,),(),(
Syx
S xfym
e,
),(1
ˆ
knD
kD
Syx
eekmk
mkxfyp
)1()(
)!(!
!),(
,
© sebastian thrun, CMU, 2000 17
Binomial DistributionBinomial distribution for eD=0.3 and m=40
P(k)
Approximates Normal distribution (Central Limit Theorem)
© sebastian thrun, CMU, 2000 18
95% Confidence Intervals
© sebastian thrun, CMU, 2000 19
Question
What’s the difference between variance and confidence intervals?
Basically a factorm
1
m
eeze SS
NS
)ˆ1(ˆˆ
© sebastian thrun, CMU, 2000 20
Common Performance Plot
Testing Error95% confidence
intervals
© sebastian thrun, CMU, 2000 21
Comparing Different Hypotheses
True difference:
Test set difference:
95% Confidence interval:
)(ˆ)(ˆˆ21 SS eed
)()( 21 DD eed
2
22
1
11 ))(ˆ1()(ˆ))(ˆ1()(ˆ96.1ˆ
m
ee
m
eed SSSS
© sebastian thrun, CMU, 2000 22
Evaluating Machine Learning Algorithms
plenty data little data
© sebastian thrun, CMU, 2000 23
Holdout Set
Data
evaluate errortrain
© sebastian thrun, CMU, 2000 24
k-fold Cross Validation
Data
Train on yellow, evaluate on pink error5
Train on yellow, evaluate on pink error6
Train on yellow, evaluate on pink error7
Train on yellow, evaluate on pink error1
Train on yellow, evaluate on pink error3
Train on yellow, evaluate on pink error4
Train on yellow, evaluate on pink error8
Train on yellow, evaluate on pink error2
error = errori / k
k-way split
© sebastian thrun, CMU, 2000 25
The Jackknife
Data
© sebastian thrun, CMU, 2000 26
The Bootstrap
Data
Repeat and average
Train on yellow, evaluate on pink error
© sebastian thrun, CMU, 2000 27
What’s the Problem?
Confidence intervals assume independence. But our individual estimates are dependent.
© sebastian thrun, CMU, 2000 28
Comparing Different Hypotheses: Paired t test
True difference:
For each partition k:
Average:
N% Confidence interval:
k
iid
kd
1
ˆ1ˆ
)()( 21 DD eed
k
iikN kk
td1
21, )ˆˆ(
)1(
1ˆ
test error for partition k
)(ˆ)(ˆˆ2,1, kSkSk eed
k-1 is degrees of freedom N is confidence level
90% 95% 98% 99%
=2 2.92 4.30 6.96 9.92
=5 2.02 2.57 3.36 4.03
=10 1.81 2.23 2.76 3.17
=20 1.72 2.09 2.53 2.84
=30 1.70 2.04 2.46 2.75
=120 1.66 1.98 2.36 2.62
= 1.64 1.96 2.33 2.58
© sebastian thrun, CMU, 2000 29
Evaluating Machine Learning Algorithms
plenty data little data
unlimiteddata
© sebastian thrun, CMU, 2000 30
Asymptotic Prediction
Useful for very large data sets
© sebastian thrun, CMU, 2000 31
Summary
Know your loss function! Finite testing data: report confidence intervals Scarce data: Repartition training/testing set Asymptotic prediction: exponential
Put thoughts into your evaluation, and be critical. Convince yourself!
Recommended