Upload
alexander-stafford
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Discovering Unrevealed Discovering Unrevealed Properties of Probability Properties of Probability
Estimation Trees: on Algorithm Estimation Trees: on Algorithm Selection and Performance Selection and Performance
ExplanationExplanation
Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu
Dec. 21, 2006
What this Paper Offers
Preference of a probability estimation tree (PET)
Many important and previously unrevealed properties of PETs
A practical guide for choosing the most appropriate PET algorithm
Statistical Supervised Leaning Bayesian Decision Rule
))y),,x(f(LE(Min)y),,x(f(LE(Min x|yy,x
)y),,x(f(LEE)y),,x(f(LE x|yxy,x Since
)jy,y(L)x|jy(Pminarg)y,y(LEminargy '
1jy
'x|y
y
*
''
,...,1j,'y),,x(f'ywhere
0-1 Loss:
Cost-Sensitive Loss:
)x|y(Pmaxarg)'y,y(LEminargy x|y'y
*
)jy,'y(L)x|jy(Pminargy1j'y
*
Why Probability Estimation?Why Probability Estimation? - Theoretical Necessity- Theoretical Necessity
Challenge: P(x,y) is unknown
so is P(y|x)!
Model
Learning Algorithm
f(x,θ)
LS= N1iii }y,x{
Unknown Distribution
P(X,Y), y=F(x)
A Loss/Cost function L
))y),,x(f(LE(Min y,x f :
Why Probability Estimation?Why Probability Estimation? - - Practical Necessity
Medical Domain
Ozone level Prediction
Direct Marketing
•Non-static, skewed distribution
•Unequal loss (Yang,icdm05)
• Direct estimation of Probability• Decision threshold determination
Posterior Probabilit
y Estimatio
n
Parametric Methods
Non-Parametric Approache
s
• The true and unknown distribution follows a “particular form”.• Via maximum likelihood estimation• E.g. Naïve Bayes, logistic regression
•Directly calculated without making any assumption•E.g. Decision trees, Nearest neighbors
Posterior Probability Estimation
a rather unbiased,
flexible and convenient
solution
PETs - Probabilistic View
of Decision Trees
N/N),x|y(P y• , E.g. (C4.5, CART)
• Confidences in the predicted labels
• Appropriately thesholding for classification w.r.t. different loss functions.
• The dependence of P(y|x,θ) on θ is non-trivial
Problems of Traditional PETs
1. Probability estimates through frequency tend to be too close to the extremes of 1 and 0
-------------------------------------------------
2. Additional inaccuracies result from the small number of examples within a leaf.
-------------------------------------------------
3. The same probability is assigned to the entire region of space defined by a given leaf.
•C4.4(Provost,03)
•CFT(Ling,03)
•BPET(Breiman,96),
•RDT(Fan,03)
Popular PET Algorithms
PET Algorith
ms
Single or
Multiple
Model(s)
Feature Selectio
n Criterion
Probability Estimation Method
Pruning Strateg
y
Diversity Acquisitio
n
C4.5 (Quinlan,93)
Single Gain RatioFrequency Estimation Error-
based Pruning
N/A
C4.4
(Provost,03)
Single Gain RatioLaplace Correction
No
CFT(Ling,03)
Single Gain Ratio
Aggregation over Leaves (FE/LC at leaf node)
No or Error-based
Pruning
RDT(Fan,03)
MultipleRandomly
Chosen
Model Averaging No or Depth
Constraint
Random Manipulati
on of feature set
BPET
(Breiman,96)
Multiple Gain Ratio
Model Averaging
No
Random Manipulati
on of training set
),x|y(PB/1),x|y(P B1k k
qqL ssP̂),x|y(Pi
)CN/()1N(),x|y(P y
N/N),x|y(P y
),x|y(PB/1),x|y(P B1k k
Which one to choose?
What performances to be expected ?
Why should one PET be preferred over
another?
Contributions
A large scale learning curve study using multiple evaluation metrics Preference of a PET: signal-noise
separability of datasets Many important and previously
unrevealed properties of PETs: In ensembles, RDT is preferable on low-signal
separability datasets, while BPET is favorable when the signal separability is high.
A practical guide for choosing the most appropriate PET algorithm
Analytical Tool # 1: AUC Analytical Tool # 1: AUC - - Index of Signal Noise Separability• Signal-noise separability
-Correct identification of information of interest and some other noise factors which may interfere this identification.
- A good analogy for two different populations present in every learning domain with uncertainty
• A synthetic scenario – tumor diagnosis- Tumor: signal present
- No tumor: signal absent
- Based on yes/no decision
1. P(yes|tumor): hit (TP)
2. P(yes|no tumor): false alarm (FP)
3. P(no|tumor): miss (FN)
4. P(no|no tumor): correct reject (TN)
0 0.5 10
0.2
0.4
0.6
0.8
1
FPR
TPR
LowSepDist
HighSepDist
-8
-6
-4
-2
024680
0.05
0.1
0.15
0.2
0.25
f(x|signal)f(x|noise1)f(x|noise2)
Noise Signal
Miss
Correct reject
False alarm
Hit
Decision Criterion
• An Illustration
Relative areas of the four different outcomes vary, the separation of the two distribution does not !
Analytical Tool # 1: AUC Analytical Tool # 1: AUC - - Index of Signal Noise Separability
AUC: an index for the separability of signal from noise
Domains: high/low degree of signal separability• High: deterministic/ little noise• Low: Stochastic/Noisy
Analytical Tool # 2: Analytical Tool # 2: Learning Curves
&
&
&
&&
&& &
& &
20 40 60 80 100
0.0
50
.10
0.1
50
.20
0.2
50
.30
Percentage of 75% Data Examples
MS
E &
&
&
&&
&& &
& &
oo
o oo o o o o o
$
$
$
$$ $
$ $
$ $
x
x
x
x xx x
x
x x
## #
# # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT
• Instead of CV or training-test splitting based on fixed data set size
•Generalization performance of different models as a function of the size of the training set
• Correlation between performance metrics and training set sizes can be observed and possibly generalized over different data sets.
1. Area Under ROC Curve (AUC)- Summarizes the “ranking capability” of a learning
algorithm in ROC space
2. MSE (Brier Score)
- - A proper assessment for the “accuracy” of probability
estimation- Calibration-Refinement decomposition
** Calibration measures the absolute precision of
probability estimation* Refinement indicates how confident the estimator is in
its estimates* Visualization tools – reliability plots and sharpness
graphs
3. Error Rate- Inappropriate criterion for evaluating probability estimates-
))5.0),x|0y(p(I)5.0),x|1y(p(I(N/1}0y|i{
ii}1y|i{
ii
ii
2N
1i y ii ),x|y(P)x|y(TN/1
),x|y(PP̂where))],P̂|1y(P1)(P̂|1y(P[E]))P̂|1y(PP̂[(E p̂2
P̂
Analytical Tool # 3: Multiple Evaluation Metrics
Data setsFeature
TypesMAX AUC
Winner AUC Winner MSE Winner Error Rate
SinglePETs
Ensembles
SinglePETs
Ensembles
SinglePETs Ensembles
Mushroom Categorical 1 C4.4/C4.5
RDT/BPET C4.5
RDT/BPET C4.5/C4.4 RDT/BPET
BC_wdbc Continuous 0.995 CFT
RDT/BPET C4.4
RDT/BPET C4.5 RDT/BPET
Chess Categorical 0.99 C4.4/CFT BPET C4.5/C4.4 BPET C4.5/C4.4 BPET
BC_wisc Continuous 0.99 CFT RDT C4.5/C4.4 RDT C4.5/C4.4 RDT
HouseVote Categorical 0.99 CFT/C4.4
RDT/BPET C4.5 BPET C4.5/C4.4 BPET
Tic Categorical 0.99 C4.4/CFT BPET C4.4 BPET C4.5/C4.4 BPET
Hypothyroid Mixed 0.989 CFT/C4.4
RDT/BPET C4.5 BPET C4.5 BPET
Spam Continuous 0.98 CFT RDT C4.4BPET/
RDT C4.5 RDT
SickEuthyroid Mixed 0.98 C4.4/CFT BPET C4.5 BPET C4.5 BPET
Ionosphere Continuous 0.966 C4.4/CFT
RDT/BPET C4.4 BPET C4.5/C4.4 BPET
Spectf Continuous 0.96 CFT RDT C4.4
RDT/BPET C4.5/C4.4 RDT/BPET
Australian Mixed 0.934 CFT
RDT/BPET C4.5 BPET C4.5 BPET
Adult Mixed 0.9 CFT
RDT/BPET C4.5 BPET C4.5 BPET
Sonar Continuous 0.88 CFT/C4.4 RDT CFT/C4.4
RDT/BPET CFT/C4.4 RDT
Hepatitis Mixed 0.87 C4.4/CFT
RDT/BPET CFT
RDT/BPET C4.5/CFT RDT
Pima Continuous 0.825 CFT
RDT/BPET CFT/C4.4 RDT CFT RDT
Spect Categorical 0.816 CFT RDT CFT RDT C4.5/CFT RDT/BPET
Liver Continuous 0.75 CFT
RDT/BPET CFT RDT C4.5/CFT RDT/BPET
Experiment Experiment ResultsResults
1. RDT and CFT are better on AUC2. RDT is preferable on low-signal separability
datasets, While BPET is favorable on high-signal separability data sets
3. High separability categorical datasets with limited feature values hurt RDT
4. Among single trees, CFT is preferable on low-signal separability datasets
Conjectures in SummaryConjectures in Summary
Behind the ScenesBehind the Scenes - Why RDT and CFT better on AUC?
Superior capability on unique probability generation
Unique Probabilities (Win-Loss-Tie)
AVG RDT C4.4 C4.5 CFT
BagPET 0-14.9-3.1 18-0-0 18-0-0 11.6-4-2.4
RDT 18-0-0 18-0-0 16-0.6-1.4
C4.4 17.9-0-0.1 0.1-15.2-2.7
C4.5 0-17.3-0.7
STDEV RDT C4.4 C4.5 CFT
BagPET 0-0.9-0.9 0-0-0 0-0-0 1.9-1.8-0.5
RDT 0-0-0 0-0-0 0.8-0.5-0.7
C4.4 0.3-0-0.3 0.3-1.6-1.3
C4.5 0-1.9-1.9
• AUC calculations: • Trapezoidal integration (Fawcett,03)
•
(Hand,01)
•For larger AUC, P(y|x,θ) should vary from one test point to another
•The number of unique probabilities is maximized as a result
RDT > BPET > CFT > C4.4 > C4.5
10
00i
nn
2/)1n(nrAUC
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
1. RDT: discards any criterion for optimal feature selection
2. More like a structure for data summarization.
3. When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET.
4. RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added.
• The reasons:
&
&
&&
& & & && &
20 40 60 80 100
0.1
20
.14
0.1
60
.18
0.2
00
.22
Percentage of 75% Data Examples
MS
E &
&
&&
& & & && &o o o
oo o o o o o
$ $
$ $ $ $$ $ $ $
x
x
xx
x x
xx x
x# #
# # # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT
&&
&&
&
&& &
& &
20 40 60 80 100
0.6
00
.65
0.7
00
.75
0.8
00
.85
0.9
0
Percentage of 75% Data Examples
AU
C&
&&
&
&
&& &
& &
o
oo
oo
o
oo
o o
$
$ $$ $
$
$$ $
$
x
xx
x x xx x x x
#
# #
##
#
##
##
&o$x#
BagPETRDTC4.4C4.5CFT
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (I) – Spect and Sonar, low-signal separability domains
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (II) – Pima, a low-signal separability domain
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robabili
ty
Cal = 0.0036
1 6 11 12 22 13 2 0
Class 1 Examples in Each Bin
Fre
quency
020
40
Class 1Class 0
3239 42
2634
17
2 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robabili
ty
Cal = 0.018
6 9 10 4 4 10 14 10
Class 1 Examples in Each Bin
Fre
quency
020
50 Class 1
Class 0
66
3019 14 13 17 20
13
RDT:
BPET:
Behind the Scenes Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (III) - Spam, a high-signal separability domain
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Score
Em
piric
al P
robability
Cal = 0.013
4 3 4 6 15 28 24 49 102 219
Class 1 Examples in Each Bin
Fre
quency
020
040
0
Class 1Class 0
404
156 86
33 35 36 26 52102
221
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Score
Em
piric
al P
robability
Cal = 0.0038
6 5 4 11 4 8 17 24 38 337
Class 1 Examples in Each Bin
Fre
quency
020
050
0Class 1Class 0
545
78 36 28 20 17 19 29 39
340
RDT:
BPET:
&
&&
&&
& & && &
20 40 60 80 100
0.0
50
.15
0.2
5
Percentage of 75% Data Examples
MS
E
&
&&
&&
& & && &
oo
o o o o o o o o
$
$$
$$ $
$ $$ $
xx
xx x
x xx
x x
# # ## # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT &
&& & & & & & & &
20 40 60 80 100
0.9
50
.97
0.9
9
Percentage of 75% Data Examples
AU
C
&
&& & & & & & & &
o
o
oo
oo o o o o
$
$$ $ $ $ $ $ $ $
x
xx
xx x x x x x
#
## # #
# # # # #
&o$x#
BagPETRDTC4.4C4.5CFT
Behind the ScenesBehind the Scenes - Why high separability categorical datasets with limited feature values hurt RDT?
• The observations – Tic-tac-toe and Chess
Behind the ScenesBehind the Scenes - - Why high separability categorical datasets with limited feature values hurt RDT?
• The reason:• High separability categorical datasets
with limited values tend to restrict the degree of diversity that RDT’s random feature selection can explore
- Random feature selection mechanism of RDT
• Categorical features: once;
• Continuous features: multiple times, but different splitting value each time.
The reasons
1. Low-signal separability domains Good performance benefits from the
probability aggregation mechanism Rectify errors introduced to the
probability estimates due to the attribute noise
2. High-signal separability domains Aggregation of the estimated
probabilities from the other irrelevant leaves will adversely affect the final probability estimates.
Behind the ScenesBehind the Scenes - Why CFT preferable on low-signal separability datasets ?
&
&
&&
& & & && &
20 40 60 80 100
0.1
20
.14
0.1
60
.18
0.2
00
.22
Percentage of 75% Data Examples
MS
E &
&
&&
& & & && &o o o
oo o o o o o
$ $
$ $ $ $$ $ $ $
x
x
xx
x x
xx x
x# #
# # # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT &
&& & & &
& & & &
20 40 60 80 100
0.6
00
.65
0.7
00
.75
0.8
00
.85
Percentage of 75% Data Examples
AU
C
&&
& & & && & & &
oo o o o o
o o o o
$
$
$$
$ $ $$ $
$
xx
xx
xx
x
x
xx# #
# #
# ## #
##
&o$x#
BagPETRDTC4.4C4.5CFT
• The evidence (I) – Spect and Pima, low-signal separability domains
Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robability
Cal = 0.0044
0 1 11 12 13 0 0 0
Class 1 Examples in Each Bin
Fre
quency
015
30
Class 1Class 0
0
11
3125
20
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robability
Cal = 0.081
9 5 4 0 2 0 7 10
Class 1 Examples in Each Bin
Fre
quency
010
Class 1Class 0
23
1016
1
10
0
9
18
CFT:
C4.4:
• The evidence (II) - Liver, a low-signal separability domain
20
AUC Score
Given dataset
Signal-noise separability estimation
through RDT or BPET
Ensemble or Single trees
Low signal-noise
separability
High signal-noise
separability
Ensemble or Single
trees
Ensemble
(AUC,MSE,ErrorRate)
RDT CFT
Single Trees
(AUC,MSE,ErrorRate)
>=0.9< 0.9
EnsembleSingle Tree
AUCMSEError Rate
CFT
AUC
MSE, ErrorRate
C4.5 or C4.4
Feature types and
value characteris
tics Categorical feature (with limited values)
BPETRDT ( BPET)
Continuous features (categorical feature with a large number of values)
AUC, MSE, ErrorRate
AUC, MSE, ErrorRate
Choosing the Appropriate PET Algorithm Given a New
Problem
Summary AUC: iAUC: index of signal noise separability Preference of a PET on multiple
evaluation metrics “signal-noise separability” of the dataset other observable statistics.
Many important and unrevealed properties of PETs are analyzed
A practical guide for choosing the most appropriate PET algorithm
Thank you!Thank you!
Questions?Questions?