Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
1
1
AdaBoost
Slides modified from: MLSS’03: Gunnar Rätsch, Introduction to Boosting
http://www.boosting.org
2
AdaBoost: Introduction
• Classifiers
• Supervised Classifiers
• Linear Classifiers
• Perceptron, Least Squares Methods
• Linear SVM
• Nonlinear Classifiers
• Part I: Multi Layer Neural Networks
• Part II: Pol. Class., RBF, Nonlinear SVM
• Nonmetric Methods - Decision Trees
• AdaBoost
• Unsupervised Classifiers
2
3
AdaBoost: Agenda
• Idea AdaBoost
(Adaptive Boosting, R. Scharpire, Y. Freund, ICML, 1996):
– Combine many low-accuracy classifiers (weak learners)
to create a high-accuracy classifier (strong learners )
4
AdaBoost: Introduction
• Example: 2 classes of apples
The World:
Data:
Unknown target function:
Unknown distribution:
Objective: Given new x, predict y
Problem: P(x, y) is unknown!
1 ( , ) , , { 1}
N d
n nn nnx y x y
( )x p x
( ) ( ( | ))y f x or y P y x
3
5
AdaBoost: Introduction
The Model:
• Hypothesis class:
• Loss:
• Objective: Minimize the true (expected) loss – ( “generalization error”)
| : { 1}d
h h
, ( ) ( e.g . [ ( )] )l y h x y h xI
( ) : , ( )L h l h
E*
arg m in ( ) w ithh
h L h
1
1ˆ m in , ( )
N
nN nh
n
h l y h xN
• Problem: Only a data sample is available, P(x, y) is unknown!
• Solution: Find empirical minimizer
How can we efficiently construct complex hypotheses with small
generalization errors?
6
AdaBoost: Frame work
Algorithm
Idea: • Simple Hypotheses are not perfect! • Hypotheses combination increased accuracy
Problems: • How to generate different hypotheses? • How to combine them?
1
T
t t
t
f h
Method:
• Compute distribution d1, . . . , dN on examples
• Find hypothesis on the weighted training sample
(x1, y1, d1), . . . , (xN, yN, dN)
• Combine hypotheses h1, h2, . . . linearly:
4
7
AdaBoost: Frame work
Input: N examples {(x1, y1), . . . , (xN, yN)},
L a learning algorithm generating hypothesis ht(x) (classifiers)
T maxNumber of hypotheses in the ensemble
1( ) ( )
T
Ens t ttf x h x
Output: final hypothesis
Zt is a normalization factor
: { 1}.t
xh
( )
1 ( ) ) (
t n
N t
t n nnxd y h
I
1 1 -ln
2
t
t
t
( 1) ( )
exp ( )t
t t
n n n nt td h xd y Z
Do for t = 1, . . . , T,
1. Train base learner according to example distribution d (t)
and obtain hypothesis
2. compute weighted error
3. compute hypothesis weight
4. update example distribution
(1)1 / for a ll 1, ,
nd N n N
Initialize: dn weight of example n (d is a distribution with ) ( )
11
N t
nnd
8
AdaBoost: Decision Stumps
• A family of weak learners,
e.g. Decision stump:
– can perform a single test on a single attribute
with threshold Θ .
– parameterize all decision stumps as follows:
1 if
( ; ) , 1, ....,1 else
jjx
f x j d
5
9
AdaBoost: Example
• Example: natural apples vs. plastic apples
How to classify?
class A
class B
10
AdaBoost
• Example: natural apples vs. plastic apples
1st hypothesis
Weak classifier (cuts on coordinate axes)
6
11
AdaBoost
• Example:
Recomputing weightings of the training patterns
12
AdaBoost
• Example:
2nd hypothesis
7
13
AdaBoost
• Example:
Recompute weighting
14
AdaBoost
• Example:
3rd hypothesis
8
15
AdaBoost
• Example:
Recompute weighting
4th hypothesis
16
AdaBoost
• Example:
Combination of hypotheses
9
17
AdaBoost
• Example:
Decision surface
18
AdaBoost
• Example
Final decision function
10
19
AdaBoost: Frame work
Input: N examples {(x1, y1), . . . , (xN, yN)},
L a learning algorithm generating hypothesis ht(x) (classifiers)
T maxNumber of hypotheses in the ensemble
1( ) ( )
T
Ens t ttf x h x
Output: final hypothesis
Zt is a normalization factor
: { 1}.t
xh
( )
1 ( ) ) (
t n
N t
t n nnxd y h
I
1 1 -ln
2
t
t
t
( 1) ( )
exp ( )t
t t
n n n nt td h xd y Z
Do for t = 1, . . . , T,
1. Train base learner according to example distribution d (t)
and obtain hypothesis
2. compute weighted error
3. compute hypothesis weight
4. update example distribution
(1)1 / for a ll 1, ,
nd N n N
Initialize: dn weight of example n (d is a distribution with ) ( )
11
N t
nnd
20
(1)1 /10 10
nd N
AdaBoost
t =1
1 1( ) ( )
Ensf x h x
1
1 1 1
( ) ( ) = 0( .3)
n
N
n nnh xd y
I
1
1
1
1 1 -ln = 0 .424
2
11
21
AdaBoost
t =2
1 1 2 2( ) ( ) ( )
Ensf x h x h x
2
2 1 2
( ) ( ( ))
N
n nn nh xd y
I
2
2
2
1 1 -ln
2
11
( 2 ) (1)
1exp ( )
n n nnhd y Zxd
Zt is a normalization factor
22
AdaBoost
t =3
……………….
12
23
AdaBoost: Frame work
• Weak Learners used with Boosting
• Decision stumps (axis parallel splits)
• Decision trees (e.g. C4.5 by Quinlan 1996)
• Multi-layer Neural networks (e.g. for OCR)
• Radial basis function networks (e.g. UCI benchmarks, etc)
Decision trees:
• Hierarchical and recursive partitioning on the input space
• Many approaches, usually axis parallel splits
24
AdaBoost: AdaBoost vs. SVM
• Comparison AdaBoost vs. SVM
AdaBoost’s decision line SVM’s decision line
These decision lines are for a low noise case with similar generalization errors. In AdaBoost, RBF networks with 13 centers were used.
13
25
AdaBoost: Application
•Application
• DT C4.5 as weak classifier
• Spam, Zip Code OCR
• Text classification: Schapire and Singer - Used stumps with normalized term frequency and multi-class encoding
• OCR: Schwenk and Bengio (neural networks)
• Natural language Processing: Collins; Haruno, Shirai and Ooyama
• Image retrieval: Thieu and Viola
• Medical diagnosis: Merle et al.
• Fraud Detection: Rätsch & Müller 2001
• Drug Discovery: Rätsch, Demiriz, Bennett 2002
• Elect. Power Monitoring: Onoda, Rätsch & Müller 2000
26
AdaBoost: Information
Introduction http://informatik.unibas.ch/lehre/ws06/cs232/_Downloads/ Schapire_A_Short_Introduction_to_Boosting.pdf
Internet http://www.boosting.org http://www.cs.princeton.edu/~schapire/boost.html
Conferences Computational Learning Theory (COLT), Neural Information Processing Systems (NIPS), Int. Conference on Machine Learning (ICML), . . .
Journals Machine Learning, Journal of Machine Learning Research, Information and Computation, Annals of Statistics
People List available at http://www.boosting.org
Software Only few implementations (algorithms ‘too simple’) (cf. http://www.boosting.org)