13
1 1 AdaBoost Slides modified from: MLSS’03: Gunnar Rätsch, Introduction to Boosting http://www.boosting.org 2 AdaBoost: Introduction • Classifiers • Supervised Classifiers • Linear Classifiers • Perceptron, Least Squares Methods • Linear SVM • Nonlinear Classifiers • Part I: Multi Layer Neural Networks • Part II: Pol. Class., RBF, Nonlinear SVM • Nonmetric Methods - Decision Trees AdaBoost • Unsupervised Classifiers

AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

1

1

AdaBoost

Slides modified from: MLSS’03: Gunnar Rätsch, Introduction to Boosting

http://www.boosting.org

2

AdaBoost: Introduction

• Classifiers

• Supervised Classifiers

• Linear Classifiers

• Perceptron, Least Squares Methods

• Linear SVM

• Nonlinear Classifiers

• Part I: Multi Layer Neural Networks

• Part II: Pol. Class., RBF, Nonlinear SVM

• Nonmetric Methods - Decision Trees

• AdaBoost

• Unsupervised Classifiers

Page 2: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

2

3

AdaBoost: Agenda

• Idea AdaBoost

(Adaptive Boosting, R. Scharpire, Y. Freund, ICML, 1996):

– Combine many low-accuracy classifiers (weak learners)

to create a high-accuracy classifier (strong learners )

4

AdaBoost: Introduction

• Example: 2 classes of apples

The World:

Data:

Unknown target function:

Unknown distribution:

Objective: Given new x, predict y

Problem: P(x, y) is unknown!

1 ( , ) , , { 1}

N d

n nn nnx y x y

( )x p x

( ) ( ( | ))y f x or y P y x

Page 3: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

3

5

AdaBoost: Introduction

The Model:

• Hypothesis class:

• Loss:

• Objective: Minimize the true (expected) loss – ( “generalization error”)

| : { 1}d

h h

, ( ) ( e.g . [ ( )] )l y h x y h xI

( ) : , ( )L h l h

E*

arg m in ( ) w ithh

h L h

1

1ˆ m in , ( )

N

nN nh

n

h l y h xN

• Problem: Only a data sample is available, P(x, y) is unknown!

• Solution: Find empirical minimizer

How can we efficiently construct complex hypotheses with small

generalization errors?

6

AdaBoost: Frame work

Algorithm

Idea: • Simple Hypotheses are not perfect! • Hypotheses combination increased accuracy

Problems: • How to generate different hypotheses? • How to combine them?

1

T

t t

t

f h

Method:

• Compute distribution d1, . . . , dN on examples

• Find hypothesis on the weighted training sample

(x1, y1, d1), . . . , (xN, yN, dN)

• Combine hypotheses h1, h2, . . . linearly:

Page 4: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

4

7

AdaBoost: Frame work

Input: N examples {(x1, y1), . . . , (xN, yN)},

L a learning algorithm generating hypothesis ht(x) (classifiers)

T maxNumber of hypotheses in the ensemble

1( ) ( )

T

Ens t ttf x h x

Output: final hypothesis

Zt is a normalization factor

: { 1}.t

xh

( )

1 ( ) ) (

t n

N t

t n nnxd y h

I

1 1 -ln

2

t

t

t

( 1) ( )

exp ( )t

t t

n n n nt td h xd y Z

Do for t = 1, . . . , T,

1. Train base learner according to example distribution d (t)

and obtain hypothesis

2. compute weighted error

3. compute hypothesis weight

4. update example distribution

(1)1 / for a ll 1, ,

nd N n N

Initialize: dn weight of example n (d is a distribution with ) ( )

11

N t

nnd

8

AdaBoost: Decision Stumps

• A family of weak learners,

e.g. Decision stump:

– can perform a single test on a single attribute

with threshold Θ .

– parameterize all decision stumps as follows:

1 if

( ; ) , 1, ....,1 else

jjx

f x j d

Page 5: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

5

9

AdaBoost: Example

• Example: natural apples vs. plastic apples

How to classify?

class A

class B

10

AdaBoost

• Example: natural apples vs. plastic apples

1st hypothesis

Weak classifier (cuts on coordinate axes)

Page 6: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

6

11

AdaBoost

• Example:

Recomputing weightings of the training patterns

12

AdaBoost

• Example:

2nd hypothesis

Page 7: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

7

13

AdaBoost

• Example:

Recompute weighting

14

AdaBoost

• Example:

3rd hypothesis

Page 8: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

8

15

AdaBoost

• Example:

Recompute weighting

4th hypothesis

16

AdaBoost

• Example:

Combination of hypotheses

Page 9: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

9

17

AdaBoost

• Example:

Decision surface

18

AdaBoost

• Example

Final decision function

Page 10: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

10

19

AdaBoost: Frame work

Input: N examples {(x1, y1), . . . , (xN, yN)},

L a learning algorithm generating hypothesis ht(x) (classifiers)

T maxNumber of hypotheses in the ensemble

1( ) ( )

T

Ens t ttf x h x

Output: final hypothesis

Zt is a normalization factor

: { 1}.t

xh

( )

1 ( ) ) (

t n

N t

t n nnxd y h

I

1 1 -ln

2

t

t

t

( 1) ( )

exp ( )t

t t

n n n nt td h xd y Z

Do for t = 1, . . . , T,

1. Train base learner according to example distribution d (t)

and obtain hypothesis

2. compute weighted error

3. compute hypothesis weight

4. update example distribution

(1)1 / for a ll 1, ,

nd N n N

Initialize: dn weight of example n (d is a distribution with ) ( )

11

N t

nnd

20

(1)1 /10 10

nd N

AdaBoost

t =1

1 1( ) ( )

Ensf x h x

1

1 1 1

( ) ( ) = 0( .3)

n

N

n nnh xd y

I

1

1

1

1 1 -ln = 0 .424

2

Page 11: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

11

21

AdaBoost

t =2

1 1 2 2( ) ( ) ( )

Ensf x h x h x

2

2 1 2

( ) ( ( ))

N

n nn nh xd y

I

2

2

2

1 1 -ln

2

11

( 2 ) (1)

1exp ( )

n n nnhd y Zxd

Zt is a normalization factor

22

AdaBoost

t =3

……………….

Page 12: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

12

23

AdaBoost: Frame work

• Weak Learners used with Boosting

• Decision stumps (axis parallel splits)

• Decision trees (e.g. C4.5 by Quinlan 1996)

• Multi-layer Neural networks (e.g. for OCR)

• Radial basis function networks (e.g. UCI benchmarks, etc)

Decision trees:

• Hierarchical and recursive partitioning on the input space

• Many approaches, usually axis parallel splits

24

AdaBoost: AdaBoost vs. SVM

• Comparison AdaBoost vs. SVM

AdaBoost’s decision line SVM’s decision line

These decision lines are for a low noise case with similar generalization errors. In AdaBoost, RBF networks with 13 centers were used.

Page 13: AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

13

25

AdaBoost: Application

•Application

• DT C4.5 as weak classifier

• Spam, Zip Code OCR

• Text classification: Schapire and Singer - Used stumps with normalized term frequency and multi-class encoding

• OCR: Schwenk and Bengio (neural networks)

• Natural language Processing: Collins; Haruno, Shirai and Ooyama

• Image retrieval: Thieu and Viola

• Medical diagnosis: Merle et al.

• Fraud Detection: Rätsch & Müller 2001

• Drug Discovery: Rätsch, Demiriz, Bennett 2002

• Elect. Power Monitoring: Onoda, Rätsch & Müller 2000

26

AdaBoost: Information

Introduction http://informatik.unibas.ch/lehre/ws06/cs232/_Downloads/ Schapire_A_Short_Introduction_to_Boosting.pdf

Internet http://www.boosting.org http://www.cs.princeton.edu/~schapire/boost.html

Conferences Computational Learning Theory (COLT), Neural Information Processing Systems (NIPS), Int. Conference on Machine Learning (ICML), . . .

Journals Machine Learning, Journal of Machine Learning Research, Information and Computation, Annals of Statistics

People List available at http://www.boosting.org

Software Only few implementations (algorithms ‘too simple’) (cf. http://www.boosting.org)