AdaBoost & Its Applications 主講人：虞台文. Outline Overview The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection

AdaBoost & Its Applications

主講人：虞台文

Outline

Overview The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection


Overview

Introduction

AdaBoostAdaptive

A learning algorithm

Building a strong classifier a lot of weaker ones

Boosting

AdaBoost Concept

1 { 1 }) , 1(h x

...

weak classifiers

slightly better than random

1

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

Weaker Classifiers

1 { 1 }) , 1(h x

...

weak classifiers


1

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

Each weak classifier learns by considering one simple feature

T most beneficial features for classification should be selected

How to– define features?– select beneficial features?– train weak classifiers?– manage (weight) training

samples?– associate weight to each weak

classifier?

The Strong Classifiers

1 { 1 }) , 1(h x

...

weak classifiers


1

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

How good the strong one will be?

How good the strong one will be?


The AdaBoost Algorithm


Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y

Initialization: 11( ) , 1, ,mD i i m

For :1, ,t T

• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X

1

where arg min ( )[ ( )]j

m

t j j t i j iih

h D i y h x

:probability distribution of 's at time ( )t iD i x t:probability distribution of 's at time ( )t iD i x t

• Weight classifier: 11

ln2

tt

t

• Update distribution: 1

( ) exp[ ( )], is for normalizati( ) ont t i t i

t tt

D i y h xD i Z

Z

minimize weighted errorminimize weighted error

for minimize exponential lossfor minimize exponential loss

Give error classified patterns more chance for learning.Give error classified patterns more chance for learning.




For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z

Output final classifier:1

( ) ( )T

t tt

sign H x h x

Boosting illustration

Weak Classifier 1


WeightsIncreased


Weak Classifier 2


WeightsIncreased


Weak Classifier 3


Final classifier is a combination of weak classifiers


How and why AdaBoost works?




For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

What goal the AdaBoost wants to reach?




For :1, ,t T


1


m

t j j t i j iih

h D i y h x

• Weight classifier:



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

What goal the AdaBoost wants to reach?

11ln

2t

tt

They are goal dependent.

They are goal dependent.

Goal

Minimize exponential loss

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ,( ) yH x

x yloss H x E e

Goal

Minimize exponential loss

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ,( ) yH x

x yloss H x E e ( )yH x

Maximize the margin yH(x)

Goal Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ,( ) yH x

x yloss H x E e Minimize

( ) ( ), |t tyH x yH x

x y x yE e E E e x

Define 1( ) ( ) ( )t t t tH x H x h x with 0 ( ) 0H x

Then, ( ) ( )TH x H x

1[ ( ) ( )] |t t ty H x h xx yE E e x

1 ( ) ( ) |t t tyH x y h xx yE E e e x

1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x

( ) ( ), |t tyH x yH x

x y x yE e E E e x

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ,( ) yH x

x yloss H x E e Minimize



1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x

( ), 0tyH x

x yt

E e

Set

1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x

0

?t

Final classifier:1

( ) ( )T

t tt

sign H x h x

Minimize




0

( ( ))1ln

2 ( ( ))t

tt

P y h x

P y h x

11

ln2

tt

t

(error)t P

?t ( )

exp ,( ) yH xx yloss H x E e

( , ) ( )i i tP x y D i( , ) ( )i i tP x y D i

1

( )[ ( )]m

t i j ii

D i y h x

( )exp ,( ) yH x

x yloss H x E e

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

Minimize

Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x



0

( ( ))1ln

2 ( ( ))t

tt

P y h x

P y h x

11

ln2

tt

t

?t Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y

I nitialization: 11( ) , 1, ,mD i i m

For :1, ,t T

•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X

1


m

t j j t i j iih

h D i y h x

•Weight classifier:11

ln2

tt

t

•Update distribution: 1


t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

(error)t P


1

( )[ ( )]m

t i j ii

D i y h x

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x

x D yloss H x E e Minimize




0

( ( ))1ln

2 ( ( ))t

tt

P y h x

P y h x

11

ln2

tt

t

1 ?tD



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

(error)t P


1

( )[ ( )]m

t i j ii

D i y h x

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x




1 ?tD

1, ,

t t t tyH yH y hx y x yE e E e e

1 2 2 2,

11

2tyH

x y t t t tE e y h y h

1 2 2 2,

1arg min 1

2tyH

t x y t th

h E e y h y h

2 2 1y h

1 2,

1arg min 1

2tyH

t x y t th

h E e y h

1 21arg min 1 |

2tyH

t x y t th

h E E e y h x

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x




1 ?tD

1 21arg min 1 |

2tyH

t x y t th

h E E e y h x

1arg min |tyHt x y t

hh E E e y h x

1arg max |tyHt x y

hh E E e yh x

1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x

hh E h x e P y x h x e P y x

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x




1 ?tD

( )1, ~ ( | )arg max ( )yH xtt x y e P y xh

h E yh x maximized when ( ) y h x x

( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x

h x sign P y x P y x

( )1, ~ ( | )( ) |yH xtt x y e P y x

h x sign E y x

1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x

hh E h x e P y x h x e P y x

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x




1 ?tD

( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x

h x sign P y x P y x

1 ( ), ~ ( | )tyH xx y e P y xAt time t

with

Final classifier:1

( ) ( )T

t tt

sign H x h x

( )exp ~ ,( ) yH x




1 ?tD

1 ( ), ~ ( | )tyH xx y e P y xAt time t

At time 1



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

, ~ ( | )x y P y x ( | ) 1i iP y x 11

1 1(1)D

Z m

At time t+1( ), ~ ( | )tyH xx y e P y x ( )t tyh x

tD e

1

( ) exp[ ( ), is for normaliza i

]( ) t ont t i t i

t tt

D i y h xD i Z

Z


AdaBoost for

Face Detection

The Task ofFace Detection

Many slides adapted from P. Viola

Basic Idea

Slide a window across image and evaluate a face model at every location.

Challenges

Slide a window across image and evaluate a face model at every location.

Sliding window detector must evaluate tens of thousands of location/scale combinations.

Faces are rare: 0–10 per image– For computational efficiency, we should try to spend as little ti

me as possible on the non-face windows– A megapixel image has ~106 pixels and a comparable number of

candidate face locations– To avoid having a false positive in every image image, our false

positive rate has to be less than 106

The Viola/Jones Face Detector

A seminal approach to real-time object detection Training is slow, but detection is very fast Key ideas

– Integral images for fast feature evaluation– Boosting for feature selection– Attentional cascade for fast rejection of non-face window

s

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features

Rectangle filters Feature Value (Pixel in white area)

(Pixel in black area)

Feature Value (Pixel in white area)


Image Features

Rectangle filters





Size of Feature Space





• How many number of possible rectangle features for a 24x24 detection region?

Rectangle filters

12 24

1 1

8 24

1 1

12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

A+B

C

D

160,000

Feature Selection


A+B

C

D

160,000What features are good for face detection?

12 24

1 1

8 24

1 1

12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

Feature Selection


A+B

C

D

160,000

Can we create a good classifier

using just a small subset of all

possible features?

How to select such a subset?

12 24

1 1

8 24

1 1

12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

Integral images

The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

(x, y)

,

( , ) ( , )x x y y

ii x y i x y

( , 1)ii x y

( 1, )s x y

Computing the Integral Image

The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

This can quickly be computed in one pass through the image.

(x, y)

,

( , ) ( , )x x y y

ii x y i x y

( , )i x y

( , ) ( 1, ) ( , )s x y s x y i x y

( , ) ( , 1) ( , )ii x y ii x y s x y

Computing Sum within a Rectangle

A B C Dsum ii ii ii ii D B

C AOnly 3 additions are

required for any size of

rectangle!

Only 3 additions are

required for any size of

rectangle!

Scaling Integral image

enables us to evaluate all rectangle sizes in constant time.

Therefore, no image scaling is necessary.

Scale the rectangular features instead!

1 2

3 4

5 6

Boosting

Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier– A weak learner need only do better than chance

Training consists of multiple boosting rounds– During each boosting round, we select a weak

learner that does well on examples that were hard for the previous weak learners

– “Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.




For :1, ,t T


1


m

t j j t i j iih

h D i y h x

:probability distribution of 's at time ( )t iD i x t:probability distribution of 's at time ( )t iD i x t


ln2

tt

t



t tt

D i y h xD i Z

Z

minimize weighted errorminimize weighted error

for minimize exponential lossfor minimize exponential loss

Give error classified patterns more chance for learning.Give error classified patterns more chance for learning.




For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

Weak Learners for Face Detection



For :1, ,t T


1


m

t j j t i j iih

h D i y h x


ln2

tt

t



t tt

D i y h xD i Z

Z


( ) ( )T

t tt

sign H x h x

What base learner is proper for face detection?

What base learner is proper for face detection?

Weak Learners for Face Detection

1 if ( )( )

0 otherwiset t t t

t

p f x ph x

window

value of rectangle feature parity threshold

Boosting Training set contains face and nonface examples

– Initially, with equal weight For each round of boosting:

– Evaluate each rectangle filter on each example– Select best threshold for each filter – Select best filter/threshold combination– Reweight examples

Computational complexity of learning: O(MNK)– M rounds, N examples, K features

Features Selected by Boosting

First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier

A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084.

Not good enough!Not good enough!

To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

Attentional Cascade

Classifier 3Classifier 3Classifier 2Classifier 2Classifier 1Classifier 1IMAGESUB-WINDOW

T T FACET

NON-FACE

F F

NON-FACE

F

NON-FACE

We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows

Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

A negative outcome at any point leads to the immediate rejection of the sub-window

Attentional Cascade


T T FACET

NON-FACE

F F

NON-FACE

F

NON-FACE

Chain classifiers that are progressively more complex and have lower false positive rates

0

1000 50

% False Pos

%

Det

ecti

on

ROC Curve

Detection Rate and False Positive Rate for Chained Classifiers


T T FACET

NON-FACE

F F

NON-FACE

F

NON-FACE

The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages

A detection rate of 0.9 and a false positive rate on the order of 106 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6106 )

11,f d 22 ,f d 33 ,f d

11

, K

i

K

ii i

F f D d

11

, K

i

K

ii i

F f D d

,F D,F D

Training the Cascade

Set target detection and false positive rates for each stage

Keep adding features to the current stage until its target rates have been met – Need to lower AdaBoost threshold to maximize detectio

n (as opposed to minimizing total classification error)– Test on a validation set

If the overall false positive rate is not low enough, then add another stage

Use false positives from current stage as the negative training examples for the next stage

Training the Cascade

ROC Curves Cascaded Classifier to Monlithic Classifier

ROC Curves Cascaded Classifier to Monlithic Classifier

There is little difference between the two in terms of accuracy.

There is a big difference in terms of speed.

– The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.

There is little difference between the two in terms of accuracy.

There is a big difference in terms of speed.

– The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.

The Implemented System

Training Data– 5000 faces

All frontal, rescaled to 24x24 pixels

– 300 million non-faces 9500 non-face images

– Faces are normalized Scale, translation

Many variations– Across individuals– Illumination– Pose

Structure of the Detector Cascade

Combining successively more complex classifiers in cascade– 38 stages– included a total of 6060 features

Reject Sub-WindowReject Sub-Window

11 22 33 44 55 66 77 88 3838 Face

F F F F F F F F F

T T T T T T T T T

All Sub-WindowsAll Sub-Windows

Structure of the Detector Cascade

Reject Sub-WindowReject Sub-Window

11 22 33 44 55 66 77 88 3838 Face

F F F F F F F F F

T T T T T T T T T

All Sub-WindowsAll Sub-Windows

2 features, reject 50% non-faces, detect 100% faces

10 features, reject 80% non-faces, detect 100% faces

25 features

50 features

by algorithm

Speed of the Final Detector

On a 700 Mhz Pentium III processor, the face detector can process a 384 288 pixel image in about .067 seconds”– 15 Hz– 15 times faster than previous detector of comparable acc

uracy (Rowley et al., 1998)

Average of 8 features evaluated per window on test set

Image Processing

Training all example sub-windows were variance normalized to minimize the effect of different lighting conditions

Detection variance normalized as well

2 2 21Nm x

Can be computed using integral image

Can be computed using integral image

of squared image

Scanning the Detector

Scaling is achieved by scaling the detector itself, rather than scaling the image

Good detection results for scaling factor of 1.25 The detector is scanned across location

– Subsequent locations are obtained by shifting the window [s] pixels, where s is the current scale

– The result for = 1.0 and = 1.5 were reported

Merging Multiple Detections

ROC Curves for Face Detection

Output of Face Detector on Test Images

Other Detection Tasks

Facial Feature Localization

Male vs. female

Profile Detection


Facial Feature Localization Profile Detection


Male vs. Female

Conclusions

How adaboost works? Why adaboost works? Adaboost for face detection

– Rectangle features– Integral images for fast computation– Boosting for feature selection– Attentional cascade for fast rejection of negativ

e windows

Documents

AdaBoost & Its Applications 主講人：虞台文. Outline Overview The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection