AdaBoost & Its Applications 主講人：虞台文. Outline Overview The AdaBoost Algorithm How...

AdaBoost & Its Applications

主講人：虞台文

Outline

Overview The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection

Overview

Introduction

AdaBoostAdaptive

A learning algorithm

Building a strong classifier a lot of weaker ones

Boosting

AdaBoost Concept

1 { 1 }) , 1(h x

weak classifiers

slightly better than random

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

Weaker Classifiers

1 { 1 }) , 1(h x

weak classifiers

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

Each weak classifier learns by considering one simple feature

T most beneficial features for classification should be selected

How to– define features?– select beneficial features?– train weak classifiers?– manage (weight) training

samples?– associate weight to each weak

classifier?

The Strong Classifiers

1 { 1 }) , 1(h x

weak classifiers

( )( )T

ttT th xH x sign

2 { 1 }) , 1(h x

{ 1( }) , 1Th x

strong classifier

How good the strong one will be?

The AdaBoost Algorithm

Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y

Initialization: 11( ) , 1, ,mD i i m

For :1, ,t T

• Find classifier which minimizes error wrt Dt ,i.e., : { 1, 1}th X

where arg min ( )[ ( )]j

t j j t i j iih

h D i y h x

:probability distribution of 's at time ( )t iD i x t:probability distribution of 's at time ( )t iD i x t

• Weight classifier: 11

• Update distribution: 1

( ) exp[ ( )], is for normalizati( ) ont t i t i

D i y h xD i Z

minimize weighted errorminimize weighted error

for minimize exponential lossfor minimize exponential loss

Give error classified patterns more chance for learning.Give error classified patterns more chance for learning.

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

Output final classifier:1

( ) ( )T

sign H x h x

Boosting illustration

Weak Classifier 1

WeightsIncreased

Weak Classifier 2

WeightsIncreased

Weak Classifier 3

Final classifier is a combination of weak classifiers

How and why AdaBoost works?

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

What goal the AdaBoost wants to reach?

For :1, ,t T

t j j t i j iih

h D i y h x

• Weight classifier:

D i y h xD i Z

( ) ( )T

sign H x h x

What goal the AdaBoost wants to reach?

They are goal dependent.

Minimize exponential loss

Final classifier:1

( ) ( )T

sign H x h x

( )exp ,( ) yH x

x yloss H x E e

Minimize exponential loss

Final classifier:1

( ) ( )T

sign H x h x

( )exp ,( ) yH x

x yloss H x E e ( )yH x

Maximize the margin yH(x)

Goal Final classifier:1

( ) ( )T

sign H x h x

( )exp ,( ) yH x

x yloss H x E e Minimize

( ) ( ), |t tyH x yH x

x y x yE e E E e x

Define 1( ) ( ) ( )t t t tH x H x h x with 0 ( ) 0H x

Then, ( ) ( )TH x H x

1[ ( ) ( )] |t t ty H x h xx yE E e x

1 ( ) ( ) |t t tyH x y h xx yE E e e x

1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x

( ) ( ), |t tyH x yH x

x y x yE e E E e x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ,( ) yH x

x yloss H x E e Minimize

1 ( ) ( ( )) ( ( ))t t tyH xx t tE e e P y h x e P y h x

( ), 0tyH x

1 ( ) ( ( )) ( ( )) 0t t tyH xx t tE e e P y h x e P y h x

Final classifier:1

( ) ( )T

sign H x h x

Minimize

( ( ))1ln

2 ( ( ))t

P y h x

(error)t P

?t ( )

exp ,( ) yH xx yloss H x E e

( , ) ( )i i tP x y D i( , ) ( )i i tP x y D i

( )[ ( )]m

t i j ii

D i y h x

( )exp ,( ) yH x

x yloss H x E e

Final classifier:1

( ) ( )T

sign H x h x

Minimize

Define 1( ) ( ) ( )t t t tH x H x h x 0 ( ) 0H x

( ( ))1ln

2 ( ( ))t

P y h x

?t Given: 1 1 where ( , ), , ( , ) , { 1, 1}m m i ix y x y x X y

I nitialization: 11( ) , 1, ,mD i i m

For :1, ,t T

•Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X

t j j t i j iih

h D i y h x

•Weight classifier:11

•Update distribution: 1

D i y h xD i Z

( ) ( )T

sign H x h x

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

(error)t P

( )[ ( )]m

t i j ii

D i y h x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

x D yloss H x E e Minimize

( ( ))1ln

2 ( ( ))t

P y h x

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

(error)t P

( )[ ( )]m

t i j ii

D i y h x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

t t t tyH yH y hx y x yE e E e e

1 2 2 2,

x y t t t tE e y h y h

1 2 2 2,

1arg min 1

t x y t th

h E e y h y h

2 2 1y h

1arg min 1

t x y t th

h E e y h

1 21arg min 1 |

t x y t th

h E E e y h x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

1 21arg min 1 |

t x y t th

h E E e y h x

1arg min |tyHt x y t

hh E E e y h x

1arg max |tyHt x y

hh E E e yh x

1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x

hh E h x e P y x h x e P y x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

( )1, ~ ( | )arg max ( )yH xtt x y e P y xh

h E yh x maximized when ( ) y h x x

( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x

h x sign P y x P y x

( )1, ~ ( | )( ) |yH xtt x y e P y x

h x sign E y x

1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H xt x

hh E h x e P y x h x e P y x

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x

h x sign P y x P y x

1 ( ), ~ ( | )tyH xx y e P y xAt time t

Final classifier:1

( ) ( )T

sign H x h x

( )exp ~ ,( ) yH x

1 ( ), ~ ( | )tyH xx y e P y xAt time t

At time 1

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

, ~ ( | )x y P y x ( | ) 1i iP y x 11

1 1(1)D

At time t+1( ), ~ ( | )tyH xx y e P y x ( )t tyh x

( ) exp[ ( ), is for normaliza i

]( ) t ont t i t i

D i y h xD i Z

AdaBoost for

Face Detection

The Task ofFace Detection

Many slides adapted from P. Viola

Basic Idea

Slide a window across image and evaluate a face model at every location.

Challenges

Slide a window across image and evaluate a face model at every location.

Sliding window detector must evaluate tens of thousands of location/scale combinations.

Faces are rare: 0–10 per image– For computational efficiency, we should try to spend as little ti

me as possible on the non-face windows– A megapixel image has ~106 pixels and a comparable number of

candidate face locations– To avoid having a false positive in every image image, our false

positive rate has to be less than 106

The Viola/Jones Face Detector

A seminal approach to real-time object detection Training is slow, but detection is very fast Key ideas

– Integral images for fast feature evaluation– Boosting for feature selection– Attentional cascade for fast rejection of non-face window

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features

Rectangle filters Feature Value (Pixel in white area)

(Pixel in black area)

Feature Value (Pixel in white area)

Image Features

Rectangle filters

Size of Feature Space

• How many number of possible rectangle features for a 24x24 detection region?

Rectangle filters

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

160,000

Feature Selection

160,000What features are good for face detection?

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

Feature Selection

160,000

Can we create a good classifier

using just a small subset of all

possible features?

How to select such a subset?

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

Integral images

The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

(x, y)

( , ) ( , )x x y y

ii x y i x y

( , 1)ii x y

( 1, )s x y

Computing the Integral Image

The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

This can quickly be computed in one pass through the image.

(x, y)

( , ) ( , )x x y y

ii x y i x y

( , )i x y

( , ) ( 1, ) ( , )s x y s x y i x y

( , ) ( , 1) ( , )ii x y ii x y s x y

Computing Sum within a Rectangle

A B C Dsum ii ii ii ii D B

C AOnly 3 additions are

required for any size of

rectangle!

Only 3 additions are

required for any size of

rectangle!

Scaling Integral image

enables us to evaluate all rectangle sizes in constant time.

Therefore, no image scaling is necessary.

Scale the rectangular features instead!

Boosting

Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier– A weak learner need only do better than chance

Training consists of multiple boosting rounds– During each boosting round, we select a weak

learner that does well on examples that were hard for the previous weak learners

– “Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

For :1, ,t T

t j j t i j iih

h D i y h x

:probability distribution of 's at time ( )t iD i x t:probability distribution of 's at time ( )t iD i x t

D i y h xD i Z

minimize weighted errorminimize weighted error

for minimize exponential lossfor minimize exponential loss

Give error classified patterns more chance for learning.Give error classified patterns more chance for learning.

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

Weak Learners for Face Detection

For :1, ,t T

t j j t i j iih

h D i y h x

D i y h xD i Z

( ) ( )T

sign H x h x

What base learner is proper for face detection?

Weak Learners for Face Detection

1 if ( )( )

0 otherwiset t t t

p f x ph x

window

value of rectangle feature parity threshold

Boosting Training set contains face and nonface examples

– Initially, with equal weight For each round of boosting:

– Evaluate each rectangle filter on each example– Select best threshold for each filter – Select best filter/threshold combination– Reweight examples

Computational complexity of learning: O(MNK)– M rounds, N examples, K features

Features Selected by Boosting

First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier

A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084.

Not good enough!Not good enough!

To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

Attentional Cascade

Classifier 3Classifier 3Classifier 2Classifier 2Classifier 1Classifier 1IMAGESUB-WINDOW

T T FACET

NON-FACE

We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows

Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

A negative outcome at any point leads to the immediate rejection of the sub-window

Attentional Cascade

T T FACET

NON-FACE

Chain classifiers that are progressively more complex and have lower false positive rates

1000 50

% False Pos

ROC Curve

Detection Rate and False Positive Rate for Chained Classifiers

T T FACET

NON-FACE

The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages

A detection rate of 0.9 and a false positive rate on the order of 106 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6106 )

11,f d 22 ,f d 33 ,f d

F f D d

,F D,F D

Training the Cascade

Set target detection and false positive rates for each stage

Keep adding features to the current stage until its target rates have been met – Need to lower AdaBoost threshold to maximize detectio

n (as opposed to minimizing total classification error)– Test on a validation set

If the overall false positive rate is not low enough, then add another stage

Use false positives from current stage as the negative training examples for the next stage

Training the Cascade

ROC Curves Cascaded Classifier to Monlithic Classifier

There is little difference between the two in terms of accuracy.

There is a big difference in terms of speed.

– The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.

There is little difference between the two in terms of accuracy.

There is a big difference in terms of speed.

– The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.

The Implemented System

Training Data– 5000 faces

All frontal, rescaled to 24x24 pixels

– 300 million non-faces 9500 non-face images

– Faces are normalized Scale, translation

Many variations– Across individuals– Illumination– Pose

Structure of the Detector Cascade

Combining successively more complex classifiers in cascade– 38 stages– included a total of 6060 features

Reject Sub-WindowReject Sub-Window

11 22 33 44 55 66 77 88 3838 Face

F F F F F F F F F

T T T T T T T T T

All Sub-WindowsAll Sub-Windows

Structure of the Detector Cascade

Reject Sub-WindowReject Sub-Window

11 22 33 44 55 66 77 88 3838 Face

F F F F F F F F F

T T T T T T T T T

All Sub-WindowsAll Sub-Windows

2 features, reject 50% non-faces, detect 100% faces

10 features, reject 80% non-faces, detect 100% faces

25 features

50 features

by algorithm

Speed of the Final Detector

On a 700 Mhz Pentium III processor, the face detector can process a 384 288 pixel image in about .067 seconds”– 15 Hz– 15 times faster than previous detector of comparable acc

uracy (Rowley et al., 1998)

Average of 8 features evaluated per window on test set

Image Processing

Training all example sub-windows were variance normalized to minimize the effect of different lighting conditions

Detection variance normalized as well

2 2 21Nm x

Can be computed using integral image

of squared image

Scanning the Detector

Scaling is achieved by scaling the detector itself, rather than scaling the image

Good detection results for scaling factor of 1.25 The detector is scanned across location

– Subsequent locations are obtained by shifting the window [s] pixels, where s is the current scale

– The result for = 1.0 and = 1.5 were reported

Merging Multiple Detections

ROC Curves for Face Detection

Output of Face Detector on Test Images

Other Detection Tasks

Facial Feature Localization

Male vs. female

Profile Detection

Facial Feature Localization Profile Detection

Male vs. Female

Conclusions

How adaboost works? Why adaboost works? Adaboost for face detection

– Rectangle features– Integral images for fast computation– Boosting for feature selection– Attentional cascade for fast rejection of negativ

e windows

AdaBoost & Its Applications 主講人：虞台文. Outline Overview The AdaBoost Algorithm How...

Documents

Adaboost Talk

AdaBoost - University of Oxfordaz/lectures/cv/adaboost...1990 – Boost-by-majority algorithm (Freund) 1995 – AdaBoost (Freund & Schapire) 1997 – Generalized version of AdaBoost

Computation Theory 主講人：虞台文. Content Overview The Limitation of Computing Complexities Textbooks & Grading

C Program Design Introduction to C Programming 主講人：虞台文

Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered

Adaboost Boosting Methods

Statistical Pattern Recognition: A Review 主講人：虞台文

C Program Design C Structures, Unions, Bit Manipulations and Enumerations 主講人：虞台文

Statistical Models of Appearance for Computer Vision 主講人：虞台文

Chapter 3 Discrete Random Variables 主講人 : 虞台文. Content Random Variables The Probability Mass Functions Distribution Functions Bernoulli Trials Bernoulli

Operating Systems Principles Memory Management Lecture 8: Virtual Memory 主講人：虞台文

AdaBoost - unibas.ch · AdaBoost: AdaBoost vs. SVM 24 • Comparison AdaBoost vs. SVM AdaBoost’s decision line SVM’s decision line These decision lines are for a low noise case

Fourier Transforms of Special Functions 主講者：虞台文

AdaBoost new.pdf

Operating Systems Principles Memory Management Lecture 9: Sharing of Code and Data in Main Memory 主講人：虞台文

Learning with AdaBoost Fall 2007. 11/24/2015Learning with Adaboost2 Outline Introduction and background of Boosting and Adaboost Adaboost Algorithm

BOOSTING & ADABOOST

Image Processing 主講人：虞台文. Content Smoothing Image Morphology Some Applications of Image Morphology Flood Fill Resize Image Pyramids Threshold

Learning with AdaBoost

Introduction to Artificial Neural Networks 主講人 : 虞台文