The Rate of Convergence of AdaBoost

Indraneel MukherjeeCynthia RudinRob Schapire

AdaBoost (Freund and Schapire 97)

Basic properties of AdaBoost’s convergence are still not fully understood.

We address one of these basic properties: convergence rates with no assumptions.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Combination: F(x)=λ1h1(x)+…+λNhN(x)

misclassific. error ≤ exponentiaλ λoss

1[ yiF(xi )≤0]i=1

∑ ≤1m

exp −yiF(xi)( )i=1

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

∑⎛⎝⎜

⎞⎠⎟i=1

Exponential loss:

L(λ)= 1m

∑⎛⎝⎜

⎞⎠⎟i=1

Exponential loss:

L(λ)= 1m

∑⎛⎝⎜

⎞⎠⎟i=1

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under strong assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Main Messages• Usual approaches assume a finite minimizer– Much more challenging not to assume this!

• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each

• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

Based on a conjecture that says...

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

radius B

L(λ*)

radius B

λ*λ t

L(λ*)

L(λ t)

radius B

λ*λ t

L(λ*)

ÚL(λ t)

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

t ≤poλy λogN,m ,B, 1Ú( )

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

poly log N ,m, B, 1Ú( )

Best known previous result is that it takes at mostorder rounds (Bickel et al). e1/Ú2

Intuition behind proof of Theorem 1

• Old fact: if AdaBoost takes a large step, it makes a lot of progress:

L(λ t)≤L(λ t−1) 1−d t2

d t is called the "edge." It is related to the step size.

radius B

λ*λ t

Rt :=λnL(λ t)−λnL(λ*)

St :=infλÅaλ −λ tÅa1: L(λ)≤L(λ*){ }

L(λ*)L(λ t)

measuresprogress

measuresdistance

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

If St is small, then d t is λarge.

• Second lemma says:

• Combining:d t 's are large at each t (unless R t already small).

d t ≥ Rt−13 / B3 in each round t.

St remains small (unless Rt is already small).

If St is small, then d t is λarge.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.

Lemma: There are simple datasets for which

inf ÅaλÅa1: L(λ)≤2m +Ú{ } ≥ 2m−2 −1( )λn(1 / (3Ú))

Conjecture: AdaBoost achieves loss at most L(λ*)+Ú

in at m ost O(B2 / Ú) rounds.

Number of rounds

– (O

10 100 1000 10000 1e+053e-0

Rate on a Simple Dataset (Log scale)

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

• Better dependence on than Theorem 1, actually optimal.

• Doesn’t depend on the size of the best solution within a ball

• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)

• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.

– Similar approach taken independently by (Telgarsky, 2011)

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

margin of g

Decomposition Lemma

• We provide a conjecture about dependence on m.

constant C is doubly exponential, at least 2Ω(2m /m ).

Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoostconverges to within Ú of the optimal loss within

2O(m ln m )Ú−1+o(1) rounds.

• This would give optimal dependence on m and simultaneously.

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Thank you

d t 's are large whenever loss on Z is large.

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

d t 's are large whenever loss on Z is large.

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

• see notes

The Rate of Convergence of AdaBoost

Documents

Ch7: Adaboost for building robust classifiers KH Wong Ch7. Adaboost, V5b 1

Markov Chain Decomposition for Convergence Rate Analysispeople.math.gatech.edu/~randall/r-factor.pdfMarkov Chain Decomposition for Convergence Rate Analysis Neal Madras Dana Randally

Adaboost Boosting Methods

Research Article Optimal Rate of Convergence for a

Villani's program on constructive rate of convergence to the …mischler/expo/2017... · 2018. 6. 28. · Villani’s program on constructive rate of convergence to the equilibrium

Analyzing Convergence and Rates of Convergence of Particle ... · Particle swarm optimization, stochastic approximation, weak convergence, rate of convergence. I. INTRODUCTION RECENTLY,

An Approach for Analyzing the Global Rate of Convergence ...kom.aau.dk/~tlj/global_quasi_truncated_Newton.pdf · An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton

Adaboost Talk

Optimal rate of convergence in periodic homogenization of

Maximal Spaces with given rate of convergence for ...people.math.sc.edu/devore/publications/cdkpstat21.pdf · Maximal Spaces with given rate of convergence for thresholding algorithms

Extending the ergodic convergence rate of the proximal ADMMmonteiro/publications/tech_reports/Ergodic... · Extending the ergodic convergence rate of the proximal ADMM Max L.N. Gon˘calves

On the rate of convergence of Schwarz waveform relaxation

BOOSTING & ADABOOST

Improving the convergence rate of the iterative Parity ...wiredspace.wits.ac.za/jspui/bitstream/10539/26743/2/peter-brookstein-msc-final-2018.pdfImproving the convergence rate of the

ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC …pdfs.semanticscholar.org › 051b › 3e75624ebd5f043f1b18bba82e17d… · ON CONVERGENCE RATE OF DISTRIBUTED STOCHASTIC GRADIENT ALGORITHM

Learning with AdaBoost Fall 2007. 11/24/2015Learning with Adaboost2 Outline Introduction and background of Boosting and Adaboost Adaboost Algorithm

Interest Rate Convergence: Evidence from the CEE EU …economics.soc.uoc.gr/macro/docs/Year/2016/papers/paper_1_229.pdf · Interest Rate Convergence: Evidence from the CEE EU Countries

Learning with AdaBoost

Adaboost Tutorial

The Weak Convergence Rate of Two Semi-Exact Discretization