Transcript
Page 1: The Rate of Convergence of  AdaBoost

The Rate of Convergence of AdaBoost

Indraneel MukherjeeCynthia RudinRob Schapire

Page 2: The Rate of Convergence of  AdaBoost

AdaBoost (Freund and Schapire 97)

Page 3: The Rate of Convergence of  AdaBoost

AdaBoost (Freund and Schapire 97)

Page 4: The Rate of Convergence of  AdaBoost

Basic properties of AdaBoost’s convergence are still not fully understood.

Page 5: The Rate of Convergence of  AdaBoost

Basic properties of AdaBoost’s convergence are still not fully understood.

We address one of these basic properties: convergence rates with no assumptions.

Page 6: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Page 7: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 8: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 9: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

misclassific. error ≤ exponentiaλ λoss

1m

1[ yiF(xi )≤0]i=1

m

∑ ≤1m

exp −yiF(xi)( )i=1

m

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 10: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 11: The Rate of Convergence of  AdaBoost

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 12: The Rate of Convergence of  AdaBoost

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 13: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under strong assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 14: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 15: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 16: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 17: The Rate of Convergence of  AdaBoost

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 18: The Rate of Convergence of  AdaBoost

Main Messages• Usual approaches assume a finite minimizer– Much more challenging not to assume this!

• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each

• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case

Page 19: The Rate of Convergence of  AdaBoost

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Based on a conjecture that says...

Page 20: The Rate of Convergence of  AdaBoost

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 21: The Rate of Convergence of  AdaBoost

radius B

λ*

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 22: The Rate of Convergence of  AdaBoost

radius B

λ*

L(λ*)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 23: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

L(λ t)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 24: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 25: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

t ≤poλy λogN,m ,B, 1Ú( )

Page 26: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

t ≤poλy λogN,m ,B, 1Ú( )

Page 27: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

t ≤poλy λogN,m ,B, 1Ú( )

This happens at:

Page 28: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Page 29: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

poly log N ,m, B, 1Ú( )

Page 30: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

poly log N ,m, B, 1Ú( )

Best known previous result is that it takes at mostorder rounds (Bickel et al). e1/Ú2

Page 31: The Rate of Convergence of  AdaBoost

Intuition behind proof of Theorem 1

• Old fact: if AdaBoost takes a large step, it makes a lot of progress:

L(λ t)≤L(λ t−1) 1−d t2

d t is called the "edge." It is related to the step size.

λ1

λ2

Page 32: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

St

Rt

Rt :=λnL(λ t)−λnL(λ*)

St :=infλÅaλ −λ tÅa1: L(λ)≤L(λ*){ }

L(λ*)L(λ t)

measuresprogress

measuresdistance

Page 33: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

Intuition behind proof of Theorem 1

If St is small, then d t is λarge.

Page 34: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

• Second lemma says:

• Combining:d t 's are large at each t (unless R t already small).

Intuition behind proof of Theorem 1

d t ≥ Rt−13 / B3 in each round t.

St remains small (unless Rt is already small).

If St is small, then d t is λarge.

Page 35: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*

Page 36: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Page 37: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Page 38: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Lemma: There are simple datasets for which

inf ÅaλÅa1: L(λ)≤2m +Ú{ } ≥ 2m−2 −1( )λn(1 / (3Ú))

Page 39: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Page 40: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Conjecture: AdaBoost achieves loss at most L(λ*)+Ú

in at m ost O(B2 / Ú) rounds.

Page 41: The Rate of Convergence of  AdaBoost

Number of rounds

Loss

– (O

ptim

al L

oss)

10 100 1000 10000 1e+053e-0

6 3

e-05

3e

-04

3e-

03

3e-0

2

Rate on a Simple Dataset (Log scale)

Page 42: The Rate of Convergence of  AdaBoost

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 43: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Page 44: The Rate of Convergence of  AdaBoost

• Better dependence on than Theorem 1, actually optimal.

• Doesn’t depend on the size of the best solution within a ball

• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Ú

Page 45: The Rate of Convergence of  AdaBoost

• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.

– Similar approach taken independently by (Telgarsky, 2011)

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Page 46: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Page 47: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 48: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Z

Page 49: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 50: The Rate of Convergence of  AdaBoost

++

++

+

++

-- -

-

-

+

+

++---

-

-

-

margin of g

margin of g

h+

Page 51: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 52: The Rate of Convergence of  AdaBoost

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 53: The Rate of Convergence of  AdaBoost

+

++

+

-

--

-

-

-

F

Page 54: The Rate of Convergence of  AdaBoost

+

++

+

-

--

-

-

-

F

h*

Page 55: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 56: The Rate of Convergence of  AdaBoost

• We provide a conjecture about dependence on m.

Lemma: There are simple datasets for which the

constant C is doubly exponential, at least 2Ω(2m /m ).

Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoostconverges to within Ú of the optimal loss within

2O(m ln m )Ú−1+o(1) rounds.

• This would give optimal dependence on m and simultaneously.

Ú

Page 57: The Rate of Convergence of  AdaBoost

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Page 58: The Rate of Convergence of  AdaBoost

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Thank you

Page 59: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

d t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

• Second lemma says:

Page 60: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

d t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

• Second lemma says:

Page 61: The Rate of Convergence of  AdaBoost

• see notes


Recommended