Theory & Applications of Online Learning -...

Theory & Applicationsof Online Learning

Shai Shalev-Shwartz Yoram Singer

ICML, July 5th 2008

Motivation - Spam Filtering

For t = 1, 2, . . . , T

Receive an email

Expert advice: Apply d spam filters to get x ! {+1,"1}d

Predict yt ! {+1,"1}

Receive true label yt ! {+1,"1}

Su!er loss !(yt, yt)

Motivation - Spam Filtering

Goal – Low Regret

We don’t know in advance the best performing expert

We’d like to find the best expert in an online manner

We’d like to make as few filtering errors as possible

This setting is called ”regret analysis”. Our goal:

!(yt, yt)!mini

!(xt,i, yt) " o(T )

Regret Analysis

Low regret means that we do not loose much fromnot knowing future events

We can perform almost as well as someone who observes the entire sequence and picks the best prediction strategy in hindsight

No statistical assumptions

We can also compete with changing environment

In many cases, data arrives sequentially while predictions are required on-the-fly

Applicable also in adversarial and competitive environments (e.g. spam filtering, stock market)

Can adapt to changing environment

Simple algorithms

Theoretical guarantees

Online-to-batch conversions, generalization properties

Why Online ?

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part I: What prediction tasks are possible

Outline

Regression with squared-loss Classification with 0-1 loss

Outline

Convexity is a key property

Outline

Online Convex Optimization

Outline

regression

experts’ advice

Outline

regression

Convexexperts’ advice

Outline

regression

Convexification

Outline

regression

Randomization

Convexification

Outline

regression

Randomization

Convexification

classificatio

structured output

Outline

Part II: An algorithmic framework for online convex optimization

Outline

Update Typeadditive multiplicative

Outline

Update Type

follow-the-leader

gradient based

additive multiplicative

Outline

Part III: Derived algorithms

• Perceptrons (aggressive, conservative)• Passive-Aggressive algorithms for the hinge-loss• Follow the regularized leader (online SVM)• Prediction with expert advice using multiplicative updates• Online logistic regresssion with multiplicative updates

Outline

Part IV: Application - Mail filtering

• Algorithms derived from framework for online convex optimization:

•Additive & multiplicative dual steppers• Aggressive update schemes: instantaneous dual maximizers

• Mail filtering by online multiclass categorization

Outline

Part V: Not covered due to lack of time

• Improved algorithms and regret bounds: Self-tuning Logarithmic regret for strongly convex losses• Other notions of regret: internal regret, drifting hypotheses• Partial feedback: Bandit problems, Reinforcement learning • Online-to-batch conversions

Problem I: Regression

Task: guess the next element of a real-valued sequence

What could constitute a good prediction strategy ?

Online Regression

For t = 1, 2, . . .

Predict a real number yt ! R

Receive yt ! R

Su!er loss (yt " yt)2

Regression (cont.)Follow-The-Leader

Predict: yt = 1t!1

Similar to Maximum Likelihood

Regret Analysis

The FTL predictor satisfies:

!y!,T!

(yt" yt)2"T!

(y!" yt)2 # O(log(T ))

FTL is minimax optimal (outside scope)

Regression (cont.)

Proof Sketch

Be-The-Leader: yt = 1t

The regret of BTL is at most 0 (elementary)

FTL is close enough to BTL (simple algebra)

(yt ! yt)2 ! (yt ! yt)2 " O( 1t )

Summing over t (harmonic series) and we are done

Problem II: Classification

Guess the next element of a binary sequence

Online PredictionFor t = 1, 2, . . .

Predict a binary number yt ! {+1,"1}

Receive yt ! {+1,"1}

Su!er 0" 1 loss

!(yt, yt) =

!1 if yt #= yt

0 otherwise

Classification (cont.)

No algorithm can guarantee low regret !

Proof Sketch

Adversary can force the cumulative loss of the learnerto be as large as T by using yt = !yt

The loss of the constant prediction

y! = sign

#is at most T/2

Regret is at least T/2

Intermediate Conclusion

Two similar problems

Predict the next real-valued element with squared loss

Predict the next binary-valued element with 0-1 loss

Size of decision set does not matter !

In the first problem, loss is convex and decision set is convex

Is convexity sufficient for predictability ?

For t = 1, 2, . . . , T

Learner picks wt ! S

Environment responds with convex loss !t : S " R

Learner su!ers loss !t(wt)

Abstract game between learner and environmentGame board is a convex set SLearner plays with vectors in SEnvironment plays with convex functions over S

Online Convex Optimization -- Example I

Regression

Learner predicts element yt = wt ! S

A true target yt ! R defines a loss function!t(w) = (w " yt)2

Online Convex Optimization -- Example II

Regression with Experts Advice

S = {w ! Rd : wi " 0, #w#1 = 1}

Learner picks wt ! S

Learner predicts yt = $wt,xt%

A pair (xt, yt) defines a loss functionover S: !t(w) = ($w,xt% & yt)2

Coping with Non-convex Loss Functions

Method I: Convexification

Find a surrogate convex loss function

Mistake bound model

Method II: Randomization

Allow randomized predictions

Analyzed expected regret

Loss in expectation is convex

Convexification and Mistake Bound

Non-convex loss: mistake indicator a.k.a 0-1 loss

Recall that regret can be as large as T/2

Surrogate loss function: hinge-loss

By construction

[a]+ = max {a, 0}

!0!1(yt, yt) =

!1 if yt != yt

0 otherwise

!hi(w, (xtyt)) = [1 ! yt"wt,xt#]+

!0!1(yt, yt) ! !hi(wt, (xtyt))

By construction

[a]+ = max {a, 0}

!0!1(yt, yt) =

!1 if yt != yt

0 otherwise

!hi(w, (xtyt)) = [1 ! yt"wt,xt#]+

!0!1(yt, yt) ! !hi(wt, (xtyt))yt!wt,xt"

By construction

[a]+ = max {a, 0}

!0!1(yt, yt) =

!1 if yt != yt

0 otherwise

!hi(w, (xtyt)) = [1 ! yt"wt,xt#]+

yt!wt,xt"

By construction

[a]+ = max {a, 0}

!0!1(yt, yt) =

!1 if yt != yt

0 otherwise

!hi(w, (xtyt)) = [1 ! yt"wt,xt#]+

yt!wt,xt"

Randomization and Expected RegretExample – Classification with Expert Advice

Learner receives expert advice xt ! [0, 1]d

Should predict yt ! {+1,"1}

Receive yt ! {+1,"1}

Su!er 0" 1 loss !0!1(yt, yt) = 1" "(yt, yt)

Convexify by randomization:

Learner picks wt in d-dim probability simplex

Predict yt = 1 with probability #wt,xt$

Expected 0" 1 loss is convex w.r.t. wt

E[yt %= yt] = yt+12 " yt#wt,xt$

Part II:An Algorithmic Framework for Online Convex Optimization

Online Learning with the Perceptron

Get a PhD in 3 month! A better job, more income and a better life can all be yours. No books to buy, no classes to go …

Spam ?

Perceptron (Rosenblatt58)

• emails encoded as vectors

Spam ?

Perceptron (Rosenblatt58)spam

not spam

• hypothesis - linear separator

Spam ?

not spam

No spam

Spam ?

not spam

No spamFeedback

Spam ?

Spam !

spamnot sp

• update:

No spamFeedback

Spam ?

Spam !

w! w + yx

spamnot sp

• update:

No spamFeedback

Spam ?

Spam !

w! w + yx

update if y!w,x" < 1(aggressive perceptron)

Regret

not spam

Lossof learner

Learner Environment Loss

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Regret

not spam

Lossof learner

Best Lossin hindsight

More Stringent Form of Regret

Original regret goal:

!hi(wt, (xt, yt)) ! minw:!w!"D

!hi(w, (xt, yt)) + o(T )

A stronger requirement:

!t(wt) ! minw

2"w"2 +

!hi(w, (xt, yt))

From Regret to SVMRewriting !hi(·)

"t = !hi(wt, (xt, yt)) ! "t " 0 # "t " 1$ yt%wt,xt&

The target regret

2'w'2 +

!hi(w, (xt, yt))

can be rewritten as

minw,!!0

2'w'2 +

"t s.t. "t " 1$ yt%wt,xt&

From Regret to SVM

Rewriting !hi(·)

The target regret

2'w'2 +

!hi(w, (xt, yt))

can be rewritten as

minw,!!0

2'w'2 +

From Regret to SVM

SVM Objective

Rewriting !hi(·)

The target regret

2'w'2 +

!hi(w, (xt, yt))

can be rewritten as

minw,!!0

2'w'2 +

Regret and Duality

The loss of Perceptron should be smaller than SVM objective

SVM duality

Primal SVM: P(w) =!

2!w!2 +

"hi(w, (xt, yt))

Constrained form

minw,!!0

2!w!2 +

#t s.t. 1" yt#w,xt$ % #t

Dual objective D(!) =!

"""""!

$tytxt

Properties of Dual Problem

D(!) =T!

!tytxt

Dedicated variable for each online round

If !t = . . . = !T = 0 then D(!)can be optimized without the knowledge of(xt, yt), . . . , (xT , yT )

D(!) can be optimized along the online process

Weak Duality max!![0,1]m

D(!) ! minwP(w)

Core idea:Online learning by incremental dual ascent

Properties of Dual Problem

D(!) =T!

!tytxt

Dedicated variable for each online round

If !t = . . . = !T = 0 then D(!)can be optimized without the knowledge of(xt, yt), . . . , (xT , yT )

D(!) can be optimized along the online process

Weak Duality max!![0,1]m

D(!) ! minwP(w)

Core idea:Online learning by incremental dual ascent

KeyAnalysis

Online Learning by Dual Ascent

Abstract Dual Ascent Learner

Initialize !1 = . . . = !T = 0

For t = 1, 2, . . . , T

Construct wt from dual variables (how ?)Receive (xt, yt) from environmentInform dual optimizer of new exampleObtain !t from dual optimizer

Let Dt be the dual value at round t

Let !t = Dt+1 !Dt be the dual increase

Assume that !t " !(wt, (xt, yt))! 12 !

!(wt, (xt, yt))!T!

!(w", (xt, yt)) # O($

Let Dt be the dual value at round t

Let !t = Dt+1 !Dt be the dual increase

Assume that !t " !(wt, (xt, yt))! 12 !

!(wt, (xt, yt))!T!

!(w", (xt, yt)) # O($

Proof follows from weak duality

Proof by animationPrimal Objective

Dual Objective

Dual Objective!t ! !(wt, (xt, yt))" 1

D(!1, 0, . . . , 0)

Dual Objective

!t ! !(wt, (xt, yt))" 12 !

D(!1, !2, 0, . . . , 0)

D(!1, 0, . . . , 0)

Dual Objective

!t ! !(wt, (xt, yt))" 12 !D(!1, !2, 0, . . . , 0)

D(!1, 0, . . . , 0)

Dual Objective

!t ! !(wt, (xt, yt))" 12 !

D(!1, !2, 0, . . . , 0)

D(!1, 0, . . . , 0)

D(!1, . . . ,!t, 0, . . .)

Dual Objective

!t(wt)! T2 ! "

!t = D("1, . . . ,"T ) " P(w")

!t ! !(wt, (xt, yt))" 12 !

D(!1, !2, 0, . . . , 0)

D(!1, 0, . . . , 0)

D(!1, . . . ,!t, 0, . . .)

Interim Recap

To design an online algorithm:

Write an “SVM-like” problem

Switch to dual problem

Incrementally increase the dual

Remains to describe:

How to construct

Scheme works only if can guarantee a sufficient increase in dual form

Sufficient dual increase procedures

At the optimum w! =1!

"!t ytxt

Along the online learning process wt =1!

"iyixi

Recursive form (weight update) wt+1 = wt +1!

"tytxt

Note that dual can be rewritten as

2 !"!wt"2

Sufficient Dual IncreaseFor aggressive Perceptron

!1 if 1! yt"wt,xt# > 00 else

If !t = 0 then 0 = !t = "t(wt) and we’re good

If !t = 1 then

!t ! 12 !$#wt + !tytxt$2

!i ! 12 !$#wt$2

= 1! yt"wt,xt# !$xt$2

% "t(wt)!1

Thus, in both cases we’re good

Three Directions for Generalization

Regularization

Aggressive dual updates:PA, FoReL

General loss functions

Thus far: specific settings

Next: Primal-Dual apparatus

for online learning

f(w)slope

Background -- Fenchel Duality

Fenchel Conjugate

The Fenchel conjugate of thefunction f : S ! R is

f!(!) = maxw!S

"w,!# $ f(w)

f(w)slope

Fenchel Conjugate

The Fenchel conjugate of thefunction f : S ! R is

f!(!) = maxw!S

"w,!# $ f(w)

! f!(!!)! g!(!) " minw

f(w) + g(w)

g(w)f(w) + g(w)

! f!(!!)! g!(!) " minw

f(w) + g(w)

tangentslope ! tangent

slope ! !

g(w)f(w) + g(w)

! f!(!!)! g!(!) " minw

f(w) + g(w)

slope ! !

g(w)f(w) + g(w)

!f!(!!)!g!(!)

! f!(!!)! g!(!) " minw

f(w) + g(w)

slope ! !

g(w)f(w) + g(w)

!f!(!!)!g!(!)

!f!(!!)! g!(!)

! f!(!!)! g!(!) " minw

f(w) + g(w)

slope ! !

g(w)f(w) + g(w)

!f!(!!)!g!(!)

!f!(!!)! g!(!)

! f!(!!)! g!(!) " minw

f(w) + g(w)

slope ! !

g(w)f(w) + g(w)

!f!(!!)!g!(!)

!f!(!!)! g!(!)

Regret and Duality

max!1,...,!T

!f!(!!

!!t (!t) " min

w!Sf(w) +

Decomposability of the dual

Di!erent dual variable associated with each online round

Future loss functions do not a!ect dual variables of cur-rent and past rounds

Therefore, the dual can be improved incrementally

To optimize !1, . . . ,!t, it is enough to know !1, . . . , !t

Primal-Dual Online Prediction Strategy

Initialize !1 = . . . = !T = 0

For t = 1, 2, . . . , T

Construct wt from the dual variablesReceive !t

Update dual variables !1, . . . ,!t

Sufficient Dual Ascent => Low Regret

LemmaLet Dt be the dual value at round t.

Assume that Dt+1 !Dt " !t(wt)! a!T

Assume that maxw"S f(w) # a$

Then, the regret is bounded by 2a$

Proof follows directly from weak duality !

Proof Sketch of Low Regret

On one hand

DT+1 =T!

(Dt+1 !Dt) "!

!t(wt)!T a#

On the other hand, from weak duality

DT+1 $ f(u) +!

!t(u) $ a#

Comparing the lower and upper bound on DT+1

!t(wt)!T a#

!t(u) %!

!t(wt) $!

!t(u)+2a#

Proof Sketch of Low Regret

On one hand

DT+1 =T!

(Dt+1 !Dt) "!

!t(wt)!T a#

On the other hand, from weak duality

DT+1 $ f(u) +!

!t(u) $ a#

Comparing the lower and upper bound on DT+1

!t(wt)!T a#

!t(u) %!

!t(wt) $!

!t(u)+2a#

Strong Convexity => Sufficient Dual Increase

Definition – Strong Convexity

A function f is !-strongly convex over S w.r.t ! ·! if

"u,v # S, f(u)+f(v)2 $ f(u+v

2 ) + !8 !u% v!2

f(u) f(v)

u v f(u)+f(v)2 ! f(u+v

"u,v # S, f(u)+f(v)2 $ f(u+v

2 ) + !8 !u% v!2

f(u) f(v)

"u,v # S, f(u)+f(v)2 $ f(u+v

2 ) + !8 !u% v!2

f(u) f(v)

Example:f(w) = 1

2!w!22 is 1 strongly

convex w.r.t. ! ·! 2

L-Lipschitz => Sufficient Dual Increase

Example:!(w) = |y ! "w,x#| is L-Lipschitzw.r.t. $ ·$ with L = $x$

Definition – Lipschitz

A function ! is L-Lipschitz w.r.t. ! ·! if

"u,v # S, |!(u)$ !(v)| % L !u$ v!

Su!cient Dual Increase for Gradient DescentAssume:

f is !-strongly convex w.r.t. ! ·!

"t is convex, and L-Lipschitz w.r.t. ! ·! !

wt = "f!(#!

i<t !i)

Set !t to be a subgradient of "t at wt

Keep !1, . . . ,!t!1 in tact

Dt+1 #Dt $ "t(wt)#L2

General Algorithmic Framework

Choose !-strongly convex complexity function f

For t = 1, 2, . . . , T

Predict wt = !f!("!

i<t !i)Receive "t

Update dual variables !1, . . . ,!t s.t.Dt+1 "Dt # "t(wt)" L2

2"(e.g. by gradient descent)

General Algorithmic Framework

Choose !-strongly convex complexity function f

For t = 1, 2, . . . , T

Predict wt = !f!("!

i<t !i)Receive "t

Update dual variables !1, . . . ,!t s.t.Dt+1 "Dt # "t(wt)" L2

2"(e.g. by gradient descent)

Gradient descent on the (primal) loss results insufficient dual increase if f is strongly convex and thelosses are L-Lipshitz (do not grow excessively fast)

General Regret BoundTheorem – General Regret Bound

Assume:

Then, the regret of all algorithms derived from the generalframework is upper bounded by f(w!) + T L2

Assume:

Corollary – Euclidean norm Regularization

If S is the Euclidean ball of radius W and !t is convex,and L-Lipschitz w.r.t. ! ·! 2

Set f = !2 !w!

2 with " =!

Then, the regret is upper bounded by L W"

Assume:

Corollary – Entropic regularization

If S is the d-dim probability simplex and !t is convex,and L-Lipschitz w.r.t. ! ·!!

Set f = "!

i wi log(d wi) with " ="

T L"log(d)

Then, the regret is upper bounded by L"

log(d) T

Generalizations and Related Work

Family of loss functions (!t)

Online Learning (Perceptron, lin-ear regression, multiclass predic-tion, structured output, ...)

Game theory (Playing repeatedgames, correlated equilibrium)

Information theory (Prediction ofindividual sequences)

Convex optimization (SGD, dualdecomposition)

Generality and Related Work

Regularization function (f)

Online learning(Grove, Littlestone, Schuurmans;Kivinen, Warmuth;Gentile; Vovk)

Game theory(Hart and Mas-collel)

Optimization(Nemirovsky, Yudin;Beck, Teboulle, Nesterov)

Unified frameworks(Cesa-Bianchi and Lugosi)

Generality and Related Work

Dual update schemes

Only two extremes were studied:

Gradient update (naive updateof a single dual variable)Follow the leader (Equivalent tofull optimization)

Our analysis enables the usage theentire spectrum of possible updates

Part III:Derived Algorithms

Fenchel Dual of SVM

SVM primal:!

! "# $f(w)

[1" yi#w,xi$]+! "# $!i(w)

Fenchel dual of f(w) % f"(!) = maxw#w,!$ " !

!" !w = 0 % !/! = w % f"(!) = #!/!,!$ " !

2!!/!!2 =

12!!!!2

Fenchel dual of hinge-loss f"(") =&"# " = "#x and # & [0, 1]' otherwise

The Fenchel dual of SVM

"f!("%

$"(!t) = " 12!!"

#tytxt!2"%

"#t s.t. #i & [0, 1]

Since f!(v) = f!(!v) and "f!(v) = v,

wt+1 = "f!(!!

!i) =!

!ixi =!

!ixi + !txt = wt + !txt

We saw that obtain a regret bound if Dt+1 ! Dt # "t(wt) ! L2

2" whereL is the Lipschitz constant of "t w.r.t $ ·$ !

We can use gradient descent (on the primal) to achieve su!cient increaseof the dual objective:

Gradient descent:1. !t = !xt (!t = !!txt with !t = 1) when [1! yt%wt,xt&]+ > 02. !t = 0 otherwise

Dual increase: Dt+1 !Dt # "t(wt)! 12"

Can we potentially make faster progress inthe dual while maintaining the regret bound?

Online SVM Revisited

wt+1 = "f!(!!

!i) =!

!ixi =!

wt+1 = "f!(!!

!i) =!

!ixi =!

!hiyt!wt,xt"

Gradient is 0

wt+1 = "f!(!!

!i) =!

!ixi =!

yt!wt,xt"Gradient is -X

Aggressive Dual Ascend Schemes (1)

Locally aggressive update:

1. Leave !1, . . . ,!t!1 intact from previous rounds2. !t+1 = . . . = !T = 0 : yet to observe future examples3. Set !t = !!txt to maximize the increase in the dual

Maximizing the ”instantaneous” dual w.r.t !t is a scalaroptimization problem that often can be solved analytically

Increase in dual is at least as large as increase due to gradientdescent. The locally aggressive scheme achieves at least asgood a regret bound as the aggressive Perceptron

Aggressive Dual Ascend Schemes (1)

!t = arg minµD(!1, . . . ,!t!1,µ, 0, . . . , 0)

Locally aggressive update:

1. Leave !1, . . . ,!t!1 intact from previous rounds2. !t+1 = . . . = !T = 0 : yet to observe future examples3. Set !t = !!txt to maximize the increase in the dual

Maximizing the ”instantaneous” dual w.r.t !t is a scalaroptimization problem that often can be solved analytically

Increase in dual is at least as large as increase due to gradientdescent. The locally aggressive scheme achieves at least asgood a regret bound as the aggressive Perceptron

Aggressive Dual Ascend Schemes (II)Follow the regularized leader (Forel):

1. !t+1 = . . . = !T = 0 as before2. Set !1, . . . ,!t so as to maximize the resulting dual

Primal of dual with !t+1 = . . . ,!T = 0 is

Pt(w) = !f(w) +t!

Strong duality: D(!!1, . . . ,!

!t ) = Pt(w!)

Thus, on round t we set wt to be the optimum of an instanta-neous primal problem: wt = arg minw !f(w)+

"ti=1 "i(w)

Increase in dual is at least as large as increase of locallyaggressive update. Forel is at least as good as scheme I

Locally Aggressive Update for Online SVM

The Fenchel dual of SVM is D(!) =T!

!tytxt

We saw that obtain a regret bound if Dt+1 ! Dt " #t(wt) ! L2

2! whereL is the Lipschitz constant of #t w.r.t # · #"

Gradient descent: !t = 1 if [1! yt$wt,xt%]+ > 0

Dual increase: Dt+1 !Dt " #t(wt)! 12!

Aggressively increase the dual by choosing !t to maximize "t = Dt+1!Dt

Passive-Aggressive: Locally Aggr. Online SVM

Recall once more SVM’s dual: D(!) =T!

!tytxt

The change in the dual due to a change of !t

!t ! 12 !""wt + !tytxt"2

!i ! 12 !""wt"2

= !t(1! yt#wt,xt$)! !2t"xt"2

Quadratic equation in !t with boundary constraints !t % [0, 1]

!"t = max

)0,min

1! yt#wt,xt$)"xt"2

Passive-Aggressive: if margin & 1 do nothing otherwise use !"t

Passive-Aggressive: Locally Aggr. Online SVM

Recall once more SVM’s dual: D(!) =T!

!tytxt

The change in the dual due to a change of !t

!t ! 12 !""wt + !tytxt"2

!i ! 12 !""wt"2

= !t(1! yt#wt,xt$)! !2t"xt"2

Quadratic equation in !t with boundary constraints !t % [0, 1]

!"t = max

)0,min

1! yt#wt,xt$)"xt"2

Passive-Aggressive: if margin & 1 do nothing otherwise use !"t

!=1!=0

Online SVM by Following the LeaderInstantaneous primal Pt(w) = !/2!w!2 +

!ti=1 "i(w)

Dual of Pt(w)

D(#1, . . . ,#t|#t+1 = . . . = 0) =t"

#i "12!

#iyixi

Follow the regularized leader - Forel:(#!

1, . . . ,#!t ) = arg min

"1,...,"t

D(#1, . . . ,#t|#t+1 = . . . = 0)

From strong duality

w!t = arg min

wPt(w) # w!

#!i yixi

The regret of FOREL is at least as good as PA’s regret

Entropic RegularizationMotivation – Prediction with expert advice:

Learner receives a vector xt = (x1t , . . . , x

dt ) ! ["1, 1]d

of experts advice

Learner needs to predict a target yt ! R

Environment gives correct target yt ! R

Learner su!ers loss |yt " yt|

Goal: predict almost as well as best committee of experts!

t |yt " yt| "!

t |yt " #w!,xt$| != o(T )

Modeling:

S is the d-dimensional probability simplex

Loss functions: !t(w) = |yt " #w,xt$|

Entropic Regularization (cont.)

Prediction with expert advice – regret:

Consider working with f(w) = !2 !w!

Regret is L W"

T where:

S is the probability simplex and thusW = maxw!! !w! = 1Lipschitz constant is L = max !x! =

Regret is O("

Is this the best we can do in terms of dependency in d?

Entropic Regularization (cont.)Prediction with expert advice – Entropic regularization:

Consider working with

f(w) =n!

wj log"

#= log(n) +

wj log(wj)

!·!1, !·!! for assessing convexity and Lipschitz constants

f is 1-strongly convex w.r.t. ! · !1

Regret is L W"

T where:

S is the probability simplex and thusW = maxw"! f(w) = log(n)Lipschitz constant of !t(w) = |yt#$w,xt%| is L = 1since !xt!! & 1Regret is O(

$log(d) T )

Entropic Regularization Multiplicative PAGeneralized hinge loss [! ! yt"w,xt#]+Use f(w) = log(n) +

!nj=1 wj log(wj) (w in prob. simplex)

Fenchel dual of f : f!(!) = log

Primal problem

P(w) = "

#log(n) +n$

wj log(wj)

[! ! yt"w,xt#]+

Define " =!

i !i = 1#

!i #iyixi to write dual problem

D(#) = !$

#i ! " log

& s.t. #i $ [0, 1]

PA Update with Entropic RegularizationFind !t with maximal local dual increase(closed form for maximal increase if xt ! {"1, 0, 1}n)

!!t = arg max

"![0,1]"!" # log

!iyixi + !ytxt

Define !t =1#

!!i yixi

Update wt = #f!(!t)

wt,j = exp($t,j)/Zt where Zt ="

exp($t,r)

Use wt,j $ exp($t,j) to obtain a multiplicative update

wt+1,j = wt,j exp(!!t ytxt,j)/Zt

PA Update with Entropic Regularization

Regret isO(

!log(d) T )

Find !t with maximal local dual increase(closed form for maximal increase if xt ! {"1, 0, 1}n)

!!t = arg max

"![0,1]"!" # log

!iyixi + !ytxt

Define !t =1#

!!i yixi

Update wt = #f!(!t)

wt,j = exp($t,j)/Zt where Zt ="

exp($t,r)

Use wt,j $ exp($t,j) to obtain a multiplicative update

wt+1,j = wt,j exp(!!t ytxt,j)/Zt

Online Logistic RegressionLoss: log (1 + exp(!yt"w,xt#))

Primal problem

!f(w) +T!

log (1 + exp(!yt"w,xt#))

Define ! ="

i("i/!)yixi

Dual problem (for f(w) = DKL(w$u))

D(") =!

H("t) ! ! log

Find "t with su!cient dual increase using binarysearch for "t % [0, 1]

Update (Zt ensures wt+1 % "n)

wt+1,j = wt,j e(!t/") yt xt,j/Zt

Primal problem

!f(w) +T!

Define ! ="

i("i/!)yixi

D(") =!

H("t) ! ! log

Same update form as multiplicative PA for SVM

Primal problem

!f(w) +T!

Define ! ="

i("i/!)yixi

D(") =!

H("t) ! ! log

Primal problem

!f(w) +T!

Define ! ="

i("i/!)yixi

D(") =!

H("t) ! ! log

Regret isO(

!log(d) T )

Back to “Classical” Perceptron

Focus on rounds with mistakes (yt!wt,xt" # 0)

Assume norm of instances bounded by 1 ($t : %xt% # 1)

Recall !t = !t & 12 (!tyt!wt,xt"+ !2

t%xt%2/")

From assumptions !t ' !t & 12! !2

Two version of the Perceptron:

Aggressive Perceptron:!t = 1 whenever #t(wt) > 0Scaled version of classical Perceptron:!t = 1 only when #t(wt) ' 1

Upon an update !t ' 1& 12! for both versions

t%xt%2/")

yt!wt,xt" Aggressive Update

t%xt%2/")

yt!wt,xt"Classical Update

t%xt%2/")

Achieves a Regret Bound

t%xt%2/")

Achieves a Mistake Bound

Universality of Classical PerceptronResulting update - ”scaled” Perceptron:

wt+1 =!

wt + 1! ytxt if !wt,xt"yt # 0

wt otherwise

Use weak duality to obtain that !"1$ 1

$t !t # P(w")

Performance the same regardless of choice of "

Choose " so as to minimize regret bound

!(T ) #T%

#hi(!u,xt", yt) + %u%&

Bound implies that

!(T ) # L" + %u%&L" + %u%2 where L" =

#hi (!u,xt", yt)

wt+1 =!

wt otherwise

$t !t # P(w")

!(T ) #T%

#hi(!u,xt", yt) + %u%&

Bound implies that

!(T ) # L" + %u%&L" + %u%2 where L" =

#hi (!u,xt", yt)

Perceptron is approximate

universal online SVM

wt+1 =!

wt otherwise

$t !t # P(w")

!(T ) #T%

#hi(!u,xt", yt) + %u%&

Bound implies that

!(T ) # L" + %u%&L" + %u%2 where L" =

#hi (!u,xt", yt)

Part IV:A Case Study:

Online Email Categorization

The Task - Email Categorization

On each round:

Receive an email message

Recommend the user a folder to which this email should go

Pay a unit loss if user does not agree with prediction

Learn the “true” folder the email should go to

Minimize cumulative loss

Modeling (highlights)

Feature representation

Represent email as bag-of-words (d-dimensional binary vectors)

Multi-vector multiclass construction

The loss function

The 0-1 loss function is not convex. Use hinge-loss as surrogate

The regularization

Euclidean & Entropic

Dual update

Three dual update schemes

Modeling: Multiple Vector Construction

r block

Prediction : yt = maxr

!w, !(xt, r)"

... Brush the eggplant slices with olive oil and season with pepper. Toss the peppers with a little olive oil. Place both on the ...

xt = [1 , 0 , 0 , 1 , 0 , . . . ]

!(xt, r) = [0 , . . . , 0 , xt , 0 , . . . ,0]

Modeling: Loss Functions

!w,!(x t,

!t(w) = maxr !=yt

1! "w, "(xt, yt)! "(xt, r)# $ !0"1(yt, yt)

Modeling: Loss Functions

!w,!(x t,

!t(w) = maxr !=yt

1! "w, "(xt, yt)! "(xt, r)# $ !0"1(yt, yt)

Modeling: Regularization

Euclidean regularization f(w) = !2 !w!2

Entropic regulaization f(w) = !!

i wi log(d wi)

Expected Performance

Recall the regret bounds we derived

Euclidean: (maxt !xt!2) !w!!2"

Entropic: (maxt !xt!!) !w!!1!

log(d)T

Let s be the length of the longest email

Let r be the number of non-zero elements of w!

Then, EntropicEuclidean #

"r log(d)

Modeling: Dual Update Schemes

DA1: Fixed sub-gradient

!t = vt ! !"t(wt)

DA2: Sub-gradient with optimal step size

!t = #tvt where #t = argmax!

D(!1, . . . ,!t!1, #vt, 0, . . .)

DA3: Optimizing current dual vector

!t = arg max!D(!1, . . . ,!t!1,!, 0, . . .)

Results: 3 dual updates

1 2 3 4 5 6 70

DA1DA2DA3

1 2 3 4 5 6 70

DA1DA2DA3

Entropic Euclidean

7 different users from the Enron data set

Results: 3 dual updates

1 2 3 4 5 6 70

DA1DA2DA3

1 2 3 4 5 6 70

DA1DA2DA3

Entropic Euclidean

7 different users from the Enron data set

Results: 2 regulairzation

1 2 3 4 5 6 70

EuclideanEntropic

1 2 3 4 5 6 70

EuclideanEntropic

DA1 DA3

Part V:Further directions

(not covered)

Self-Tuned parameters

Our algorithmic framework relies on the strong convexityparameter !

The optimal choice of ! depends on unknown parameterssuch as the horizon T and the Lipschitz constants of "t(·)

It is possible to infer these parameters ”on-the-fly”

Logarithmic Regret for Strongly Convex

The dependence of the regret on T in the bounds wederived is O(

This dependency tight in a minimax sense

It is possible to obtain O(log(T )) regret if the loss func-tion is strongly convex

Main idea: when the loss function is strongly convexadditional regularization is not required (the function fcan be omitted) and by taking diminishing steps " 1/t

Online-to-Batch conversions

Online algorithms can be used in batch settings

Main idea: if an online algorithm performs well on a se-quence of i.i.d. examples then an ensemble of online hy-potheses should generalize well

Thus, we need to construct a single hypothesis from thesequence of online generated hypotheses

This process is called ”Online-to-Batch” conversions

Popular conversions: pick the averaged hypothesis, the ma-jority vote, use a validation set for choosing a good hypothe-sis, or simply pick at random a hypothesis from the ensemble

References

There are numerous relevant papers by:

Littlestone, Warmuth, Kivinen, Vovk, Azoury, Freund, Schapire,Gentile, Auer, Grove, Schurmmanns, Long, Smola, Williamson, Herbster, Kalai, Vempala, Hazan ...

A comprehensive book on online prediction that also covers the connections to game theory and information theory

Prediction Learning and Games.N. Cesa-Bianchi and G. Lugosi. Cambridge university press, 2006.

The “online convex optimization” model was introduced by Zinkevich

Use of duality for online learning due to Shalev-Shwartz and Singer

Most of the topics covered in the tutorial can be found in

Online Learning: Theory, Algorithms, and Applications.S. Shalev-Shwartz. PhD Thesis, The Hebrew University, 2007.Advisor: Yoram Singer

Theory & Applications of Online Learning -...

Documents

“Particle Detectors” Claus Grupen and Boris Shwartz”epx.phys.tohoku.ac.jp/eeweb/jp/seminar/20100520belle.pdf“Particle Detectors” Claus Grupen and Boris Shwartz” 2010/05/20

Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Handouts Shwartz

Shai vyakarnam entrepreneurship

Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering,

Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale

Accelerating Stochastic Optimization - Tel Aviv Universitywolf/deeplearningmeeting/pdfs/FastStochastic.pdf · Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS

Introduction to Machine Learning (67577) Lecture 3shais/Lectures2014/lecture3.pdf · 2014. 5. 7. · Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School

Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua ... · On a Formal Model of Safe and Scalable Self-driving Cars Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua Mobileye, 2017

Shai Paneer

Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering,

Introduction to Machine Learning (67577) Lecture 5shais/Lectures2014/lecture5.pdf · 2014-05-07 · Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of

Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,

LEARNING KERNEL-BASED HALFSPACES WITH THE …shais/papers/2011_SICOMP_Shal...LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS SHAI SHALEV-SHWARTZ y, OHAD SHAMIR , AND KARTHIK SRIDHARANz

Shwartz Paper Tcm18-66590

Learning to Align Polyphonic Music. Slide 1 Learning to Align Polyphonic Music Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram

Shai blended present

Introduction to Machine Learning (67577) Lecture 11shais/Lectures2014/lecture11.pdf · 2014-05-27 · Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School