Transcript
Page 1: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Tutorial:    PART  1

Online  Convex  Optimization,A  Game-­‐Theoretic  Approach  to  Learning  

Elad  Hazan                                                        Satyen  KalePrinceton  University                                Yahoo  Research            

http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm

Page 2: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Agenda

1. Motivating  examples2. Unified  framework  &  why  it  makes  sense3. Algorithms  &  main  results  /  techniques  4. Research  directions  &  open  questions

Page 3: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Section  1:Motivating  ExamplesInherently  adversarial  &  online

Page 4: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Sequential  Spam  Classification-­‐ online  &  adversarial  learning

• observe  n  features  (words):  𝑎" ∈ 𝑅%

• Predict  label  𝑏'" ∈ {−1,1},  feedback  bt• Objective:  average  error  à best  linear  model  in  hindsight

min1∈23

4 log(1 +�

"

𝑒<=>⋅1@A>)

Page 5: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Spam  Classifier  Selection

Stream  of  emails:  

Objective:  make  predictions  with  accuracy  à best  classifier  in  hindsight

log 1 + 𝑒<=⋅1@A + 𝜆D|𝑥|G  

log 1 + 𝑒<=⋅1@A + 𝜆%|𝑥|G  

log 1 + 𝑒<=⋅1@A + 𝜆G|𝑥|G  𝑎 ∈ 𝑅H

Page 6: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Distribution  of  wealth  xt

Objective:  choose  portfolios  s.t. avg.  log  wealth  à avg.  log  wealth  of  best  CRP

rt× 1.5          

× 1              

× 1              

× 0.5      

Price  Relatives  vector  rt

xt1/2

0

1/2

0

Wealth  multiplier  =  

(½*1.5  +  0*1  +½*1  +  0  *  ½)    =  1  ¼

Market  -­‐ No  Statisticalassumptions

Universal  portfolio  selectionPrice  relatives:

rJ 𝑖 = MNOPQ%R  STQMUOSU%Q%R  STQMU

 

for  commodity   i

logWt+1 = logWt + log(r

>t xt)

Page 7: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Constant  rebalancing  portfolio

• Single  currency  depreciates   exponentially  • The  50/50  Constant  Rebalanced  Portfolio  (CRP)  makes    5%  every  day  

× 0.1 × 2 × 0.1 × 2 × 0.1 ×…

× 2 × 0.1 × 2 × 0.1 × 2 ×…

Page 8: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

X 1 1 X … X

X X X X … X

X X -1 X … X

X X 1 X … -1

X -1 X X … X

… X X X … X

X 1 X X … X

X X X -1 … X

Recommendation  systems

• Objective:  ratings  with  accuracy  à best  low  rank  matrix• Computation  challenge:  optimizing  over  low  rank  matrices

18000  movies

480000users

Page 9: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Ad  Selection

• Yahoo,  Google  etc display  ads  on  pages,  clicks  generate  revenue• Abstraction:• Observe  a  stream  of  users• Given  a  user,   select  one  ad  to  display• Feedback:  click  (or  not)  on  the  ad  shown

• Objective:  average  revenue  per  ad  à best  policy  in  a  given  class

• Additional  difficulty:  partial  (bandit)  feedback  (feedback  only  for  ad  actually  shown)

Page 10: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

2.  Methodology:  Online  Convex  Optimizationdefinition  &  relation  to  PAC  learninggo  through  examplesargue  it  makes  sense  

Page 11: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Nature:  i.i.d from  distribution   D over  A  ×𝐵 =   {(𝑎, 𝑏)}

Statistical  (PAC)  learning

learner:

Hypothesis  h

Loss,  e.g.    ℓ𝓁 ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 G

h1

hN

(a1,b1) (aM,bM)

h2

Hypothesis   class  H:  X  -­‐>  Y    is  learnable  if    ∀𝜖, 𝛿 > 0   exists  algorithm  s.t. after  seeing  mexamples,  for  𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))finds  h s.t. w.p.  1-­‐ δ:  

err(h) minh⇤2H

err(h⇤) + ✏

𝑒𝑟𝑟 ℎ = 𝔼A,=∼l[ℓ𝓁(ℎ, 𝑎, 𝑏 ]

Page 12: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Iteratively,  for  t = 1,2, … , 𝑇

Player:    ℎ" ∈ 𝐻Adversary:  (𝑎",𝑏") ∈ 𝐴

Loss  ℓ𝓁(ℎ",(𝑎", 𝑏"))

Goal:  minimize  (average,  expected)  regret:

Can  we  learn  in  games  efficiently?      Regret  à o(T)?

(i.e.  as  fundamental   theorem  of  stat.  learning)

Online  Learning  in  GamesA  x  B

H

1

T

"X

t

`(ht, (at, bt)� minh⇤2H

X

t

`(h⇤, (at, bt))

#�!T!1

0

Page 13: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Online  vs.  Batch  (PAC,  Statistical)

Online  convex  optimization  &  regret  minimization

• Learning  against  an  adversary

• Regret

• Streaming  setting:  process  examples  sequentially

• Less  assumptions,  stronger  guarantees

• Harder  algorithmically

Batch  learning,  PAC/statistical  learning

• Nature  is  benign,  i.i.d.  samples  from  unknown  distribution

• Generalization  error

• Batch:  process  dataset  multiple  times

• Weaker  guarantees  (changing  distribution,  etc.)

• Easier  algorithmically  

Page 14: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Online  Convex  Optimization

Point    xt in  convex  set  K  in  Rn

Convex  costfunction  ft

Total  loss  Σt ft (xt)

lossft(xt)

Online  PlayerAdversary

Access  to  K,ft?

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

Page 15: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Prediction  from  expert  advice

• Decision  set  =  set  of  all  distributions  over  n  experts  

• Cost  functions?  Let:    ct(i)  =  loss  of  expert  i in  round  t

ft(x) =  expected  loss  if  choosing  experts  by  distribution  xtRegret  =  difference   in  #  expected  loss  vs.  best  expert:

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

ft(x) = c

>t x =

X

i

ct(i)x(i)

Page 16: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Parameterized  Hypothesis  Classes

• H -­‐ parameterized  by  vector  x  in  convex  set  𝐾 ⊆ 𝑅%

• In  round  t,  define  loss  function  based  on  example  (at,  bt)

1. Online  Linear  Spam  Filtering:• 𝐾 =   𝑥  |   ∥ 𝑥 ∥  ≤ 𝜔• Loss  function  𝑓" 𝑥 = 𝑎"{𝑥   − 𝑏" G

2. Online  Soft  Margin  SVM:• K    =  Rn

• Loss  function  𝑓" 𝑥 = max 0, 1   − 𝑏"𝑎"{𝑥 +  𝜆 ∥ 𝑥 ∥G

3. Online  Matrix  Completion:• K = 𝑋 ∈ 𝑅%  ×%      , ∥ 𝑋 ∥∗  ≤ 𝑘 matrices  with  bounded   nuclear  norm• At  time  t,  if  at  =  (it,  jt),   then  loss  function  𝑓" 𝑥 = 𝑥 𝑖", 𝑗"  − 𝑏" G

Page 17: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Universal  portfolio  selection

logWt+1 = logWt + log(r

>t xt)

Regret

T

=1

T

X

t

f

t

(xt

)� 1

T

minx

X

t

f

t

(x⇤) 7! 0

A  “Universal  Portfolio”  algorithm:

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

ft(x) = � log(r

>t xt)

Recall  change  in  log-­‐wealth:

Thus:

Page 18: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Bandit  Online  Convex  Optimization

• Same  as  OCO,  except  ft is  not  revealed  to  learner,  only  ft(xt) is.

• E.g.  Ad  selection:  a  bandit  version  of  “prediction  with  expert  advice”

• Decision  set  K  =  set  of  distributions  over  possible  ads  to  be  shown

• Loss  function:    let  ct(i)  = 0 if  ad  i would  be  clicked  if  it  were  shown,  and  1 otherwise

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

ft(x) = c

>t x =

X

i

ct(i)x(i)

Page 19: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Algorithmsrandomized  weighted  majorityfollow-­‐the-­‐leaderonline  gradient  descent,  regularization?  log-­‐regret,  Online-­‐Newton-­‐StepBandit  algorithms  (FKM,  barriers,  volumetric  spanners)Projection-­‐free  methods  (FW  algorithm,  online  FW)

Part  1:  Foundations

Page 20: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Prediction  from  expert  advice

• N “experts”  (different   regularizations)• Can  we  perform  as  good  as  the  best  expert  ?(non-­‐stochastic,  adversarial  setting)

0 if  correct

1  for  mistake

𝑎 ∈ 𝑅H

Page 21: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Weighted  Majority  [Littlestone-­‐Warmuth ‘89]

n experts

Initially,  all  𝑥" ℎD = 1

round  t,  predict  by  weighted  majority  

Update  weights:𝑥"�D ℎ = 1− 𝜂 𝑥" ℎ if  expert  h  errs𝑥"�D ℎ = 𝑥" ℎ         otherwise

Thm: Total-­‐errors ≤  2 1 + 𝜂 Best-­‐Expert +    G��� %�

Binary  classification:    {0,  1} loss

𝑥" ℎD𝑥" ℎG..𝑥" ℎ%

Page 22: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Weighted  Majority:  Analysis

• Total  weight:  Φ" =  ∑ 𝑥"(ℎ)��  ∈�

1. Alg mistake  à at  least  half  weight  on  experts  who  erred

2. If  h*  is  best  expert,  

• Thus:

�t+1 1

2�t(1� ⌘) +

1

2�t �t(1� ⌘

2 )

#(alg errors) 2(1 + ⌘)#(best expert errors) +

2 log n

(1� ⌘)#(best expert errors) �T �0

· (1� ⌘2

)#(alg errors)

�T � xT (h⇤) = (1� ⌘)#(h⇤

errors)

Page 23: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Randomized  Weighted  Majority  [LW‘89]

n experts

Initially,  all  𝑥" ℎD = 1

round  t,  predict  h w.p.:     1> �∑ 1>(��)��

Update  weights:𝑥"�D ℎ = 𝑒<�M> � ⋅  𝑥" ℎ

Thm: For    𝜂 = ����{

�Regret  =  O( 𝑇 log𝑁  )�

general:    expert  h time  t:  loss  =  𝑐" ℎ ∈ [0,1]

𝑥" ℎD𝑥" ℎG..𝑥" ℎ%

Page 24: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Reminder:  Online  Convex  Optimization

Point    xt in  convex  set  K  in  Rn

Convex  costfunction  ft

Total  loss  Σt ft (xt)

lossft(xt)

Online  PlayerAdversary

Access  to  ft?Benchmark? Regret =

X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

Page 25: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

• Most  natural:

• Provably  works:  (even  for  non-­‐convex  sets  [Kalai-­‐Vempala’05])

• Is  xt close  to  xt+1 ?  

• Decision  may  be  unstable!        (teaser:  this  can  be  fixed  w.  regularization)

Minimize  regret:  best-­‐in-­‐hindsight

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

xt = argmin{t�1X

i=1

fi(x)}

xt = argmin{tX

i=1

fi(x) +1

R(x)}

x

t+1 = arg minx2K

(tX

i=1

f

i

(x)

)

Page 26: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

• Move  in  the  direction  of  steepest  descent,  which  is:

Offline:  Steepest  Descent

p1p* p2p3

�[rf(x)]i = � @

@xif(x)

Page 27: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Online  gradient  descent  [Zinkevich ’03]

x

t+1 = argminx2K

kyt+1 � x

t

k

Theorem:    Regret  =  𝑂 𝑇�

yt+1 = xt � ⌘rft(xt)

Page 28: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Observation  1:

Observation  2:    (Pythagoras)

Thus:

Convexity:

Analysis

kyt+1 � x

⇤k2 = kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k2 kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k kyt+1 � x

⇤k

X

t

[ft(xt)� ft(x⇤)]

X

t

rt(xt � x⇤)

1

⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘

X

t

krtk2

1

⌘kx1 � x⇤k2 + ⌘TG = O(

pT )

∇"≔ ∇𝑓"(𝑥")

Page 29: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

• 2  experts,  T iterations:• First  expert  has  random  loss  in  {-­‐1,1}• Second  expert  loss  =  first  *  -­‐1

• Expected  loss  =  0    (any  algorithm)• Regret  =  (think  of  first  expert  only)

Lower  bound

Regret = ⌦(pT )

E[|#10s�#(�1)0s|] = ⌦(pT )

Page 30: Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Algorithmsrandomized  weighted  majorityfollow-­‐the-­‐leaderonline  gradient  descent,  regularization?  log-­‐regret,  Online-­‐Newton-­‐StepBandit  algorithms  (FKM,  barriers,  volumetric  spanners)Projection-­‐free  methods  (FW  algorithm,  online  FW)

Part  2  coming  up…