Tutorial :PART1 - Princeton University Computer Science ehazan/tutorial/OCO-tutorial-part1.pdfآ  Online"vs."Batch"(PAC,"Statistical)

  • View
    0

  • Download
    0

Embed Size (px)

Text of Tutorial :PART1 - Princeton University Computer Science ehazan/tutorial/OCO-tutorial-part1.pdfآ ...

  • Tutorial:    PART  1

    Online  Convex  Optimization, A  Game-­‐Theoretic  Approach  to  Learning  

    Elad  Hazan                                                        Satyen  Kale Princeton  University                                Yahoo  Research            

    http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm

  • Agenda

    1. Motivating  examples 2. Unified  framework  &  why  it  makes  sense 3. Algorithms  &  main  results  /  techniques   4. Research  directions  &  open  questions

  • Section  1: Motivating  Examples Inherently  adversarial  &  online

  • Sequential  Spam  Classification -­‐ online  &  adversarial  learning

    • observe  n  features  (words):  𝑎" ∈ 𝑅%

    • Predict  label  𝑏'" ∈ {−1,1},  feedback  bt • Objective:  average  error  à best  linear  model  in  hindsight

    min 1∈23

    4 log(1 + �

    "

    𝑒⋅1@A>)

  • Spam  Classifier  Selection

    Stream  of  emails:  

    Objective:  make  predictions  with  accuracy  à best  classifier  in  hindsight

    log 1 + 𝑒

  • Distribution  of  wealth  xt

    Objective:  choose  portfolios  s.t. avg.  log  wealth  à avg.  log  wealth  of  best  CRP

    rt × 1.5          

    × 1              

    × 1              

    × 0.5      

    Price  Relatives  vector  rt

    xt 1/2

    0

    1/2

    0

    Wealth  multiplier  =  

    (½*1.5  +  0*1  +½*1  +  0  *  ½)    =  1  ¼

    Market  -­‐ No   Statistical assumptions

    Universal  portfolio  selection Price  relatives:

    rJ 𝑖 = MNOPQ%R  STQMU OSU%Q%R  STQMU

     

    for  commodity   i

    logWt+1 = logWt + log(r > t xt)

  • Constant  rebalancing  portfolio

    • Single  currency  depreciates   exponentially   • The  50/50  Constant  Rebalanced  Portfolio  (CRP)  makes    5%  every  day  

    × 0.1 × 2 × 0.1 × 2 × 0.1 ×…

    × 2 × 0.1 × 2 × 0.1 × 2 ×…

  • X 1 1 X … X

    X X X X … X

    X X -1 X … X

    X X 1 X … -1

    X -1 X X … X

    … X X X … X

    X 1 X X … X

    X X X -1 … X

    Recommendation  systems

    • Objective:  ratings  with  accuracy  à best  low  rank  matrix • Computation  challenge:  optimizing  over  low  rank  matrices

    18000  movies

    480000 users

  • Ad  Selection

    • Yahoo,  Google  etc display  ads  on  pages,  clicks  generate  revenue • Abstraction: • Observe  a  stream  of  users • Given  a  user,   select  one  ad  to  display • Feedback:  click  (or  not)  on  the  ad  shown

    • Objective:  average  revenue  per  ad  à best  policy  in  a  given  class

    • Additional  difficulty:  partial  (bandit)  feedback   (feedback  only  for  ad  actually  shown)

  • 2.  Methodology:   Online  Convex  Optimization definition  &  relation  to  PAC  learning go  through  examples argue  it  makes  sense  

  • Nature:  i.i.d from  distribution   D over   A  ×𝐵 =   {(𝑎, 𝑏)}

    Statistical  (PAC)  learning

    learner:

    Hypothesis  h

    Loss,  e.g.    ℓ𝓁 ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 G

    h1

    hN

    (a1,b1) (aM,bM)

    h2

    Hypothesis   class  H:  X  -­‐>  Y    is  learnable  if    ∀𝜖, 𝛿 > 0   exists  algorithm  s.t. after  seeing  m examples,  for  𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻)) finds  h s.t. w.p.  1-­‐ δ:  

    err(h)  min h⇤2H

    err(h⇤) + ✏

    𝑒𝑟𝑟 ℎ = 𝔼A,=∼l[ℓ𝓁(ℎ, 𝑎, 𝑏 ]

  • Iteratively,  for  t = 1,2, … , 𝑇

    Player:    ℎ" ∈ 𝐻 Adversary:  (𝑎",𝑏") ∈ 𝐴

    Loss  ℓ𝓁(ℎ",(𝑎", 𝑏"))

    Goal:  minimize  (average,  expected)  regret:

    Can  we  learn  in  games  efficiently?      Regret  à o(T)?

    (i.e.  as  fundamental   theorem  of  stat.  learning)

    Online  Learning  in  Games A  x  B

    H

    1

    T

    " X

    t

    `(ht, (at, bt)� min h⇤2H

    X

    t

    `(h⇤, (at, bt))

    # �! T!1

    0

  • Online  vs.  Batch  (PAC,  Statistical)

    Online  convex  optimization  &  regret   minimization

    • Learning  against  an  adversary

    • Regret

    • Streaming  setting:  process   examples  sequentially

    • Less  assumptions,  stronger   guarantees

    • Harder  algorithmically

    Batch  learning,  PAC/statistical  learning

    • Nature  is  benign,  i.i.d.  samples   from  unknown  distribution

    • Generalization  error

    • Batch:  process  dataset  multiple   times

    • Weaker  guarantees  (changing   distribution,  etc.)

    • Easier  algorithmically  

  • Online  Convex  Optimization

    Point    xt in  convex  set  K  in  Rn Convex  cost function  ft

    Total  loss  Σt ft (xt)

    loss ft(xt)

    Online  PlayerAdversary

    Access  to  K,ft?

    Regret = X

    t

    f

    t

    (x t

    )� min x

    ⇤2K

    X

    t

    f

    t

    (x⇤)

  • Prediction  from  expert  advice

    • Decision  set  =  set  of  all  distributions  over  n  experts  

    • Cost  functions?   Let:    ct(i)  =  loss  of  expert  i in  round  t

    ft(x) =  expected  loss  if  choosing  experts  by  distribution  xt Regret  =  difference   in  #  expected  loss  vs.  best  expert:

    K = �n = {x 2 Rn , X

    i

    xi = 1 , xi � 0}

    Regret = X

    t

    f

    t

    (x t

    )� min x

    ⇤2K

    X

    t

    f

    t

    (x⇤)

    ft(x) = c > t x =

    X

    i

    ct(i)x(i)

  • Parameterized  Hypothesis  Classes

    • H -­‐ parameterized  by  vector  x  in  convex  set  𝐾 ⊆ 𝑅%

    • In  round  t,  define  loss  function  based  on  example  (at,  bt)

    1. Online  Linear  Spam  Filtering: • 𝐾 =   𝑥  |   ∥ 𝑥 ∥  ≤ 𝜔 • Loss  function  𝑓" 𝑥 = 𝑎"{𝑥   − 𝑏" G

    2. Online  Soft  Margin  SVM: • K    =  Rn • Loss  function  𝑓" 𝑥 = max 0, 1   − 𝑏"𝑎"{𝑥 +  𝜆 ∥ 𝑥 ∥G

    3. Online  Matrix  Completion: • K = 𝑋 ∈ 𝑅%  ×%      , ∥ 𝑋 ∥∗  ≤ 𝑘 matrices  with  bounded   nuclear  norm • At  time  t,  if  at  =  (it,  jt),   then  loss  function  𝑓" 𝑥 = 𝑥 𝑖", 𝑗"  − 𝑏" G

  • Universal  portfolio  selection

    logWt+1 = logWt + log(r > t xt)

    Regret

    T

    = 1

    T

    X

    t

    f

    t

    (x t

    )� 1 T

    min x

    X

    t

    f

    t

    (x⇤) 7! 0

    A  “Universal  Portfolio”  algorithm:

    K = �n = {x 2 Rn , X

    i

    xi = 1 , xi � 0}

    ft(x) = � log(r>t xt)

    Recall  change  in  log-­‐wealth:

    Thus:

  • Bandit  Online  Convex  Optimization

    • Same  as  OCO,  except  ft is  not  revealed  to  learner,  only  ft(xt) is.

    • E.g.  Ad  selection:  a  bandit  version  of  “prediction  with  expert  advice”

    • Decision  set  K  =  set  of  distributions  over  possible  ads  to  be  shown

    • Loss  function:    let  ct(i)  = 0 if  ad  i would  be  clicked  if  it  were  shown,   and  1 otherwise

    K = �n = {x 2 Rn , X

    i

    xi = 1 , xi � 0}

    ft(x) = c > t x =

    X

    i

    ct(i)x(i)

  • Algorithms randomized  weighted  majority follow-­‐the-­‐leader online  gradient  descent,  regularization?   log-­‐regret,  Online-­‐Newton-­