Download pdf - Tutorial :PART1 - Princeton University Computer Scienceehazan/tutorial/OCO-tutorial-part1.pdf · Online"vs."Batch"(PAC,"Statistical) Online’convex’optimization’&’regret’

Tutorial: PART 1

Online Convex Optimization,A Game-‐Theoretic Approach to Learning

Elad Hazan Satyen KalePrinceton University Yahoo Research

http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm

Agenda

1. Motivating examples2. Unified framework & why it makes sense3. Algorithms & main results / techniques 4. Research directions & open questions

Section 1:Motivating ExamplesInherently adversarial & online

Sequential Spam Classification-‐ online & adversarial learning

• observe n features (words): 𝑎" ∈ 𝑅%

• Predict label 𝑏'" ∈ {−1,1}, feedback bt• Objective: average error à best linear model in hindsight

min1∈23

4 log(1 +�

"

𝑒<=>⋅1@A>)

Spam Classifier Selection

Stream of emails:

Objective: make predictions with accuracy à best classifier in hindsight

log 1 + 𝑒<=⋅1@A + 𝜆D|𝑥|G

log 1 + 𝑒<=⋅1@A + 𝜆%|𝑥|G

log 1 + 𝑒<=⋅1@A + 𝜆G|𝑥|G 𝑎 ∈ 𝑅H

Distribution of wealth xt

Objective: choose portfolios s.t. avg. log wealth à avg. log wealth of best CRP

rt× 1.5

× 1

× 1

× 0.5

Price Relatives vector rt

xt1/2

0

1/2

0

Wealth multiplier =

(½*1.5 + 0*1 +½*1 + 0 * ½) = 1 ¼

Market -‐ No Statisticalassumptions

Universal portfolio selectionPrice relatives:

rJ 𝑖 = MNOPQ%R STQMUOSU%Q%R STQMU

for commodity i

logWt+1 = logWt + log(r

>t xt)

Constant rebalancing portfolio

• Single currency depreciates exponentially • The 50/50 Constant Rebalanced Portfolio (CRP) makes 5% every day

× 0.1 × 2 × 0.1 × 2 × 0.1 ×…

× 2 × 0.1 × 2 × 0.1 × 2 ×…

X 1 1 X … X

X X X X … X

X X -1 X … X

X X 1 X … -1

X -1 X X … X

… X X X … X

X 1 X X … X

X X X -1 … X

Recommendation systems

• Objective: ratings with accuracy à best low rank matrix• Computation challenge: optimizing over low rank matrices

18000 movies

480000users

Ad Selection

• Yahoo, Google etc display ads on pages, clicks generate revenue• Abstraction:• Observe a stream of users• Given a user, select one ad to display• Feedback: click (or not) on the ad shown

• Objective: average revenue per ad à best policy in a given class

• Additional difficulty: partial (bandit) feedback (feedback only for ad actually shown)

2. Methodology: Online Convex Optimizationdefinition & relation to PAC learninggo through examplesargue it makes sense

Nature: i.i.d from distribution D over A ×𝐵 = {(𝑎, 𝑏)}

Statistical (PAC) learning

learner:

Hypothesis h

Loss, e.g. ℓ𝓁 ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 G

h1

hN

(a1,b1) (aM,bM)

h2

Hypothesis class H: X -‐> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing mexamples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))finds h s.t. w.p. 1-‐ δ:

err(h) minh⇤2H

err(h⇤) + ✏

𝑒𝑟𝑟 ℎ = 𝔼A,=∼l[ℓ𝓁(ℎ, 𝑎, 𝑏 ]

Iteratively, for t = 1,2, … , 𝑇

Player: ℎ" ∈ 𝐻Adversary: (𝑎",𝑏") ∈ 𝐴

Loss ℓ𝓁(ℎ",(𝑎", 𝑏"))

Goal: minimize (average, expected) regret:

Can we learn in games efficiently? Regret à o(T)?

(i.e. as fundamental theorem of stat. learning)

Online Learning in GamesA x B

H

1

T

"X

t

`(ht, (at, bt)� minh⇤2H

X

t

`(h⇤, (at, bt))

#�!T!1

0

Online vs. Batch (PAC, Statistical)

Online convex optimization & regret minimization

• Learning against an adversary

• Regret

• Streaming setting: process examples sequentially

• Less assumptions, stronger guarantees

• Harder algorithmically

Batch learning, PAC/statistical learning

• Nature is benign, i.i.d. samples from unknown distribution

• Generalization error

• Batch: process dataset multiple times

• Weaker guarantees (changing distribution, etc.)

• Easier algorithmically

Online Convex Optimization

Point xt in convex set K in Rn

Convex costfunction ft

Total loss Σt ft (xt)

lossft(xt)

Online PlayerAdversary

Access to K,ft?

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

Prediction from expert advice

• Decision set = set of all distributions over n experts

• Cost functions? Let: ct(i) = loss of expert i in round t

ft(x) = expected loss if choosing experts by distribution xtRegret = difference in # expected loss vs. best expert:

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

ft(x) = c

>t x =

X

i

ct(i)x(i)

Parameterized Hypothesis Classes

• H -‐ parameterized by vector x in convex set 𝐾 ⊆ 𝑅%

• In round t, define loss function based on example (at, bt)

1. Online Linear Spam Filtering:• 𝐾 = 𝑥 | ∥ 𝑥 ∥ ≤ 𝜔• Loss function 𝑓" 𝑥 = 𝑎"{𝑥 − 𝑏" G

2. Online Soft Margin SVM:• K = Rn

• Loss function 𝑓" 𝑥 = max 0, 1 − 𝑏"𝑎"{𝑥 + 𝜆 ∥ 𝑥 ∥G

3. Online Matrix Completion:• K = 𝑋 ∈ 𝑅% ×% , ∥ 𝑋 ∥∗ ≤ 𝑘 matrices with bounded nuclear norm• At time t, if at = (it, jt), then loss function 𝑓" 𝑥 = 𝑥 𝑖", 𝑗" − 𝑏" G

Universal portfolio selection

logWt+1 = logWt + log(r

>t xt)

Regret

T

=1

T

X

t

f

t

(xt

)� 1

T

minx

⇤

X

t

f

t

(x⇤) 7! 0

A “Universal Portfolio” algorithm:

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

ft(x) = � log(r

>t xt)

Recall change in log-‐wealth:

Thus:

Bandit Online Convex Optimization

• Same as OCO, except ft is not revealed to learner, only ft(xt) is.

• E.g. Ad selection: a bandit version of “prediction with expert advice”

• Decision set K = set of distributions over possible ads to be shown

• Loss function: let ct(i) = 0 if ad i would be clicked if it were shown, and 1 otherwise

K = �n = {x 2 R

n,

X

i

xi = 1 , xi � 0}

ft(x) = c

>t x =

X

i

ct(i)x(i)

Algorithmsrandomized weighted majorityfollow-‐the-‐leaderonline gradient descent, regularization? log-‐regret, Online-‐Newton-‐StepBandit algorithms (FKM, barriers, volumetric spanners)Projection-‐free methods (FW algorithm, online FW)

Part 1: Foundations

Prediction from expert advice

• N “experts” (different regularizations)• Can we perform as good as the best expert ?(non-‐stochastic, adversarial setting)

0 if correct

1 for mistake

𝑎 ∈ 𝑅H

Weighted Majority [Littlestone-‐Warmuth ‘89]

n experts

Initially, all 𝑥" ℎD = 1

round t, predict by weighted majority

Update weights:𝑥"�D ℎ = 1− 𝜂 𝑥" ℎ if expert h errs𝑥"�D ℎ = 𝑥" ℎ otherwise

Thm: Total-‐errors ≤ 2 1 + 𝜂 Best-‐Expert + G�� %�

Binary classification: {0, 1} loss

𝑥" ℎD𝑥" ℎG..𝑥" ℎ%

Weighted Majority: Analysis

• Total weight: Φ" = ∑ 𝑥"(ℎ)�� ∈�

1. Alg mistake à at least half weight on experts who erred

2. If h* is best expert,

• Thus:

�t+1 1

2�t(1� ⌘) +

1

2�t �t(1� ⌘

2 )

#(alg errors) 2(1 + ⌘)#(best expert errors) +

2 log n

⌘

(1� ⌘)#(best expert errors) �T �0

· (1� ⌘2

)#(alg errors)

�T � xT (h⇤) = (1� ⌘)#(h⇤

errors)

Randomized Weighted Majority [LW‘89]

n experts

Initially, all 𝑥" ℎD = 1

round t, predict h w.p.: 1> �∑ 1>(��)��

Update weights:𝑥"�D ℎ = 𝑒<�M> � ⋅ 𝑥" ℎ

Thm: For 𝜂 = ��{

�Regret = O( 𝑇 log𝑁 )�

general: expert h time t: loss = 𝑐" ℎ ∈ [0,1]

𝑥" ℎD𝑥" ℎG..𝑥" ℎ%

Reminder: Online Convex Optimization

Point xt in convex set K in Rn

Convex costfunction ft

Total loss Σt ft (xt)

lossft(xt)

Online PlayerAdversary

Access to ft?Benchmark? Regret =

X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

• Most natural:

• Provably works: (even for non-‐convex sets [Kalai-‐Vempala’05])

• Is xt close to xt+1 ?

• Decision may be unstable! (teaser: this can be fixed w. regularization)

Minimize regret: best-‐in-‐hindsight

Regret =X

t

f

t

(xt

)� minx

⇤2K

X

t

f

t

(x⇤)

xt = argmin{t�1X

i=1

fi(x)}

xt = argmin{tX

i=1

fi(x) +1

⌘

R(x)}

x

t+1 = arg minx2K

(tX

i=1

f

i

(x)

)

• Move in the direction of steepest descent, which is:

Offline: Steepest Descent

p1p* p2p3

�[rf(x)]i = � @

@xif(x)

Online gradient descent [Zinkevich ’03]

x

t+1 = argminx2K

kyt+1 � x

t

k

Theorem: Regret = 𝑂 𝑇�

yt+1 = xt � ⌘rft(xt)

Observation 1:

Observation 2: (Pythagoras)

Thus:

Convexity:

Analysis

kyt+1 � x

⇤k2 = kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k2 kxt � x

⇤k2 � 2⌘rt(x⇤ � xt) + ⌘

2krtk2

kxt+1 � x

⇤k kyt+1 � x

⇤k

X

t

[ft(xt)� ft(x⇤)]

X

t

rt(xt � x⇤)

1

⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘

X

t

krtk2

1

⌘kx1 � x⇤k2 + ⌘TG = O(

pT )

∇"≔ ∇𝑓"(𝑥")

• 2 experts, T iterations:• First expert has random loss in {-‐1,1}• Second expert loss = first * -‐1

• Expected loss = 0 (any algorithm)• Regret = (think of first expert only)

Lower bound

Regret = ⌦(pT )

E[|#10s�#(�1)0s|] = ⌦(pT )

Algorithmsrandomized weighted majorityfollow-‐the-‐leaderonline gradient descent, regularization? log-‐regret, Online-‐Newton-‐StepBandit algorithms (FKM, barriers, volumetric spanners)Projection-‐free methods (FW algorithm, online FW)

Part 2 coming up…