Lecture 6: stochastic gradient descent Sanjeev Arora Elad …...• Gradient descent algorithm, linear classification • Stochastic gradient descent Title 402-lec6 Created Date 10/6/2016

Lecture6:stochasticgradientdescent

SanjeevArora EladHazan

COS402– MachineLearningand

ArtificialIntelligenceFall2016

Admin

• Exercise2(implementation)thisThu,inclass• Exercise3(written),thisThu,inclass• Movie– “ExMachina”+discussionpanelw.Prof.Hasson(PNI)WedOct.5th 19:30ticketsfromBella;room204COS

• Today:specialguest- Dr.Yoram Singer@Google

Recap

• Definition+fundamentaltheoremofstatisticallearning,motivatedefficientalgorithms/optimization• Convexityandit’scomputationalimportance• Localgreedyoptimization– gradientdescent

Agenda

• Stochasticgradientdescent• Dr.Singeronopt@google&beyond

Mathematicaloptimization

Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅'

Output:point𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾

PreferConvexProblems

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 536

Convexfunctions:localà global

Sumofconvexfunctionsà alsoconvex

Convex Sets

DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],

αx + (1 − α)y ∈ C.

x

y

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53

ConvexFunctionsandSets

A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],

f(ax+ (1� a)y) af(x) + (1� a)f(y)

A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],

ax+ (1� a)y 2 C

8

Specialcase:optimizationforlinearclassificationGivenasample𝑆 = 𝑥0, 𝑦0 ,… , 𝑥3,𝑦3 ,findhyperplane(throughtheoriginw.l.o.g)suchthat:

min7

#ofmistakes

min7 90

ℓ(𝑤=𝑥>,𝑦>) foraconvexlossfunction

𝑤

Convexrelaxationfor0-1loss

1. Ridge/linearregressionℓ 𝑤, 𝑥>, 𝑦> = 𝑤=𝑥> − 𝑦> A

2. SVMℓ 𝑤,𝑥>, 𝑦> = max{0,1 − 𝑦> 𝑤=𝑥>}

i.e.for|w|=|x_i|=1,Wehave:

1 − 𝑦>𝑤H𝑥> = {0𝑦> = 𝑤=𝑥>{≤ 2𝑦> ≠ 𝑤H𝑥>

• Moveinthedirectionofsteepestdescent:

Greedyoptimization:gradientdescent

p1p* p2p3

𝑥KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥K

Wesaw:forcertainstepsize choice,

𝑓1𝑇Q𝑥K

K≤ min

R∗∈T𝑓 𝑥∗ +

1𝑇

GDforlinearclassification

min7 90

1𝑚Qℓ 𝑤=𝑥>,𝑦>

>

𝑤KL0 = wX − 𝜂1𝑚QℓY 𝑤K=𝑥>,𝑦> 𝑥>

>

• Complexity? 0Z[

iterations,eachtaking~lineartimeindataset

• Overall𝑂 3'Z[

runningtime,m=#ofexamplesinRd

• Canwespeeditup??

GDforlinearclassification

• Whatifwetakeasingleexample,andcomputegradientonlyw.r.t it’sloss??

• Whichexample?• -->uniformlyatrandom…

• Whywouldthiswork?

SGD forlinearclassification

• Uniformlyatrandom?!𝑖K ∼ 𝑈[1,… ,𝑚]

• EachiterationismuchfasterO(md)à O(d),convergence??

min7 90

1𝑚Qℓ 𝑤=𝑥>,𝑦>

>

𝑤KL0 = wX − 𝜂ℓY 𝑤K=𝑥>b , 𝑦>b 𝑥>b

Hasexpectation=fullgradient

CrucialforSGD:linearityofexpectationandderivatives

Letf w = 03∑ ℓ> 𝑤> ,thenfor𝑖K ∼ 𝑈[1, … ,𝑚] chosenuniformlyatrandom,wehave

E[𝛻ℓ>b 𝑤 ] = ∑>s0

1𝑚 𝛻ℓ> 𝑤 = 𝛻

1𝑚Qℓ> 𝑤

>

= 𝛻ℓ(𝑤)

• Moveinarandomdirection,whoseexpectationisthesteepestdescent:

• Denoteby𝛻𝑓t w avectorrandomvariablewhoseexpectationisthegradient,

𝐸 𝛻𝑓t w = 𝛻𝑓 𝑤

Greedyoptimization:gradientdescent

𝑥KL0 ← 𝑥K − 𝜂𝛻𝑓t 𝑥K

Stochasticgradientdescent–constrainedcase

𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥Kv ,E 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin

x∈T|𝑦KL0 − 𝑥|

Stochasticgradientdescent–constrainedset

Let:• G=upperboundonnormofgradientestimators

|𝛻𝑓 𝑥Kv | ≤ 𝐺

• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷

Theorem:forstepsize𝜂 = }~ H

E[𝑓 0H∑ 𝑥KK ] ≤ min

R∗∈T𝑓 𝑥∗ + }~

H

𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥KvE 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin


Proof:

1. Wehaveproved:(foranysequenceof𝛻K)

1𝑇Q𝛻K=𝑥K

K

] ≤ minR∗∈T

1𝑇Q𝛻K=𝑥∗

K

+𝐷𝐺𝑇

2. Bypropertyofexpectation:

E[𝑓1𝑇Q𝑥K

K

− minR∗∈T

𝑓 𝑥∗ ] ≤1𝑇Q𝛻𝑓 𝑥K =(𝑥K − 𝑥∗)

K

] ≤𝐷𝐺𝑇

𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥KvE 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin


Summary

• Mathematical&convexoptimization• Gradientdescentalgorithm,linearclassification• Stochasticgradientdescent

Documents

Lecture 6: stochastic gradient descent Sanjeev Arora Elad …...• Gradient descent algorithm, linear classification • Stochastic gradient descent Title 402-lec6 Created Date 10/6/2016