Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Lecture6:stochasticgradientdescent
SanjeevArora EladHazan
COS402– MachineLearningand
ArtificialIntelligenceFall2016
Admin
• Exercise2(implementation)thisThu,inclass• Exercise3(written),thisThu,inclass• Movie– “ExMachina”+discussionpanelw.Prof.Hasson(PNI)WedOct.5th 19:30ticketsfromBella;room204COS
• Today:specialguest- Dr.Yoram Singer@Google
Recap
• Definition+fundamentaltheoremofstatisticallearning,motivatedefficientalgorithms/optimization• Convexityandit’scomputationalimportance• Localgreedyoptimization– gradientdescent
Agenda
• Stochasticgradientdescent• Dr.Singeronopt@google&beyond
Mathematicaloptimization
Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅'
Output:point𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾
PreferConvexProblems
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 536
Convexfunctions:localà global
Sumofconvexfunctionsà alsoconvex
Convex Sets
DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],
αx + (1 − α)y ∈ C.
x
y
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53
ConvexFunctionsandSets
A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],
f(ax+ (1� a)y) af(x) + (1� a)f(y)
A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],
ax+ (1� a)y 2 C
8
Specialcase:optimizationforlinearclassificationGivenasample𝑆 = 𝑥0, 𝑦0 ,… , 𝑥3,𝑦3 ,findhyperplane(throughtheoriginw.l.o.g)suchthat:
min7
#ofmistakes
min7 90
ℓ(𝑤=𝑥>,𝑦>) foraconvexlossfunction
𝑤
Convexrelaxationfor0-1loss
1. Ridge/linearregressionℓ 𝑤, 𝑥>, 𝑦> = 𝑤=𝑥> − 𝑦> A
2. SVMℓ 𝑤,𝑥>, 𝑦> = max{0,1 − 𝑦> 𝑤=𝑥>}
i.e.for|w|=|x_i|=1,Wehave:
1 − 𝑦>𝑤H𝑥> = {0𝑦> = 𝑤=𝑥>{≤ 2𝑦> ≠ 𝑤H𝑥>
• Moveinthedirectionofsteepestdescent:
Greedyoptimization:gradientdescent
p1p* p2p3
𝑥KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥K
Wesaw:forcertainstepsize choice,
𝑓1𝑇Q𝑥K
K≤ min
R∗∈T𝑓 𝑥∗ +
1𝑇
GDforlinearclassification
min7 90
1𝑚Qℓ 𝑤=𝑥>,𝑦>
>
𝑤KL0 = wX − 𝜂1𝑚QℓY 𝑤K=𝑥>,𝑦> 𝑥>
>
• Complexity? 0Z[
iterations,eachtaking~lineartimeindataset
• Overall𝑂 3'Z[
runningtime,m=#ofexamplesinRd
• Canwespeeditup??
GDforlinearclassification
• Whatifwetakeasingleexample,andcomputegradientonlyw.r.t it’sloss??
• Whichexample?• -->uniformlyatrandom…
• Whywouldthiswork?
SGD forlinearclassification
• Uniformlyatrandom?!𝑖K ∼ 𝑈[1,… ,𝑚]
• EachiterationismuchfasterO(md)à O(d),convergence??
min7 90
1𝑚Qℓ 𝑤=𝑥>,𝑦>
>
𝑤KL0 = wX − 𝜂ℓY 𝑤K=𝑥>b , 𝑦>b 𝑥>b
Hasexpectation=fullgradient
CrucialforSGD:linearityofexpectationandderivatives
Letf w = 03∑ ℓ> 𝑤> ,thenfor𝑖K ∼ 𝑈[1, … ,𝑚] chosenuniformlyatrandom,wehave
E[𝛻ℓ>b 𝑤 ] = ∑>s0
1𝑚 𝛻ℓ> 𝑤 = 𝛻
1𝑚Qℓ> 𝑤
>
= 𝛻ℓ(𝑤)
• Moveinarandomdirection,whoseexpectationisthesteepestdescent:
• Denoteby𝛻𝑓t w avectorrandomvariablewhoseexpectationisthegradient,
𝐸 𝛻𝑓t w = 𝛻𝑓 𝑤
Greedyoptimization:gradientdescent
𝑥KL0 ← 𝑥K − 𝜂𝛻𝑓t 𝑥K
Stochasticgradientdescent–constrainedcase
𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥Kv ,E 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin
x∈T|𝑦KL0 − 𝑥|
Stochasticgradientdescent–constrainedset
Let:• G=upperboundonnormofgradientestimators
|𝛻𝑓 𝑥Kv | ≤ 𝐺
• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷
Theorem:forstepsize𝜂 = }~ H
E[𝑓 0H∑ 𝑥KK ] ≤ min
R∗∈T𝑓 𝑥∗ + }~
H
𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥KvE 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin
x∈T|𝑦KL0 − 𝑥|
Proof:
1. Wehaveproved:(foranysequenceof𝛻K)
1𝑇Q𝛻K=𝑥K
K
] ≤ minR∗∈T
1𝑇Q𝛻K=𝑥∗
K
+𝐷𝐺𝑇
2. Bypropertyofexpectation:
E[𝑓1𝑇Q𝑥K
K
− minR∗∈T
𝑓 𝑥∗ ] ≤1𝑇Q𝛻𝑓 𝑥K =(𝑥K − 𝑥∗)
K
] ≤𝐷𝐺𝑇
𝑦KL0 ← 𝑥K − 𝜂𝛻𝑓 𝑥KvE 𝛻𝑓 𝑥Kv = 𝛻𝑓(𝑥K)𝑥KL0 = argmin
x∈T|𝑦KL0 − 𝑥|
Summary
• Mathematical&convexoptimization• Gradientdescentalgorithm,linearclassification• Stochasticgradientdescent