97
Stochastic Optimization Online and batch stochastic optimization methods †‡ Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO Intensive course @ Nagoya University 1 / 78

Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic OptimizationOnline and batch stochastic optimization methods

† ‡ Taiji Suzuki

†Tokyo Institute of TechnologyGraduate School of Information Science and EngineeringDepartment of Mathematical and Computing Sciences

‡JST, PRESTO

Intensive course @ Nagoya University

1 / 78

Page 2: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

2 / 78

Page 3: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

3 / 78

Page 4: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Two types of stochastic optimization

Online type stochastic optimization:

We observe data sequentially.We don’t need to wait until the whole sample is obtained.Each observation is obtained just once (basically).

minx

EZ [ℓ(Z , x)]

Batch type stochastic optimization

The whole sample has been already observed.We can make use of the (finite and fixed) sample size.We may use sample multiple times.

minx

1

n

n∑i=1

ℓ(zi , x)

4 / 78

Page 5: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Online method

You don’t need to wait until the whole sample arrives.Update the parameter at each data observation.

zt-1 z

txt

xt+1

P(z)

Observe ObserveUpdate Update

storage New data input

minx

E[ℓ(Z , x)] + ψ(x) ≃ minx

1

T

T∑t=1

ℓ(zt , x) + ψ(x)

5 / 78

Page 6: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

The objective of online stochastic optimization

Let ℓ(z , x) be a loss of x for an observation z .

(Expected loss) L(x) = EZ [ℓ(Z , x)]

or

(EL with regularization) Lψ(x) = EZ [ℓ(Z , x)] + ψ(x)

The distribution of Z could be

the true population→ L(x) is the generalization error.

an empirical distribution of stored data in a storage→ L (or Lψ) is the (regularized) empirical risk.

Online stochastic optimization itself is learning!

6 / 78

Page 7: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

7 / 78

Page 8: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Three steps to stochastic gradient descent

E[ℓ(Z , x)] =

∫ℓ(Z , x)dP(Z )

≃(1)ℓ(zt , xt−1) (sampling)

≃(2)⟨∇xℓ(zt , xt−1), x⟩ (linearlization)

The approximation is correct just around xt−1.

minx

E[ℓ(Z , x)] ≃(3)

minx

⟨∇xℓ(zt , xt−1), x⟩+

1

2ηt∥x − xt−1∥2

(proximation)

xt-18 / 78

Page 9: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic gradient descent (SGD)

SGD (without regularization)

Observe zt ∼ P(Z ), and let ℓt(x) := ℓ(zt , x).

Calculate subgradient:

gt ∈ ∂xℓt(xt−1).

Update x asxt = xt−1 − ηtgt .

We just need to observe one training data zt at each iteration.→ O(1) computation per iteration (O(n) for batch gradient descent).

We do not need to go through the whole sample zini=1

Reminder: prox(q|ψ) := argminxψ(x) + 1

2∥x − q∥2.

9 / 78

Page 10: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic gradient descent (SGD)

SGD (with regularization)

Observe zt ∼ P(Z ), and let ℓt(x) := ℓ(zt , x).

Calculate subgradient:

gt ∈ ∂xℓt(xt−1).

Update x asxt = prox(xt−1 − ηtgt |ηtψ).

We just need to observe one training data zt at each iteration.→ O(1) computation per iteration (O(n) for batch gradient descent).

We do not need to go through the whole sample zini=1

Reminder: prox(q|ψ) := argminxψ(x) + 1

2∥x − q∥2.

9 / 78

Page 11: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SGD

Assumption

(A1) E[∥gt∥2] ≤ G 2.

(A2) E[∥xt − x∗∥2] ≤ D2.

Theorem

Let xT = 1T+1

∑Tt=0 xt (Polyak-Ruppert averaging). For ηt =

η0√t, it holds

Ez1:T [Lψ(xT )− Lψ(x∗)] ≤ η0G

2 + D2/η0√T

.

For η0 =DG , we have

2GD√T.

This is minimax optimal (up to constant).

G is independent of ψ thanks to the proximal mapping. Note that∥∂ψ(x)∥ ≤ C

√p for L1-reg.

10 / 78

Page 12: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SGD (strongly convex)

Assumption

(A1) E[∥gt∥2] ≤ G 2.

(A3) Lψ is µ-strongly convex.

Theorem

Let xT = 1T+1

∑Tt=0 xt . For ηt =

1µt , it holds

Ez1:T [Lψ(xT )− Lψ(x∗)] ≤ G 2 log(T )

Tµ.

Better than non-strongly convex situation.But, this is not minimax optimal.The bound is tight (Rakhlin et al., 2012).

11 / 78

Page 13: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Polynomial averaging for strongly convex risk

Assumption

(A1) E[∥gt∥2] ≤ G 2.

(A3) Lψ is µ-strongly convex.

Modify the update rule as

xt = prox

(xt−1 − ηt

t

t + 1gt |ηtψ

),

and take the weighted average xT =2

(T + 1)(T + 2)

T∑t=0

(t + 1)xt .

Theorem

For ηt =2µt , it holds Ez1:T [Lψ(xT )− Lψ(x

∗)] ≤ 2G 2

Tµ.

log(T ) is removed.This is minimax optimal (explained later).

12 / 78

Page 14: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Remark on polynomial averaging

xT =2

(T + 1)(T + 2)

T∑t=0

(t + 1)xt

O(T ) computation? No.xT can be efficiently updated:

xt =t

t + 2xt−1 +

2

t + 2xt .

13 / 78

Page 15: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

General step size and weighting policy

Let st (t = 1, 2, . . . ,T +1) be a positive sequence such that∑T+1

t=1 st = 1.

xt = prox

(xt−1 − ηt

stst+1

gt |ηtψ)

(t = 1, . . . ,T )

xT =T∑t=0

st+1xt .

Assumption: (A1) E[∥gt∥2] ≤ G 2, (A2) E[∥xt − x∗∥2] ≤ D2, (A3) Lψ is

µ-strongly convex.

Theorem

Ez1:T [Lψ(xT )− Lψ(x∗)]

≤T∑t=1

st+1ηt+1

2G 2 +

T−1∑t=0

max st+2

ηt+1− st+1(

1ηt

+ µ), 0D2

2

As for t = 0, we set 1/η0 = 0 .14 / 78

Page 16: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Special case

Let the weight proportion to the step size (step size could be seen asimportance):

st =ηt∑T+1τ=1 ητ

.

In this setting, the previous theorem gives

Ez1:T [Lψ(xT )− Lψ(x∗)] ≤

∑Tt=1 η

2tG

2 + D2

2∑T

t=1 ηt ∞∑t=1

ηt =∞

∞∑t=1

η2t <∞

ensures the convergence. 15 / 78

Page 17: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

16 / 78

Page 18: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic regularized dual averaging (SRDA)

The second assumption E[∥xt − x∗∥2] ≤ D2 can be removed by using dualaveraging (Nesterov, 2009, Xiao, 2009).

SRDA

Observe zt ∼ P(Z ), and let ℓt(x) := ℓ(zt , x).

Calculate gradient: gt ∈ ∂xℓt(xt−1).

Take the average of the gradients:

gt =1

t

t∑τ=1

gτ .

Update as

xt = argminx∈Rp

⟨gt , x⟩+ ψ(x) +

1

2ηt∥x∥2

= prox(−ηt gt |ηtψ).

The information of old observations is maintained by taking average ofgradients.

17 / 78

Page 19: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SRDA

Assumption(A1) E[∥gt∥2] ≤ G 2.(A2) E[∥xt − x∗∥2] ≤ D2.

Theorem

Let xT = 1T+1

∑Tt=0 xt . For ηt = η0

√t, it holds

Ez1:T [Lψ(xT )− Lψ(x∗)] ≤ η0G

2 + ∥x∗ − x0∥2/η0√T

.

If ∥x∗ − x0∥ ≤ R, then for η0 =RG , we have

2RG√T.

This is minimax optimal (up to constant).The norm of intermediate solution xt is well controlled. Thus (A2) isnot required.

18 / 78

Page 20: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SRDA (strongly convex)

Assumption

(A1) E[∥gt∥2] ≤ G 2.

(A2) E[∥xt − x∗∥2] ≤ D2.

(A3) ψ is µ-strongly convex.

Modify the update rule as

gt =2

(t + 1)(t + 2)

t∑τ=1

τgτ , xt = prox(− ηt gt |ηtψ

),

and take the weighted average xT = 2(T+1)(T+2)

∑Tt=0(t + 1)xt .

Theorem

For ηt = (t + 1)(t + 2)/ξ, it holds

Ez1:T [Lψ(xT )− Lψ(x∗)] ≤ ξ∥x∗ − x0∥2

T 2+

2G 2

Tµ.

ξ → 0 yields 2G2

Tµ : minimax optimal.

19 / 78

Page 21: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

General convergence analysis

Let the weight st > 0 (t = 1, . . . ) is any positive sequence. We generalizethe update rule as

gt =

∑tτ=1 sτgτ∑t+1τ=1 sτ

,

xt = prox(−ηt gt |ηtψ) (t = 1, . . . ,T ).

Let the weighted average of (xt)t be xT =∑T

τ=0 sτ+1xτ∑Tτ=0 sτ+1

.

Assumption: (A1) E[∥gt∥2] ≤ G 2, (A3) Lψ is µ-strongly convex (µ can be 0).

Theorem

Suppse that ηt/(∑t+1

τ=1 sτ ) is non-decreasing, then

Ez1:T [Lψ(xT )− Lψ(x∗)]

≤ 1∑T+1t=1 st

(T+1∑t=1

s2t2[(∑t

τ=1 sτ )(µ+ 1/ηt−1)]G 2 +

∑T+2t=1 st

2ηT+1∥x∗ − x0∥2

).

20 / 78

Page 22: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Computational cost and generalization error

The optimal learning rate for a strongly convex expected risk(generalization error) is O(1/n) (n is the sample size).

To achieve O(1/n) generalization error, we need to decrease thetraining error to O(1/n).

Normal gradient des. SGD

Time per iteration n 1Number of iterations until ϵ error log(1/ϵ) 1/ϵ

Time until ϵ error n log(1/ϵ) 1/ϵTime until 1/n error n log(n) n

(Bottou, 2010)

SGD is O(log(n)) faster with respect to the generalization error.

21 / 78

Page 23: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Typical behavior

Elaplsed time (s)0 0.5 1 1.5 2 2.5 3 3.5

Tra

inig

err

or0

0.1

0.2

0.3

0.4

0.5

SGDBatch

Elaplsed time (s)0 0.5 1 1.5 2 2.5 3 3.5

Gen

eral

izat

ion

erro

r

0

0.1

0.2

0.3

0.4

0.5

Normal gradient descent v.s. SGDLogistic regression with L1-regularization: n = 10, 000, p = 2.

SGD decreases the objective rapidly, and after a while, the batch gradientmethod catches up and slightly surpasses.

22 / 78

Page 24: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

23 / 78

Page 25: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

24 / 78

Page 26: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Changing the metric (divergence)

minx

L(x) + ψ(x)

x (t) = argminx∈Rp

⟨gt , x⟩+ ψ(x) +

1

2η∥x − x (t−1)∥2

25 / 78

Page 27: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Changing the metric (divergence)

minx

L(x) + ψ(x)

x (t) = argminx∈Rp

⟨gt , x⟩+ ψ(x) +

1

2η∥x − x (t−1)∥2Ht

∥x∥2H := x⊤Hx .

25 / 78

Page 28: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Changing the metric (divergence)

minx

L(x) + ψ(x)

x (t) = argminx∈Rp

⟨gt , x⟩+ ψ(x) +

1

2η∥x − x (t−1)∥2Ht

∥x∥2H := x⊤Hx .

Choice of Ht

Hessian Ht = ∇∇⊤L(x (t−1)): Newton methodFisher information matrix Ht = EZ |x(t−1) [−∇x∇⊤

x px(Z )|x=x(t−1) ]:Natural gradient(x is a parameter of a parametric model pxx)

c.f. Bregman divergence.

Bϕ(x ||x ′) := ϕ(x)− ϕ(x ′)− ⟨∇ϕ(x ′), x − x ′⟩.→ Mirror descent 25 / 78

Page 29: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

26 / 78

Page 30: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

AdaGrad (Duchi et al. (2011))

Let

Ht = G12t + δI

for some δ ≥ 0, where Gt is either of the followings:

(Full) Gt =t∑

τ=1

gτg⊤τ ,

(Diag) Gt = diag

(t∑

τ=1

gτg⊤τ

).

AdaGrad stretches flat directions and shrinks steep directions.

Ada-SGD:

x (t) = argminx∈Rp

⟨gt , x⟩+ ψ(x) +

1

2η∥x − x (t−1)∥2Ht

.

Ada-SRDA: for gt =1t

∑tτ=1 gτ ,

x (t) = argminx

⟨gt , x⟩+ ψ(x) +

1

2tη∥x∥2Ht

.

27 / 78

Page 31: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Analysis of AdaGrad

Theorem

Let q = 2 for FULL, and q =∞ for Diag. Define the regret as

Q(T ) :=1

T

T∑t=1

(ℓt+1(x

(t)) + ψ(x (t))− ℓt(β∗)− ψ(β∗)).

Ada-SGD: ∀δ ≥ 0,

Q(T ) ≤ δ

Tη∥x∗∥22 +

maxt≤T∥x∗ − x (t)∥2q/η + 2η

2Ttr[G

1/2T

].

Ada-SRDA: for δ ≥ maxt ∥gt∥2,

Q(T ) ≤ δ

Tη∥β∗∥22 +

∥x∗∥2q/η + 2η

2Ttr[G

1/2T

].

28 / 78

Page 32: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Analysis of AdaGrad

Suppose

The gradient is unbalanced:

|gt,j |2 ≤ Gj−2 (j = 1, . . . , p, ∀t).

(Ada-SGD) E[L(x (T ))− L(x∗)] ≤ Clog(p)√

T

(ordinary SGD) E[L(x (T ))− L(x∗)] ≤ CGE[maxt ∥x (t)∥]√

T≤ C

√p√T

√p → log(p)

Much improvement.29 / 78

Page 33: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

AdaGrad is used in various applications including sparse learning and deeplearning.

In deep learning, we often encounter a phenomenon called plateau, that is,we are stuck in a flat region.

It is hard to get out from plateau by standard SGD. AdaGrad adaptivelyadjust the search space to get out of plateau.

AdaGrad is one of the standard optimization methods for deep learning.Related methods: AdaDelta (Zeiler, 2012), RMSProp (Tieleman andHinton, 2012), Adam (Kingma and Ba, 2014).

30 / 78

Page 34: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

31 / 78

Page 35: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Nesterov’s acceleration of SGD

Assumption:

the expected loss L(x) is γ-smooth.

the variance of gradient is bounded by σ2:

EZ [∥∇βℓ(Z , β)−∇L(β)∥2] ≤ σ2.

→ combining with Nesterov’s acceleration, the convergence can be gotfaster.

Acceleration for SGD: Hu et al. (2009)

Accleration for SRDA: Xiao (2010), Chen et al. (2012)

General method and analysis (including non-convex): Lan (2012),Ghadimi and Lan (2012, 2013)

Ez1:T [Lψ(x(T ))]− Lψ(x

∗) ≤ C(σD√T+ D2γ

T 2

)(D is the diameter: E[∥x (t) − x∗∥2] ≤ D2 (∀t))

32 / 78

Page 36: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Speed up of accelerated SGD

Ez1:T [Lψ(x(T ))]− Lψ(x

∗) ≤ C

(σD√T

+D2γ

T 2

)σ2 is the variance of the gradient estimate:

EZ [∥∇βℓ(Z , β)−∇L(β)∥2] ≤ σ2.

The variance can be reduced by simply taking average:

g = ∇ℓ(z , x (t−1)) ⇒ g =1

K

K∑k=1

∇ℓ(zk , x (t−1))

(Variance) σ2 σ2/K

Computing independent gradients can be parallelized.

As σ → 0, the bound goes to O(1/T 2): non-stochastic Nesterov’sacceleration.

33 / 78

Page 37: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

(a) Objective for L1 (b) Objective for Elastic-net

Numerical comparison on synthetic data with (a) L1 regularization (Lasso)and (b) Elastic-net regularization (figure is from Chen et al. (2012)).

SAGE: Accelerated SGD (Hu et al., 2009), AC-RDA: Accelerated stochastic RDA (Xiao,

2010), AC-SA: Accelerated stochastic approximation Ghadimi and Lan (2012), ORDA:

Optimal stochastic RDA (Chen et al., 2012)34 / 78

Page 38: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Accelerated SA for strongly convex objective

Assumption: Objective is µ-strongly convex and γ-smooth.

Accelerated stochastic approximation: Hu et al. (2009), Ghadimi andLan (2012)

Ez1:T [Lψ(x(T ))]− Lψ(x

∗) ≤ C

(σ2

µT+γR2

T 2

).

Multi-stage accelerated stochastic approximation: Chen et al. (2012),Ghadimi and Lan (2013)

Ez1:T [Lψ(x(T ))]− Lψ(x

∗) ≤ C

(σ2

µT+ exp

(−C√µ

γT

))σ = 0 gives the batch optimal rate.

35 / 78

Page 39: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Summary of convergence rates

Online methods (expected risk minimization):GR√T

(non-smooth, non-strongly convex) Polyak-Ruppert averaging

G 2

µT(non-smooth, strongly convex) Polynomial averaging

σR√T

+R2L

T 2(smooth, non-strongly convex) Acceleration

σ2

µT+ exp

(−√µ

LT

)(smooth, strongly convex) Acceleration

G : upper bound of norm of gradient, R: diameter of the domain,L: smoothness, µ: strong convexity, σ: variance of the gradient

36 / 78

Page 40: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

37 / 78

Page 41: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Minimax optimal rate of stochastic first order methods

minx∈B

L(x) = minx∈B

EZ [ℓ(Z , x)]

Condition

gx ∈ ∂xℓ(Z , x) is bounded as ∥E[gx ]∥ ≤ G (∀x ∈ B).The domain B contains a ball with radius R.L(x) is µ-strongly convex (µ = 0 is allowed).

Theorem (Minimax optimality (Agarwal et al., 2012, Nemirovsky and Yudin,

1983))

For any first order algorithm, there exist loss function ℓ and distributionP(Z ) satisfying the assumption on which the algorithm must suffer

E[L(x (T ))− L(x∗)] ≥ c min

GR√T,G 2

µT,GR√p

.

SGD and SRDA achieve this optimal rate.First order algorithm: an algorithm that depends on only the loss and itsgradient (ℓ(Z , x), gx) for a query point x . (SGD, SRDA are included.)

38 / 78

Page 42: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

39 / 78

Page 43: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

From expectation to finite sum

Online:

P(x) = E[ℓ(Z , x)] =

∫ℓ(Z , x)dP(Z )

Batch:

P(x) =1

n

n∑i=1

ℓ(zi , x)

40 / 78

Page 44: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

From online to batch

In the batch setting, the data are fixed. We just minimize the objective

function defined by

P(x) =1

n

n∑i=1

ℓi(x) + ψ(x).

We construct a method that

uses few observations per iteration (like online method),

converges linearly (unlike online method):

T > (n + γ/λ)log(1/ϵ)

to achieve ϵ accuracy for γ-smooth loss andλ-strongly convex regularization. 41 / 78

Page 45: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Three methods that must be remembered

Stochastic Average Gradient descent, SAG (Le Roux et al., 2012,

Schmidt et al., 2013, Defazio et al., 2014)

Stochastic Variance Reduced Gradient descent, SVRG (Johnson

and Zhang, 2013, Xiao and Zhang, 2014)

Stochastic Dual Coordinate Ascent, SDCA (Shalev-Shwartz and

Zhang, 2013a)

- SAG and SVRG are methods performed on the primal.

- SDCA is on the dual.

42 / 78

Page 46: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Assumptions

P(x) = 1n

∑ni=1 ℓi (x)︸ ︷︷ ︸

smooth

+ ψ(x)︸︷︷︸strongly convex

Assumption:

ℓi : Loss is γ-smooth.ψ: reg func is λ-strongly convex. Typically λ = O(1/n) or O(1/

√n).

Example:Loss function

smoothed hinge loss

logistic loss

0

Regularization function

L2 regularization

Elastic net regularization

ψ(x) + λ∥x∥2 (with small h)

0

43 / 78

Page 47: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

44 / 78

Page 48: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Coordinate Descent

45 / 78

Page 49: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Note on CD

failure success

Left hand side: CD fails. No descent direction.

To make CD success, the objective should have descent direction.Ideally, separable f (x) =

∑pj=1 fj(xj).

46 / 78

Page 50: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Coordinate descent in primal

minxP(x) = min

xf (x) + ψ(x) = min

xf (x) +

p∑j=1

ψj(xj)

Coordinate descent (sketch)

1 Choose j ∈ 1, . . . , p in some way.(typically, random choice)

2 j-th coordinate xj is updated so that the objective is decreased,

Usually a block of coordinates are updated instead of one coordinate(block coordinate descent).

47 / 78

Page 51: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Coordinate descent in primal

minxP(x) = min

xf (x) + ψ(x) = min

xf (x) +

p∑j=1

ψj(xj)

Coordinate descent (sketch)

1 Choose j ∈ 1, . . . , p in some way.(typically, random choice)

2 j-th coordinate xj is updated so that the objective is decreased, e.g.,

x(t)j ← argminxj P(x

(t−1)1 , . . . , xj , . . . , x

(t−1)p ),

orfor gj =

∂f (x (t))∂xj

x(t+1)j ← argminxj ⟨gj , xj⟩+ ψj(xj) +

12ηt∥xj − x

(t−1)j ∥2.

Usually a block of coordinates are updated instead of one coordinate(block coordinate descent).

47 / 78

Page 52: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence of primal CD method

We consider a separable regularization:

minxP(x) = minxf (x) + ψ(x) = minxf (x) +∑p

j=1 ψj(xj).

Assumption: f is γ-smooth (∥∇f (x)−∇f (x ′)∥ ≤ γ∥x − x ′∥)

Cyclic (Saha and Tewari, 2013, Beck and Tetruashvili, 2013)

P(x (t))− R(x∗) ≤ γp∥x (0) − x∗∥2

2t= O(1/t) (with isotonicity).

Random choice (Nesterov, 2012, Richtarik and Takac, 2014)No acceleration: O(1/t).Nesterov’s acceleration: O(1/t2) (Fercoq and Richtarik, 2013).f is α-strongly convex: O(exp(−C (α/γ)t)).f is α-strongly conv + acceleration: O(exp(−C

√α/γt)) (Lin et al.,

2014).

Nice review is given by Wright (2015).48 / 78

Page 53: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic Dual Coordinate Ascent, SDCA

Suppose that ∃fi : R→ R such that ℓ(zi , x) = fi (a⊤i x).

Let A = [a1, . . . , an]. (Primal) inf

x∈Rp

1

n

n∑i=1

fi (a⊤i x) + ψ(x)

[Fenchel’s duality theorem]

infx∈Rpf (A⊤x) + nψ(x) = − inf

y∈Rnf ∗(y) + nψ∗(−Ay/n)

(Dual) inf

y∈Rn

1

n

n∑i=1

f ∗i (yi ) + ψ∗(−1

nAy

) We used the following facts:

For f (α) =∑n

i=1 fi (αi ), we have f ∗(β) =∑n

i=1 f∗i (βi ).

For ψ(x) = nψ(x), we have ψ∗(y) = nψ∗(y/n).

49 / 78

Page 54: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic Dual Coordinate Ascent, SDCA

Suppose that ∃fi : R→ R such that ℓ(zi , x) = fi (a⊤i x).

Let A = [a1, . . . , an]. (Primal) inf

x∈Rp

1

n

n∑i=1

fi (a⊤i x) + ψ(x)

[Fenchel’s duality theorem]

infx∈Rpf (A⊤x) + nψ(x) = − inf

y∈Rnf ∗(y) + nψ∗(−Ay/n)

(Dual) inf

y∈Rn

1

n

n∑i=1

f ∗i (yi ) + ψ∗(−1

nAy

) We used the following facts:

For f (α) =∑n

i=1 fi (αi ), we have f ∗(β) =∑n

i=1 f∗i (βi ).

For ψ(x) = nψ(x), we have ψ∗(y) = nψ∗(y/n).49 / 78

Page 55: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Remarks

supy∈Rn

1

n

n∑i=1

f ∗i (yi ) + ψ∗(−1

nAy

)

The dual loss term∑n

i=1 f∗i (yi ) is separable.

Each coordinate yi affects the objective through only the i-th data:

f ∗i (yi ),

ψ∗(−1

n(a1y1 + · · ·+ aiyi + · · ·+ anyn)

).

→ Coordinate descent behaves like online methods!

The loss fi is smooth ⇔ f ∗i is strongly convex.

The reg func ψ is strongly convex ⇔ ψ∗ is smooth.

50 / 78

Page 56: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SDCA

SDCA (Shalev-Shwartz and Zhang, 2013a)

Iterate the following for t = 1, 2, . . .

1 Pick up an index i ∈ 1, . . . , n uniformly at random.

2 Update the i-th coordinate yi so that the objective function isdecreased.

51 / 78

Page 57: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SDCA

SDCA (Shalev-Shwartz and Zhang, 2013a)

Iterate the following for t = 1, 2, . . .

1 Pick up an index i ∈ 1, . . . , n uniformly at random.

2 Update the i-th coordinate yi :(let A\i = [a1, . . . , ai−1, ai+1, . . . , an], and y\i = (yj)j =i )

• y(t)i ∈ argmin

yi∈R

f ∗i (yi ) + nψ∗

(− 1

n(aiyi + A\iy

(t−1)\i )

)+

1

2η∥yi − y

(t−1)i ∥2

,

• y(t)j =y

(t−1)j (for j = i).

51 / 78

Page 58: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SDCA

SDCA (linearized version) (Shalev-Shwartz and Zhang, 2013a)

Iterate the following for t = 1, 2, . . .

1 Pick up an index i ∈ 1, . . . , n uniformly at random.

2 Calculate x (t−1) = ∇ψ∗(−Ay (t−1)/n).

3 Update the i-th coordinate yi :

• y(t)i ∈ argmin

yi∈R

f ∗i (yi )− ⟨x (t−1), aiyi ⟩+

1

2η∥yi − y

(t−1)i ∥2

• y(t)j =y

(t−1)j (for j = i).

If the reg func ψ is λ-strongly covnex, ψ∗ is 1/λ-smooth and thusdifferentiable: x (t) = ∇ψ∗(−Ay (t)/n).x (t) is actually the primal variable.Computational complexity per iteration is same as online methods!Important relation: prox(q|g∗) = q − prox(q|g). primal!

51 / 78

Page 59: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SDCA

SDCA (linearized version) (Shalev-Shwartz and Zhang, 2013a)

Iterate the following for t = 1, 2, . . .

1 Pick up an index i ∈ 1, . . . , n uniformly at random.

2 Calculate x (t−1) = ∇ψ∗(−Ay (t−1)/n).

3 Update the i-th coordinate yi :

• y(t)i ∈ argmin

yi∈R

f ∗i (yi )− ⟨x (t−1), aiyi ⟩+

1

2η∥yi − y

(t−1)i ∥2

= prox(y

(t−1)i + ηa⊤i x

(t−1)|ηf ∗i ),

• y(t)j =y

(t−1)j (for j = i).

If the reg func ψ is λ-strongly covnex, ψ∗ is 1/λ-smooth and thusdifferentiable: x (t) = ∇ψ∗(−Ay (t)/n).x (t) is actually the primal variable.Computational complexity per iteration is same as online methods!Important relation: prox(q|g∗) = q − prox(q|g). primal!

51 / 78

Page 60: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SDCA

Assuption:

fi is γ-smooth.ψ is λ-strongly convex.

Theorem

Suppose there exists R such that ∥ai∥ ≤ R. Then, for η = λn/R2, we have

E[P(x (T )) + D(y (T ))] ≤(n +

R2γ

λ

)exp

(− T

n + R2γλ

)(D(y (0))− D(y∗)).

E[·] is taken w.r.t. the choice of coordinates.

Linear convergence!Required number of iterations to achieve ϵ:

T ≥ C

(n +

R2γ

λ

)log ((n + γ/λ)/ϵ) .

52 / 78

Page 61: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Comparison with the non-stochastic method

How much computation is required to achieve E[P(x (T ))− P(x∗)] ≤ ϵ?Let κ = γ/λ (condition number).

SDCA:(n + κ) log ((n + κ)/ϵ)

Ω((n + κ) log (1/ϵ)) iterations × Ω(1) per iteration

Non-stochastic first order method:

nκ log (1/ϵ)

Ω(κ log (1/ϵ)) iterations × Ω(n) per iteration

Sample size n = 100, 000, reg param λ = 1/1000, smoothness γ = 1:

n×κ = 108, n+κ = 105.

53 / 78

Page 62: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Numerical comparison between SDCA, SDCA-perm (randomly shuffledcyclic), SGD (figure is from Shalev-Shwartz and Zhang (2013a)).

54 / 78

Page 63: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Nesterov’s acceleration of SDCA

Accelerated SDCA (Lin et al., 2014)

Set α = 1n

√λγ .

1 y (t) = y (t−1)+αw (t−1)

1+α

2 Pick up an index i ∈ 1, . . . , n uniformly at random.

3 Calculate x (t−1) = ∇ψ∗(−Ay (t−1)/n).

4 Update the i-th coordinate:

• w(t)i ∈ argminwi∈R

f ∗i (wi )−⟨x(t−1),aiwi ⟩+αn

2γ∥wi−y

(t)i −(1−α)w (t−1)

i ∥2

• w(t)j =(1− α)w (t−1)

j + y(t)i (for j = i).

5 y(t)i = y

(t)i + nα(w

(t)i − (1− α)w (t−1)

i − y(t)i ),

y(t)j = y

(t)j (for j = i).

Shalev-Shwartz and Zhang (2014) also proposed a double-loop acceleration

method.55 / 78

Page 64: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence of accelerated SDCA

∥ai∥ ≤ R (∀i) ∥A∥: spectral norm of A

Theorem

Convergence of acc. SDCA If

T ≥

(n +

√γnR2

λ

)log

(Cγ∥A∥22λnϵ

),

then(Duality gap) E[P(x (T ))− D(y (T ))] ≤ ϵ.

(normal)(n +

γ

λ

)log((n + κ)/ϵ)

(accelerated)

(n +

√γn

λ

)log((n + κ)/ϵ)

56 / 78

Page 65: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Mini-batch SDCA

Instead of choosing one coordinate yi , we may choose a block ofcoordinates yI where I ⊆ 1, . . . , n.Typically, 1, . . . , n is divided into K equally sized groups: I1, . . . , IK s.t.|Ik | = n/K ,

∪k Ik = 1, . . . , n, Ik ∩ Ik′ = ∅.

Mini-batch technique (Takac et al., 2013, Shalev-Shwartz and Zhang, 2013b).

If K = n, we observe only one data at each iteration.57 / 78

Page 66: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Mini-batch SDCA

Mini-batch SDCA (stochastic block coordinate descent)

For t = 1, 2, . . . ,, iterate the following:

1 Randomly pick up a mini-batch I ⊆ 1, . . . , n so thatP(i ∈ I ) = 1/K (∀i).

2 x (t−1) = ∇ψ∗(−Ay (t−1)/n).

3 Update y (t) as

• y (t)I ∈ argminyi (i∈I )

|I |∑i=1

f ∗i (yi )− ⟨x (t−1),AI yI ⟩+1

2η∥yI − y

(t−1)I ∥2

,

• y (t)i =y(t−1)i (i ∈ I ).

The update of yi can be parallelized:

yi = prox(y(t−1)i + ηa⊤i x

(t−1)|ηf ∗i ) (i ∈ I ).

58 / 78

Page 67: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Mini-batch SDCA

Mini-batch SDCA (stochastic block coordinate descent)

For t = 1, 2, . . . ,, iterate the following:

1 Randomly pick up a mini-batch I ⊆ 1, . . . , n so thatP(i ∈ I ) = 1/K (∀i).

2 x (t−1) = ∇ψ∗(−Ay (t−1)/n).

3 Update y (t) as

• y (t)I ∈ argminyi (i∈I )

|I |∑i=1

[f ∗i (yi )− ⟨x (t−1),Aiyi ⟩+

1

2η∥yi − y

(t−1)i ∥2

],

• y (t)i =y(t−1)i (i ∈ I ).

The update of yi can be parallelized:

yi = prox(y(t−1)i + ηa⊤i x

(t−1)|ηf ∗i ) (i ∈ I ).

58 / 78

Page 68: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence of mini-batch SDCA

Assuption:

fi is γ-smooth.ψ is λ-strongly convex.

Theorem

Suppose there exists R such that ∥A⊤I AI∥ ≤ R2 (∀I ). Then, for

η = λn/R2, we have

E[P(x (T ))− D(y (T ))] ≤(K +

R2γ

λ

)exp

(− T

K + R2γλ

)(D(y (0))− D(y∗)).

E[·] is taken w.r.t. the choice of coordinates.

T ≥ C

(K +

R2γ

λ

)log((n + κ)/ϵ)

achieves ϵ accuracy. → iteration complexity is improved (if R2 is not largeand parallelization is used).

59 / 78

Page 69: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

60 / 78

Page 70: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Primal methods

The key idea: reduce the variance of gradient estimate.

1

n

n∑i=1

ℓi (x)︸ ︷︷ ︸How to approximate this?

+ψ(x)

61 / 78

Page 71: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Primal methods

The key idea: reduce the variance of gradient estimate.

1

n

n∑i=1

ℓi (x)︸ ︷︷ ︸How to approximate this?

⟨g ,x⟩

+ψ(x)

Online method: pick up i ∈ 1, . . . , n randomly, and use linearapproximation.

g = ∇ℓi (x) ⇒ E[g ] = 1n

∑ni=1∇ℓi (x)

This is an unbiased estimator of the full gradient.

How about variance?→ Variance is the problem!→ In the batch setting, it is easy to reduce the variance.

61 / 78

Page 72: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic Variance Reduced Gradient descent,SVRG (Johnson and Zhang, 2013, Xiao and Zhang, 2014)

minxL(x) + ψ(x) = minx 1n∑n

i=1 ℓi (x) + ψ(x)

With fixed reference point x which is close to x , a reduced variancegradient estimator is given as

g = ∇ℓi (x)−∇ℓi (x) + 1n

∑nj=1∇ℓj(x)︸ ︷︷ ︸∇L(x)

.

Bias: unbiased,

E[g ] =1

n

n∑i=1

[∇ℓi (x)−∇ℓi (x) +∇L(x)] =1

n

n∑i=1

∇ℓi (x) = ∇L(x).

Variance ?

62 / 78

Page 73: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

A key observation

g = ∇ℓi (x)−∇ℓi (x) +∇L(x).Variance:

Var[g ] = 1n

∑ni=1 ∥∇ℓi (x)−∇ℓi (x) +∇L(x)−∇L(x)∥

2

= 1n

∑ni=1

∥∥∇ℓi (x)−∇ℓi (x)∥2 − ∥∇L(x)−∇L(x)∥∥2(∵ Var[X ] = E[∥X∥2]− ∥E[X ]∥2)

≤ 1n

∑ni=1 ∥∇ℓi (x)−∇ℓi (x)∥2

≤ γ∥x − x∥2.

The variance could be small if x and x are close and ℓi is smooth.

Main strategy:

Calculate the full gradient at x .Update xt several times, say, O(n) times.Set x = xt .

63 / 78

Page 74: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Variance

64 / 78

Page 75: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Variance

64 / 78

Page 76: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SVRG

The algorithm consists of inner loop and outer loop.

SVRG

For t = 1, 2, . . . , iterate the following:

1 Set x = x (t−1), x[0] = x ,

g = ∇L(x) = 1n

∑ni=1∇ℓi (x). (full gradient)

2 For k = 1, . . . ,m, execute the following:1 Uniformly sample i ∈ 1, . . . , n.2 Set

g = ∇ℓi (x[k−1])−∇ℓi (x) + g . (variance reduction)

3 Update x[k] asx[k] = prox(x[k−1] − ηg |ηψ).

3 Set x (t) = 1m

∑mk=1 x[k].

Computational complexity until t iteration: O(t × (n +m)).65 / 78

Page 77: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis

Assuption: ℓi is γ-smooth, and ψ is λ-strongly convex.

Theorem

If η and m satisfy η > 4γ and

ρ := ηλ(1−4γ/η)m + 4γ(m+1)

η(1−4γ/η)m < 1,

then, after T iteration the objective is bounded by

E[P(x (T ))− P(x∗)] ≤ ρT (P(x (0))− P(x∗))

The assumption of the theorem is satisfied by

m ≥ Ω(γλ

).

Inner loop computation O(n +m) for each t.Outer loop iteration T = O(log(1/ϵ)) until ϵ accuracy.

⇒ The whole computation :

O ((n +m) log(1/ϵ)) = O((n +

γ

λ) log(1/ϵ)

)66 / 78

Page 78: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

(c) rcv1 (d) covertype

(e) sido0

Numerical comparison between several stochastic methods on a batchsetting (figure is from Xiao and Zhang (2014)). 67 / 78

Page 79: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Related method: SAGA

SAGA (Defazio et al., 2014) does not require the double-loop,but requires more memory.Difference: the gradient estimate g

(SAGA) g = ∇ℓi (x (t−1))−∇ℓi (xi ) +1

n

n∑j=1

∇ℓj(xj)

(SVRG) g = ∇ℓi (x (t−1))−∇ℓi (x) +1

n

n∑j=1

∇ℓi (x)

x depends on the data index i ∈ 1, . . . , n.xi is updated at every iteration:

xi = x (t−1) (if i is chosen at the t-th round),

xj is not changed (∀j = i).

- Update rule of x (t) is same: x (t) = prox(x (t−1) − ηg |ηψ).- We need to store all gradients ∇ℓi (xi ) (i = 1, . . . , n).

68 / 78

Page 80: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SAGA

SAGA (Defazio et al., 2014)

1 Pick up i ∈ 1, . . . , n uniformly at random.

2 Update g(t)j (j = 1, . . . , n) as

g(t)j =

∇ℓi (x (t−1)) (i = j),

g(t−1)j (otherwise).

3 Update x (t) as

vt = g(t)i − g

(t−1)i +

1

n

n∑j=1

g(t−1)j ,

x (t) = prox(x (t−1) − ηvt |ηψ).

69 / 78

Page 81: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence of SAGA

Assumption: ℓi is γ-smooth and λ-strongly convex (λ = 0 is allowed)(∀i = 1, . . . , n).

Theorem

Set η = 1/3γ. Then

λ = 0: for x (T ) = 1T

∑Tt=1 x

(t), it holds that

E[P(x (T ))− P(x∗)] ≤ 4n

TC0.

λ > 0:

E[∥x (T ) − x∗∥2] ≤(1−min

1

4n,λ

)T

C1.

SAGA is adaptive to the strong convexity λ.

70 / 78

Page 82: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Stochastic Average Gradient descent, SAG

Like SAGA, SAG is also a single-loop method (Le Roux et al., 2012,Schmidt et al., 2013).Historically SAG was proposed earlier than SVRG and SAGA.Proximal technique can not be involved in SAG→ SAGA was proposed to overcome this drawback.

P(x) =1

n

n∑i=1

ℓi (x) ≃ ⟨g , x⟩.

(SAG) g =∇ℓi (x (t−1))−∇ℓi (xi )

n+

1

n

n∑j=1

∇ℓj(xj)

g is biased.

(SAGA) g = ∇ℓi (x (t−1))−∇ℓi (xi ) + 1n

∑nj=1∇ℓj(xj)

(SVRG) g = ∇ℓi (x (t−1))−∇ℓi (x) + 1n

∑nj=1∇ℓi (x)

71 / 78

Page 83: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Algorithm of SAG

SAG

Initialize g(0)i = 0 (i = 1, . . . , n).

For t = 1, 2, . . . , iterate the following:

1 Pick up i ∈ 1, . . . , n uniformly at random.

2 Update g(t)i ′ (i ′ = 1, . . . , n) as

g(t)i ′ =

∇ℓi (x (t−1)) (i = i ′),

g(t−1)i ′ (otherwise).

3 Update x (t) as

x (t) = x (t−1) − η

n

n∑j=1

g(t)j .

72 / 78

Page 84: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Convergence analysis of SAG

Assumption: ℓi is γ-smooth and P(x) = 1n

∑ni=1 ℓi (x) is λ-strongly

convex (λ = 0 is allowed).Milder condition than SAGA because the strong convexity is about P(x)rather than the loss function ℓi .

Theorem (Convergence rate of SAG)

Set η = 116γ . Then SAG converges as

λ = 0: x (T ) = 1T

∑Tt=1 x

(t)に対し,

E[P(x (T ))− P(x∗)] ≤ 32n

TC0

λ > 0:

E[∥x (T ) − x∗∥2] ≤(1−min

1

8n,λ

16γ

)T

C0.

SAG also has adaptivity. 73 / 78

Page 85: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Catalyst: Acceleration of SVRG, SAG, SAGA

Catalyst (Lin et al., 2015)

Iterate the following for t = 1, 2, . . . :

1 Find an approximated solution of a modified problem which hashigher strong convexity:

x (t) ≃ argminxP(x) + α

2 ∥x − y (t−1)∥2

(up to ϵt precision).

2 Accelerate the solution: y (t) = x (t) + βt(x(t) − x (t−1)).

− Catalyst is an acceleration method of an inexact proximal point alg.− For ϵt = C (1−

√λ/2(λ+ α))t ,

P(x (t))− P(x∗) ≤ C ′(1−

√λ

2(λ+α)

)t

.

− Using SVRG, SAG, SAGA with α = maxc γn , λ in the inner loop

achieves (n +√

γnλ ) log(1/ϵ) overall computation.

− This is a universal method but is sensitive to the choice of the innerloop iteration number and α. 74 / 78

Page 86: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Summary and comparison of batch methods

Properties of the batch methodsMethod SDCA SVRG SAG

P/D Dual Primal PrimalMemory efficiency

Acceleration (µ > 0) Catalyst CatalystOther remark ℓi (β) = fi (x

⊤i β) double loop smooth reg.

Convergence rate (up to log term of γ, µ)Method λ > 0 λ = 0 Acceleration (µ > 0)

SDCA (n + γλ) log(1/ϵ) - (n +

√nγλ ) log(1/ϵ)

SVRG (n + γλ) log(1/ϵ) - (n +

√nγλ ) log(1/ϵ)

SAG (n + γλ) log(1/ϵ) γn/ϵ (n +

√nγλ ) log(1/ϵ)

: Catalyst.

As for µ = 0, Catalyst gives an acceleration with convergence rate O(n√

γϵ ).

75 / 78

Page 87: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Summary and comparison of batch methods

Properties of the batch methodsMethod SDCA SVRG SAG

P/D Dual Primal PrimalMemory efficiency

Acceleration (µ > 0) Catalyst CatalystOther remark ℓi (β) = fi (x

⊤i β) double loop smooth reg.

Convergence rate (up to log term of γ, µ)Method λ > 0 λ = 0 Acceleration (µ > 0)

SDCA (n + γλ) log(1/ϵ) - (n +

√nγλ ) log(1/ϵ)

SVRG (n + γλ) log(1/ϵ) - (n +

√nγλ ) log(1/ϵ)

SAG (n + γλ) log(1/ϵ) γn/ϵ (n +

√nγλ ) log(1/ϵ)

: Catalyst.

As for µ = 0, Catalyst gives an acceleration with convergence rate O(n√

γϵ ).

75 / 78

Page 88: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Outline

1 Online stochastic optimizationStochastic gradient descentStochastic regularized dual averaging

2 Getting stochastic gradient methods fasterBregman divergence and AdaGradAcceleration of stochastic gradient methodsMinimax optimality of first order online stochastic methods

3 Batch stochastic methodsDual method: stochastic dual coordinate ascentPrimal method: SVRG, SAG and SAGAMinimax optimality of first order batch stochastic methods

76 / 78

Page 89: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Minimax optimal convergence rate

Let κ = γλ be the condition number.

The iteration numberT ≥ (n + κ) log(1/ϵ)

is almost minimax, but not minimax.

The accelerated version

T ≥(n +√nκ)log(1/ϵ)

is minimax up to log(1/ϵ) (Agarwal and Bottou, 2015).

77 / 78

Page 90: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Minimax optimality in the batch setting

P(x) =1

n

n∑i=1

ℓi (x) +λ

2∥x∥2

Assumption: ℓi is (γ − λ)-smooth (γ > λ).

First order oracle: for an input (x , i), it returns the pair (ℓi (x),∇ℓi (x)).First order algorithm: an algorithm that depends on only the return of thefirst order oracle for a query point x .(SAG, SAGA, SVRG are included. SDCA is not included.)

Theorem (Minimax optimal rate for FOA (Agarwal and Bottou, 2015))

For any first order algorithm, there exist functions ℓi (i = 1, . . . , n)satisfying the assumption on which the algorithm must perform at least

T ≥ Ω(n +√

n(κ− 1) log(1/ϵ))

calls for the first order oracle to get ∥x (T ) − x∗∥ ≤ ϵ∥x∗∥.78 / 78

Page 91: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

A. Agarwal and L. Bottou. A lower bound for the optimization of finitesums. In the 32nd International Conference on Machine Learning, pages78–86, 2015.

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright.Information-theoretic lower bounds on the oracle complexity ofstochastic convex optimization. IEEE Transcations on InformationTheory, 58(5):3235–3249, 2012.

A. Beck and L. Tetruashvili. On the convergence of block coordinatedescent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.

L. Bottou. Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.

X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methodsfor stochastic optimization. In F. Pereira, C. Burges, L. Bottou, andK. Weinberger, editors, Advances in Neural Information ProcessingSystems 25, pages 395–403. Curran Associates, Inc., 2012. URLhttp://papers.nips.cc/paper/

4543-optimal-regularized-dual-averaging-methods-for-stochastic-optimization.

pdf.78 / 78

Page 92: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incrementalgradient method with support for non-strongly convex compositeobjectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, andK. Weinberger, editors, Advances in Neural Information ProcessingSystems 27, pages 1646–1654. Curran Associates, Inc., 2014. URLhttp://papers.nips.cc/paper/

5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives.

pdf.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods foronline learning and stochastic optimization. Journal of MachineLearning Research, 12:2121–2159, 2011.

O. Fercoq and P. Richtarik. Accelerated, parallel and proximal coordinatedescent. Technical report, 2013. arXiv:1312.5799.

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms forstrongly convex stochastic composite optimization I: A genericalgorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms forstrongly convex stochastic composite optimization, II: shrinking

78 / 78

Page 93: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.

C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods forstochastic optimization and online learning. In Y. Bengio,D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 781–789.Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/

3817-accelerated-gradient-methods-for-stochastic-optimization-and-online-learning.

pdf.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

78 / 78

Page 94: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

G. Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 133(1-2):365–397, 2012.

N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient methodwith an exponential convergence rate for finite training sets. InF. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advancesin Neural Information Processing Systems 25, pages 2663–2671. CurranAssociates, Inc., 2012. URL http://papers.nips.cc/paper/

4633-a-stochastic-gradient-method-with-an-exponential-convergence-_

rate-for-finite-training-sets.pdf.

H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-orderoptimization. Technical report, 2015. arXiv:1506.02186.

Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradientmethod and its application to regularized empirical risk minimization.Technical report, 2014. arXiv:1407.1296.

A. Nemirovsky and D. Yudin. Problem complexity and method efficiencyin optimization. John Wiley, New York, 1983.

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, 120(1):221–259, 2009.

78 / 78

Page 95: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Y. Nesterov. Efficiency of coordinate descent methods on huge-scaleoptimization problems. SIAM Journal on Optimization, 22(2):341–362,2012.

A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descentoptimal for strongly convex stochastic optimization. In J. Langford andJ. Pineau, editors, Proceedings of the 29th International Conference onMachine Learning, pages 449–456. Omnipress, 2012. ISBN978-1-4503-1285-1.

P. Richtarik and M. Takac. Iteration complexity of randomizedblock-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144:1–38, 2014.

A. Saha and A. Tewari. On the non-asymptotic convergence of cycliccoordinate descent methods. SIAM Journal on Optimization, 23(1):576–601, 2013.

M. Schmidt, N. Le Roux, and F. R. Bach. Minimizing finite sums with thestochastic average gradient, 2013. hal-00860051.

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascentmethods for regularized loss minimization. Journal of Machine LearningResearch, 14:567–599, 2013a.

78 / 78

Page 96: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dualcoordinate ascent. In Advances in Neural Information ProcessingSystems 26, 2013b.

S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dualcoordinate ascent for regularized loss minimization. In Proceedings ofThe 31st International Conference on Machine Learning, pages 64–72,2014.

M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal anddual methods for SVMs. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude. COURSERA: Neural Networksfor Machine Learning, 2012.

S. J. Wright. Coordinate descent algorithms. Mathematical Programming,151(1):3–34, 2015.

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23. 2009.

78 / 78

Page 97: Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

L. Xiao and T. Zhang. A proximal stochastic gradient method withprogressive variance reduction. SIAM Journal on Optimization, 24:2057–2075, 2014.

M. D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR,abs/1212.5701, 2012. URL http://arxiv.org/abs/1212.5701.

78 / 78