30
Nesterov’s Acceleration 1 min X f (X )+ (X ) Nesterov Accelerated Gradient a.k.a FISTA f γ -smooth. Set s 1 = 1 and = 1 γ . Set y 0 . Iterate by increasing t: g t 2 @ f (y t ) s t+1 = 1+ p 1+4s 2 t 2 y t = x t + s t -1 s t+1 (x t - x t-1 ) x t+1 = prox(y t - g t | )

Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Nesterov’s Acceleration

1

minX f(X) + (X)

Nesterov Accelerated Gradient

a.k.a FISTA

f �-smooth. Set s1 = 1 and ⌘ = 1�. Set

y0. Iterate by increasing t:

• gt 2 @f(yt)

• st+1 =

1+p

1+4s2t2

• yt = xt +st�1st+1

(xt � xt�1)

• xt+1 = prox(yt � ⌘gt|⌘ )

Page 2: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Nesterov’s Acceleration

251 / 73

f �-smooth. Set s1 = 1 and ⌘ = 1�. Set

y0. Iterate:

• gt 2 @f(yt)

• xt = prox(yt � ⌘gt|⌘ )

• st+1 =

1+p

1+4s2t2

• yt = xt +st�1st+1

(xt � xt�1)

source: T. Suzuki

f(xt)� f(x⇤) 2�kxt � x

⇤kt

2

Page 3: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Nesterov’s Acceleration

3

Iteration0 20 40 60 80 100 120

Rel

ativ

e ob

ject

ive

(f(x t) -

f* )

10-8

10-6

10-4

10-2

100

102

104

NormalNesterov

Nesterov’s acceleration v.s. normal gradient descentLasso: n = 8, 000, p = 500.

52 / 73

source: T. Suzuki

min✓1n

Pi(✓

Txi � yi)2 + �k✓k1

Page 4: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Stochastic Gradient

4

We want to minimize

Due to practical constraints, samples only come one by one, each at a time t, and cannot be stored. Only previous parameter is stored. We use a double approximation

E[l(✓, Z)] ⇡ l(✓, zt)

⇡ l(✓t�1, zt) + hrl(✓t�1, zt),✓ i

min✓2Rp

L(✓) := E[l✓(Z)]

Page 5: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Stochastic Gradient

5

To approximate the minimization of

we use the approximated problem, only valid around the previous iterate

min✓2Rp

E[l✓(Z)]

✓t := arg min✓2Rp

hrl(✓t�1, zt),✓ i+ 1

2⌘tk✓t�1 � ✓k2

Page 6: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SG (no regularization)

6

Update ✓t = ✓t�1 � ⌘tgt

Sample zt ⇠ P (Z).

Set ✓0 and sequence ⌘t. Repeat:

Compute subgradient gt 2 @✓l(✓, zt)

Stochastic Gradient Method (regularization)

Output : ✓T = 1T+1

PTt=0 ✓t

min✓2Rp

L(✓) := E[l✓(Z)]

Page 7: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SG (regularization)

7

Sample zt ⇠ P (Z).

Set ✓0 and sequence ⌘t. Repeat:

Compute subgradient gt 2 @✓l(✓, zt)

Stochastic Gradient Method (regularization)

Update ✓t = prox(✓t�1 � ⌘tgt |⌘t )

Output : ✓T = 1T+1

PTt=0 ✓t

We want to minimize now:min✓2Rp

L (✓) := E[l✓(Z)] + (✓)

Page 8: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Polynomial Averaging

8

Sample zt ⇠ P (Z).

Set ✓0 and sequence ⌘t. Repeat:

Compute subgradient gt 2 @✓l(✓, zt)

Stochastic Gradient Method (regularization)

Update ✓t = prox(✓t�1 � ⌘tgt |⌘t )

Output : ✓T = 2(T+1)(T+2)

PTt=0(t+ 1)✓t

Page 9: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Batch Problems

9

• SGMethods have several drawbacks, chief among them is the choice of a stepsize.

• Is there a setting where this can be mitigated? Yes, when the expectation is in fact a large sum:

min✓2Rp

L (✓) := E[l✓(Z)] + (✓)

+

min✓2Rp

1

n

nX

i=1

l(✓, zi) + (✓)

Page 10: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Batch Methods

10

•We would like to have the benefits of SGM (low cost per iteration) without the disadvantages (slow convergence near optimum, step size selection)

source: https://wikidocs.net/3413

Page 11: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Batch Methods

11

Typical behavior

Elaplsed time (s)0 0.5 1 1.5 2 2.5 3 3.5

Trai

nig

erro

r

0

0.1

0.2

0.3

0.4

0.5

SGDBatch

Elaplsed time (s)0 0.5 1 1.5 2 2.5 3 3.5

Gen

eral

izat

ion

erro

r

0

0.1

0.2

0.3

0.4

0.5

Normal gradient descent v.s. SGDLogistic regression with L1-regularization: n = 10, 000, p = 2.

SGD decreases the objective rapidly, and after a while, the batch gradientmethod catches up and slightly surpasses. 72 / 73

Logistic Regression L1 regularizationsource: T. Suzuki

Page 12: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Three Methods

12

•Primal methods !

• Stochastic Average Gradient (A) descent, SAG(A) (Le Roux et al., 2012, Schmidt et al., 2013, Defazio et al., 2014)

!• Stochastic Variance Reduced Gradient descent, SVRG

(Johnson and Zhang, 2013, Xiao and Zhang, 2014) !

•Dual methods (see Fenchel duality) • Stochastic Dual Coordinate ascent, SDCA (Shalev-Shwartz and

Zhang, 2013a)

Page 13: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Primal Methods

13

smooth strongly!convex

min✓2Rp

1

n

nX

i=1

li(✓) + (✓)

We want to approximate r 1n

Pni=1 li(✓)

Page 14: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Primal Methods

13

smooth strongly!convex

min✓2Rp

1

n

nX

i=1

li(✓) + (✓)

We want to approximate r 1n

Pni=1 li(✓)

Ei⇠unif{1,...,n}[rli(✓)] =1

n

X

i

rli(✓) = rL(✓)

Randomizing points in the dataset gives a way to get an unbiased estimator of the gradient.

Page 15: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Primal Methods

13

smooth strongly!convex

min✓2Rp

1

n

nX

i=1

li(✓) + (✓)

We want to approximate r 1n

Pni=1 li(✓)

Ei⇠unif{1,...,n}[rli(✓)] =1

n

X

i

rli(✓) = rL(✓)

Problem: Variance !

Page 16: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SVRG

14

g = rli(✓)�rli(✓) +1

n

nX

j=1

rlj(✓)

•easy to show that this gradient estimate is unbiased

•Variance is controlled by how far are.✓, ✓

Page 17: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SVRG

14

g = rli(✓)�rli(✓) +1

n

nX

j=1

rlj(✓)

•easy to show that this gradient estimate is unbiased

•Variance is controlled by how far are.✓, ✓

var[g] =1

n

nX

i=1

krli(✓)�rli(✓) +rL(✓)�rL(✓)k2

=1

n

nX

i=1

krli(✓)�rli(✓)k2 � krL(✓)�rL(✓)k2

1

n

nX

i=1

krli(✓)�rli(✓)k2

�2k✓ � ✓k2

Page 18: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SVRG

15

SVRGSet

ˆ✓0. For t = 1, . . . , T ,

• Set

ˆ✓ = ˆ✓t�1. ✓0 =

ˆ✓.

• g =

1n

Pni=1 rli(ˆ✓) : full gradient, at ˆ✓.

• For k = 1, . . . ,m

– Sample i ⇠ {1, . . . , n}– g = rli(✓k�1)�rli(ˆ✓) + g : variance reduction

– ✓k = prox(✓k�1 � ⌘g | ⌘ )

• ˆ✓t = 1m

Pmk=1 ✓k

Page 19: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SAGA

16

(SVRG) g = rli(✓t�1)�rli(✓) +1n

Pnj=1 rlj(✓)

(SAGA) g = rli(✓t�1)�rli(✓i) +1n

Pnj=1 rlj(✓j)

ˆ✓ depends on the data index.

ˆ✓i is updated at every iteration.

(ˆ✓i = ✓t�1 i chosenˆ✓i unchanged otherwise.

Consequence: larger storage is necessary, but no double loop

Page 20: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

SAGA

17

SAGA• Set gi = g = 0, i 2 {1, . . . , n}, Set ✓..

– Pick i 2 {1, . . . , n} randomly.

– Update gi = rli(✓)– Estimate gradient g = gi � gi + g

– Update average gradient g = g + 1n (gi � gi).

– Update stored gradients gi = gi.

– Update ✓ prox(✓ � ⌘g|⌘ )

Step size: ~1/γ , convergence guaranteed. In practice: important to use mini-batches.

Page 21: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Recent Extensions: Point SAGA

18

No regularizer required, proximal operator of each function.

Page 22: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

• Set x

0= (x

01, . . . , x

0n),

• For k = 1, . . . ,K

– x

k+1i = argmin

y2Rf(x

k+11 , . . . , x

k+1i�1 , y, x

ki+1, . . . , x

kn)

source: wikipedia

Page 23: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

Reminders on Coordinate Descent

• Set x

0= (x

01, . . . , x

0n),

• For k = 1, . . . ,K

– x

k+1i = argmin

y2Rf(x

k+11 , . . . , x

k+1i�1 , y, x

ki+1, . . . , x

kn)

source: wikipedia

Page 24: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

Reminders on Coordinate Descent

source: wikipedia

Page 25: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

Reminders on Coordinate Descent

source: wikipedia

Page 26: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

Reminders on Coordinate Descent

source: wikipedia

Page 27: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

19

Reminders on Coordinate Descent

source: wikipedia

To ensure success of CD, some progress must be guaranteed. Separability of the objective function helps.

Page 28: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

20 source: wikipedia

• Set ✓0 = (✓01, . . . , ✓0p),

• For k = 1, . . . ,K

– Sample j.

– Compute gj = @f(✓)/@✓j

– ✓j argmin

y2Rgjy + j(y) +

12⌘tky � ✓jk2

Page 29: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Dual Methods

20

Coordinate Descent on Primal Problem

source: wikipedia

• Set ✓0 = (✓01, . . . , ✓0p),

• For k = 1, . . . ,K

– Sample j.

– Compute gj = @f(✓)/@✓j

– ✓j argmin

y2Rgjy + j(y) +

12⌘tky � ✓jk2

Regularizer must be separable.

Page 30: Nesterov’s Acceleration - Marco Cuturimarcocuturi.net/Teaching/ENSAE/2017/DSOML/DSOML4.pdfNesterov’s Acceleration 3 Iteration 0 20 40 60 80 100 120 Relative objective (f(x t) -

Fenchel Duality Theorem

21

Theorem

Let f : Rp ! ¯

R and g : Rq ! ¯

R be closed convex, and A 2 Rq⇥p

a linear

map. Suppose that either condition (a) or (b) is satisfied. Then

inf

x2Rpf(x) + g(Ax) = sup

y2Rq�f

⇤(A

T

y)� g

⇤(�y)

min✓2Rp

1

n

nX

i=1

l✓(zi) + (✓) l✓(zi) = l(yi, xTi ✓)

supy2Rn

1

n

X

i

l⇤i (yi) + ⇤(�XT y/n)

✓⇤ = r ⇤(�XTy⇤/n)