Optimal Mini-Batch and Step Sizes for SAGA€¦ · for SAGA algorithm with { fL{smoothand {strongly convex { f i L i{smooth, 8i 2[n] Includes problems such as Ridge regression: f

Optimal Mini-Batch and Step Sizes for SAGA

Nidham Gazagnadou1,a

joint work with Robert M. Gower1 & Joseph Salmon2

1Telecom Paris, Institut Polytechnique de Paris, Paris, France2IMAG, Universite de Montpellier, CNRS, Montpellier, France

aThis work was supported by grants from Region Ile-de-France 1

Goals of this Work

• Finite Sum Minimization problem

Find optimal parameter/model w∗ s.t.

w∗ = arg minw∈Rd

[f (w) :=

1

n

n∑i=1

fi (w)

](P)

• Stochastic Gradient Descent (SGD) and practitioners

– SGD and variance-reduced variants widely used to solve (P) ...

– ... yet, painful hyper-parameters tuning

• This presentation

→ Provide theoretical optimal step and mini-batch sizes

for SAGA algorithm

with

– f L–smooth and µ–strongly convex

– fi Li–smooth, ∀i ∈ [n]

• Includes problems such as• Ridge regression: fi (w) = 1

2(a>i w − yi )

2 + λ2‖w‖2

2

• Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ

2‖w‖2

2

where

– ai ∈ Rd : feature vector (input)

– yi ∈ R or {−1, 1}: label (output)

– λ > 0: ridge/Tikhonov’s regularization parameter

2

Goals of this Work




[f (w) :=

1

n

n∑i=1

fi (w)

](P)

• Stochastic Gradient Descent (SGD) and practitioners– SGD and variance-reduced variants widely used to solve (P) ...




for SAGA algorithm

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

2

Goals of this Work




[f (w) :=

1

n

n∑i=1

fi (w)

](P)





for SAGA algorithm

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

2

Goals of this Work




[f (w) :=

1

n

n∑i=1

fi (w)

](P)





for SAGA algorithm

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

2

Expected Smoothness Constant

Supervised Learning Optimization Problem

• Empirical Risk Minimization (ERM)



[f (w) :=

1

n

n∑i=1

fi (w)

](P)

with

– f is L–smooth and µ–strongly convex

– fi is Li–smooth, ∀i ∈ [n] := {1, . . . , n}

• Includes problems such as– Ridge regression: fi (w) = 1

2(a>i w − yi )

2 + λ2‖w‖2

2

– Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ

2‖w‖2

2

where




3





[f (w) :=

1

n

n∑i=1

fi (w)

](P)

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

3





[f (w) :=

1

n

n∑i=1

fi (w)

](P)

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

3





[f (w) :=

1

n

n∑i=1

fi (w)

](P)

with




2(a>i w − yi )

2 + λ2‖w‖2

2


2‖w‖2

2

where




Loss function

of the i–th

data sample

3

Solving the ERM Problem

Given a precision ε > 0

• Gradient Descent (GD)

– Step update: w k+1 = w k − 1

L∇f (w k)

– Iteration complexity: O(L

µlog (1/ε)

)• SAGA (or SVRG/SARAH)

Let Lmax := maxi∈[n] Li


3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1

]– Iteration complexity: O

((Lmax

µ+ n

)log (1/ε)

)• Distance between L and Lmax

L ≤ Lmax ≤ nL

→ Can we benefit from mini-batching to find an interpolating

smoothness s.t. L?≤ L

?≤ Lmax ?

4





L∇f (w k)


µlog (1/ε)

)

• SAGA (or SVRG/SARAH)Let Lmax := maxi∈[n] Li


3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)


L ≤ Lmax ≤ nL



?≤ Lmax ?

4





L∇f (w k)


µlog (1/ε)




3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)

)

• Distance between L and Lmax

L ≤ Lmax ≤ nL



?≤ Lmax ?

4





L∇f (w k)


µlog (1/ε)




3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)

)

• Distance between L and Lmax

L ≤ Lmax ≤ nL



?≤ Lmax ?

Previous gradients

stored in the

Jacobian estimate

4





L∇f (w k)


µlog (1/ε)




3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)


L ≤ Lmax ≤ nL



?≤ Lmax ?

Previous gradients

stored in the

Jacobian estimate

4





L∇f (w k)


µlog (1/ε)




3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)


L ≤ Lmax ≤ nL



?≤ Lmax ?

Previous gradients

stored in the

Jacobian estimate

When n is big

possibly L � Lmax

4





L∇f (w k)


µlog (1/ε)




3(nµ+ Lmax)

[∇fi (w k)− Jk

:i +1

nJk1


((Lmax

µ+ n

)log (1/ε)


L ≤ Lmax ≤ nL



?≤ Lmax ?

Previous gradients

stored in the

Jacobian estimate

When n is big

possibly L � Lmax

4

Key Constant: Expected Smoothness

Definition (Subsample/batch function)

Let B ⊆ [n] a mini-batch of size |B| = b

fB(w) :=1

b

∑i∈B

fi (w)

and denote LB be the smallest constant s.t. fB is LB–smooth.

Recovering known smoothness constants

– B = [n] =⇒ LB = L

– B = {i} =⇒ LB = Li

Assumption (Expected Smoothness)

Let S ⊆ [n] be a random set of b points sampled without replacement.

There exist L(b) > 0 s.t.

E[‖∇fS(w)−∇fS(w∗)‖2

2

]≤ 2L(b) (f (w)− f (w∗))

5




fB(w) :=1

b

∑i∈B

fi (w)



– B = [n] =⇒ LB = L

– B = {i} =⇒ LB = Li




E[‖∇fS(w)−∇fS(w∗)‖2

2

]≤ 2L(b) (f (w)− f (w∗))

5




fB(w) :=1

b

∑i∈B

fi (w)



– B = [n] =⇒ LB = L

– B = {i} =⇒ LB = Li




E[‖∇fS(w)−∇fS(w∗)‖2

2

]≤ 2L(b) (f (w)− f (w∗))

5

Key Constant: Expected Smoothness (cont’d)

Definition (b−sampling without replacement)

S (a random set-valued mapping) is a b−sampling without replacement if

P [S = B] =1(nb

) ∀B ⊂ [n] : |B| = b

Lemma (Expected smoothness formula)

For b−sampling without replacement,

L(b) :=1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b ∧ i∈B

LB

Consequence of fB being LB–smooth and convex for a B ⊆ [n]

Problem: calculating L(b) is intractable for large n

6




P [S = B] =1(nb

) ∀B ⊂ [n] : |B| = b



L(b) :=1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b ∧ i∈B

LB



6




P [S = B] =1(nb

) ∀B ⊂ [n] : |B| = b



L(b) :=1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b ∧ i∈B

LB



6

Extreme Values of the Expected Smoothness

Let S be a b−sampling without replacement

L(b) =1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b∧ i∈B

LB

• If b = 1

• Recovered algorithm: SAGA

• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n

• Recovered algorithm: Gradient Descent

• B = [n]

L(n) = L

→ L(b) interpolates between Lmax and L

7



L(1) =1(

n−10

) maxi=1,...,n

∑B⊆[n] : |B|=1∧ i∈B

LB

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(1) =1(

n−10

) maxi=1,...,n

∑B ∈{{1},...,{n}} : i ∈B

L

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(1) = maxi=1,...,n

Li

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(1) = maxi=1,...,n

Li

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(n) =1(

n−1n−1

) maxi=1,...,n

∑B⊆[n] : |B|=n∧ i∈B

L

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(n) = maxi=1,...,n

∑B=[n]

LB

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(n) = maxi=1,...,n

L

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(n) = L

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7



L(n) = L

• If b = 1


• B ∈ {{1}, . . . , {n}}

L(1) = Lmax

• If b = n


• B = [n]

L(n) = L


7

Optimal Mini-Batch and Step

Sizes for SAGA

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .

– Sample a mini-batch B ⊂ [n] s.t. |B| = b

– Compute a gradient estimate

g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)

– Update the Jacobian estimate Jk

Jk:i = ∇fi (w k), ∀i ∈ B

• What is the optimal mini-batch size?

→ Find the “best” b value

8

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .



g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)


Jk:i = ∇fi (w k), ∀i ∈ B



8

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .



g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)


Jk:i = ∇fi (w k), ∀i ∈ B



8

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .



g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)


Jk:i = ∇fi (w k), ∀i ∈ B



8

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .



g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)


Jk:i = ∇fi (w k), ∀i ∈ B



8

Mini-Batch SAGA

• The algorithm

For k = 0, 1, 2, . . .



g(w k) =1

b

∑i∈B

∇fi (w k)− 1

b

∑i∈B

Jk:i +

1

nJk1

– Take a step

w k+1 = w k − γg(w k)


Jk:i = ∇fi (w k), ∀i ∈ B



8

Mini-Batch SAGA Iteration Complexity

Theorem (Convergence of mini-batch SAGA1)

Consider the iterates wk of the mini-batch SAGA algorithm. Let the step

size be

γ(b) =1

4

1

max

{L(b),

1

b

n − b

n − 1Lmax +

µ

4

n

b

}Given an ε > 0, if k ≥ Kiter(b) where

k ≥ Kiter(b) :=

{4L(b)

µ,n

b+

n − b

n − 1

4Lmax

bµ

}log

(1

ε

)=⇒ E

[∥∥wk − w∗∥∥2]≤ εC .

with C > 0 a constanta.

aC :=∥∥w0 − w∗

∥∥2+ γ

2Lmax

∑i∈[n]

∥∥J0:i −∇f (w∗)

∥∥2

1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:

Variance Reduction via Jacobian Sketching”

9

https://arxiv.org/abs/1805.02632

Methodology

Given a precision ε > 0,

• Optimal mini-batch size

find b∗ ∈ arg minb∈[n]

Ktotal(b) = b × Kiter(b)

• Total complexity1

Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε

)• Importance of expected smoothness

• L(b) embodies the complexity

• Gives larger step sizes γ

→ Need to estimate L(b)



10


Methodology





#gradients

per iteration


Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε







10


Methodology





#gradients

per iteration

#iterations k to achieve

E[∥∥wk − w∗

∥∥2]≤ εC


Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε







10


Methodology





#gradients

per iteration


E[∥∥wk − w∗

∥∥2]≤ εC


Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε

)

• Importance of expected smoothness






10


Methodology





#gradients

per iteration


E[∥∥wk − w∗

∥∥2]≤ εC


Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε







10


Methodology





#gradients

per iteration


E[∥∥wk − w∗

∥∥2]≤ εC


Ktotal(b) = max

{4bL(b)

µ, n +

n − b

n − 1

4Lmax

µ

}log

(1

ε







10


First Proven Upper-Bound

Lemma (Simple bound)

If S is a b−sampling without replacement,

L(b) ≤ Lsimple(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

where L :=1

n

n∑i=1

Li

– Lsimple(b) interpolates between Lmax and L ≥ L ...

– ... but L(b) interpolates between Lmax and L

Problem: L and L can be far from each other

11





b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

where L :=1

n

n∑i=1

Li




11





b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

where L :=1

n

n∑i=1

Li




11





b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

where L :=1

n

n∑i=1

Li




11

Exploiting Matrix Concentration Inequalities

Lemma (Bernstein bound)


L(b) ≤ LBernstein(b) :=1

b

(n − b

n − 1+

4

3log(d)

)Lmax + 2

b − 1

b

n

n − 1L

• Proof idea

L(b) =1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b∧ i∈B

LB

≤ [...] + max

i=1,...,nE

[λmax

(∑k

Mik

)]where

(Mi

k

)k

is a sequence of random matrices

• Technical detail

Adapt Matrix Bernstein Inequality to sampling without replacement

Problem: LBernstein(b) approximation interpolates between

≈ log (d)Lmax and 2L

12





b

(n − b

n − 1+

4

3log(d)

)Lmax + 2

b − 1

b

n

n − 1L

• Proof idea

L(b) =1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b∧ i∈B

LB

≤ [...] + max

i=1,...,nE

[λmax

(∑k

Mik

)]where

(Mi

k

)k



Adapt Matrix Bernstein Inequality to sampling without replacement



12





b

(n − b

n − 1+

4

3log(d)

)Lmax + 2

b − 1

b

n

n − 1L

• Proof idea

L(b) =1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b∧ i∈B

LB

≤ [...] + max

i=1,...,nE

[λmax

(∑k

Mik

)]where

(Mi

k

)k



Adapt Matrix Bernstein Inequality to sampling without replacement2



2Gross & Nesme (2010), Tropp (2011, 2015) 12





b

(n − b

n − 1+

4

3log(d)

)Lmax + 2

b − 1

b

n

n − 1L

• Proof idea

L(b) =1(

n−1b−1

) maxi=1,...,n

∑B⊆[n] : |B|=b∧ i∈B

LB

≤ [...] + max

i=1,...,nE

[λmax

(∑k

Mik

)]where

(Mi

k

)k



Adapt Matrix Bernstein Inequality to sampling without replacement2


≈ log (d)Lmax and 2L2Gross & Nesme (2010), Tropp (2011, 2015) 12

The “Practical” Estimate

Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

– Nicely interpolation between Lmax and L . . .

– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)

Upper bounds and L(b)

computed on artificial data

(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant

→ Numerically Lpractical(b) ≈ L(b)

13


Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

L instead of L in

the simple bound

– Nicely interpolation between Lmax and L . . .




(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant


13


Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

L instead of L in

the simple bound– Nicely interpolation between Lmax and L . . .




(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant


13


Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

L instead of L in





(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant


13


Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

L instead of L in





(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant


13


Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

L instead of L in





(n = d = 24)

5 10 15 20

mini-batch size

0

100

200

300

400

500

smooth

ness

const

ant

→ Numerically Lpractical(b) ≈ L(b)13

Optimal Mini-Batch from the “Practical” Estimate


• Total complexity bound

Since L(b) ≤ Lpractical(b),

Ktotal(b) ≤ max

{nb − 1

n − 1

4L

µ+n − b

n − 1

4Lmax

µ, n+

n − b

n − 1

4Lmax

µ

}log

(1

ε

)

Optimal mini-batch size

=⇒ bpractical =

⌊1 +

µ(n − 1)

4L

⌋

14





Ktotal(b) ≤ max

{nb − 1

n − 1

4L

µ+

n − b

n − 1

4Lmax

µ︸︷︷︸linearly increasing

, n+n − b

n − 1

4Lmax

µ

}log

(1

ε

)


=⇒ bpractical =

⌊1 +

µ(n − 1)

4L

⌋

14





Ktotal(b) ≤ max

{nb − 1

n − 1

4L

µ+

n − b

n − 1

4Lmax


, n +n − b

n − 1

4Lmax

µ︸︷︷︸linearly decreasing

}log

(1

ε

)


=⇒ bpractical =

⌊1 +

µ(n − 1)

4L

⌋

14





Ktotal(b) ≤ max

{nb − 1

n − 1

4L

µ+

n − b

n − 1

4Lmax


, n +n − b

n − 1

4Lmax

µ︸︷︷︸linearly decreasing

}log

(1

ε

)


=⇒ bpractical =

⌊1 +

µ(n − 1)

4L

⌋

14

Mini-Batch SAGA Step Sizes

• Link between the step size and the expected smoothness

γ(b) =1

4

1

max

{L(b),

1

b

n − b

n − 1Lmax +

µ

4

n

b

}

• The smaller L(b) (the smoother fB), the larger γ(b)

• Plugging Lpractical(b), Lsimple(b) or LBernstein(b) into γ(b)

Step size increasing with

mini-batch size

on slice data set

(n = 53, 500, d = 384)

0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴

mini-batch size

0.000

0.001

0.002

0.003

0.004

step s

ize

→ Straightforward and larger step size for large b

15



γ(b) =1

4

1

max

{L(b),

1

b

n − b

n − 1Lmax +

µ

4

n

b

}• The smaller L(b) (the smoother fB), the larger γ(b)



mini-batch size

on slice data set

(n = 53, 500, d = 384)

0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴

mini-batch size

0.000

0.001

0.002

0.003

0.004

step s

ize


15



γ(b) =1

4

1

max

{L(b),

1

b

n − b

n − 1Lmax +

µ

4

n

b




mini-batch size

on slice data set

(n = 53, 500, d = 384)

0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴

mini-batch size

0.000

0.001

0.002

0.003

0.004

step s

ize


15



γ(b) =1

4

1

max

{L(b),

1

b

n − b

n − 1Lmax +

µ

4

n

b




mini-batch size

on slice data set

(n = 53, 500, d = 384)

0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴

mini-batch size

0.000

0.001

0.002

0.003

0.004

step s

ize

→ Straightforward and larger step size for large b15

Numerical Experiments

Convergence Results on Real Data

0 1 2 3 4epochs

10−4

10−3

10−2

10−1

100

resid

ual

bDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03

Comparison of SAGA settings for the slice3 data set

(n = 53, 500, d = 384)

→ Larger mini-batch and step sizes: faster convergence

→ Competing against grid-search!

3UCI Machine Learning Repository16

https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis


0 1 2 3 4epochs

10−4

10−3

10−2

10−1

100

resid

ual



(n = 53, 500, d = 384)

Grid search over

γ ∈ {2−33, 2−31 . . . , 223, 225}






0 1 2 3 4epochs

10−4

10−3

10−2

10−1

100

resid

ual



(n = 53, 500, d = 384)

Grid search over

γ ∈ {2−33, 2−31 . . . , 223, 225}






0 1 2 3 4epochs

10−4

10−3

10−2

10−1

100

resid

ual



(n = 53, 500, d = 384)

Grid search over

γ ∈ {2−33, 2−31 . . . , 223, 225}


→ Competing against grid-search!3UCI Machine Learning Repository

16


Optimality of Our Mini-Batch Size

1 2 4 8 16 32 64 128 256 1024 4096 16384mini-batch size

105.0

105.5

106.0

106.5

107.0

empi

rical

tota

l com

plex

ity

bempirical = 2bpractical = 70

Complexity explosion for the slice data set (n = 53, 500, d = 384)

→ Regime change observed & predicting the largest mini-batch

size before complexity explosion

17

Optimality of Our Mini-Batch Size (cont’d)

1 2 4 8 16 32 64 128

256

1024

4096

1638

465

536

mini-batch size

105.5

105.6

105.7

105.8

105.9

106.0

empi

rical

tota

l com

plex

ity bempirical = 16384bpractical = 8898

Complexity explosion for the real-sim4 data set (n = 72, 309, d = 20, 958)


size before complexity explosion

4LIBSVM Data

18

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Optimality of Our Mini-Batch Size (cont’d)

1 2 4 8 16 32 64 128

256

1024

4096

1638

465

536

mini-batch size

105.5

105.6

105.7

105.8

105.9

106.0

empi

rical

tota

l com

plex

ity bempirical = 16384bpractical = 8898

Complexity explosion for the real-sim4 data set (n = 72, 309, d = 20, 958)


size before complexity explosion4LIBSVM Data

18

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Conclusion

Summary of Our Contributions

What was done for SAGA

• Estimates of the expected smoothness

• Simple formula for the step size γ(b)

• Optimal mini-batch size bpractical

• Convincing numerics verifying the optimality of our parameters on

real data sets5

Julia code available athttps://github.com/gowerrobert/StochOpt.jl/

What has been done since then

• Extended study to variants of SVRG6

4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the

theory and practice of SVRG”

19

https://github.com/gowerrobert/StochOpt.jl/








real data sets5






19






Turned out that

Lpractical(b)

is an actual

upper-bound!4




real data sets5






19






Turned out that

Lpractical(b)

is an actual

upper-bound!4




real data sets5






19






Turned out that

Lpractical(b)

is an actual

upper-bound!4




real data sets5






19






Turned out that

Lpractical(b)

is an actual

upper-bound!4




real data sets5






19






Turned out that

Lpractical(b)

is an actual

upper-bound!4




real data sets5






19



References (1/2)

• Defazio, Bach and Lacoste-Julien (2014), NIPS“SAGA: A Fast Incremental Gradient Method With Support forNon-Strongly Convex Composite Objectives”

• Gower, Richtarik and Bach (2018), preprint online arXiv:1805.02632“Stochastic Quasi-Gradient Methods: Variance Reduction viaJacobian Sketching”

• Gower, Loizou, Qian, Sailanbayev, Shulgin, Richtarik (2019), ICML“SGD: General Analysis and Improved Rates”

• Sebbouh, Gazagnadou, Jelassi, Bach, Gower (2019), preprint onlinearXiv:1908.02725“Towards closing the gap between the theory and practice of SVRG”

• Robbins and Monro (1951), Annals of Mathematical Statistics“A stochastic approximation method”

• Schmidt, Le Roux and Bach (2017), Mathematical Programming“Minimizing finite sums with the stochastic average gradient”

• Johnson and Zhang (2013), NIPS“Accelerating Stochastic Gradient Descent using Predictive VarianceReduction”

20



References (2/2)

• Tropp (2015), Foundations and Trends R© in Machine Learning“An Introduction to Matrix Concentration Inequalities”

• Tropp (2012), Foundations of Computational Mathematics“User-Friendly Tail Bounds for Sums of Random Matrices”

• Tropp (2011), Advances in Adaptive Data Analysis“Improved analysis of the subsampled randomized Hadamardtransform”

• Bach (2013), COLT“Sharp analysis of low-rank kernel matrix approximations”

• Gross and Nesme (2010), preprint online arXiv:1001.2738“Note on sampling without replacing from a finite collection ofmatrices”

• Hoeffding (1963), Journal of the American Statistical Association“Probability inequalities for sums of bounded random variables”

• Chang and Lin (2011), ACM TIST“LIBSVM : A library for support vector machines”

21

Questions ?

21

Time Plot Convergence Results on Real Data

0 5 10 15time

10−4

10−3

10−2

10−1

100resid

ualbDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03


(n = 53, 500, d = 384)5UCI Machine Learning Repository 22

JacSketch7 Lyapunov Function

Definition (Stochastic Lyapunov function)

Ψk :=∥∥wk − w∗

∥∥2

2+

γ

2bLmax

∥∥Jk −∇F(w∗)∥∥2

F

• ‖·‖F: Frobenius norm

• ∇F(w) = [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n: Jacobian matrix

•{wk , Jk

}k≥0

are the points and Jacobian estimate

If ε > 0 denotes the desired precision, Theorem 3.6 ensures that, for a

step size γ = min

{1

4L,

11bn−bn−1Lmax + µ

4nb

},

E[Ψk]≤ εΨ0 .



23


Proof of the “Practical” Bound (1/2)

Lemma (Practical bound)


L(b) ≤ Lpractical(b) :=1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

Proof: See Proposition 3.8 in Gower et. al (2019), “SGD: General Analysis and Improved Rates”,

Let S be a b–sampling without replacement with b ∈ [n], and x, z ∈ Rd . Let us denote

pB := P [S = B] =1(nb

) , pi := P [i ∈ S] =b

n, Pij := P [i, j ∈ S] =

{b(b−1)n(n−1) if i 6= j

pi = nb else

,

for all B ⊆ [n] and all i, j ∈ [n].

‖∇fS (w)−∇fS (z)‖22 =

1

n2

∥∥∥∥∥∥∑i∈S

1

pi(∇fi (w)−∇fi (z))

∥∥∥∥∥∥2

2

=∑i,j∈S

⟨1

npi(∇fi (w)−∇fi (z)) ,

1

npj(∇fj (w)−∇fj (z))

⟩Then, we take expectation over all possible mini-batches (B ⊆ [n] : |B| = b).

E[‖∇fS (w)−∇fS (z)‖2

2

]=∑B

pB∑i,j∈B

⟨1


1


⟩double counting

=n∑

i,j=1

∑B : i,j∈B

pB

⟨1


1


⟩∑

B : i,j∈B pB=Pij=

n∑i,j=1

⟨1


1


⟩Pij

24

Proof of the “Practical” Bound (2/2)

Now consider the two disjoint cases where i 6= j and i = j .

E[‖∇fS (w)−∇fS (z)‖2

2

]=

n∑i,j=1

Pij

pipj

⟨1

n(∇fi (w)−∇fi (z)) ,

1

n(∇fj (w)−∇fj (z))

⟩

=∑

i,j : i 6=j

n(b − 1)

b(n − 1)

⟨1

n(∇fi (w)−∇fi (z)) ,

1

n(∇fj (w)−∇fj (z))

⟩+

n∑i=1

1

n2

1

pi‖∇fi (w)−∇fi (z)‖2

2

=n∑

i,j=1

n(b − 1)

b(n − 1)

⟨1

n(∇fi (w)−∇fi (z)) ,

1

n(∇fj (w)−∇fj (z))

⟩

+n∑

i=1

1

n2

1

pi

(1−

n(b − 1)

b(n − 1)pi

)‖∇fi (w)−∇fi (z)‖2

2

fi smooth& convex≤

n(b − 1)

b(n − 1)‖∇f (w)−∇f (z)‖2

2 + 2n∑

i=1

Li

n2pi

(1−

n(b − 1)

b(n − 1)pi

)(fi (w)− fi (z)− 〈∇fi (z),w − z〉)

f smooth& convex≤ 2

(n(b − 1)

b(n − 1)L + max

i=1,...,n

Li

npi

(1−

n(b − 1)

b(n − 1)pi

))(f (w)− f (z)− 〈∇f (z),w − z〉)

= 2

(n(b − 1)

b(n − 1)L + max

i=1,...,n

Li

b

(1−

b − 1

n − 1

))(f (w)− f (z)− 〈∇f (z),w − z〉)

= 2

(1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

)︸︷︷︸

Lpractical(b)

(f (w)− f (z)− 〈∇f (z),w − z〉)

25

From Sampling Without to With Replacement

Lemma (Domination of the trace of the mgf of a sample without

replacement)

Consider two finite sequences, of same length, {Xk} and {Mk} of

Hermitian random matrices of same size sampled respectively with and

without replacement from a finite set X . Let θ ∈ R, then

E tr exp(θ∑

kMk

)≤ E tr exp

(θ∑

kXk

).

See Gross and Nesme (2010), arXiv:1001.2738,

“Note on sampling without replacing from a finite collection of matrices”

26

Proof Sketch of the Bernstein Bound (1/2)

(i) Write L as an expectation

L = maxi=1,...,n

E[LS i∪{i}

]= max

i=1,...,nUE

λmax

1

b

∑j∈Sn∪{i}

aja>j

≤ 1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

+ maxi=1,...,n

UE

λmax

1

b

∑j∈S i

aja>j −

1

b

b − 1

n − 1

∑j∈[n]\{i}

aja>j︸︷︷︸

N=∑

k Mk

27


(i) Write L as an expectation

L = maxi=1,...,n

E[LS i∪{i}

]= max

i=1,...,nUE

λmax

1

b

∑j∈Sn∪{i}

aja>j

≤ 1

b

n − b

n − 1Lmax +

n

b

b − 1

n − 1L

+ maxi=1,...,n

UE

λmax

1

b

∑j∈S i

aja>j −

1

b

b − 1

n − 1

∑j∈[n]\{i}

aja>j︸︷︷︸

N=∑

k Mk

Practical

approximation

27


(ii) Write N as a sum of random matrices and apply

Theorem (Matrix Bernstein Inequality Without Replacement)

Let X be a finite set of Hermitian matrices with dimension d s.t.

λmax(X) ≤ L ∀X ∈ X .

Sample {Xk} and {Mk} uniformly at random from X resp. with and

without replacement s.t.

EXk = 0 ∀k .

Let Y :=∑

k Xk and N :=∑

k Mk . Then

Eλmax(N) ≤√

2v(Y) log d +1

3L log d .

where v(Y) :=∥∥EY2

∥∥ =∥∥∥∑

kEX2

k

∥∥∥ = λmax

(∑kEX2

k

).

(Tropp, 2011, 2012, 2015; Gross and Nesme, 2010; Bach 2013)

28

Stochastic Reformulation of the ERM

• Sampling vector

Let v ∈ Rn, with distribution D s.t.

ED [v ] = 1

where 1 is the all-ones vector

• Unbiased subsampled function

fv (w) :=1

n

n∑i=1

fi (w)vi =1

n〈F (w), v〉

where F (w) := (f1(w), . . . , fn(w))> ∈ Rn

=⇒ ED [fv (w)] =1

n〈F (w),ED [v ]〉 = f (w)

• ERM reformulation

solving ERM ⇐⇒ find w∗ ∈ arg minw∈Rd

1

n

n∑i=1

fi (w) = ED [fv (w)]

29


• Sampling vector


ED [v ] = 1



fv (w) :=1

n

n∑i=1

fi (w)vi =1

n〈F (w), v〉

where F (w) := (f1(w), . . . , fn(w))> ∈ Rn

=⇒ ED [fv (w)] =1

n〈F (w),ED [v ]〉 = f (w)



1

n

n∑i=1


29


• Sampling vector


ED [v ] = 1



fv (w) :=1

n

n∑i=1

fi (w)vi =1

n〈F (w), v〉

where F (w) := (f1(w), . . . , fn(w))> ∈ Rn

=⇒ ED [fv (w)] =1

n〈F (w),ED [v ]〉 = f (w)



1

n

n∑i=1


29


• Sampling vector


ED [v ] = 1



fv (w) :=1

n

n∑i=1

fi (w)vi =1

n〈F (w), v〉

where F (w) := (f1(w), . . . , fn(w))> ∈ Rn

=⇒ ED [fv (w)] =1

n〈F (w),ED [v ]〉 = f (w)



1

n

n∑i=1


29

Arbitrary Sampling

Examples of sampling vector

Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian

• Let v =1 (deterministic)

∇fv (w) =1

n∇F(w)1 = ∇f (w) (GD)

• Let v =nei (where ei is the i-th vector of basis) with probability 1/n

for all i ∈ [n]

∇fv (w) =1

n∇F(w)nei = ∇fi (w) (SGD)

• Let v =(n/b)∑∑∑

i∈B ei with B drawn from [n] uniformly s.t. |B|=b

∇fv (w) =1

n∇F(w)

n

b

∑i∈B

ei =1

b

∑i∈B

∇fi (w)

(b–SGD without replacement)

→ Arbitrary sampling covers many common methods

30

Arbitrary Sampling




∇fv (w) =1

n∇F(w)1 = ∇f (w) (GD)


for all i ∈ [n]

∇fv (w) =1


• Let v =(n/b)∑∑∑


∇fv (w) =1

n∇F(w)

n

b

∑i∈B

ei =1

b

∑i∈B

∇fi (w)



30

Arbitrary Sampling




∇fv (w) =1

n∇F(w)1 = ∇f (w) (GD)


for all i ∈ [n]

∇fv (w) =1


• Let v =(n/b)∑∑∑


∇fv (w) =1

n∇F(w)

n

b

∑i∈B

ei =1

b

∑i∈B

∇fi (w)



30

Arbitrary Sampling




∇fv (w) =1

n∇F(w)1 = ∇f (w) (GD)


for all i ∈ [n]

∇fv (w) =1


• Let v =(n/b)∑∑∑


∇fv (w) =1

n∇F(w)

n

b

∑i∈B

ei =1

b

∑i∈B

∇fi (w)



30

Arbitrary Sampling




∇fv (w) =1

n∇F(w)1 = ∇f (w) (GD)


for all i ∈ [n]

∇fv (w) =1


• Let v =(n/b)∑∑∑


∇fv (w) =1

n∇F(w)

n

b

∑i∈B

ei =1

b

∑i∈B

∇fi (w)



30

Unbiased Gradient Estimate

• Unbiased estimates

– Function: ED [fv (w)] = 1n〈F (w),ED [v ]〉 = f (w)

– Gradient: ED [∇fv (w)] = 1n∇F(w)ED [v ] = ∇f (w)

where ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n

→ Motivates using: wk+1 = wk − γk∇fv (wk)

where γk is a step size sequence

31








31








31








31

Reducing the Variance With Control Variates

• Controlled stochastic reformulation of the ERM

find w∗ ∈ arg minw∈Rd

1

n

n∑i=1

fi (w) = ED[fv (w)−zv (w) + ED [zv (w)]︸︷︷︸

unbiased correction term

]with zv (·) a random function

• Stochastic variance-reduced gradient estimator

gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]

Iteration: wk+1 = wk − γgv (wk)

→ Recovers stochastic variance-reduced methods

32




1

n

n∑i=1





gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]



32




1

n

n∑i=1





gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]



32

Advantages of Stochastic Variance-Reduced Methods

Brief recap

• Iterative methods

wk+1 = wk − γgv (wk)

Such that

– Unbiased gradient: E[gv (w k)

]= ∇f (w k)

– Decreasing variance: E[∥∥gv (w k)−∇f (w k)

∥∥2

2

]−−−−−→wk→w∗

0

• Advantages– No decreasing step sizes (γk)k needed

– fv(w) determines the smoothness of the controlled stochastic

reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))

– No “bounded gradient” assumption such as E[∥∥∇fv (w k)

∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33


Brief recap



Such that


]= ∇f (w k)


∥∥2

2

]−−−−−→wk→w∗

0



reformulation


E[‖∇fv (w)−∇fv (w∗)‖2

2

]≤ 2L (fv (w)−∇fv (w∗))


∥∥2

2

]≤ cst

33

Example: Recovering SAGA Algorithm

• SAGA6fv (w) = 1

n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v

zv (w) = 1n 〈J>w︸︷︷︸

linear estimation of F (w)

, v〉 =⇒ ∇zv (w) = 1nJv

with J an estimate of the Jacobian in Rd×n

• If v = nei , where ei is the i-th vector of basis

gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]

=1

n∇F(w)nei −

1

nJnei + E

[1

nJv]

= ∇fi (w)− J:i +1

nJ1

where 1 is the all-ones vector and J:i the i−th column of J• Variance term

Convergence analysis: ED[‖gv (w)−∇f (w)‖2

2

]low for J ≈ ∇F(w)

6Defazio, Bach and Lacoste-Julien (2014), NIPS, “SAGA: A Fast Incremental

Gradient Method With Support for Non-Strongly Convex Composite Objectives”34


• SAGA6fv (w) = 1

n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v

zv (w) = 1n 〈J>w︸︷︷︸


, v〉 =⇒ ∇zv (w) = 1nJv



gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]

=1

n∇F(w)nei −

1

nJnei + E

[1

nJv]

= ∇fi (w)− J:i +1

nJ1

where 1 is the all-ones vector and J:i the i−th column of J

• Variance term


2





• SAGA6fv (w) = 1

n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v

zv (w) = 1n 〈J>w︸︷︷︸


, v〉 =⇒ ∇zv (w) = 1nJv



gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]

=1

n∇F(w)nei −

1

nJnei + E

[1

nJv]

= ∇fi (w)− J:i +1

nJ1

where 1 is the all-ones vector and J:i the i−th column of J• Variance term


2




Documents

Optimal Mini-Batch and Step Sizes for SAGA€¦ · for SAGA algorithm with { fL{smoothand {strongly convex { f i L i{smooth, 8i 2[n] Includes problems such as Ridge regression: f