Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Optimal Mini-Batch and Step Sizes for SAGA
Nidham Gazagnadou1,a
joint work with Robert M. Gower1 & Joseph Salmon2
1Telecom Paris, Institut Polytechnique de Paris, Paris, France2IMAG, Universite de Montpellier, CNRS, Montpellier, France
aThis work was supported by grants from Region Ile-de-France 1
Goals of this Work
• Finite Sum Minimization problem
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
• Stochastic Gradient Descent (SGD) and practitioners
– SGD and variance-reduced variants widely used to solve (P) ...
– ... yet, painful hyper-parameters tuning
• This presentation
→ Provide theoretical optimal step and mini-batch sizes
for SAGA algorithm
with
– f L–smooth and µ–strongly convex
– fi Li–smooth, ∀i ∈ [n]
• Includes problems such as• Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
• Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
2
Goals of this Work
• Finite Sum Minimization problem
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
• Stochastic Gradient Descent (SGD) and practitioners– SGD and variance-reduced variants widely used to solve (P) ...
– ... yet, painful hyper-parameters tuning
• This presentation
→ Provide theoretical optimal step and mini-batch sizes
for SAGA algorithm
with
– f L–smooth and µ–strongly convex
– fi Li–smooth, ∀i ∈ [n]
• Includes problems such as• Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
• Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
2
Goals of this Work
• Finite Sum Minimization problem
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
• Stochastic Gradient Descent (SGD) and practitioners– SGD and variance-reduced variants widely used to solve (P) ...
– ... yet, painful hyper-parameters tuning
• This presentation
→ Provide theoretical optimal step and mini-batch sizes
for SAGA algorithm
with
– f L–smooth and µ–strongly convex
– fi Li–smooth, ∀i ∈ [n]
• Includes problems such as• Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
• Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
2
Goals of this Work
• Finite Sum Minimization problem
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
• Stochastic Gradient Descent (SGD) and practitioners– SGD and variance-reduced variants widely used to solve (P) ...
– ... yet, painful hyper-parameters tuning
• This presentation
→ Provide theoretical optimal step and mini-batch sizes
for SAGA algorithm
with
– f L–smooth and µ–strongly convex
– fi Li–smooth, ∀i ∈ [n]
• Includes problems such as• Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
• Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
2
Expected Smoothness Constant
Supervised Learning Optimization Problem
• Empirical Risk Minimization (ERM)
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
with
– f is L–smooth and µ–strongly convex
– fi is Li–smooth, ∀i ∈ [n] := {1, . . . , n}
• Includes problems such as– Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
– Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
3
Supervised Learning Optimization Problem
• Empirical Risk Minimization (ERM)
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
with
– f is L–smooth and µ–strongly convex
– fi is Li–smooth, ∀i ∈ [n] := {1, . . . , n}
• Includes problems such as– Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
– Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
3
Supervised Learning Optimization Problem
• Empirical Risk Minimization (ERM)
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
with
– f is L–smooth and µ–strongly convex
– fi is Li–smooth, ∀i ∈ [n] := {1, . . . , n}
• Includes problems such as– Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
– Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
3
Supervised Learning Optimization Problem
• Empirical Risk Minimization (ERM)
Find optimal parameter/model w∗ s.t.
w∗ = arg minw∈Rd
[f (w) :=
1
n
n∑i=1
fi (w)
](P)
with
– f is L–smooth and µ–strongly convex
– fi is Li–smooth, ∀i ∈ [n] := {1, . . . , n}
• Includes problems such as– Ridge regression: fi (w) = 1
2(a>i w − yi )
2 + λ2‖w‖2
2
– Regularized logistic regression: fi (w) = log(1 + e−yi a>i w ) + λ
2‖w‖2
2
where
– ai ∈ Rd : feature vector (input)
– yi ∈ R or {−1, 1}: label (output)
– λ > 0: ridge/Tikhonov’s regularization parameter
Loss function
of the i–th
data sample
3
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)
• SAGA (or SVRG/SARAH)Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)
• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)
• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
Previous gradients
stored in the
Jacobian estimate
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
Previous gradients
stored in the
Jacobian estimate
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
Previous gradients
stored in the
Jacobian estimate
When n is big
possibly L � Lmax
4
Solving the ERM Problem
Given a precision ε > 0
• Gradient Descent (GD)
– Step update: w k+1 = w k − 1
L∇f (w k)
– Iteration complexity: O(L
µlog (1/ε)
)• SAGA (or SVRG/SARAH)
Let Lmax := maxi∈[n] Li
– Step update: w k+1 = w k − 1
3(nµ+ Lmax)
[∇fi (w k)− Jk
:i +1
nJk1
]– Iteration complexity: O
((Lmax
µ+ n
)log (1/ε)
)• Distance between L and Lmax
L ≤ Lmax ≤ nL
→ Can we benefit from mini-batching to find an interpolating
smoothness s.t. L?≤ L
?≤ Lmax ?
Previous gradients
stored in the
Jacobian estimate
When n is big
possibly L � Lmax
4
Key Constant: Expected Smoothness
Definition (Subsample/batch function)
Let B ⊆ [n] a mini-batch of size |B| = b
fB(w) :=1
b
∑i∈B
fi (w)
and denote LB be the smallest constant s.t. fB is LB–smooth.
Recovering known smoothness constants
– B = [n] =⇒ LB = L
– B = {i} =⇒ LB = Li
Assumption (Expected Smoothness)
Let S ⊆ [n] be a random set of b points sampled without replacement.
There exist L(b) > 0 s.t.
E[‖∇fS(w)−∇fS(w∗)‖2
2
]≤ 2L(b) (f (w)− f (w∗))
5
Key Constant: Expected Smoothness
Definition (Subsample/batch function)
Let B ⊆ [n] a mini-batch of size |B| = b
fB(w) :=1
b
∑i∈B
fi (w)
and denote LB be the smallest constant s.t. fB is LB–smooth.
Recovering known smoothness constants
– B = [n] =⇒ LB = L
– B = {i} =⇒ LB = Li
Assumption (Expected Smoothness)
Let S ⊆ [n] be a random set of b points sampled without replacement.
There exist L(b) > 0 s.t.
E[‖∇fS(w)−∇fS(w∗)‖2
2
]≤ 2L(b) (f (w)− f (w∗))
5
Key Constant: Expected Smoothness
Definition (Subsample/batch function)
Let B ⊆ [n] a mini-batch of size |B| = b
fB(w) :=1
b
∑i∈B
fi (w)
and denote LB be the smallest constant s.t. fB is LB–smooth.
Recovering known smoothness constants
– B = [n] =⇒ LB = L
– B = {i} =⇒ LB = Li
Assumption (Expected Smoothness)
Let S ⊆ [n] be a random set of b points sampled without replacement.
There exist L(b) > 0 s.t.
E[‖∇fS(w)−∇fS(w∗)‖2
2
]≤ 2L(b) (f (w)− f (w∗))
5
Key Constant: Expected Smoothness (cont’d)
Definition (b−sampling without replacement)
S (a random set-valued mapping) is a b−sampling without replacement if
P [S = B] =1(nb
) ∀B ⊂ [n] : |B| = b
Lemma (Expected smoothness formula)
For b−sampling without replacement,
L(b) :=1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b ∧ i∈B
LB
Consequence of fB being LB–smooth and convex for a B ⊆ [n]
Problem: calculating L(b) is intractable for large n
6
Key Constant: Expected Smoothness (cont’d)
Definition (b−sampling without replacement)
S (a random set-valued mapping) is a b−sampling without replacement if
P [S = B] =1(nb
) ∀B ⊂ [n] : |B| = b
Lemma (Expected smoothness formula)
For b−sampling without replacement,
L(b) :=1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b ∧ i∈B
LB
Consequence of fB being LB–smooth and convex for a B ⊆ [n]
Problem: calculating L(b) is intractable for large n
6
Key Constant: Expected Smoothness (cont’d)
Definition (b−sampling without replacement)
S (a random set-valued mapping) is a b−sampling without replacement if
P [S = B] =1(nb
) ∀B ⊂ [n] : |B| = b
Lemma (Expected smoothness formula)
For b−sampling without replacement,
L(b) :=1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b ∧ i∈B
LB
Consequence of fB being LB–smooth and convex for a B ⊆ [n]
Problem: calculating L(b) is intractable for large n
6
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(b) =1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b∧ i∈B
LB
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(1) =1(
n−10
) maxi=1,...,n
∑B⊆[n] : |B|=1∧ i∈B
LB
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(1) =1(
n−10
) maxi=1,...,n
∑B ∈{{1},...,{n}} : i ∈B
L
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(1) = maxi=1,...,n
Li
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(1) = maxi=1,...,n
Li
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(n) =1(
n−1n−1
) maxi=1,...,n
∑B⊆[n] : |B|=n∧ i∈B
L
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(n) = maxi=1,...,n
∑B=[n]
LB
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(n) = maxi=1,...,n
L
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(n) = L
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Extreme Values of the Expected Smoothness
Let S be a b−sampling without replacement
L(n) = L
• If b = 1
• Recovered algorithm: SAGA
• B ∈ {{1}, . . . , {n}}
L(1) = Lmax
• If b = n
• Recovered algorithm: Gradient Descent
• B = [n]
L(n) = L
→ L(b) interpolates between Lmax and L
7
Optimal Mini-Batch and Step
Sizes for SAGA
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA
• The algorithm
For k = 0, 1, 2, . . .
– Sample a mini-batch B ⊂ [n] s.t. |B| = b
– Compute a gradient estimate
g(w k) =1
b
∑i∈B
∇fi (w k)− 1
b
∑i∈B
Jk:i +
1
nJk1
– Take a step
w k+1 = w k − γg(w k)
– Update the Jacobian estimate Jk
Jk:i = ∇fi (w k), ∀i ∈ B
• What is the optimal mini-batch size?
→ Find the “best” b value
8
Mini-Batch SAGA Iteration Complexity
Theorem (Convergence of mini-batch SAGA1)
Consider the iterates wk of the mini-batch SAGA algorithm. Let the step
size be
γ(b) =1
4
1
max
{L(b),
1
b
n − b
n − 1Lmax +
µ
4
n
b
}Given an ε > 0, if k ≥ Kiter(b) where
k ≥ Kiter(b) :=
{4L(b)
µ,n
b+
n − b
n − 1
4Lmax
bµ
}log
(1
ε
)=⇒ E
[∥∥wk − w∗∥∥2]≤ εC .
with C > 0 a constanta.
aC :=∥∥w0 − w∗
∥∥2+ γ
2Lmax
∑i∈[n]
∥∥J0:i −∇f (w∗)
∥∥2
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
9
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
#gradients
per iteration
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
#gradients
per iteration
#iterations k to achieve
E[∥∥wk − w∗
∥∥2]≤ εC
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
#gradients
per iteration
#iterations k to achieve
E[∥∥wk − w∗
∥∥2]≤ εC
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)
• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
#gradients
per iteration
#iterations k to achieve
E[∥∥wk − w∗
∥∥2]≤ εC
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
Methodology
Given a precision ε > 0,
• Optimal mini-batch size
find b∗ ∈ arg minb∈[n]
Ktotal(b) = b × Kiter(b)
#gradients
per iteration
#iterations k to achieve
E[∥∥wk − w∗
∥∥2]≤ εC
• Total complexity1
Ktotal(b) = max
{4bL(b)
µ, n +
n − b
n − 1
4Lmax
µ
}log
(1
ε
)• Importance of expected smoothness
• L(b) embodies the complexity
• Gives larger step sizes γ
→ Need to estimate L(b)
1Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
10
First Proven Upper-Bound
Lemma (Simple bound)
If S is a b−sampling without replacement,
L(b) ≤ Lsimple(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
where L :=1
n
n∑i=1
Li
– Lsimple(b) interpolates between Lmax and L ≥ L ...
– ... but L(b) interpolates between Lmax and L
Problem: L and L can be far from each other
11
First Proven Upper-Bound
Lemma (Simple bound)
If S is a b−sampling without replacement,
L(b) ≤ Lsimple(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
where L :=1
n
n∑i=1
Li
– Lsimple(b) interpolates between Lmax and L ≥ L ...
– ... but L(b) interpolates between Lmax and L
Problem: L and L can be far from each other
11
First Proven Upper-Bound
Lemma (Simple bound)
If S is a b−sampling without replacement,
L(b) ≤ Lsimple(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
where L :=1
n
n∑i=1
Li
– Lsimple(b) interpolates between Lmax and L ≥ L ...
– ... but L(b) interpolates between Lmax and L
Problem: L and L can be far from each other
11
First Proven Upper-Bound
Lemma (Simple bound)
If S is a b−sampling without replacement,
L(b) ≤ Lsimple(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
where L :=1
n
n∑i=1
Li
– Lsimple(b) interpolates between Lmax and L ≥ L ...
– ... but L(b) interpolates between Lmax and L
Problem: L and L can be far from each other
11
Exploiting Matrix Concentration Inequalities
Lemma (Bernstein bound)
If S is a b−sampling without replacement,
L(b) ≤ LBernstein(b) :=1
b
(n − b
n − 1+
4
3log(d)
)Lmax + 2
b − 1
b
n
n − 1L
• Proof idea
L(b) =1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b∧ i∈B
LB
≤ [...] + max
i=1,...,nE
[λmax
(∑k
Mik
)]where
(Mi
k
)k
is a sequence of random matrices
• Technical detail
Adapt Matrix Bernstein Inequality to sampling without replacement
Problem: LBernstein(b) approximation interpolates between
≈ log (d)Lmax and 2L
12
Exploiting Matrix Concentration Inequalities
Lemma (Bernstein bound)
If S is a b−sampling without replacement,
L(b) ≤ LBernstein(b) :=1
b
(n − b
n − 1+
4
3log(d)
)Lmax + 2
b − 1
b
n
n − 1L
• Proof idea
L(b) =1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b∧ i∈B
LB
≤ [...] + max
i=1,...,nE
[λmax
(∑k
Mik
)]where
(Mi
k
)k
is a sequence of random matrices
• Technical detail
Adapt Matrix Bernstein Inequality to sampling without replacement
Problem: LBernstein(b) approximation interpolates between
≈ log (d)Lmax and 2L
12
Exploiting Matrix Concentration Inequalities
Lemma (Bernstein bound)
If S is a b−sampling without replacement,
L(b) ≤ LBernstein(b) :=1
b
(n − b
n − 1+
4
3log(d)
)Lmax + 2
b − 1
b
n
n − 1L
• Proof idea
L(b) =1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b∧ i∈B
LB
≤ [...] + max
i=1,...,nE
[λmax
(∑k
Mik
)]where
(Mi
k
)k
is a sequence of random matrices
• Technical detail
Adapt Matrix Bernstein Inequality to sampling without replacement2
Problem: LBernstein(b) approximation interpolates between
≈ log (d)Lmax and 2L
2Gross & Nesme (2010), Tropp (2011, 2015) 12
Exploiting Matrix Concentration Inequalities
Lemma (Bernstein bound)
If S is a b−sampling without replacement,
L(b) ≤ LBernstein(b) :=1
b
(n − b
n − 1+
4
3log(d)
)Lmax + 2
b − 1
b
n
n − 1L
• Proof idea
L(b) =1(
n−1b−1
) maxi=1,...,n
∑B⊆[n] : |B|=b∧ i∈B
LB
≤ [...] + max
i=1,...,nE
[λmax
(∑k
Mik
)]where
(Mi
k
)k
is a sequence of random matrices
• Technical detail
Adapt Matrix Bernstein Inequality to sampling without replacement2
Problem: LBernstein(b) approximation interpolates between
≈ log (d)Lmax and 2L2Gross & Nesme (2010), Tropp (2011, 2015) 12
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)
13
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
L instead of L in
the simple bound
– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)
13
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
L instead of L in
the simple bound– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)
13
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
L instead of L in
the simple bound– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)
13
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
L instead of L in
the simple bound– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)
13
The “Practical” Estimate
Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
L instead of L in
the simple bound– Nicely interpolation between Lmax and L . . .
– ... Yet, not a proven bound: L(b)?≤ Lpractical(b)
Upper bounds and L(b)
computed on artificial data
(n = d = 24)
5 10 15 20
mini-batch size
0
100
200
300
400
500
smooth
ness
const
ant
→ Numerically Lpractical(b) ≈ L(b)13
Optimal Mini-Batch from the “Practical” Estimate
Given a precision ε > 0,
• Total complexity bound
Since L(b) ≤ Lpractical(b),
Ktotal(b) ≤ max
{nb − 1
n − 1
4L
µ+n − b
n − 1
4Lmax
µ, n+
n − b
n − 1
4Lmax
µ
}log
(1
ε
)
Optimal mini-batch size
=⇒ bpractical =
⌊1 +
µ(n − 1)
4L
⌋
14
Optimal Mini-Batch from the “Practical” Estimate
Given a precision ε > 0,
• Total complexity bound
Since L(b) ≤ Lpractical(b),
Ktotal(b) ≤ max
{nb − 1
n − 1
4L
µ+
n − b
n − 1
4Lmax
µ︸ ︷︷ ︸linearly increasing
, n+n − b
n − 1
4Lmax
µ
}log
(1
ε
)
Optimal mini-batch size
=⇒ bpractical =
⌊1 +
µ(n − 1)
4L
⌋
14
Optimal Mini-Batch from the “Practical” Estimate
Given a precision ε > 0,
• Total complexity bound
Since L(b) ≤ Lpractical(b),
Ktotal(b) ≤ max
{nb − 1
n − 1
4L
µ+
n − b
n − 1
4Lmax
µ︸ ︷︷ ︸linearly increasing
, n +n − b
n − 1
4Lmax
µ︸ ︷︷ ︸linearly decreasing
}log
(1
ε
)
Optimal mini-batch size
=⇒ bpractical =
⌊1 +
µ(n − 1)
4L
⌋
14
Optimal Mini-Batch from the “Practical” Estimate
Given a precision ε > 0,
• Total complexity bound
Since L(b) ≤ Lpractical(b),
Ktotal(b) ≤ max
{nb − 1
n − 1
4L
µ+
n − b
n − 1
4Lmax
µ︸ ︷︷ ︸linearly increasing
, n +n − b
n − 1
4Lmax
µ︸ ︷︷ ︸linearly decreasing
}log
(1
ε
)
Optimal mini-batch size
=⇒ bpractical =
⌊1 +
µ(n − 1)
4L
⌋
14
Mini-Batch SAGA Step Sizes
• Link between the step size and the expected smoothness
γ(b) =1
4
1
max
{L(b),
1
b
n − b
n − 1Lmax +
µ
4
n
b
}
• The smaller L(b) (the smoother fB), the larger γ(b)
• Plugging Lpractical(b), Lsimple(b) or LBernstein(b) into γ(b)
Step size increasing with
mini-batch size
on slice data set
(n = 53, 500, d = 384)
0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴
mini-batch size
0.000
0.001
0.002
0.003
0.004
step s
ize
→ Straightforward and larger step size for large b
15
Mini-Batch SAGA Step Sizes
• Link between the step size and the expected smoothness
γ(b) =1
4
1
max
{L(b),
1
b
n − b
n − 1Lmax +
µ
4
n
b
}• The smaller L(b) (the smoother fB), the larger γ(b)
• Plugging Lpractical(b), Lsimple(b) or LBernstein(b) into γ(b)
Step size increasing with
mini-batch size
on slice data set
(n = 53, 500, d = 384)
0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴
mini-batch size
0.000
0.001
0.002
0.003
0.004
step s
ize
→ Straightforward and larger step size for large b
15
Mini-Batch SAGA Step Sizes
• Link between the step size and the expected smoothness
γ(b) =1
4
1
max
{L(b),
1
b
n − b
n − 1Lmax +
µ
4
n
b
}• The smaller L(b) (the smoother fB), the larger γ(b)
• Plugging Lpractical(b), Lsimple(b) or LBernstein(b) into γ(b)
Step size increasing with
mini-batch size
on slice data set
(n = 53, 500, d = 384)
0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴
mini-batch size
0.000
0.001
0.002
0.003
0.004
step s
ize
→ Straightforward and larger step size for large b
15
Mini-Batch SAGA Step Sizes
• Link between the step size and the expected smoothness
γ(b) =1
4
1
max
{L(b),
1
b
n − b
n − 1Lmax +
µ
4
n
b
}• The smaller L(b) (the smoother fB), the larger γ(b)
• Plugging Lpractical(b), Lsimple(b) or LBernstein(b) into γ(b)
Step size increasing with
mini-batch size
on slice data set
(n = 53, 500, d = 384)
0 1×10⁴ 2×10⁴ 3×10⁴ 4×10⁴ 5×10⁴
mini-batch size
0.000
0.001
0.002
0.003
0.004
step s
ize
→ Straightforward and larger step size for large b15
Numerical Experiments
Convergence Results on Real Data
0 1 2 3 4epochs
10−4
10−3
10−2
10−1
100
resid
ual
bDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03
Comparison of SAGA settings for the slice3 data set
(n = 53, 500, d = 384)
→ Larger mini-batch and step sizes: faster convergence
→ Competing against grid-search!
3UCI Machine Learning Repository16
Convergence Results on Real Data
0 1 2 3 4epochs
10−4
10−3
10−2
10−1
100
resid
ual
bDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03
Comparison of SAGA settings for the slice3 data set
(n = 53, 500, d = 384)
Grid search over
γ ∈ {2−33, 2−31 . . . , 223, 225}
→ Larger mini-batch and step sizes: faster convergence
→ Competing against grid-search!
3UCI Machine Learning Repository16
Convergence Results on Real Data
0 1 2 3 4epochs
10−4
10−3
10−2
10−1
100
resid
ual
bDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03
Comparison of SAGA settings for the slice3 data set
(n = 53, 500, d = 384)
Grid search over
γ ∈ {2−33, 2−31 . . . , 223, 225}
→ Larger mini-batch and step sizes: faster convergence
→ Competing against grid-search!
3UCI Machine Learning Repository16
Convergence Results on Real Data
0 1 2 3 4epochs
10−4
10−3
10−2
10−1
100
resid
ual
bDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03
Comparison of SAGA settings for the slice3 data set
(n = 53, 500, d = 384)
Grid search over
γ ∈ {2−33, 2−31 . . . , 223, 225}
→ Larger mini-batch and step sizes: faster convergence
→ Competing against grid-search!3UCI Machine Learning Repository
16
Optimality of Our Mini-Batch Size
1 2 4 8 16 32 64 128 256 1024 4096 16384mini-batch size
105.0
105.5
106.0
106.5
107.0
empi
rical
tota
l com
plex
ity
bempirical = 2bpractical = 70
Complexity explosion for the slice data set (n = 53, 500, d = 384)
→ Regime change observed & predicting the largest mini-batch
size before complexity explosion
17
Optimality of Our Mini-Batch Size (cont’d)
1 2 4 8 16 32 64 128
256
1024
4096
1638
465
536
mini-batch size
105.5
105.6
105.7
105.8
105.9
106.0
empi
rical
tota
l com
plex
ity bempirical = 16384bpractical = 8898
Complexity explosion for the real-sim4 data set (n = 72, 309, d = 20, 958)
→ Regime change observed & predicting the largest mini-batch
size before complexity explosion
4LIBSVM Data
18
Optimality of Our Mini-Batch Size (cont’d)
1 2 4 8 16 32 64 128
256
1024
4096
1638
465
536
mini-batch size
105.5
105.6
105.7
105.8
105.9
106.0
empi
rical
tota
l com
plex
ity bempirical = 16384bpractical = 8898
Complexity explosion for the real-sim4 data set (n = 72, 309, d = 20, 958)
→ Regime change observed & predicting the largest mini-batch
size before complexity explosion4LIBSVM Data
18
Conclusion
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
Turned out that
Lpractical(b)
is an actual
upper-bound!4
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
Turned out that
Lpractical(b)
is an actual
upper-bound!4
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
Turned out that
Lpractical(b)
is an actual
upper-bound!4
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
Turned out that
Lpractical(b)
is an actual
upper-bound!4
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
Summary of Our Contributions
What was done for SAGA
• Estimates of the expected smoothness
Turned out that
Lpractical(b)
is an actual
upper-bound!4
• Simple formula for the step size γ(b)
• Optimal mini-batch size bpractical
• Convincing numerics verifying the optimality of our parameters on
real data sets5
Julia code available athttps://github.com/gowerrobert/StochOpt.jl/
What has been done since then
• Extended study to variants of SVRG6
4Gower et. al (2019), ICML, “SGD: General Analysis and Improved Rates”5LIBSVM and UCI repositories6Sebbouh et. al (2019), arXiv:1908.02725, “Towards closing the gap between the
theory and practice of SVRG”
19
References (1/2)
• Defazio, Bach and Lacoste-Julien (2014), NIPS“SAGA: A Fast Incremental Gradient Method With Support forNon-Strongly Convex Composite Objectives”
• Gower, Richtarik and Bach (2018), preprint online arXiv:1805.02632“Stochastic Quasi-Gradient Methods: Variance Reduction viaJacobian Sketching”
• Gower, Loizou, Qian, Sailanbayev, Shulgin, Richtarik (2019), ICML“SGD: General Analysis and Improved Rates”
• Sebbouh, Gazagnadou, Jelassi, Bach, Gower (2019), preprint onlinearXiv:1908.02725“Towards closing the gap between the theory and practice of SVRG”
• Robbins and Monro (1951), Annals of Mathematical Statistics“A stochastic approximation method”
• Schmidt, Le Roux and Bach (2017), Mathematical Programming“Minimizing finite sums with the stochastic average gradient”
• Johnson and Zhang (2013), NIPS“Accelerating Stochastic Gradient Descent using Predictive VarianceReduction”
20
References (2/2)
• Tropp (2015), Foundations and Trends R© in Machine Learning“An Introduction to Matrix Concentration Inequalities”
• Tropp (2012), Foundations of Computational Mathematics“User-Friendly Tail Bounds for Sums of Random Matrices”
• Tropp (2011), Advances in Adaptive Data Analysis“Improved analysis of the subsampled randomized Hadamardtransform”
• Bach (2013), COLT“Sharp analysis of low-rank kernel matrix approximations”
• Gross and Nesme (2010), preprint online arXiv:1001.2738“Note on sampling without replacing from a finite collection ofmatrices”
• Hoeffding (1963), Journal of the American Statistical Association“Probability inequalities for sums of bounded random variables”
• Chang and Lin (2011), ACM TIST“LIBSVM : A library for support vector machines”
21
Questions ?
21
Time Plot Convergence Results on Real Data
0 5 10 15time
10−4
10−3
10−2
10−1
100resid
ualbDefazio =1 , γDefazio =6.10e−05bpractical =70 , γpractical =1.20e−02bpractical =70 , γgridsearch=3.13e−02bHofmann=20 , γHofmann =1.59e−03
Comparison of SAGA settings for the slice5 data set
(n = 53, 500, d = 384)5UCI Machine Learning Repository 22
JacSketch7 Lyapunov Function
Definition (Stochastic Lyapunov function)
Ψk :=∥∥wk − w∗
∥∥2
2+
γ
2bLmax
∥∥Jk −∇F(w∗)∥∥2
F
• ‖·‖F: Frobenius norm
• ∇F(w) = [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n: Jacobian matrix
•{wk , Jk
}k≥0
are the points and Jacobian estimate
If ε > 0 denotes the desired precision, Theorem 3.6 ensures that, for a
step size γ = min
{1
4L,
11bn−bn−1Lmax + µ
4nb
},
E[Ψk]≤ εΨ0 .
7Gower et al (2018), arXiv:1805.02632, “Stochastic Quasi-Gradient Methods:
Variance Reduction via Jacobian Sketching”
23
Proof of the “Practical” Bound (1/2)
Lemma (Practical bound)
If S is a b−sampling without replacement,
L(b) ≤ Lpractical(b) :=1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
Proof: See Proposition 3.8 in Gower et. al (2019), “SGD: General Analysis and Improved Rates”,
Let S be a b–sampling without replacement with b ∈ [n], and x, z ∈ Rd . Let us denote
pB := P [S = B] =1(nb
) , pi := P [i ∈ S] =b
n, Pij := P [i, j ∈ S] =
{b(b−1)n(n−1) if i 6= j
pi = nb else
,
for all B ⊆ [n] and all i, j ∈ [n].
‖∇fS (w)−∇fS (z)‖22 =
1
n2
∥∥∥∥∥∥∑i∈S
1
pi(∇fi (w)−∇fi (z))
∥∥∥∥∥∥2
2
=∑i,j∈S
⟨1
npi(∇fi (w)−∇fi (z)) ,
1
npj(∇fj (w)−∇fj (z))
⟩Then, we take expectation over all possible mini-batches (B ⊆ [n] : |B| = b).
E[‖∇fS (w)−∇fS (z)‖2
2
]=∑B
pB∑i,j∈B
⟨1
npi(∇fi (w)−∇fi (z)) ,
1
npj(∇fj (w)−∇fj (z))
⟩double counting
=n∑
i,j=1
∑B : i,j∈B
pB
⟨1
npi(∇fi (w)−∇fi (z)) ,
1
npj(∇fj (w)−∇fj (z))
⟩∑
B : i,j∈B pB=Pij=
n∑i,j=1
⟨1
npi(∇fi (w)−∇fi (z)) ,
1
npj(∇fj (w)−∇fj (z))
⟩Pij
24
Proof of the “Practical” Bound (2/2)
Now consider the two disjoint cases where i 6= j and i = j .
E[‖∇fS (w)−∇fS (z)‖2
2
]=
n∑i,j=1
Pij
pipj
⟨1
n(∇fi (w)−∇fi (z)) ,
1
n(∇fj (w)−∇fj (z))
⟩
=∑
i,j : i 6=j
n(b − 1)
b(n − 1)
⟨1
n(∇fi (w)−∇fi (z)) ,
1
n(∇fj (w)−∇fj (z))
⟩+
n∑i=1
1
n2
1
pi‖∇fi (w)−∇fi (z)‖2
2
=n∑
i,j=1
n(b − 1)
b(n − 1)
⟨1
n(∇fi (w)−∇fi (z)) ,
1
n(∇fj (w)−∇fj (z))
⟩
+n∑
i=1
1
n2
1
pi
(1−
n(b − 1)
b(n − 1)pi
)‖∇fi (w)−∇fi (z)‖2
2
fi smooth& convex≤
n(b − 1)
b(n − 1)‖∇f (w)−∇f (z)‖2
2 + 2n∑
i=1
Li
n2pi
(1−
n(b − 1)
b(n − 1)pi
)(fi (w)− fi (z)− 〈∇fi (z),w − z〉)
f smooth& convex≤ 2
(n(b − 1)
b(n − 1)L + max
i=1,...,n
Li
npi
(1−
n(b − 1)
b(n − 1)pi
))(f (w)− f (z)− 〈∇f (z),w − z〉)
= 2
(n(b − 1)
b(n − 1)L + max
i=1,...,n
Li
b
(1−
b − 1
n − 1
))(f (w)− f (z)− 〈∇f (z),w − z〉)
= 2
(1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
)︸ ︷︷ ︸
Lpractical(b)
(f (w)− f (z)− 〈∇f (z),w − z〉)
25
From Sampling Without to With Replacement
Lemma (Domination of the trace of the mgf of a sample without
replacement)
Consider two finite sequences, of same length, {Xk} and {Mk} of
Hermitian random matrices of same size sampled respectively with and
without replacement from a finite set X . Let θ ∈ R, then
E tr exp(θ∑
kMk
)≤ E tr exp
(θ∑
kXk
).
See Gross and Nesme (2010), arXiv:1001.2738,
“Note on sampling without replacing from a finite collection of matrices”
26
Proof Sketch of the Bernstein Bound (1/2)
(i) Write L as an expectation
L = maxi=1,...,n
E[LS i∪{i}
]= max
i=1,...,nUE
λmax
1
b
∑j∈Sn∪{i}
aja>j
≤ 1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
+ maxi=1,...,n
UE
λmax
1
b
∑j∈S i
aja>j −
1
b
b − 1
n − 1
∑j∈[n]\{i}
aja>j︸ ︷︷ ︸
N=∑
k Mk
27
Proof Sketch of the Bernstein Bound (1/2)
(i) Write L as an expectation
L = maxi=1,...,n
E[LS i∪{i}
]= max
i=1,...,nUE
λmax
1
b
∑j∈Sn∪{i}
aja>j
≤ 1
b
n − b
n − 1Lmax +
n
b
b − 1
n − 1L
+ maxi=1,...,n
UE
λmax
1
b
∑j∈S i
aja>j −
1
b
b − 1
n − 1
∑j∈[n]\{i}
aja>j︸ ︷︷ ︸
N=∑
k Mk
Practical
approximation
27
Proof Sketch of the Bernstein Bound (2/2)
(ii) Write N as a sum of random matrices and apply
Theorem (Matrix Bernstein Inequality Without Replacement)
Let X be a finite set of Hermitian matrices with dimension d s.t.
λmax(X) ≤ L ∀X ∈ X .
Sample {Xk} and {Mk} uniformly at random from X resp. with and
without replacement s.t.
EXk = 0 ∀k .
Let Y :=∑
k Xk and N :=∑
k Mk . Then
Eλmax(N) ≤√
2v(Y) log d +1
3L log d .
where v(Y) :=∥∥EY2
∥∥ =∥∥∥∑
kEX2
k
∥∥∥ = λmax
(∑kEX2
k
).
(Tropp, 2011, 2012, 2015; Gross and Nesme, 2010; Bach 2013)
28
Stochastic Reformulation of the ERM
• Sampling vector
Let v ∈ Rn, with distribution D s.t.
ED [v ] = 1
where 1 is the all-ones vector
• Unbiased subsampled function
fv (w) :=1
n
n∑i=1
fi (w)vi =1
n〈F (w), v〉
where F (w) := (f1(w), . . . , fn(w))> ∈ Rn
=⇒ ED [fv (w)] =1
n〈F (w),ED [v ]〉 = f (w)
• ERM reformulation
solving ERM ⇐⇒ find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED [fv (w)]
29
Stochastic Reformulation of the ERM
• Sampling vector
Let v ∈ Rn, with distribution D s.t.
ED [v ] = 1
where 1 is the all-ones vector
• Unbiased subsampled function
fv (w) :=1
n
n∑i=1
fi (w)vi =1
n〈F (w), v〉
where F (w) := (f1(w), . . . , fn(w))> ∈ Rn
=⇒ ED [fv (w)] =1
n〈F (w),ED [v ]〉 = f (w)
• ERM reformulation
solving ERM ⇐⇒ find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED [fv (w)]
29
Stochastic Reformulation of the ERM
• Sampling vector
Let v ∈ Rn, with distribution D s.t.
ED [v ] = 1
where 1 is the all-ones vector
• Unbiased subsampled function
fv (w) :=1
n
n∑i=1
fi (w)vi =1
n〈F (w), v〉
where F (w) := (f1(w), . . . , fn(w))> ∈ Rn
=⇒ ED [fv (w)] =1
n〈F (w),ED [v ]〉 = f (w)
• ERM reformulation
solving ERM ⇐⇒ find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED [fv (w)]
29
Stochastic Reformulation of the ERM
• Sampling vector
Let v ∈ Rn, with distribution D s.t.
ED [v ] = 1
where 1 is the all-ones vector
• Unbiased subsampled function
fv (w) :=1
n
n∑i=1
fi (w)vi =1
n〈F (w), v〉
where F (w) := (f1(w), . . . , fn(w))> ∈ Rn
=⇒ ED [fv (w)] =1
n〈F (w),ED [v ]〉 = f (w)
• ERM reformulation
solving ERM ⇐⇒ find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED [fv (w)]
29
Arbitrary Sampling
Examples of sampling vector
Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian
• Let v =1 (deterministic)
∇fv (w) =1
n∇F(w)1 = ∇f (w) (GD)
• Let v =nei (where ei is the i-th vector of basis) with probability 1/n
for all i ∈ [n]
∇fv (w) =1
n∇F(w)nei = ∇fi (w) (SGD)
• Let v =(n/b)∑∑∑
i∈B ei with B drawn from [n] uniformly s.t. |B|=b
∇fv (w) =1
n∇F(w)
n
b
∑i∈B
ei =1
b
∑i∈B
∇fi (w)
(b–SGD without replacement)
→ Arbitrary sampling covers many common methods
30
Arbitrary Sampling
Examples of sampling vector
Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian
• Let v =1 (deterministic)
∇fv (w) =1
n∇F(w)1 = ∇f (w) (GD)
• Let v =nei (where ei is the i-th vector of basis) with probability 1/n
for all i ∈ [n]
∇fv (w) =1
n∇F(w)nei = ∇fi (w) (SGD)
• Let v =(n/b)∑∑∑
i∈B ei with B drawn from [n] uniformly s.t. |B|=b
∇fv (w) =1
n∇F(w)
n
b
∑i∈B
ei =1
b
∑i∈B
∇fi (w)
(b–SGD without replacement)
→ Arbitrary sampling covers many common methods
30
Arbitrary Sampling
Examples of sampling vector
Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian
• Let v =1 (deterministic)
∇fv (w) =1
n∇F(w)1 = ∇f (w) (GD)
• Let v =nei (where ei is the i-th vector of basis) with probability 1/n
for all i ∈ [n]
∇fv (w) =1
n∇F(w)nei = ∇fi (w) (SGD)
• Let v =(n/b)∑∑∑
i∈B ei with B drawn from [n] uniformly s.t. |B|=b
∇fv (w) =1
n∇F(w)
n
b
∑i∈B
ei =1
b
∑i∈B
∇fi (w)
(b–SGD without replacement)
→ Arbitrary sampling covers many common methods
30
Arbitrary Sampling
Examples of sampling vector
Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian
• Let v =1 (deterministic)
∇fv (w) =1
n∇F(w)1 = ∇f (w) (GD)
• Let v =nei (where ei is the i-th vector of basis) with probability 1/n
for all i ∈ [n]
∇fv (w) =1
n∇F(w)nei = ∇fi (w) (SGD)
• Let v =(n/b)∑∑∑
i∈B ei with B drawn from [n] uniformly s.t. |B|=b
∇fv (w) =1
n∇F(w)
n
b
∑i∈B
ei =1
b
∑i∈B
∇fi (w)
(b–SGD without replacement)
→ Arbitrary sampling covers many common methods
30
Arbitrary Sampling
Examples of sampling vector
Let ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n be the Jacobian
• Let v =1 (deterministic)
∇fv (w) =1
n∇F(w)1 = ∇f (w) (GD)
• Let v =nei (where ei is the i-th vector of basis) with probability 1/n
for all i ∈ [n]
∇fv (w) =1
n∇F(w)nei = ∇fi (w) (SGD)
• Let v =(n/b)∑∑∑
i∈B ei with B drawn from [n] uniformly s.t. |B|=b
∇fv (w) =1
n∇F(w)
n
b
∑i∈B
ei =1
b
∑i∈B
∇fi (w)
(b–SGD without replacement)
→ Arbitrary sampling covers many common methods
30
Unbiased Gradient Estimate
• Unbiased estimates
– Function: ED [fv (w)] = 1n〈F (w),ED [v ]〉 = f (w)
– Gradient: ED [∇fv (w)] = 1n∇F(w)ED [v ] = ∇f (w)
where ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n
→ Motivates using: wk+1 = wk − γk∇fv (wk)
where γk is a step size sequence
31
Unbiased Gradient Estimate
• Unbiased estimates
– Function: ED [fv (w)] = 1n〈F (w),ED [v ]〉 = f (w)
– Gradient: ED [∇fv (w)] = 1n∇F(w)ED [v ] = ∇f (w)
where ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n
→ Motivates using: wk+1 = wk − γk∇fv (wk)
where γk is a step size sequence
31
Unbiased Gradient Estimate
• Unbiased estimates
– Function: ED [fv (w)] = 1n〈F (w),ED [v ]〉 = f (w)
– Gradient: ED [∇fv (w)] = 1n∇F(w)ED [v ] = ∇f (w)
where ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n
→ Motivates using: wk+1 = wk − γk∇fv (wk)
where γk is a step size sequence
31
Unbiased Gradient Estimate
• Unbiased estimates
– Function: ED [fv (w)] = 1n〈F (w),ED [v ]〉 = f (w)
– Gradient: ED [∇fv (w)] = 1n∇F(w)ED [v ] = ∇f (w)
where ∇F(w) := [∇f1(w), . . . ,∇fn(w)] ∈ Rd×n
→ Motivates using: wk+1 = wk − γk∇fv (wk)
where γk is a step size sequence
31
Reducing the Variance With Control Variates
• Controlled stochastic reformulation of the ERM
find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED[fv (w)−zv (w) + ED [zv (w)]︸ ︷︷ ︸
unbiased correction term
]with zv (·) a random function
• Stochastic variance-reduced gradient estimator
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
Iteration: wk+1 = wk − γgv (wk)
→ Recovers stochastic variance-reduced methods
32
Reducing the Variance With Control Variates
• Controlled stochastic reformulation of the ERM
find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED[fv (w)−zv (w) + ED [zv (w)]︸ ︷︷ ︸
unbiased correction term
]with zv (·) a random function
• Stochastic variance-reduced gradient estimator
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
Iteration: wk+1 = wk − γgv (wk)
→ Recovers stochastic variance-reduced methods
32
Reducing the Variance With Control Variates
• Controlled stochastic reformulation of the ERM
find w∗ ∈ arg minw∈Rd
1
n
n∑i=1
fi (w) = ED[fv (w)−zv (w) + ED [zv (w)]︸ ︷︷ ︸
unbiased correction term
]with zv (·) a random function
• Stochastic variance-reduced gradient estimator
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
Iteration: wk+1 = wk − γgv (wk)
→ Recovers stochastic variance-reduced methods
32
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Advantages of Stochastic Variance-Reduced Methods
Brief recap
• Iterative methods
wk+1 = wk − γgv (wk)
Such that
– Unbiased gradient: E[gv (w k)
]= ∇f (w k)
– Decreasing variance: E[∥∥gv (w k)−∇f (w k)
∥∥2
2
]−−−−−→wk→w∗
0
• Advantages– No decreasing step sizes (γk)k needed
– fv(w) determines the smoothness of the controlled stochastic
reformulation
Assumption (Expected Smoothness)
E[‖∇fv (w)−∇fv (w∗)‖2
2
]≤ 2L (fv (w)−∇fv (w∗))
– No “bounded gradient” assumption such as E[∥∥∇fv (w k)
∥∥2
2
]≤ cst
33
Example: Recovering SAGA Algorithm
• SAGA6fv (w) = 1
n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v
zv (w) = 1n 〈J>w︸ ︷︷ ︸
linear estimation of F (w)
, v〉 =⇒ ∇zv (w) = 1nJv
with J an estimate of the Jacobian in Rd×n
• If v = nei , where ei is the i-th vector of basis
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
=1
n∇F(w)nei −
1
nJnei + E
[1
nJv]
= ∇fi (w)− J:i +1
nJ1
where 1 is the all-ones vector and J:i the i−th column of J• Variance term
Convergence analysis: ED[‖gv (w)−∇f (w)‖2
2
]low for J ≈ ∇F(w)
6Defazio, Bach and Lacoste-Julien (2014), NIPS, “SAGA: A Fast Incremental
Gradient Method With Support for Non-Strongly Convex Composite Objectives”34
Example: Recovering SAGA Algorithm
• SAGA6fv (w) = 1
n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v
zv (w) = 1n 〈J>w︸ ︷︷ ︸
linear estimation of F (w)
, v〉 =⇒ ∇zv (w) = 1nJv
with J an estimate of the Jacobian in Rd×n
• If v = nei , where ei is the i-th vector of basis
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
=1
n∇F(w)nei −
1
nJnei + E
[1
nJv]
= ∇fi (w)− J:i +1
nJ1
where 1 is the all-ones vector and J:i the i−th column of J
• Variance term
Convergence analysis: ED[‖gv (w)−∇f (w)‖2
2
]low for J ≈ ∇F(w)
6Defazio, Bach and Lacoste-Julien (2014), NIPS, “SAGA: A Fast Incremental
Gradient Method With Support for Non-Strongly Convex Composite Objectives”34
Example: Recovering SAGA Algorithm
• SAGA6fv (w) = 1
n 〈F (w), v〉 =⇒ ∇fv (w) = 1n∇F(w)v
zv (w) = 1n 〈J>w︸ ︷︷ ︸
linear estimation of F (w)
, v〉 =⇒ ∇zv (w) = 1nJv
with J an estimate of the Jacobian in Rd×n
• If v = nei , where ei is the i-th vector of basis
gv (w) := ∇fv (w)−∇zv (w) + ED [∇zv (w)]
=1
n∇F(w)nei −
1
nJnei + E
[1
nJv]
= ∇fi (w)− J:i +1
nJ1
where 1 is the all-ones vector and J:i the i−th column of J• Variance term
Convergence analysis: ED[‖gv (w)−∇f (w)‖2
2
]low for J ≈ ∇F(w)
6Defazio, Bach and Lacoste-Julien (2014), NIPS, “SAGA: A Fast Incremental
Gradient Method With Support for Non-Strongly Convex Composite Objectives”34