206
Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Machine learning summer school - Cadiz 2016 Slides available: www.di.ens.fr/ ~ fbach/mlss_2016_cadiz_slides.pdf

Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

  • Upload
    others

  • View
    29

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Statistical machine learningand convex optimization

Francis Bach

INRIA - Ecole Normale Superieure, Paris, France

ÉCOLE NORMALE

S U P É R I E U R E

Machine learning summer school - Cadiz 2016

Slides available: www.di.ens.fr/~fbach/mlss_2016_cadiz_slides.pdf

Page 2: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

“Big data” revolution?

A new scientific context

• Data everywhere: size does not (always) matter

• Science and industry

• Size and variety

• Learning from examples

– n observations in dimension d

Page 3: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Search engines - Advertising

Page 4: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Visual object recognition

Page 5: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Personal photos

Page 6: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

Page 7: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n

– d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

– Ideal running-time complexity: O(dn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 8: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n

– d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 9: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n

– d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 10: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 11: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 12: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rd

1

n

n∑

i=1

ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

convex data fitting term + regularizer

Page 13: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Usual losses

• Regression: y ∈ R, prediction y = θ⊤Φ(x)

– quadratic loss 12(y − y)2 = 1

2(y − θ⊤Φ(x))2

Page 14: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Usual losses

• Regression: y ∈ R, prediction y = θ⊤Φ(x)

– quadratic loss 12(y − y)2 = 1

2(y − θ⊤Φ(x))2

• Classification : y ∈ −1, 1, prediction y = sign(θ⊤Φ(x))

– loss of the form ℓ(y θ⊤Φ(x))– “True” 0-1 loss: ℓ(y θ⊤Φ(x)) = 1y θ⊤Φ(x)<0

– Usual convex losses:

−3 −2 −1 0 1 2 3 40

1

2

3

4

5

0−1hingesquarelogistic

Page 15: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Main motivating examples

• Support vector machine (hinge loss): non-smooth

ℓ(Y, θ⊤Φ(X)) = max1− Y θ⊤Φ(X), 0

• Logistic regression: smooth

ℓ(Y, θ⊤Φ(X)) = log(1 + exp(−Y θ⊤Φ(X)))

• Least-squares regression

ℓ(Y, θ⊤Φ(X)) =1

2(Y − θ⊤Φ(X))2

• Structured output regression

– See Tsochantaridis et al. (2005); Lacoste-Julien et al. (2013)

Page 16: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: ‖θ‖22 =∑d

j=1 |θj|2

– Numerically well-behaved

– Representer theorem and kernel methods : θ =∑n

i=1αiΦ(xi)

– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004)

Page 17: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: ‖θ‖22 =∑d

j=1 |θj|2

– Numerically well-behaved

– Representer theorem and kernel methods : θ =∑n

i=1αiΦ(xi)

– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004)

• Sparsity-inducing norms

– Main example: ℓ1-norm ‖θ‖1 =∑d

j=1 |θj|– Perform model selection as well as regularization

– Non-smooth optimization and structured sparsity

– See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a)

Page 18: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rd

1

n

n∑

i=1

ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

convex data fitting term + regularizer

Page 19: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rd

1

n

n∑

i=1

ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost

• Two fundamental questions: (1) computing θ and (2) analyzing θ

– May be tackled simultaneously

Page 20: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rd

1

n

n∑

i=1

ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost

• Two fundamental questions: (1) computing θ and (2) analyzing θ

– May be tackled simultaneously

Page 21: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rd

1

n

n∑

i=1

ℓ(

yi, θ⊤Φ(xi)

)

such that Ω(θ) 6 D

convex data fitting term + constraint

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost

• Two fundamental questions: (1) computing θ and (2) analyzing θ

– May be tackled simultaneously

Page 22: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

General assumptions

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Bounded features Φ(x) ∈ Rd: ‖Φ(x)‖2 6 R

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost

• Loss for a single observation: fi(θ) = ℓ(yi, θ⊤Φ(xi))

⇒ ∀i, f(θ) = Efi(θ)

• Properties of fi, f, f

– Convex on Rd

– Additional regularity assumptions: Lipschitz-continuity,

smoothness and strong convexity

Page 23: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convexity

• Global definitions

g(θ)

θ

Page 24: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convexity

• Global definitions (full domain)

g(θ)

θ

g(θ1)g(θ2)

θ2θ1

– Not assuming differentiability:

∀θ1, θ2, α ∈ [0, 1], g(αθ1 + (1− α)θ2) 6 αg(θ1) + (1− α)g(θ2)

Page 25: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convexity

• Global definitions (full domain)

g(θ)

θ

g(θ1)g(θ2)

θ2θ1

– Assuming differentiability:

∀θ1, θ2, g(θ1) > g(θ2) + g′(θ2)⊤(θ1 − θ2)

• Extensions to all functions with subgradients / subdifferential

Page 26: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convexity

• Global definitions (full domain)

g(θ)

θ

• Local definitions

– Twice differentiable functions

– ∀θ, g′′(θ) < 0 (positive semi-definite Hessians)

Page 27: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convexity

• Global definitions (full domain)

g(θ)

θ

• Local definitions

– Twice differentiable functions

– ∀θ, g′′(θ) < 0 (positive semi-definite Hessians)

• Why convexity?

Page 28: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Why convexity?

• Local minimum = global minimum

– Optimality condition (non-smooth): 0 ∈ ∂g(θ)

– Optimality condition (smooth): g′(θ) = 0

• Convex duality

– See Boyd and Vandenberghe (2003)

• Recognizing convex problems

– See Boyd and Vandenberghe (2003)

Page 29: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Lipschitz continuity

• Bounded gradients of g (⇔ Lipschitz-continuity): the function

g if convex, differentiable and has (sub)gradients uniformly bounded

by B on the ball of center 0 and radius D:

∀θ ∈ Rd, ‖θ‖2 6 D ⇒ ‖g′(θ)‖2 6 B

∀θ, θ′ ∈ Rd, ‖θ‖2, ‖θ′‖2 6 D ⇒ |g(θ)− g(θ′)| 6 B‖θ − θ′‖2

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi))

– G-Lipschitz loss and R-bounded data: B = GR

Page 30: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is L-smooth if and only if it is differentiable

and its gradient is L-Lipschitz-continuous

∀θ1, θ2 ∈ Rd, ‖g′(θ1)− g′(θ2)‖2 6 L‖θ1 − θ2‖2

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) 4 L · Id

smooth non−smooth

Page 31: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is L-smooth if and only if it is differentiable

and its gradient is L-Lipschitz-continuous

∀θ1, θ2 ∈ Rd, ‖g′(θ1)− g′(θ2)‖2 6 L‖θ1 − θ2‖2

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) 4 L · Id

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi))

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)Φ(xi)

– Lloss-smooth loss and R-bounded data: L = LlossR2

Page 32: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)

⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id

convexstronglyconvex

Page 33: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)

⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id

(large µ/L) (small µ/L)

Page 34: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)

⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi))

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

Page 35: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness and strong convexity

• A function g : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)

⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22

• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi))

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by µ2‖θ‖2

– creates additional bias unless µ is small

Page 36: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of smoothness/convexity assumptions

• Bounded gradients of g (Lipschitz-continuity): the function g if

convex, differentiable and has (sub)gradients uniformly bounded by

B on the ball of center 0 and radius D:

∀θ ∈ Rd, ‖θ‖2 6 D ⇒ ‖g′(θ)‖2 6 B

• Smoothness of g: the function g is convex, differentiable with

L-Lipschitz-continuous gradient g′ (e.g., bounded Hessians):

∀θ1, θ2 ∈ Rd, ‖g′(θ1)− g′(θ2)‖2 6 L‖θ1 − θ2‖2

• Strong convexity of g: The function g is strongly convex with

respect to the norm ‖ · ‖, with convexity constant µ > 0:

∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)

⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22

Page 37: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Analysis of empirical risk minimization

• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D

f(θ)− minθ∈Rd

f(θ) =

[

f(θ)−minθ∈Θ

f(θ)

]

+

[

minθ∈Θ

f(θ)− minθ∈Rd

f(θ)

]

Estimation error Approximation error

– NB: may replace minθ∈Rd

f(θ) by best (non-linear) predictions

Page 38: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Analysis of empirical risk minimization

• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D

f(θ)− minθ∈Rd

f(θ) =

[

f(θ)−minθ∈Θ

f(θ)

]

+

[

minθ∈Θ

f(θ)− minθ∈Rd

f(θ)

]

Estimation error Approximation error

1. Uniform deviation bounds, with θ ∈ argminθ∈Θ

f(θ)

f(θ)−minθ∈Θ

f(θ) 6 2 · supθ∈Θ

|f(θ)− f(θ)|

– Typically slow rate O(

1/√n)

2. More refined concentration results with faster rates O(1/n)

Page 39: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Slow rate for supervised learning

• Assumptions (f is the expected risk, f the empirical risk)

– Ω(θ) = ‖θ‖2 (Euclidean norm)

– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.

– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D– No assumptions regarding convexity

Page 40: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Slow rate for supervised learning

• Assumptions (f is the expected risk, f the empirical risk)

– Ω(θ) = ‖θ‖2 (Euclidean norm)

– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.

– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D– No assumptions regarding convexity

• With probability greater than 1− δ

supθ∈Θ

|f(θ)− f(θ)| 6 ℓ0 +GRD√n

[

2 +

2 log2

δ

]

• Expectated estimation error: E[

supθ∈Θ

|f(θ)− f(θ)|]

64ℓ0 + 4GRD√

n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

Page 41: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Motivation from mean estimation

• Estimator θ = 1n

∑ni=1 zi = argminθ∈R

12n

∑ni=1(θ − zi)

2 = f(θ)

• From before:

– f(θ) = 12E(θ − z)2 = 1

2(θ − Ez)2 + 12 var(z) = f(θ) +O(1/

√n)

– f(θ) = 12(θ − Ez)2 + 1

2 var(z) = f(Ez) +O(1/√n)

Page 42: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Motivation from mean estimation

• Estimator θ = 1n

∑ni=1 zi = argminθ∈R

12n

∑ni=1(θ − zi)

2 = f(θ)

• From before:

– f(θ) = 12E(θ − z)2 = 1

2(θ − Ez)2 + 12 var(z) = f(θ) +O(1/

√n)

– f(θ) = 12(θ − Ez)2 + 1

2 var(z) = f(Ez) +O(1/√n)

• More refined/direct bound:

f(θ)− f(Ez) =1

2(θ − Ez)2

E[

f(θ)− f(Ez)]

=1

2E

(

1

n

n∑

i=1

zi − Ez

)2

=1

2nvar(z)

• Bound only at θ + strong convexity (instead of uniform bound)

Page 43: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Fast rate for supervised learning

• Assumptions (f is the expected risk, f the empirical risk)

– Same as before (bounded features, Lipschitz loss)

– Regularized risks: fµ(θ) = f(θ)+µ2‖θ‖22 and fµ(θ) = f(θ)+µ

2‖θ‖22– Convexity

• For any a > 0, with probability greater than 1− δ, for all θ ∈ Rd,

fµ(θ)− minη∈Rd

fµ(η) 68(1 + 1

a)G2R2(32 + log 1

δ)

µn

• Results from Sridharan, Srebro, and Shalev-Shwartz (2008)

– see also Boucheron and Massart (2011) and references therein

• Strongly convex functions ⇒ fast rate

– Warning: µ should decrease with n to reduce approximation error

Page 44: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 45: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 46: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Complexity results in convex optimization

• Assumption: g convex on Rd

• Classical generic algorithms

– Gradient descent and accelerated gradient descent

– Newton method

– Subgradient method and ellipsoid algorithm

• Key additional properties of g

– Lipschitz continuity, smoothness or strong convexity

• Key insight from Bottou and Bousquet (2008)

– In machine learning, no need to optimize below estimation error

• Key references: Nesterov (2004), Bubeck (2015)

Page 47: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

(smooth) gradient descent

• Assumptions

– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)

– Minimum attained at θ∗

• Algorithm:

θt = θt−1 −1

Lg′(θt−1)

Page 48: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

(smooth) gradient descent - strong convexity

• Assumptions

– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)

– g µ-strongly convex

• Algorithm:

θt = θt−1 −1

Lg′(θt−1)

• Bound:

g(θt)− g(θ∗) 6 (1− µ/L)t[

g(θ0)− g(θ∗)]

• Three-line proof

• Line search, steepest descent or constant step-size

Page 49: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

(smooth) gradient descent - slow rate

• Assumptions

– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)

– Minimum attained at θ∗

• Algorithm:

θt = θt−1 −1

Lg′(θt−1)

• Bound:

g(θt)− g(θ∗) 62L‖θ0 − θ∗‖2

t+ 4• Four-line proof

• Adaptivity of gradient descent to problem difficulty

• Not best possible convergence rates after O(d) iterations

Page 50: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Gradient descent - Proof for quadratic functions

• Quadratic convex function: g(θ) = 12θ

⊤Hθ − c⊤θ

– µ and L are smallest largest eigenvalues of H

– Global optimum θ∗ = H−1c (or H†c)

• Gradient descent:

θt = θt−1 −1

L(Hθ − c) = θt−1 −

1

L(Hθ −Hθ∗)

θt − θ∗ = (I − 1

LH)(θt−1 − θ∗) = (I − 1

LH)t(θ0 − θ∗)

• Strong convexity µ > 0: eigenvalues of (I − 1LH)t in [0, (1− µ

L)t]

– Convergence of iterates: ‖θt − θ∗‖2 6 (1− µ/L)2t‖θ0 − θ∗‖2– Function values: g(θt)− g(θ∗) 6 (1− µ/L)2t

[

g(θ0)− g(θ∗)]

– Function values: g(θt)− g(θ∗) 6Lt ‖θ0 − θ∗‖2

Page 51: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Gradient descent - Proof for quadratic functions

• Quadratic convex function: g(θ) = 12θ

⊤Hθ − c⊤θ

– µ and L are smallest largest eigenvalues of H

– Global optimum θ∗ = H−1c (or H†c)

• Gradient descent:

θt = θt−1 −1

L(Hθ − c) = θt−1 −

1

L(Hθ −Hθ∗)

θt − θ∗ = (I − 1

LH)(θt−1 − θ∗) = (I − 1

LH)t(θ0 − θ∗)

• Convexity µ = 0: eigenvalues of (I − 1LH)t in [0, 1]

– No convergence of iterates: ‖θt − θ∗‖2 6 ‖θ0 − θ∗‖2– Function values: g(θt)−g(θ∗) 6 maxv∈[0,L] v(1−v/L)2t‖θ0−θ∗‖2– Function values: g(θt)− g(θ∗) 6

Lt ‖θ0 − θ∗‖2

Page 52: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Accelerated gradient methods (Nesterov, 1983)

• Assumptions

– g convex with L-Lipschitz-cont. gradient , min. attained at θ∗

• Algorithm:θt = ηt−1 −

1

Lg′(ηt−1)

ηt = θt +t− 1

t+ 2(θt − θt−1)

• Bound:g(θt)− g(θ∗) 6

2L‖θ0 − θ∗‖2(t+ 1)2

• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

• Not improvable

• Extension to strongly-convex functions

Page 53: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Accelerated gradient methods - strong convexity

• Assumptions

– g convex with L-Lipschitz-cont. gradient , min. attained at θ∗– g µ-strongly convex

• Algorithm:

θt = ηt−1 −1

Lg′(ηt−1)

ηt = θt +1−

µ/L

1 +√

µ/L(θt − θt−1)

• Bound: g(θt)− f(θ∗) 6 L‖θ0 − θ∗‖2(1−√

µ/L)t

– Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

– Not improvable

– Relationship with conjugate gradient for quadratic functions

Page 54: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012b)

• Gradient descent as a proximal method (differentiable functions)

– θt+1 = arg minθ∈Rd

f(θt) + (θ − θt)⊤∇f(θt)+

L

2‖θ − θt‖22

– θt+1 = θt − 1L∇f(θt)

Page 55: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012b)

• Gradient descent as a proximal method (differentiable functions)

– θt+1 = arg minθ∈Rd

f(θt) + (θ − θt)⊤∇f(θt)+

L

2‖θ − θt‖22

– θt+1 = θt − 1L∇f(θt)

• Problems of the form: minθ∈Rd

f(θ) + µΩ(θ)

– θt+1 = arg minθ∈Rd

f(θt) + (θ − θt)⊤∇f(θt)+µΩ(θ)+

L

2‖θ − θt‖22

– Ω(θ) = ‖θ‖1 ⇒ Thresholded gradient descent

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

Page 56: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Soft-thresholding for the ℓ1-norm

• Example 1: quadratic problem in 1D, i.e. minx∈R

1

2x2 − xy + λ|x|

• Piecewise quadratic function with a kink at zero

– Derivative at 0+: g+ = λ− y and 0−: g− = −λ− y

– x = 0 is the solution iff g+ > 0 and g− 6 0 (i.e., |y| 6 λ)

– x > 0 is the solution iff g+ 6 0 (i.e., y > λ) ⇒ x∗ = y − λ

– x 6 0 is the solution iff g− 6 0 (i.e., y 6 −λ) ⇒ x∗ = y + λ

• Solution x∗ = sign(y)(|y| − λ)+ = soft thresholding

Page 57: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Soft-thresholding for the ℓ1-norm

• Example 1: quadratic problem in 1D, i.e. minx∈R

1

2x2 − xy + λ|x|

• Piecewise quadratic function with a kink at zero

• Solution x∗ = sign(y)(|y| − λ)+ = soft thresholding

x

−λ

x*(y)

λ y

Page 58: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Newton method

• Given θt−1, minimize second-order Taylor expansion

g(θ) = g(θt−1)+g′(θt−1)⊤(θ−θt−1)+

1

2(θ−θt−1)

⊤g′′(θt−1)⊤(θ−θt−1)

• Expensive Iteration: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– Running-time complexity: O(d3) in general

• Quadratic convergence: If ‖θt−1 − θ∗‖ small enough, for some

constant C, we have

(C‖θt − θ∗‖) = (C‖θt−1 − θ∗‖)2

– See Boyd and Vandenberghe (2003)

Page 59: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary: minimizing smooth convex functions

• Assumption: g convex

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for smooth convex functions

– O(e−tµ/L) convergence rate for strongly smooth convex functions

– Optimal rates O(1/t2) and O(e−t√

µ/L)

• Newton method: θt = θt−1 − f ′′(θt−1)−1f ′(θt−1)

– O(

e−ρ2t)

convergence rate

Page 60: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary: minimizing smooth convex functions

• Assumption: g convex

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for smooth convex functions

– O(e−tµ/L) convergence rate for strongly smooth convex functions

– Optimal rates O(1/t2) and O(e−t√

µ/L)

• Newton method: θt = θt−1 − f ′′(θt−1)−1f ′(θt−1)

– O(

e−ρ2t)

convergence rate

• From smooth to non-smooth

– Subgradient method and ellipsoid (not covered)

Page 61: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Counter-example (Bertsekas, 1999)

Steepest descent for nonsmooth objectives

• g(θ1, θ2) =

−5(9θ21 + 16θ22)1/2 if θ1 > |θ2|

−(9θ1 + 16|θ2|)1/2 if θ1 6 |θ2|

• Steepest descent starting from any θ such that θ1 > |θ2| >

(9/16)2|θ1|

−5 0 5−5

0

5

Page 62: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method/“descent” (Shor et al., 1985)

• Assumptions

– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D

• Algorithm: θt = ΠD

(

θt−1 −2D

B√tg′(θt−1)

)

– ΠD : orthogonal projection onto ‖θ‖2 6 D

Constraints

Page 63: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method/“descent” (Shor et al., 1985)

• Assumptions

– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D

• Algorithm: θt = ΠD

(

θt−1 −2D

B√tg′(θt−1)

)

– ΠD : orthogonal projection onto ‖θ‖2 6 D

• Bound:

g

(

1

t

t−1∑

k=0

θk

)

− g(θ∗) 62DB√

t

• Three-line proof

• Best possible convergence rate after O(d) iterations (Bubeck, 2015)

Page 64: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method/“descent” - proof - I

• Iteration: θt = ΠD(θt−1 − γtg′(θt−1)) with γt =

2DB√t

• Assumption: ‖g′(θ)‖2 6 B and ‖θ‖2 6 D

‖θt − θ∗‖22 6 ‖θt−1 − θ∗ − γtg′(θt−1)‖22 by contractivity of projections

6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt(θt−1 − θ∗)

⊤g′(θt−1) because ‖g′(θt−1)‖2 6 B

6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt

[

g(θt−1)− g(θ∗)]

(property of subgradients)

• leading to

g(θt−1)− g(θ∗) 6B2γt2

+1

2γt

[

‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]

Page 65: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method/“descent” - proof - II

• Starting from g(θt−1)− g(θ∗) 6B2γt2

+1

2γt

[

‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]

• Constant step-size γt = γ

t∑

u=1

[

g(θu−1)− g(θ∗)]

6

t∑

u=1

B2γ

2+

t∑

u=1

1

[

‖θu−1 − θ∗‖22 − ‖θu − θ∗‖22]

6 tB2γ

2+

1

2γ‖θ0 − θ∗‖22 6 t

B2γ

2+

2

γD2

• Optimized step-size γt =2DB√tdepends on “horizon”

– Leads to bound of 2DB√t

• Using convexity: g

(

1

t

t−1∑

k=0

θk

)

− g(θ∗) 62DB√

t

Page 66: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method/“descent” - proof - III

• Starting from g(θt−1)− g(θ∗) 6B2γt2

+1

2γt

[

‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]

• Decreasing step-size

t∑

u=1

[

g(θu−1)− g(θ∗)]

6

t∑

u=1

B2γu2

+t

u=1

1

2γu

[

‖θu−1 − θ∗‖22 − ‖θu − θ∗‖22]

=

t∑

u=1

B2γu2

+

t−1∑

u=1

‖θu − θ∗‖22( 1

2γu+1− 1

2γu

)

+‖θ0 − θ∗‖22

2γ1− ‖θt − θ∗‖22

2γt

6

t∑

u=1

B2γu2

+t−1∑

u=1

4D2( 1

2γu+1− 1

2γu

)

+4D2

2γ1

=t

u=1

B2γu2

+4D2

2γt6 2DB

√t with γt =

2D

B√t

• Using convexity: g(

1t

∑t−1k=0 θk

)

− g(θ∗) 62DB√

t

Page 67: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient descent for machine learning

• Assumptions (f is the expected risk, f the empirical risk)

– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.

– f(θ) = 1n

∑ni=1 ℓ(yi,Φ(xi)

⊤θ)

– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D

• Statistics: with probability greater than 1− δ

supθ∈Θ

|f(θ)− f(θ)| 6 GRD√n

[

2 +

2 log2

δ

]

• Optimization: after t iterations of subgradient method

f(θ)−minη∈Θ

f(η) 6GRD√

t

• t = n iterations, with total running-time complexity of O(n2d)

Page 68: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient descent - strong convexity

• Assumptions

– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D– g µ-strongly convex

• Algorithm: θt = ΠD

(

θt−1 −2

µ(t+ 1)g′(θt−1)

)

• Bound:

g

(

2

t(t+ 1)

t∑

k=1

kθk−1

)

− g(θ∗) 62B2

µ(t+ 1)

• Three-line proof

• Best possible convergence rate after O(d) iterations (Bubeck, 2015)

Page 69: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method - strong convexity - proof - I

• Iteration: θt = ΠD(θt−1 − γtg′(θt−1)) with γt =

2µ(t+1)

• Assumption: ‖g′(θ)‖2 6 B and ‖θ‖2 6 D and µ-strong convexity of f

‖θt − θ∗‖22 6 ‖θt−1 − θ∗ − γtg′(θt−1)‖22 by contractivity of projections

6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt(θt−1 − θ∗)

⊤g′(θt−1) because ‖g′(θt−1)‖2 6 B

6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt

[

g(θt−1)− g(θ∗)+µ

2‖θt−1 − θ∗‖22

]

(property of subgradients and strong convexity)

• leading to

g(θt−1)− g(θ∗) 6B2γt2

+1

2

[ 1

γt− µ

]

‖θt−1 − θ∗‖22 −1

2γt‖θt − θ∗‖22

6B2

µ(t+ 1)+

µ

2

[t− 1

2

]

‖θt−1 − θ∗‖22 −µ(t+ 1)

4‖θt − θ∗‖22

Page 70: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient method - strong convexity - proof - II

• From g(θt−1)− g(θ∗) 6B2

µ(t+ 1)+

µ

2

[t− 1

2

]

‖θt−1 − θ∗‖22 −µ(t+ 1)

4‖θt − θ∗‖22

t∑

u=1

u[

g(θu−1)− g(θ∗)]

6

u∑

t=1

B2u

µ(u+ 1)+

1

4

t∑

u=1

[

u(u− 1)‖θu−1 − θ∗‖22 − u(u+ 1)‖θu − θ∗‖22]

6B2t

µ+

1

4

[

0− t(t+ 1)‖θt − θ∗‖22]

6B2t

µ

• Using convexity: g

(

2

t(t+ 1)

t∑

u=1

uθu−1

)

− g(θ∗) 62B2

t+ 1

• NB: with step-size γn = 1/(nµ), extra logarithmic factor

Page 71: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary: minimizing convex functions

• Assumption: g convex

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/√t) convergence rate for non-smooth convex functions

– O(1/t) convergence rate for smooth convex functions

– O(e−ρt) convergence rate for strongly smooth convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

Page 72: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary: minimizing convex functions

• Assumption: g convex

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/√t) convergence rate for non-smooth convex functions

– O(1/t) convergence rate for smooth convex functions

– O(e−ρt) convergence rate for strongly smooth convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

• Key insights from Bottou and Bousquet (2008)

1. In machine learning, no need to optimize below statistical error

2. In machine learning, cost functions are averages

⇒ Stochastic approximation

Page 73: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of rates of convergence

• Problem parameters

– D diameter of the domain

– B Lipschitz-constant

– L smoothness constant

– µ strong convexity constant

convex strongly convex

nonsmooth deterministic: BD/√t deterministic: B2/(tµ)

stochastic: BD/√n stochastic: B2/(nµ)

smooth deterministic: LD2/t2 deterministic: exp(−t√

µ/L)

stochastic: LD2/√n stochastic: L/(nµ)

finite sum: n/t finite sum: exp(−min1/n, µ/Lt)quadratic deterministic: LD2/t2 deterministic: exp(−t

µ/L)

stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n

Page 74: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 75: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 76: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic approximation

• Goal: Minimizing a function f defined on Rd

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rd

Page 77: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic approximation

• Goal: Minimizing a function f defined on Rd

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rd

• Machine learning - statistics

– loss for a single pair of observations: fn(θ) = ℓ(yn, θ⊤Φ(xn))

– f(θ) = Efn(θ) = E ℓ(yn, θ⊤Φ(xn)) = generalization error

– Expected gradient: f ′(θ) = Ef ′n(θ) = E

ℓ′(yn, θ⊤Φ(xn))Φ(xn)

– Non-asymptotic results

• Number of iterations = number of observations

Page 78: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic approximation

• Goal: Minimizing a function f defined on Rd

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rd

• Stochastic approximation

– (much) broader applicability beyond convex optimization

θn = θn−1 − γnhn(θn−1) with E[

hn(θn−1)|θn−1

]

= h(θn−1)

– Beyond convex problems, i.i.d assumption, finite dimension, etc.

– Typically asymptotic results

– See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)

Page 79: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Relationship to online learning

• Stochastic approximation

– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ

– Using the gradients of single i.i.d. observations

Page 80: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Relationship to online learning

• Stochastic approximation

– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ

– Using the gradients of single i.i.d. observations

• Batch learning

– Finite set of observations: z1, . . . , zn– Empirical risk: f(θ) = 1

n

∑nk=1 ℓ(θ, zi)

– Estimator θ = Minimizer of f(θ) over a certain class Θ

– Generalization bound using uniform concentration results

Page 81: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Relationship to online learning

• Stochastic approximation

– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ

– Using the gradients of single i.i.d. observations

• Batch learning

– Finite set of observations: z1, . . . , zn– Empirical risk: f(θ) = 1

n

∑nk=1 ℓ(θ, zi)

– Estimator θ = Minimizer of f(θ) over a certain class Θ

– Generalization bound using uniform concentration results

• Online learning

– Update θn after each new (potentially adversarial) observation zn– Cumulative loss: 1

n

∑nk=1 ℓ(θk−1, zk)

– Online to batch through averaging (Cesa-Bianchi et al., 2004)

Page 82: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

• Key properties of f and/or fn

– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous

– Strong convexity: f µ-strongly convex

Page 83: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

• Key properties of f and/or fn

– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous

– Strong convexity: f µ-strongly convex

• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γnf′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n

∑n−1k=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

Page 84: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

• Key properties of f and/or fn

– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous

– Strong convexity: f µ-strongly convex

• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γnf′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n

∑n−1k=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

• Desirable practical behavior

– Applicable (at least) to classical supervised learning problems

– Robustness to (potentially unknown) constants (L,B,µ)

– Adaptivity to difficulty of the problem (e.g., strong convexity)

Page 85: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient “descent”/method

• Assumptions

– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f

– θ∗ global optimum of f on C = ‖θ‖2 6 D

• Algorithm: θn = ΠD

(

θn−1 −2D

B√nf ′n(θn−1)

)

Page 86: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient “descent”/method

• Assumptions

– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f

– θ∗ global optimum of f on C = ‖θ‖2 6 D

• Algorithm: θn = ΠD

(

θn−1 −2D

B√nf ′n(θn−1)

)

• Bound:

Ef

(

1

n

n−1∑

k=0

θk

)

− f(θ∗) 62DB√

n

• “Same” three-line proof as in the deterministic case

• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

• Running-time complexity: O(dn) after n iterations

Page 87: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient method - proof - I

• Iteration: θn = ΠD(θn−1 − γnf′n(θn−1)) with γn = 2D

B√n

• Fn : information up to time n

• ‖f ′n(θ)‖2 6 B and ‖θ‖2 6 D, unbiased gradients/functions E(fn|Fn−1) = f

‖θn − θ∗‖22 6 ‖θn−1 − θ∗ − γnf′n(θn−1)‖22 by contractivity of projections

6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)

⊤f ′n(θn−1) because ‖f ′

n(θn−1)‖2 6 B

E[

‖θn − θ∗‖22|Fn−1

]

6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)

⊤f ′(θn−1)

6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn

[

f(θn−1)− f(θ∗)]

(subgradient property)

E‖θn − θ∗‖22 6 E‖θn−1 − θ∗‖22 + B2γ2n − 2γn

[

Ef(θn−1)− f(θ∗)]

• leading to Ef(θn−1)− f(θ∗) 6B2γn2

+1

2γn

[

E‖θn−1 − θ∗‖22 − E‖θn − θ∗‖22]

Page 88: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient method - proof - II

• Starting from Ef(θn−1)− f(θ∗) 6B2γn2

+1

2γn

[

E‖θn−1 − θ∗‖22 − E‖θn − θ∗‖22]

n∑

u=1

[

Ef(θu−1)− f(θ∗)]

6

n∑

u=1

B2γu2

+n∑

u=1

1

2γu

[

E‖θu−1 − θ∗‖22 − E‖θu − θ∗‖22]

6

n∑

u=1

B2γu2

+4D2

2γn6 2DB

√n with γn =

2D

B√n

• Using convexity: Ef

(

1

n

n−1∑

k=0

θk

)

− f(θ∗) 62DB√

n

Page 89: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient descent - strong convexity - I

• Assumptions

– fn convex and B-Lipschitz-continuous

– (fn) i.i.d. functions such that Efn = f

– f µ-strongly convex on ‖θ‖2 6 D– θ∗ global optimum of f over ‖θ‖2 6 D

• Algorithm: θn = ΠD

(

θn−1 −2

µ(n+ 1)f ′n(θn−1)

)

• Bound:

Ef

(

2

n(n+ 1)

n∑

k=1

kθk−1

)

− f(θ∗) 62B2

µ(n+ 1)

• “Same” proof than deterministic case (Lacoste-Julien et al., 2012)

• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

Page 90: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient - strong convexity - proof - I

• Iteration: θn = ΠD(θn−1 − γnf′n(θt−1)) with γn = 2

µ(n+1)

• Assumption: ‖f ′n(θ)‖2 6 B and ‖θ‖2 6 D and µ-strong convexity of f

‖θn − θ∗‖22 6 ‖θn−1 − θ∗ − γnf′n(θt−1)‖22 by contractivity of projections

6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)

⊤f ′n(θt−1) because ‖f ′

n(θt−1)‖2 6 B

E(·|Fn−1) 6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn

[

f(θn−1)− f(θ∗)+µ

2‖θn−1 − θ∗‖22

]

(property of subgradients and strong convexity)

• leading to

Ef(θn−1)− f(θ∗) 6B2γn2

+1

2

[ 1

γn− µ

]

‖θn−1 − θ∗‖22 −1

2γn‖θn − θ∗‖22

6B2

µ(n+ 1)+

µ

2

[n− 1

2

]

‖θn−1 − θ∗‖22 −µ(n+ 1)

4‖θn − θ∗‖22

Page 91: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient - strong convexity - proof - II

• FromEf(θn−1)−f(θ∗)6B2

µ(n+ 1)+µ

2

[n− 1

2

]

E‖θn−1−θ∗‖22−µ(n+ 1)

4E‖θn−θ∗‖22

n∑

u=1

u[

Ef(θu−1)− f(θ∗)]

6

n∑

u=1

B2u

µ(u+ 1)+

1

4

n∑

u=1

[

u(u− 1)E‖θu−1 − θ∗‖22 − u(u+ 1)E‖θu − θ∗‖

6B2n

µ+

1

4

[

0− n(n+ 1)E‖θn − θ∗‖22]

6B2n

µ

• Using convexity: Ef

(

2

n(n+ 1)

n∑

u=1

uθu−1

)

− g(θ∗) 62B2

n+ 1

• NB: with step-size γn = 1/(nµ), extra logarithmic factor (see later)

Page 92: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient descent - strong convexity - II

• Assumptions

– fn convex and B-Lipschitz-continuous

– (fn) i.i.d. functions such that Efn = f

– θ∗ global optimum of g = f + µ2‖ · ‖22

– No compactness assumption - no projections

• Algorithm:

θn = θn−1−2

µ(n+ 1)g′n(θn−1) = θn−1−

2

µ(n+ 1)

[

f ′n(θn−1)+µθn−1

]

• Bound: Eg

(

2

n(n+ 1)

n∑

k=1

kθk−1

)

− g(θ∗) 62B2

µ(n+ 1)

• Minimax convergence rate

Page 93: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond convergence in expectation

• Typical result: Ef

(

1

n

n−1∑

k=0

θk

)

− f(θ∗) 62DB√

n

– Obtained with simple conditioning arguments

• High-probability bounds

– Markov inequality: P(

f(

1n

∑n−1k=0 θk

)

− f(θ∗) > ε)

62DB√nε

Page 94: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond convergence in expectation

• Typical result: Ef

(

1

n

n−1∑

k=0

θk

)

− f(θ∗) 62DB√

n

– Obtained with simple conditioning arguments

• High-probability bounds

– Markov inequality: P(

f(

1n

∑n−1k=0 θk

)

− f(θ∗) > ε)

62DB√nε

– Concentration inequality (Nemirovski et al., 2009; Nesterov and

Vial, 2008)

P

(

f(1

n

n−1∑

k=0

θk

)

− f(θ∗) >2DB√

n(2 + 4t)

)

6 2 exp(−t2)

• See also Bach (2013) for logistic regression

Page 95: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond stochastic gradient method

• Adding a proximal step

– Goal: minθ∈Rd

f(θ) + Ω(θ) = Efn(θ) + Ω(θ)

– Replace recursion θn = θn−1 − γnf′n(θn) by

θn = minθ∈Rd

∥θ − θn−1 + γnf′n(θn)

2

2+ CΩ(θ)

– Xiao (2010); Hu et al. (2009)

– May be accelerated (Ghadimi and Lan, 2013)

• Related frameworks

– Regularized dual averaging (Nesterov, 2009; Xiao, 2010)

– Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)

Page 96: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Minimax rates (Agarwal et al., 2012)

• Model of computation (i.e., algorithms): first-order oracle

– Queries a function f by obtaining f(θk) and f ′(θk) with zero-mean

bounded variance noise, for k = 0, . . . , n− 1 and outputs θn

• Class of functions

– convex B-Lipschitz-continuous (w.r.t. ℓ2-norm) on a compact

convex set C containing an ℓ∞-ball

• Performance measure

– for a given algorithm and function εn(algo, f) = f(θn)−infθ∈C f(θ)– for a given algorithm: sup

functions f

εn(algo, f)

• Minimax performance: infalgo

supfunctions f

εn(algo, f)

Page 97: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Minimax rates (Agarwal et al., 2012)

• Convex functions: domain C that contains an ℓ∞-ball of radius D

infalgo

supfunctions f

ε(algo, f) > cst ×min

BD

d

n,BD

– Consequences for ℓ2-ball of radius D: BD/√n

– Upper-bound through stochastic subgradient

• µ-strongly-convex functions:

infalgo

supfunctions f

εn(algo, f) > cst ×minB2

µn,B2

µd,BD

d

n,BD

Page 98: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of rates of convergence

• Problem parameters

– D diameter of the domain

– B Lipschitz-constant

– L smoothness constant

– µ strong convexity constant

convex strongly convex

nonsmooth deterministic: BD/√t deterministic: B2/(tµ)

stochastic: BD/√n stochastic: B2/(nµ)

smooth deterministic: LD2/t2 deterministic: exp(−t√

µ/L)

stochastic: LD2/√n stochastic: L/(nµ)

finite sum: n/t finite sum: exp(−min1/n, µ/Lt)quadratic deterministic: LD2/t2 deterministic: exp(−t

µ/L)

stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n

Page 99: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 100: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 101: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

Existing work

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

– Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan

et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz

et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov

and Vial (2008); Nemirovski et al. (2009)

Page 102: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

Existing work

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Many contributions in optimization and online learning: Bottou

and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al.

(2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al.

(2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and

Vial (2008); Nemirovski et al. (2009)

Page 103: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

Existing work

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

A single algorithm with global adaptive convergence rate for

smooth problems?

Page 104: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

Existing work

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

• Non-asymptotic analysis for smooth problems? problems with

convergence rate O(min1/µn, 1/√n)

Page 105: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Smoothness/convexity assumptions

• Iteration: θn = θn−1 − γnf′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n

∑n−1k=0 θk

• Smoothness of fn: For each n > 1, the function fn is a.s. convex,

differentiable with L-Lipschitz-continuous gradient f ′n:

– Smooth loss and bounded data

• Strong convexity of f : The function f is strongly convex with

respect to the norm ‖ · ‖, with convexity constant µ > 0:

– Invertible population covariance matrix

– or regularization by µ2‖θ‖2

Page 106: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γn = Cn−α

• Strongly convex smooth objective functions

– Old: O(n−1) rate achieved without averaging for α = 1

– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants

– Forgetting of initial conditions

– Robustness to the choice of C

Page 107: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γn = Cn−α

• Strongly convex smooth objective functions

– Old: O(n−1) rate achieved without averaging for α = 1

– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants

– Forgetting of initial conditions

– Robustness to the choice of C

• Convergence rates for E‖θn − θ∗‖2 and E‖θn − θ∗‖2

– no averaging: O(σ2γn

µ

)

+O(e−µnγn)‖θ0 − θ∗‖2

– averaging:trH(θ∗)−1

n+ µ−1O(n−2α+n−2+α) +O

(‖θ0 − θ∗‖2µ2n2

)

Page 108: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Robustness to wrong constants for γn = Cn−α

• f(θ) = 12|θ|2 with i.i.d. Gaussian noise (d = 1)

• Left: α = 1/2

• Right: α = 1

0 2 4−5

0

5

log(n)

log[

f(θ n)−

f∗ ]

α = 1/2

sgd − C=1/5ave − C=1/5sgd − C=1ave − C=1sgd − C=5ave − C=5

0 2 4−5

0

5

log(n)

log[

f(θ n)−

f∗ ]

α = 1

sgd − C=1/5ave − C=1/5sgd − C=1ave − C=1sgd − C=5ave − C=5

• See also http://leon.bottou.org/projects/sgd

Page 109: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γn = Cn−α

• Strongly convex smooth objective functions

– Old: O(n−1) rate achieved without averaging for α = 1

– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants

Page 110: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γn = Cn−α

• Strongly convex smooth objective functions

– Old: O(n−1) rate achieved without averaging for α = 1

– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants

• Non-strongly convex smooth objective functions

– Old: O(n−1/2) rate achieved with averaging for α = 1/2

– New: O(maxn1/2−3α/2, n−α/2, nα−1) rate achieved without

averaging for α ∈ [1/3, 1]

• Take-home message

– Use α = 1/2 with averaging to be adaptive to strong convexity

Page 111: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Robustness to lack of strong convexity

• Left: f(θ) = |θ|2 between −1 and 1

• Right: f(θ) = |θ|4 between −1 and 1

• affine outside of [−1, 1], continuously differentiable.

0 1 2 3 4 5

−4

−3

−2

−1

0

1

log(n)

log[

f(θ n)−

f∗ ]

power 2

sgd − 1/3ave − 1/3sgd − 1/2ave − 1/2sgd − 2/3ave − 2/3sgd − 1ave − 1

0 1 2 3 4 5

−6

−4

−2

0

2

log(n)

log[

f(θ n)−

f∗ ]

power 4

sgd − 1/3ave − 1/3sgd − 1/2ave − 1/2sgd − 2/3ave − 2/3sgd − 1ave − 1

Page 112: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convex stochastic approximation

Existing work

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

• A single adaptive algorithm for smooth problems with

convergence rate O(min1/µn, 1/√n) in all situations?

Page 113: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Adaptive algorithm for logistic regression

• Logistic regression: (Φ(xn), yn) ∈ Rd × −1, 1

– Single data point: fn(θ) = log(1 + exp(−ynθ⊤Φ(xn)))

– Generalization error: f(θ) = Efn(θ)

Page 114: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Adaptive algorithm for logistic regression

• Logistic regression: (Φ(xn), yn) ∈ Rd × −1, 1

– Single data point: fn(θ) = log(1 + exp(−ynθ⊤Φ(xn)))

– Generalization error: f(θ) = Efn(θ)

• Cannot be strongly convex ⇒ local strong convexity

– unless restricted to |θ⊤Φ(xn)| 6 M (with constants eM - proof )

– µ = lowest eigenvalue of the Hessian at the optimum f ′′(θ∗)

logistic loss

Page 115: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Adaptive algorithm for logistic regression

• Logistic regression: (Φ(xn), yn) ∈ Rd × −1, 1

– Single data point: fn(θ) = log(1 + exp(−ynθ⊤Φ(xn)))

– Generalization error: f(θ) = Efn(θ)

• Cannot be strongly convex ⇒ local strong convexity

– unless restricted to |θ⊤Φ(xn)| 6 M (with constants eM - proof )

– µ = lowest eigenvalue of the Hessian at the optimum f ′′(θ∗)

• n steps of averaged SGD with constant step-size 1/(

2R2√n)

– with R = radius of data (Bach, 2013):

Ef(θn)− f(θ∗) 6 min

1√n,R2

(

15 + 5R‖θ0 − θ∗‖)4

– Proof based on self-concordance (Nesterov and Nemirovski, 1994)

Page 116: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Self-concordance

• Usual definition for convex ϕ : R → R: |ϕ′′′(t)| 6 2ϕ′′(t)3/2

– Affine invariant

– Extendable to all convex functions on Rd by looking at rays

– Used for the sharp proof of quadratic convergence of Newton

method (Nesterov and Nemirovski, 1994)

• Generalized notion: |ϕ′′′(t)| 6 ϕ′′(t)

– Applicable to logistic regression (with extensions)

– ϕ(t) = log(1 + e−t), ϕ′(t) = (1 + et)−1, etc...

• Important properties

– Allows global Taylor expansions

– Relates expansions of derivatives of different orders

Page 117: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Adaptive algorithm for logistic regression

• Logistic regression: (Φ(xn), yn) ∈ Rd × −1, 1

– Single data point: fn(θ) = log(1 + exp(−ynθ⊤Φ(xn)))

– Generalization error: f(θ) = Efn(θ)

• Cannot be strongly convex ⇒ local strong convexity

– unless restricted to |θ⊤Φ(xn)| 6 M (and with constants eM)

– µ = lowest eigenvalue of the Hessian at the optimum f ′′(θ∗)

• n steps of averaged SGD with constant step-size 1/(

2R2√n)

– with R = radius of data (Bach, 2013):

Ef(θn)− f(θ∗) 6 min

1√n,R2

(

15 + 5R‖θ0 − θ∗‖)4

– A single adaptive algorithm for smooth problems with

convergence rate O(1/n) in all situations?

Page 118: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rd

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

Page 119: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rd

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

• New analysis for averaging and constant step-size γ = 1/(4R2)

– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely

– No assumption regarding lowest eigenvalues of H

– Main result: Ef(θn−1)− f(θ∗) 64σ2d

n+

4R2‖θ0 − θ∗‖2n

• Matches statistical lower bound (Tsybakov, 2003)

– Non-asymptotic robust version of Gyorfi and Walk (1996)

Page 120: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Least-squares - Proof technique - I

• LMS recursion:

θn − θ∗ =[

I − γΦ(xn)⊗ Φ(xn)]

(θn−1 − θ∗) + γ εnΦ(xn)

• Simplified LMS recursion: with H = E[

Φ(xn)⊗ Φ(xn)]

θn − θ∗ =[

I − γH]

(θn−1 − θ∗) + γ εnΦ(xn)

– Direct proof technique of Polyak and Juditsky (1992), e.g.,

θn − θ∗ =[

I − γH]n(θ0 − θ∗) + γ

n∑

k=1

[

I − γH]n−k

εkΦ(xk)

• Infinite expansion of Aguech, Moulines, and Priouret (2000) in powers

of γ

Page 121: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

- For least-squares, θγ = θ∗

θγ

θ0

θn

Page 122: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

Page 123: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

θn

Page 124: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

– θn does not converge to θ∗ but oscillates around it

– oscillations of order√γ

• Ergodic theorem:

– Averaged iterates converge to θγ = θ∗ at rate O(1/n)

Page 125: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic square

1/2R2

1/8R2

1/32R2

1/2R2n1/2

Page 126: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - benchmarks

• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

log 10

[f(θ)

−f(

θ *)]

alpha square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

alpha square C=opt test

C/R2

C/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

log 10

[f(θ)

−f(

θ *)]

news square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

news square C=opt test

C/R2

C/R2n1/2

SAG

Page 127: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2014)

– Beyond strong convexity or lack thereof

f(θn)− f(θ∗) 616σ2 trH1/α

n(γn)1/α +

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

- Extension to non-parametric estimation

- Previous results: α = +∞ and r = 1/2

Page 128: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Finer assumptions (Dieuleveut and Bach, 2014)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2d

nby

σ2(γn)1/α trH1/α

n

0 5 10−2

0

2

4

6

8

log(m)

log(

λm

)

Page 129: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Finer assumptions (Dieuleveut and Bach, 2014)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2d

nby

σ2(γn)1/α trH1/α

n

0 5 10−2

0

2

4

6

8

log(m)

log(

λm

)

0 50 1000

10

20

30

40

alpha

(γ n

)1/α T

r H1/

α / n

Page 130: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Finer assumptions (Dieuleveut and Bach, 2014)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant

– Actual decay as λm = o(m−α) with trH1/α =∑

m

λ1/αm small

– New result: replaceσ2d

nby

σ2(γn)1/α trH1/α

n

• Optimal predictor

– Pessimistic assumption: ‖θ0 − θ∗‖2 finite

– Finer assumption: ‖H1/2−r(θ0 − θ∗)‖2 small

– Replace‖θ0 − θ∗‖2

γnby

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

Page 131: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2014)

– Beyond strong convexity or lack thereof

f(θn)− f(θ∗) 616σ2 trH1/α

n(γn)1/α +

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

– Previous results: α = +∞ and r = 1/2

– Valid for all α and r

– Optimal step-size potentially decaying with n

– Extension to non-parametric estimation (kernels) with optimal rates

Page 132: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

From least-squares to non-parametric estimation - I

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• If θ0 = 0, θn is a linear combination of Φ(x1), . . . ,Φ(xn)

θn =n∑

k=1

αkΦ(xk) and αn = −γn−1∑

k=1

αk〈Φ(xk),Φ(xn)〉+ γyn

– Kernel trick: k(x, x′) = 〈Φ(x),Φ(x′)〉– Reproducing kernel Hilbert spaces and non parametric estimation

– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2014)

– Still O(n2)

Page 133: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

From least-squares to non-parametric estimation - I

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• If θ0 = 0, θn is a linear combination of Φ(x1), . . . ,Φ(xn)

θn =n∑

k=1

αkΦ(xk) and αn = −γn−1∑

k=1

αk〈Φ(xk),Φ(xn)〉+ γyn

• Kernel trick: k(x, x′) = 〈Φ(x),Φ(x′)〉– Reproducing kernel Hilbert spaces and non-parametric estimation

– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2014)

– Still O(n2)

Page 134: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

From least-squares to non-parametric estimation - II

• Simple example: Sobolev space on X = [0, 1]

– Φ(x) = weighted Fourier basis Φ(x)j = ϕj cos(2jπx) (plus sine)

– kernel k(x, x′) =∑

j ϕ2j cos

[

2jπ(x− x′)]

– Optimal prediction function θ∗ has norm ‖θ∗‖2 =∑

j |F(θ∗)j|2ϕ−2j

– Depending on smoothness, may or may not be finite

– Adapted norm ‖H1/2−rθ∗‖2 =∑

j |F(θ∗)j|2ϕ−4rj may be finite

f(θn)− f(θ∗) 616σ2 trH1/α

n(γn)1/α +

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

– Same effect than ℓ2-regularization with weight λ equal to1

γn

Page 135: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

From least-squares to non-parametric estimation - II

• Simple example: Sobolev space on X = [0, 1]

– Φ(x) = weighted Fourier basis Φ(x)j = ϕj cos(2jπx) (plus sine)

– kernel k(x, x′) =∑

j ϕ2j cos

[

2jπ(x− x′)]

– Optimal prediction function θ∗ has norm ‖θ∗‖2 =∑

j |F(θ∗)j|2ϕ−2j

– Depending on smoothness, may or may not be finite

• Adapted norm ‖H1/2−rθ∗‖2 =∑

j |F(θ∗)j|2ϕ−4rj may be finite

f(θn)− f(θ∗) 616σ2 trH1/α

n(γn)1/α +

4‖H1/2−r(θ0 − θ∗)‖2γ2rn2minr,1

• Same effect than ℓ2-regularization with weight λ equal to1

γn

Page 136: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic square

1/2R2

1/8R2

1/32R2

1/2R2n1/2

• Explaining actual behavior for all n

Page 137: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Bias-variance decomposition

(Defossez and Bach, 2015)

• Simplification: dominating (but exact) term when n → ∞ and γ → 0

• Variance (e.g., starting from the solution)

f(θn)− f(θ∗) ∼1

nE

[

ε2Φ(x)⊤H−1Φ(x)]

– NB: if noise ε is independent, then we obtain dσ2

n

– Exponentially decaying remainder terms (strongly convex problems)

• Bias (e.g., no noise)

f(θn)− f(θ∗) ∼1

n2γ2(θ0 − θ∗)

⊤H−1(θ0 − θ∗)

Page 138: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Bias-variance decomposition (synthetic data d = 25)

100 101 102 103 104 105 106 107

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration n

f(θ

n)−

f(θ

∗)

(bias) γ = γ0

(var.) γ = γ0

Page 139: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Bias-variance decomposition (synthetic data d = 25)

100 101 102 103 104 105 106 107

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

(slope=-2)

Iteration n

f(θ

n)−

f(θ

∗)

(bias) γ = γ0/10(var.) γ = γ0/10(bias) γ = γ0

(var.) γ = γ0

Page 140: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal sampling (Defossez and Bach, 2015)

• Sampling from a different distribution with importance weights

Ep(x)p(y|x)|y − Φ(x)⊤θ|2 = Eq(x)p(y|x)dp(x)dq(x)|y − Φ(x)⊤θ|2

– Recursion: θn = θn−1 − γdp(xn)dq(xn)

(

Φ(xn)⊤θn−1 − yn

)

Φ(xn)

Page 141: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal sampling (Defossez and Bach, 2015)

• Sampling from a different distribution with importance weights

Ep(x)p(y|x)|y − Φ(x)⊤θ|2 = Eq(x)p(y|x)dp(x)dq(x)|y − Φ(x)⊤θ|2

– Recursion: θn = θn−1 − γdp(xn)dq(xn)

(

Φ(xn)⊤θn−1 − yn

)

Φ(xn)

– Specific to least-squares = Eq(x)p(y|x)

dp(x)dq(x)y−

dp(x)dq(x)Φ(x)

⊤θ∣

2

– Reweighting of the data: same bounds apply!

Page 142: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal sampling (Defossez and Bach, 2015)

• Sampling from a different distribution with importance weights

Ep(x)p(y|x)|y − Φ(x)⊤θ|2 = Eq(x)p(y|x)dp(x)dq(x)|y − Φ(x)⊤θ|2

– Recursion: θn = θn−1 − γdp(xn)dq(xn)

(

Φ(xn)⊤θn−1 − yn

)

Φ(xn)

– Specific to least-squares = Eq(x)p(y|x)

dp(x)dq(x)y−

dp(x)dq(x)Φ(x)

⊤θ∣

2

– Reweighting of the data: same bounds apply!

• Optimal for variance:dq(x)

dp(x)∝

Φ(x)⊤H−1Φ(x)

– Same density as active learning (Kanamori and Shimodaira, 2003)

– Limited gains: different between first and second moments

– Caveat: need to know H

Page 143: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Optimal sampling (Defossez and Bach, 2015)

• Sampling from a different distribution with importance weights

Ep(x)p(y|x)|y − Φ(x)⊤θ|2 = Eq(x)p(y|x)dp(x)dq(x)|y − Φ(x)⊤θ|2

– Recursion: θn = θn−1 − γdp(xn)dq(xn)

(

Φ(xn)⊤θn−1 − yn

)

Φ(xn)

– Specific to least-squares = Eq(x)p(y|x)

dp(x)dq(x)y−

dp(x)dq(x)Φ(x)

⊤θ∣

2

– Reweighting of the data: same bounds apply!

• Optimal for bias:dq(x)

dp(x)∝ ‖Φ(x)‖2

– Simpy allows biggest possible step size γ < 2trH

– Large gains in practice

– Corresponds to normalized least-mean-squares

Page 144: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convergence on Sido dataset (d = 4932)

100 101 102 103 104 105 106 107 108

10−4

10−3

10−2

10−1

100

Iteration n

f(θ

n)−

f(θ

∗)

c−1(X) ∝ XTX, step=0.00013

c−1(X) ∝ XTX, step=0.0017uniform, step=0.00013

Page 145: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2d

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H−1(θ0 − θ∗)〉

n2

Page 146: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2d

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H−1(θ0 − θ∗)〉

n2

Page 147: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2d

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H−1(θ0 − θ∗)〉

n2

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2d

n

Page 148: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2d

n

Page 149: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2d

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2d

• Acceleration is notoriously non-robust to noise (d’Aspremont,

2008; Schmidt et al., 2011)

– For non-structured noise, see Lan (2012)

Page 150: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2d

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2d

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2d

n1−α

Page 151: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2d

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2d

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2d

n1−α

Averaging and acceleration

(Dieuleveut, Flammarion, and Bach, 2016)R2‖θ0 − θ∗‖2

n2

σ2d

n

Page 152: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

Page 153: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

• θn oscillates around the wrong value θγ 6= θ∗

θγ

θ0

θn

θn

θ∗

Page 154: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Beyond least-squares - Markov chain interpretation

• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain

– Stationary distribution πγ such that∫

f ′(θ)πγ(dθ) = 0

– When f ′ is not linear, f ′(∫

θπγ(dθ)) 6=∫

f ′(θ)πγ(dθ) = 0

• θn oscillates around the wrong value θγ 6= θ∗

– moreover, ‖θ∗ − θn‖ = Op(√γ)

– Linear convergence up to the noise level for strongly-convex

problems (Nedic and Bertsekas, 2000)

• Ergodic theorem

– averaged iterates converge to θγ 6= θ∗ at rate O(1/n)

– moreover, ‖θ∗ − θγ‖ = O(γ) (Bach, 2013)

Page 155: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic logistic − 1

1/2R2

1/8R2

1/32R2

1/2R2n1/2

Page 156: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)

for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1)

for all convex quadratic functions

3. Newton’s method squares the error at each iteration

for smooth functions

4. A single step of Newton’s method is equivalent to minimizing the

quadratic Taylor expansion

– Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1)

– Complexity: O(d) per iteration

Page 157: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Restoring convergence through online Newton steps

• Known facts

1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)

for all convex functions

2. Averaged SGD with γn constant leads to robust rate O(n−1)

for all convex quadratic functions ⇒ O(n−1)

3. Newton’s method squares the error at each iteration

for smooth functions ⇒ O((n−1/2)2)

4. A single step of Newton’s method is equivalent to minimizing the

quadratic Taylor expansion

• Online Newton step

– Rate: O((n−1/2)2 + n−1) = O(n−1)

– Complexity: O(d) per iteration

Page 158: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ)def= E

[

ℓ(yn, 〈θ,Φ(xn)〉)]

at θ is

equivalent to minimizing the quadratic approximation

g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉

= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1

2〈θ − θ,Ef ′′n(θ)(θ − θ)〉

= E

[

f(θ) + 〈f ′n(θ), θ − θ〉+ 1

2〈θ − θ, f ′′n(θ)(θ − θ)〉

]

Page 159: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Restoring convergence through online Newton steps

• The Newton step for f = Efn(θ)def= E

[

ℓ(yn, 〈θ,Φ(xn)〉)]

at θ is

equivalent to minimizing the quadratic approximation

g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉

= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1

2〈θ − θ,Ef ′′n(θ)(θ − θ)〉

= E

[

f(θ) + 〈f ′n(θ), θ − θ〉+ 1

2〈θ − θ, f ′′n(θ)(θ − θ)〉

]

• Complexity of least-mean-square recursion for g is O(d)

θn = θn−1 − γ[

f ′n(θ) + f ′′

n(θ)(θn−1 − θ)]

– f ′′n(θ) = ℓ′′(yn, 〈θ,Φ(xn)〉)Φ(xn)⊗ Φ(xn) has rank one

– New online Newton step without computing/inverting Hessians

Page 160: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)

– Provable convergence rate of O(d/n) for logistic regression

– Additional assumptions but no strong convexity

Page 161: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)

– Provable convergence rate of O(d/n) for logistic regression

– Additional assumptions but no strong convexity

• Update at each iteration using the current averaged iterate

– Recursion: θn = θn−1 − γ[

f ′n(θn−1) + f ′′

n(θn−1)(θn−1 − θn−1)]

– No provable convergence rate (yet) but best practical behavior

– Note (dis)similarity with regular SGD: θn = θn−1 − γf ′n(θn−1)

Page 162: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic logistic − 1

1/2R2

1/8R2

1/32R2

1/2R2n1/2

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)lo

g 10[f(

θ)−

f(θ *)]

synthetic logistic − 2

every 2p

every iter.2−step2−step−dbl.

Page 163: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Simulations - benchmarks

• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)

0 2 4 6−2.5

−2

−1.5

−1

−0.5

0

0.5

log10

(n)

log 10

[f(θ)

−f(

θ *)]

alpha logistic C=1 test

1/R2

1/R2n1/2

SAGAdagradNewton

0 2 4 6−2.5

−2

−1.5

−1

−0.5

0

0.5

log10

(n)

alpha logistic C=opt test

C/R2

C/R2n1/2

SAGAdagradNewton

0 2 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

log 10

[f(θ)

−f(

θ *)]

news logistic C=1 test

1/R2

1/R2n1/2

SAGAdagradNewton

0 2 4−1

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

news logistic C=opt test

C/R2

C/R2n1/2

SAGAdagradNewton

Page 164: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of rates of convergence

• Problem parameters

– D diameter of the domain

– B Lipschitz-constant

– L smoothness constant

– µ strong convexity constant

convex strongly convex

nonsmooth deterministic: BD/√t deterministic: B2/(tµ)

stochastic: BD/√n stochastic: B2/(nµ)

smooth deterministic: LD2/t2 deterministic: exp(−t√

µ/L)

stochastic: LD2/√n stochastic: L/(nµ)

finite sum: n/t finite sum: exp(−min1/n, µ/Lt)quadratic deterministic: LD2/t2 deterministic: exp(−t

µ/L)

stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n

Page 165: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of rates of convergence

• Problem parameters

– D diameter of the domain

– B Lipschitz-constant

– L smoothness constant

– µ strong convexity constant

convex strongly convex

nonsmooth deterministic: BD/√t deterministic: B2/(tµ)

stochastic: BD/√n stochastic: B2/(nµ)

smooth deterministic: LD2/t2 deterministic: exp(−t√

µ/L)

stochastic: LD2/√n stochastic: L/(nµ)

finite sum: n/t finite sum: exp(−min1/n, µ/Lt)quadratic deterministic: LD2/t2 deterministic: exp(−t

µ/L)

stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n

Page 166: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 167: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 168: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Going beyond a single pass over the data

• Stochastic approximation

– Assumes infinite data stream

– Observations are used only once

– Directly minimizes testing cost E(x,y) ℓ(y, θ⊤Φ(x))

Page 169: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Going beyond a single pass over the data

• Stochastic approximation

– Assumes infinite data stream

– Observations are used only once

– Directly minimizes testing cost E(x,y) ℓ(y, θ⊤Φ(x))

• Machine learning practice

– Finite data set (x1, y1, . . . , xn, yn)

– Multiple passes

– Minimizes training cost 1n

∑ni=1 ℓ(yi, θ

⊤Φ(xi))

– Need to regularize (e.g., by the ℓ2-norm) to avoid overfitting

• Goal: minimize g(θ) =1

n

n∑

i=1

fi(θ)

Page 170: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

Page 171: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

Page 172: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

– Sampling with replacement: i(t) random element of 1, . . . , n– Convergence rate in O(1/t)

– Iteration complexity is independent of n (step size selection?)

Page 173: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(

yi, θ⊤Φ(xi)

)

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Page 174: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

time

log

(exc

ess

co

st)

stochastic

deterministic

Page 175: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

hybridlog

(exc

ess

co

st)

stochastic

deterministic

time

Page 176: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

Page 177: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Extra memory requirement

– Supervised machine learning

– If fi(θ) = ℓi(yi,Φ(xi)⊤θ), then f ′

i(θ) = ℓ′i(yi,Φ(xi)⊤θ)Φ(xi)

– Only need to store n real numbers

Page 178: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient - Convergence analysis

• Assumptions

– Each fi is R2-smooth, i = 1, . . . , n

– g= 1n

∑ni=1 fi is µ-strongly convex (with potentially µ = 0)

– constant step size γt = 1/(16R2)

– initialization with one pass of averaged SGD

Page 179: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient - Convergence analysis

• Assumptions

– Each fi is R2-smooth, i = 1, . . . , n

– g= 1n

∑ni=1 fi is µ-strongly convex (with potentially µ = 0)

– constant step size γt = 1/(16R2)

– initialization with one pass of averaged SGD

• Strongly convex case (Le Roux et al., 2012, 2013)

E[

g(θt)−g(θ∗)]

6

(8σ2

nµ+4R2‖θ0−θ∗‖2

n

)

exp(

− tmin 1

8n,

µ

16R2

)

– Linear (exponential) convergence rate with O(1) iteration cost

– After one pass, reduction of cost by exp(

−min1

8,nµ

16R2

)

Page 180: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient - Convergence analysis

• Assumptions

– Each fi is R2-smooth, i = 1, . . . , n

– g= 1n

∑ni=1 fi is µ-strongly convex (with potentially µ = 0)

– constant step size γt = 1/(16R2)

– initialization with one pass of averaged SGD

• Non-strongly convex case (Le Roux et al., 2013)

E[

g(θt)− g(θ∗)]

6 48σ2 +R2‖θ0−θ∗‖2√

n

n

t

– Improvement over regular batch and stochastic gradient

– Adaptivity to potentially hidden strong convexity

Page 181: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Convergence analysis - Proof sketch

• Main step: find “good” Lyapunov function J(θt, yt1, . . . , y

tn)

– such that E[

J(θt, yt1, . . . , y

tn)|Ft−1

]

< J(θt−1, yt−11 , . . . , yt−1

n )

– no natural candidates

• Computer-aided proof

– Parameterize function J(θt, yt1, . . . , y

tn) = g(θt)−g(θ∗)+quadratic

– Solve semidefinite program to obtain candidates (that depend on

n, µ, L)

– Check validity with symbolic computations

Page 182: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Rate of convergence comparison

• Assume that L = 100, µ = .01, and n = 80000 (L 6= R2)

– Full gradient method has rate(

1− µL

)

= 0.9999

– Accelerated gradient method has rate(

1−√

µL

)

= 0.9900

– Running n iterations of SAG for the same cost has rate(

1− 18n

)n= 0.8825

– Fastest possible first-order method has rate(√

L−√µ√

L+õ

)2

= 0.9608

• Beating two lower bounds (with additional assumptions)

– (1) stochastic gradient and (2) full gradient

Page 183: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient

Implementation details and extensions

• The algorithm can use sparsity in the features to reduce the storage

and iteration cost

• Grouping functions together can further reduce the memory

requirement

• We have obtained good performance when R2 is not known with a

heuristic line-search

• Algorithm allows non-uniform sampling

• Possibility of making proximal, coordinate-wise, and Newton-like

variants

Page 184: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

spam dataset (n = 92 189, d = 823 470)

Page 185: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

protein dataset (n = 145751, d = 74)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tive

min

us

Op

tim

um

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

0 5 10 15 20 25 30

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

Effective Passes

Test

Lo

gis

tic

Lo

ss

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

Training cost Testing cost

Page 186: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Extensions and related work

• Exponential convergence rate for strongly convex problems

• Need to store gradients

– SVRG (Johnson and Zhang, 2013)

• Adaptivity to non-strong convexity

– SAGA (Defazio, Bach, and Lacoste-Julien, 2014)

• Simple proof

– SVRG, SAGA, random coordinate descent (Nesterov, 2012; Shalev-

Shwartz and Zhang, 2012)

• Lower bounds

– Agarwal and Bottou (2014)

Page 187: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Variance reduction

• Principle: reducing variance of sample of X by using a sample from

another random variable Y with known expectation

Zα = α(X − Y ) + EY

– EZα = αEX + (1− α)EY

– varZα = α2[

varX + varY − 2cov(X,Y )]

– α = 1: no bias, α < 1: potential bias (but reduced variance)

– Useful if Y positively correlated with X

Page 188: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Variance reduction

• Principle: reducing variance of sample of X by using a sample from

another random variable Y with known expectation

Zα = α(X − Y ) + EY

– EZα = αEX + (1− α)EY

– varZα = α2[

varX + varY − 2cov(X,Y )]

– α = 1: no bias, α < 1: potential bias (but reduced variance)

– Useful if Y positively correlated with X

• Application to gradient estimation : SVRG (Johnson and Zhang,

2013)

– Estimating the averaged gradient g′(θ) = 1n

∑ni=1 f

′i(θ)

– Using the gradients of a previous iterate θ

Page 189: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic variance reduced gradient (SVRG)

• Algorithm divide into “epochs”

• At each epoch, starting from θ0 = θ, perform the iteration

– Sample it uniformly at random

– Gradient step: θt = θt−1 − γ[

f ′it(θt−1)− f ′

it(θ) + g′(θ)

]

• Proposition: If each fi is R2-smooth and g = 1n

∑ni=1 fi is µ-

strongly convex, then after k = 20R2/µ steps and with γ = 1/10R2,

then f(θ)− f(θ∗) is reduced by 10%

Page 190: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

1. Introduction

• Large-scale machine learning and optimization

• Classes of functions (convex, smooth, etc.)

• Traditional statistical analysis through Rademacher complexity

2. Classical methods for convex optimization

• Smooth optimization (gradient descent, Newton method)

• Non-smooth optimization (subgradient descent)

3. Classical stochastic approximation (not covered)

• Robbins-Monro algorithm (1951)

Page 191: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Outline - II

4. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

5. Smooth stochastic approximation algorithms

• Non-asymptotic analysis for smooth functions

• Least-squares regression without decaying step-sizes

6. Finite data sets

• Gradient methods with exponential convergence rates

Page 192: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Subgradient descent for machine learning

• Assumptions (f is the expected risk, f the empirical risk)

– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.

– f(θ) = 1n

∑ni=1 ℓ(yi,Φ(xi)

⊤θ)

– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D

• Statistics: with probability greater than 1− δ

supθ∈Θ

|f(θ)− f(θ)| 6 GRD√n

[

2 +

2 log2

δ

]

• Optimization: after t iterations of subgradient method

f(θ)−minη∈Θ

f(η) 6GRD√

t

• t = n iterations, with total running-time complexity of O(n2d)

Page 193: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic subgradient “descent”/method

• Assumptions

– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f

– θ∗ global optimum of f on ‖θ‖2 6 D

• Algorithm: θn = ΠD

(

θn−1 −2D

B√nf ′n(θn−1)

)

• Bound:

Ef

(

1

n

n−1∑

k=0

θk

)

− f(θ∗) 62DB√

n

• “Same” three-line proof as in the deterministic case

• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

• Running-time complexity: O(dn) after n iterations

Page 194: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of new results (Bach and Moulines, 2011)

• Stochastic gradient descent with learning rate γn = Cn−α

• Strongly convex smooth objective functions

– Old: O(n−1) rate achieved without averaging for α = 1

– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]

– Non-asymptotic analysis with explicit constants

– Forgetting of initial conditions

– Robustness to the choice of C

• Convergence rates for E‖θn − θ∗‖2 and E‖θn − θ∗‖2

– no averaging: O(σ2γn

µ

)

+O(e−µnγn)‖θ0 − θ∗‖2

– averaging:trH(θ∗)−1

n+ µ−1O(n−2α+n−2+α) +O

(‖θ0 − θ∗‖2µ2n2

)

Page 195: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rd

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

• New analysis for averaging and constant step-size γ = 1/(4R2)

– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely

– No assumption regarding lowest eigenvalues of H

– Main result: Ef(θn−1)− f(θ∗) 64σ2d

n+

4R2‖θ0 − θ∗‖2n

• Matches statistical lower bound (Tsybakov, 2003)

– Non-asymptotic robust version of Gyorfi and Walk (1996)

Page 196: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Choice of support point for online Newton step

• Two-stage procedure

(1) Run n/2 iterations of averaged SGD to obtain θ

(2) Run n/2 iterations of averaged constant step-size LMS

– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)

– Provable convergence rate of O(d/n) for logistic regression

– Additional assumptions but no strong convexity

• Update at each iteration using the current averaged iterate

– Recursion: θn = θn−1 − γ[

f ′n(θn−1) + f ′′

n(θn−1)(θn−1 − θn−1)]

– No provable convergence rate (yet) but best practical behavior

– Note (dis)similarity with regular SGD: θn = θn−1 − γf ′n(θn−1)

Page 197: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Extra memory requirement

– Supervised machine learning

– If fi(θ) = ℓi(yi,Φ(xi)⊤θ), then f ′

i(θ) = ℓ′i(yi,Φ(xi)⊤θ)Φ(xi)

– Only need to store n real numbers

Page 198: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Summary of rates of convergence

• Problem parameters

– D diameter of the domain

– B Lipschitz-constant

– L smoothness constant

– µ strong convexity constant

convex strongly convex

nonsmooth deterministic: BD/√t deterministic: B2/(tµ)

stochastic: BD/√n stochastic: B2/(nµ)

smooth deterministic: LD2/t2 deterministic: exp(−t√

µ/L)

stochastic: LD2/√n stochastic: L/(nµ)

finite sum: n/t finite sum: exp(−min1/n, µ/Lt)quadratic deterministic: LD2/t2 deterministic: exp(−t

µ/L)

stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n

Page 199: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Conclusions

Machine learning and convex optimization

• Statistics with or without optimization?

– Significance of mixing algorithms with analysis

– Benefits of mixing algorithms with analysis

• Open problems

– Non-parametric stochastic approximation

– Characterization of implicit regularization of online methods

– Structured prediction

– Going beyond a single pass over the data (testing performance)

– Further links between convex optimization and online

learning/bandits

– Parallel and distributed optimization

Page 200: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

References

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds

on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions

on, 58(5):3235–3249, 2012.

Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. arXiv preprint

arXiv:1410.0723, 2014.

R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of stochastic

tracking algorithms. SIAM J. Control and Optimization, 39(3):872–899, 2000.

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic

regression. Technical Report 00804431, HAL, 2013.

F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine

learning. In Adv. NIPS, 2011.

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate o(1/n). Technical Report 00831977, HAL, 2013.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization,

2012a.

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-

inducing penalties. Foundations and Trends R© in Machine Learning, 4(1):1–106, 2012b.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

Page 201: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive algorithms and stochastic

approximations. Springer Publishing Company, Incorporated, 2012.

D. P. Bertsekas. Nonlinear programming. Athena scientific, 1999.

D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant

step size. SIAM Journal on Optimization, 18(1):29–51, 2008.

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.

L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models in

Business and Industry, 21(2):137–151, 2005.

S. Boucheron and P. Massart. A high-dimensional wilks phenomenon. Probability theory and related

fields, 150(3-4):405–433, 2011.

S. Boucheron, O. Bousquet, G. Lugosi, et al. Theory of classification: A survey of some recent

advances. ESAIM Probability and statistics, 9:323–375, 2005.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2003.

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine

Learning, 8(3-4):231–357, 2015. ISSN 1935-8237. doi: 10.1561/2200000050. URL http:

//dx.doi.org/10.1561/2200000050.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.

Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.

A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183,

2008.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method

Page 202: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

with support for non-strongly convex composite objectives. In Advances in Neural Information

Processing Systems, pages 1646–1654, 2014.

A. Defossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal

sampling distributions. 2015.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical

report, ArXiv, 2014.

A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for

least-squares regression. Technical Report 1602.05419, arXiv, 2016.

J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal

of Machine Learning Research, 10:2899–2934, 2009. ISSN 1532-4435.

N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. arXiv preprint

arXiv:1504.01577, 2015.

Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex

stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal

on Optimization, 23(4):2061–2089, 2013.

L. Gyorfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal

on Control and Optimization, 34(1):31–61, 1996.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization.

Machine Learning, 69(2):169–192, 2007.

Chonghai Hu, James T Kwok, and Weike Pan. Accelerated gradient methods for stochastic optimization

and online learning. In NIPS, volume 22, pages 781–789, 2009.

Page 203: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance

reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.

Takafumi Kanamori and Hidetoshi Shimodaira. Active learning algorithm using the maximum weighted

log-likelihood estimator. Journal of statistical planning and inference, 116(1):149–162, 2003.

H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications.

Springer-Verlag, second edition, 2003.

S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence

rate for projected stochastic subgradient descent. Technical Report 1212.2002, ArXiv, 2012.

Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In Proceedings of The 30th International Conference

on Machine Learning, pages 53–61, 2013.

G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A):

365–397, 2012.

Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent

stochastic approximation method. Mathematical programming, 134(2):425–458, 2012.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL,

2013.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Page 204: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

Wiley West Sussex, 1995.

A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic

Optimization: Algorithms and Applications, pages 263–304, 2000.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to

stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley

& Sons, 1983.

Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k2).

Soviet Math. Doklady, 269(3):543–547, 1983.

Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers,

2004.

Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations

Research and Econometrics (CORE), Catholic University of Louvain, Tech. Rep, 76, 2007.

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120

(1):221–259, 2009.

Y. Nesterov and A. Nemirovski. Interior-point polynomial algorithms in convex programming. SIAM

studies in Applied Mathematics, 1994.

Y. Nesterov and J. P. Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):

1559–1568, 2008. ISSN 0005-1098.

Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM

Journal on Optimization, 22(2):341–362, 2012.

Page 205: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal

on Control and Optimization, 30(4):838–855, 1992.

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407,

1951. ISSN 0003-4851.

R Tyrrell Rockafellar. Convex Analysis. Number 28. Princeton University Press, 1997.

D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report

781, Cornell University Operations Research and Industrial Engineering, 1988.

M. Schmidt, N. Le Roux, and F. Bach. Optimization with approximate gradients. Technical report,

HAL, 2011.

B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.

S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proc.

ICML, 2008.

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss

minimization. Technical Report 1209.1873, Arxiv, 2012.

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm.

In Proc. ICML, 2007.

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In proc.

COLT, 2009.

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,

2004.

Naum Zuselevich Shor, Krzysztof C. Kiwiel, and Andrzej Ruszcay?ski. Minimization methods for

Page 206: Statistical machine learning and convex optimizationfbach/mlss_2016_cadiz_slides.pdf · • Large-scale machine learning and optimization • Classes of functions (convex, smooth,

non-differentiable functions. Springer-Verlag New York, Inc., 1985.

K. Sridharan, N. Srebro, and S. Shalev-Shwartz. Fast rates for regularized objectives. 2008.

I. Tsochantaridis, Thomas Joachims, T., Y. Altun, and Y. Singer. Large margin methods for structured

and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.

A. B. Tsybakov. Optimal rates of aggregation. 2003.

A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.

L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal

of Machine Learning Research, 9:2543–2596, 2010. ISSN 1532-4435.