34
Optimization for Machine Learning Proximal operator and proximal gradient methods Lecturers: Francis Bach & Robert M. Gower Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou African Master's in Machine Intelligence (AMMI), Kigali

Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Optimization for Machine Learning

Proximal operator and proximal gradient methods

Lecturers: Francis Bach & Robert M. Gower

Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou

African Master's in Machine Intelligence (AMMI), Kigali

Page 2: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

References classes today

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Sébastien Bubeck (2015)Convex Optimization: Algorithms andComplexity

Chapter 1 and Section 5.1

Page 3: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Page 4: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

The Training Problem

Page 5: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Convergence GD

Theorem

Let f be m-strongly convex and L-smooth.

Where

Page 6: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Convergence GD I

Theorem

Let f be convex and L-smooth.

Where

Not true for many problems

Is f always differentiable?

Page 7: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Change notation: Keep loss and regularizor separate

Loss function

The Training problem

If L or R is not differentiable

L+R is not differentiable

If L or R is not smooth

L+R is not smooth

Page 8: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Non-smooth Example

Does not fit. Not smooth

Need more tools

Page 9: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Assumptions for this class

The Training problem

What does this mean?

Assume this is easy to solve

Examples all together!!

Page 10: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Examples

SVM with soft margin

Lasso

Not smooth, but prox is easy

Not smooth

Page 11: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Convexity: Subgradient

g =0

Page 12: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Examples: L1 norm

Page 13: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Optimality conditions

The Training problem

Page 14: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Working example: Lasso

Lasso

Difficult inclusion, do iteratively.

Page 15: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal method I

The w that minimizes the upper bound gives gradient descent

But what about R(w)? Adding on + to upper bound:

Can we minimize the right-hand side?

Page 16: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal method II

Minimizing the right-hand side of

What is this prox operator?

Page 17: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal Operator I

Rearranging

EXE: Is this Proximal operator well defined? Is it even a function?

Page 18: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal Operator II: Optimality conditions

The Training problem

Optimal is a fixed point.

Page 19: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal Operator III: Properties

Exe:

Page 20: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal Operator IV: Soft thresholding

Exe:

Page 21: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal Operator V: Singular value thresholding

Similarily, the prox of the nuclear norm for matrices:

Page 22: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Proximal method V

Minimizing the right-hand side of

Make iterative method based on this upper bound minimization

Page 23: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

The Proximal Gradient Method

Page 24: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Iterative Soft Thresholding Algorithm (ISTA)

Lasso

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

ISTA:

Page 25: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Then

where

Can we do better?

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 26: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

The FISTA Method

Weird, but it works

Page 27: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Convergence of FISTATheorem (Beck Teboulle 2009)

Then

Where wt are given by the FISTA algorithm

Is this as good as it gets?

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 28: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Lab Session 01.10

Bring your laptop!

Page 29: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Introduction to Stochastic Gradient Descent

Page 30: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Page 31: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

The Training Problem

Page 32: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

The Training Problem

Page 33: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 34: Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible to design a method that uses only the gradient of a single data function at each

Stochastic Gradient Descent