Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible...

Optimization for Machine Learning

Proximal operator and proximal gradient methods

Lecturers: Francis Bach & Robert M. Gower

Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou

African Master's in Machine Intelligence (AMMI), Kigali

References classes today

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Sébastien Bubeck (2015)Convex Optimization: Algorithms andComplexity

Chapter 1 and Section 5.1

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

The Training Problem

Convergence GD

Theorem

Let f be m-strongly convex and L-smooth.

Convergence GD I

Theorem

Let f be convex and L-smooth.

Not true for many problems

Is f always differentiable?

Change notation: Keep loss and regularizor separate

Loss function

The Training problem

If L or R is not differentiable

L+R is not differentiable

If L or R is not smooth

L+R is not smooth

Non-smooth Example

Does not fit. Not smooth

Need more tools

Assumptions for this class

What does this mean?

Assume this is easy to solve

Examples all together!!

Examples

SVM with soft margin

Not smooth, but prox is easy

Not smooth

Convexity: Subgradient

Examples: L1 norm

Optimality conditions

Working example: Lasso

Difficult inclusion, do iteratively.

Proximal method I

The w that minimizes the upper bound gives gradient descent

But what about R(w)? Adding on + to upper bound:

Can we minimize the right-hand side?

Proximal method II

Minimizing the right-hand side of

What is this prox operator?

Proximal Operator I

Rearranging

EXE: Is this Proximal operator well defined? Is it even a function?

Proximal Operator II: Optimality conditions

Optimal is a fixed point.

Proximal Operator III: Properties

Proximal Operator IV: Soft thresholding

Proximal Operator V: Singular value thresholding

Similarily, the prox of the nuclear norm for matrices:

Proximal method V

Minimizing the right-hand side of

Make iterative method based on this upper bound minimization

The Proximal Gradient Method

Iterative Soft Thresholding Algorithm (ISTA)

Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Can we do better?

The FISTA Method

Weird, but it works

Convergence of FISTATheorem (Beck Teboulle 2009)

Where wt are given by the FISTA algorithm

Is this as good as it gets?

Lab Session 01.10

Bring your laptop!

Introduction to Stochastic Gradient Descent

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

The Training Problem

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Stochastic Gradient Descent

Proximal operator and proximal gradient methods€¦ · Stochastic Gradient Descent Is it possible...

Documents

Proximal Gradient Descent - CMU Statisticsstat.cmu.edu/~ryantibs/convexopt/lectures/prox-grad.pdfProximal Gradient Descent (and Acceleration) Ryan Tibshirani Convex Optimization 10-725

Optimization/Gradient Descent

Sargur Srihari srihari@cedar.buffalo · – This is known as the method of steepest descent or gradient descent • Steepest descent proposes a new point ... • Gradient Descent

The Gradient Descent Algorithm

Proximal Gradient Descent and Accelerationryantibs/convexopt-S15/lectures/08... · 2015. 5. 8. · Acceleration Turns out we canaccelerateproximal gradient descent in order to achieve

Stochastic Proximal Gradient Algorithmagarivie/Telecom/slidesMAS/moulines.p… · Stochastic proximal gradient algorithm Network structure estimation Conclusion Stochastic Proximal

Neural Proximal Gradient Descent for Compressive Imaging · 2019-02-19 · Neural Proximal Gradient Descent for Compressive Imaging Morteza Mardani1, Qingyun Sun4, Shreyas Vasawanala2,

Exponentiated Gradient versus Gradient Descent for Linear

Multiple Gradient Descent Algorithm

10-315 Recitation Review of Gradient Descent & Kernelsninamf/courses/315sp19/recitations/2_21-… · Review of Gradient Descent & Kernels Misha 21 February 2019. Gradient Descent:

Proximal gradient method - pku.edu.cn

Gradient descent method

Learning to learn by gradient descent by gradient descent

Proximal Gradient Descent › ~aarti › Class › 10725_Fall17 › Lecture_Slides › ...Proximal gradient descent has convergence rate O(1=k), or O(1= ) Same as gradient descent!

GRADIENT DESCENT - Pomona College

Gradient descent GAN optimization is locally stablepapers.nips.cc/paper/7142-gradient-descent-gan-optimization-is... · Gradient descent GAN optimization is locally stable ... similarities

Holonomic Gradient Descent

Intro Logistic+Regression Gradient+Descent+++SGD...9 SGD:+Stochastic+Gradient+Ascent+(or+Descent) • “True”gradient: • Samplebasedapproximation: • Whatifweestimategradientwithjustonesample???

Lecture 6: stochastic gradient descent Sanjeev Arora Elad …...• Gradient descent algorithm, linear classification • Stochastic gradient descent Title 402-lec6 Created Date 10/6/2016

Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition