Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf ·...

Optimization methods

Optimization-Based Data Analysishttp://www.cims.nyu.edu/~cfgranda/pages/OBDA_spring16

Carlos Fernandez-Granda

2/8/2016

Introduction

Aim: Overview of optimization methods that

I Tend to scale well with the problem dimensionI Are widely used in machine learning and signal processingI Are (reasonably) well understood theoretically

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Gradient

Direction of maximum variation

Gradient descent (aka steepest descent)

Method to solve the optimization problem

minimize f (x) ,

where f is differentiable

Gradient-descent iteration:

x (0) = arbitrary initialization

x (k+1) = x (k) − αk ∇f(x (k)

)where αk is the step size

Gradient descent (1D)

Gradient descent (2D)

0.51.01.52.02.53.03.54.0

Small step size

0.51.01.52.02.53.03.54.0

Large step size

102030405060708090100

Line search

I Exact

αk := argminβ≥0

f(x (k) − β∇f

(x (k)

I Backtracking (Armijo rule)

Given α0 ≥ 0 and β ∈ (0, 1), set αk := α0 βi for the smallest isuch that

f(x (k+1)

)≤ f

(x (k)

)− 1

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

Backtracking line search

0.51.01.52.02.53.03.54.0

Lipschitz continuity

A function f : Rn → Rm is Lipschitz continuous with Lipschitzconstant L if for any x , y ∈ Rn

||f (y)− f (x)||2 ≤ L ||y − x ||2

Example:

f (x) := Ax is Lipschitz continuous with L = σmax (A)

Quadratic upper bound

If the gradient of f : Rn → R is Lipschitz continuous with constant L

||∇f (y)−∇f (x)||2 ≤ L ||y − x ||2

then for any x , y ∈ Rn

f (y) ≤ f (x) +∇f (x)T (y − x) +L2||y − x ||22

Consequence of quadratic bound

Since x (k+1) = x (k) − αk∇f(x (k))

f(x (k+1)

)≤ f

(x (k)

)− αk

(1− αkL

) ∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

If αk ≤ 1L the value of the function always decreases!

f(x (k+1)

)≤ f

(x (k)

)− αk

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

Gradient descent with constant step size

Conditions:I f is convexI ∇f is L-Lipschitz continuousI There exists a solution x∗ such that f (x∗) is finite

If αk = α ≤ 1L

f(x (k)

)− f (x∗) ≤

∣∣∣∣x (k) − x (0)∣∣∣∣2

22α k

We need O(1ε

)iterations to get an ε-optimal solution

Recall that if α ≤ 1L

f(x (i))≤ f

(x (i−1)

)− α

∣∣∣∣∣∣∇f(x (i−1)

)∣∣∣∣∣∣22

By the first-order characterization of convexity

f(x (i−1)

)− f (x∗) ≤ ∇f

(x (i−1)

)T (x (i−1) − x∗

)This implies

f(x (i))− f (x∗) ≤ ∇f

(x (i−1)

)T (x (i−1) − x∗

)− α

∣∣∣∣∣∣∇f(x (i−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣x (i−1) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (i−1) − x∗ − α∇f

(x (i−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣x (i−1) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (i) − x∗

∣∣∣∣∣∣2

Because the value of f never increases,

f(x (k)

)− f (x∗) ≤ 1

k∑i=1

f(x (i))− f (x∗)

(∣∣∣∣∣∣x (0) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (k) − x∗

∣∣∣∣∣∣22

)≤∣∣∣∣x (0) − x∗

∣∣∣∣22

Backtracking line search

If the gradient of f : Rn → R is Lipschitz continuous with constant Lthe step size in the backtracking line search satisfies

αk ≥ αmin := min{α0,

Line search ends when

f(x (k+1)

)≤ f

(x (k)

)− αk

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

but we know that if αk ≤ 1L

f(x (k+1)

)≤ f

(x (k)

)− αk

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

This happens as soon as β/L ≤ α0βi ≤ 1/L

Gradient descent with backtracking

Under the same conditions as before gradient descent withbacktracking line search achieves

f(x (k)

)− f (x∗) ≤

∣∣∣∣x (0) − x∗∣∣∣∣2

22αmin k

Strong convexity

A function f : Rn is strongly convex if for any x , y ∈ Rn

f (y) ≥ f (x) +∇f (x)T (y − x) + S ||y − x ||2 .

Example:

f (x) := ||Ax − y ||22

where A ∈ Rm×n is strongly convex with S = σmin (A) if m > n

Gradient descent for strongly convex functions

If f is S-strongly convex and ∇f is L-Lipschitz continuous

f(x (k)

)− f (x∗) ≤

ckL∣∣∣∣x (k) − x (0)

∣∣∣∣22

c :=LS − 1LS + 1

We need O(log 1

Lower bounds for convergence rate

There exist convex functions with L-Lipschitz-continuous gradientssuch that for any algorithm that selects x (k) from

x (0) + span{∇f(x (0)

),∇f

(x (1)

), . . . ,∇f

(x (k−1)

)}we have

f(x (k)

)− f (x∗) ≥

3L∣∣∣∣x (0) − x∗

∣∣∣∣22

32 (k + 1)2

Nesterov’s accelerated gradient method

Achieves lower bound, i.e. O(

1√ε

)convergence

Uses momentum variable

y (k+1) = x (k) − αk∇f(x (k)

)x (k+1) = βky (k+1) + γky (k)

Despite guarantees, why this works is not completely understood

Projected gradient descent

Optimization problem

minimize f (x)subject to x ∈ S,

where f is differentiable and S is convex

Projected-gradient-descent iteration:

x (k+1) = PS(x (k) − αk ∇f

(x (k)

12345678

Subgradient method

Optimization problem

minimize f (x)

where f is convex but nondifferentiable

Subgradient-method iteration:

x (k+1) = x (k) − αk q(k)

where q(k) is a subgradient of f at x (k)

Least-squares regression with `1-norm regularization

minimize12||Ax − y ||22 + λ ||x ||1

Sum of subgradients is a subgradient of the sum

q(k) = AT(Ax (k) − y

)+ λ sign

(x (k)

)Subgradient-method iteration:

x (k+1) = x (k) − αk

(AT(Ax (k) − y

)+ λ sign

(x (k)

Convergence of subgradient method

It is not a descent method

Convergence rate can be shown to be O(1/ε2

)Diminishing step sizes are necessary for convergence

Experiment:

minimize12||Ax − y ||22 + λ ||x ||1

A ∈ R2000×1000, y = Ax0 + z where x0 is 100-sparse and z is iid Gaussian

0 20 40 60 80 100

(k))−f(x∗)

f(x∗)

α0/√k

0 1000 2000 3000 4000 5000

(k))−f(x∗)

f(x∗)

α0/√n

Composite functions

Interesting class of functions for data analysis

f (x) + g (x)

f convex and differentiable, g convex but not differentiable

Example:

Least-squares regression (f ) + `1-norm regularization (g)

12||Ax − y ||22 + λ ||x ||1

Interpretation of gradient descent

Solution of local first-order approximation

x (k+1) := x (k) − αk ∇f(x (k)

)= argmin

∣∣∣∣∣∣x − (x (k) − αk ∇f(x (k)

))∣∣∣∣∣∣22

= argminx

f(x (k)

)+∇f

(x (k)

)T (x − x (k)

∣∣∣∣∣∣x − x (k)∣∣∣∣∣∣2

Proximal gradient method

Idea: Minimize local first-order approximation + g

x (k+1) = argminx

f(x (k)

)+∇f

(x (k)

)T (x − x (k)

∣∣∣∣∣∣x − x (k)∣∣∣∣∣∣2

+ g (x)

= argminx

∣∣∣∣∣∣x − (x (k) − αk ∇f(x (k)

))∣∣∣∣∣∣22+ αk g (x)

= proxαk g

(x (k) − αk ∇f

(x (k)

))Proximal operator:

proxg (y) := argminx

g (x) +12||y − x ||22

Proximal gradient method

Method to solve the optimization problem

minimize f (x) + g (x) ,

where f is differentiable and proxg is tractable

Proximal-gradient iteration:

x (k+1) = proxαk g

(x (k) − αk ∇f

(x (k)

Interpretation as a fixed-point method

A vector x̂ is a solution to

minimize f (x) + g (x) ,

if and only if it is a fixed point of the proximal-gradient iterationfor any α > 0

x̂ = proxαk g (x̂ − αk ∇f (x̂))

Projected gradient descent as a proximal method

The proximal operator of the indicator function

IS (x) :=

{0 if x ∈ S,∞ if x /∈ S.

of a convex set S ⊆ Rn is projection onto S

Proximal-gradient iteration:

x (k+1) = proxαk IS

(x (k) − αk ∇f

(x (k)

))= PS

(x (k) − αk ∇f

(x (k)

Proximal operator of `1 norm

The proximal operator of the `1 norm is the soft-thresholding operator

proxβ ||·||1 (y) = Sβ (y)

where β > 0 and

Sβ (y)i :=

{yi − sign (yi )β if |yi | ≥ β0 otherwise

Iterative Shrinkage-Thresholding Algorithm (ISTA)

The proximal gradient method for the problem

minimize12||Ax − y ||22 + λ ||x ||1

is called ISTA

ISTA iteration:

x (k+1) = Sαk λ

(x (k) − αk AT

(Ax (k) − y

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

ISTA can be accelerated using Nesterov’s accelerated gradient method

FISTA iteration:

z(0) = x (0)

x (k+1) = Sαk λ

(z(k) − αk AT

(Az(k) − y

))z(k+1) = x (k+1) +

kk + 3

(x (k+1) − x (k)

Convergence of proximal gradient method

Without acceleration:I Descent methodI Convergence rate can be shown to be O (1/ε) with constant step or

backtracking line searchWith acceleration:I Not a descent methodI Convergence rate can be shown to be O

(1√ε

)with constant step or

backtracking line search

Experiment: minimize 12 ||Ax − y ||22 + λ ||x ||1

A ∈ R2000×1000, y = Ax0 + z , x0 100-sparse and z iid Gaussian

Convergence of proximal gradient method

0 20 40 60 80 100

(k))−f(x∗)

f(x∗)

Subg. method (α0/√k )

Coordinate descent

Idea: Solve the n-dimensional problem

minimize h (x1, x2, . . . , xn)

by solving a sequence of 1D problems

Coordinate-descent iteration:

x (k+1)i = argmin

αh(x (k)1 , . . . , α, . . . , x (k)

)for some 1 ≤ i ≤ n

Coordinate descent

Convergence is guaranteed for functions of the form

f (x) +n∑

gi (xi )

where f is convex and differentiable and g1, . . . , gn are convex

Least-squares regression with `1-norm regularization

h (x) :=12||Ax − y ||22 + λ ||x ||1

The solution to the subproblem minxi h (x1, . . . , xi , . . . , xn) is

x̂i =Sλ (γi )

||Ai ||22

where Ai is the ith column of A and

γi :=m∑

yl −∑j 6=i

Computational experiments

From Statistical Learning with Sparsity The Lasso and Generalizationsby Hastie, Tibshirani and Wainwright

Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf ·...

Documents

Convection Heat Transfer - Sharifmech.sharif.ir/~saman/files/lecture_3.pdf · Convection Heat Transfer Instructor: M.H.Saidi ... vAn introduction to scale analysis ... FUNDAMENTAL

Lecture_3 Atmel+AVR+Architecture+OverviewLec3

slide 29 of Lecture 6 - University of Torontotijmen/csc321/slides/lecture_slides_lec6.pdf · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Lecture_3 - Pre-disaster Phase - Dra

Home Page | NYU Courant - Sparse regressioncfgranda/pages/OBDA_spring16...Theorem 1.5. Assume the data yare generated according to a linear model with additive noise, y= X + z; (17)

Regression Estimation - Least Squares and Maximum …fwood/w4315/Lectures/lecture_3/lecture_3.pdf · Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood. Least

Lecture_3 (What is .Net Technology)

Lecture’3:’Biology’Basics’Con4nued’cs425/spring17/slides/Lecture_3...– Human’3.2Gbp’ ’ ’ FastaFile’ • A’sequence’in’FASTA’formatbegins’with’asingleRline’

[PPT]PowerPoint Presentation - Department of Molecular & …mcb.berkeley.edu/courses/mcb130L/Originals/Lecture_3.ppt · Web viewTitle PowerPoint Presentation Author Laurent Coscoy

Lecture_3 finance

Lecture_3 Electrical Engineering IIT G

Principal of management 9erobbins ppt05 lecture_3

PR Campaigns Lecture_3. PR Roles Communication technician Communications manager - Expert prescriber - Communication facilitator - Problem-solving process

Low-rank models - NYU Courantcfgranda/pages/OBDA_spring16... · 1 25(5) 5 Superman2. Matrixcompletion Thematrixcompletionproblem Nuclearnorm Theoreticalguarantees Algorithms Alternatingminimization

Introduction to Advertising Lecture_3. Sources articles/types-of-advertising-appeals.html

Methodology - fac.ksu.edu.safac.ksu.edu.sa/sites/default/files/IS_525-Lecture_3-1_Methodology_0.… · 3 Methodology Methodology is: a set of methods, rules, practices, procedures,

Convex optimization - New York Universitycfgranda/pages/OBDA_spring16/material/lectu… · Convexity Convexsets Convexfunctions Operationsthatpreserveconvexity Diﬀerentiablefunctions

Java i lecture_3

화소점처리의개념 디지털영상의산술연산과논리연산 …imsp.kw.ac.kr/IMSP/Download/Image Signal Processing/Lecture_3.pdf · 화소점처리의개념을알아본다

Review - Courant Institute of Mathematical Sciencescfgranda/pages/OBDA_spring16... · Review Optimization-Based Data Analysis cfgranda/pages/OBDA_spring16 Carlos Fernandez-Granda