59
First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Embed Size (px)

Citation preview

Page 1: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

First-order methods for structured nonsmoothoptimization

Sangwoon Yun

Department of Mathematics EducationSungkyunkwan University

Oct 19, 2016Center for Mathematical Analysis & Computation, Yonsei University

Page 2: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Outline

I. Coordinate (Gradient) Descent Method

II. Incremental Gradient Method

III. (Linearized) Alternating Direction Method of Multipliers

Page 3: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Thank you!

I. Coordinate (Gradient) Descent Method

Page 4: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Structured Nonsmooth Optimization

minx

F (X ) = f (x) + P(x),

f : real-valued, (convex) smooth on domf

P: proper, convex, lsc

In particular, P: separable i.e., P(x) =∑n

j=1 Pj (xj )

Bound constraint:

P(x) =

0 if l ≤ x ≤ u;

∞ else,

where l ≤ u (possibly with −∞ or∞).

`1-norm: P(x) = λ‖x‖1 with λ > 0

or indicator function of linear constraints (Ax = b).

Page 5: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Bound-constrained Optimization

minl≤x≤u

f (x),

where f : <N → < is smooth, l ≤ u (possibly with −∞ or∞ components).

Can be reformulated as the following unconstrained optimization problem:

minx

f (x) + P(x),

where P(x) =

0 if l ≤ x ≤ u∞ else

.

Page 6: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

`1-regularized Convex Minimization

1. `1-regularized linear least squares problem

Find x so that Ax − b ≈ 0 and x has “few” nonzeros.Formulate this as an unconstrained convex optimization problem:

minx∈<n

‖Ax − b‖22 + λ‖x‖1

2. `1-regularized logistic regression problem

minw∈<n−1,v∈<

1m

m∑i=1

log(1 + exp(−(wT ai + vbi ))) + λ‖w‖1,

where ai = bizi and (zi ,bi ) ∈ <n−1 × −1,1, i = 1, ...,m are a given set of(observed or training) examples.

Page 7: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Support Vector Machines

Page 8: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Support Vector Machines

Support Vector Classification

Training points : zi ∈ <p, i = 1, ...,n.

Consider a simple case with two classes (linear separable case):Define a vector a :

ai =

1 if zi in class 1−1 if zi in class 2

A hyperplane (0 = wT z − b) separates data with the maximal margin.Margin is the distance of the hyperplane to the nearest of the positiveand negative points.Nearest points lie on the planes ±1 = wT z − b

Page 9: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Support Vector Machines

Convex Quadratic Programming Problem: Support Vector Classification

The (original) Optimization Problem

mina,b

12‖a‖2

2

subject to zi(aT xi − b

)≥ 1, i = 1, ...,n.

The Modified Optimization Problem (allows, but penalizes, the failure ofa point to reach the correct margin)

mina,b,ξ

12‖a‖2

2 + Cn∑

i=1

ξi

subject to zi(aT xi − b

)≥ 1− ξi , ξi ≥ 0, i = 1, ...,n.

Page 10: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Support Vector Machines

SVM (Dual) Optimization Problem (Convex Quadratic Program)

minx12 xT Qx − eT x

subject to 0 ≤ xi ≤ C, i = 1, ...,n,aT x = 0,

where a ∈ −1,1n, 0 < C ≤ ∞, e = [1, ...,1]T , Q ∈ <n×n is a sym. pos.semidef. with Qij = aiajK (zi , zj ), K : <p ×<p → < (“kernel function”), andzi ∈ <p (“i th data point”), i = 1, ...,n.

Popular Choices of K :

linear kernel K (zi , zj ) = zTi zj

radial basis function kernel K (zi , zj ) = exp(−γ‖zi − zj‖22)

sigmoid kernel K (zi , zj ) = tanh(γzTi zj )

where γ is a constant.Q is an n × n fully dense matrix and even indefinite. (n ≥ 5000)

Page 11: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Rank Minimization

Now, imagine that we only observe a few entries of a data matrix. Then is itpossible to accurately guess the entries that we have not seen?

Netflix problem: Given a sparse matrix where Mij is the rating given by user ion movie j , predict the rating a user would assign to a movie he has not seen,i.e., we would like to infer users preference for unrated movies. (impossible!in general)

The problem is ill-posed. Intuitively, users’ preferences depend only on a fewfactors, i.e., rank(M) is small.

Thus can be formulated as the low-rank matrix completion problem (affinerank minimization):

minX∈<m×n

rank(X ) |Xij = Mij , (i , j) ∈ Ω

, (NP hard!)

where Ω is an index set of p observed entries.

Page 12: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Rank Minimization

Nuclear norm minimization:

minX∈<m×n

‖X‖∗ :=

m∑i=1

σi (X ) |Xij = Mij , (i , j) ∈ Ω.

where σi (X )’s are singular values of X .

a more general nuclear norm minimization problem:

minX∈<m×n

‖X‖∗ : A(X ) = b

.

When the matrix variable is restricted to be diagonal, the above problemreduces to the following `1-minimization problem:

minx∈<n

‖x‖1 : Ax = b

.

Page 13: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Nuclear Norm Regularized Least Squares Problem

If the observation b is contaminated with noise

consider the following nuclear norm regularized least squares problem:

minX∈<m×n

12‖A(X )− b‖2

2 + µ‖X‖∗.

where µ > 0 is a given parameter.

appeared in many applications of engineering and science including

collaborative filtering

global positioning

system identification

remote sensing

computer vision

Page 14: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sparse Portfolio Selection

How to allocate an investor’s available capital into a prefixed set of assetswith the aims of maximizing the expected return and minimizing theinvestment risk.

Traditional Markowitz portfolio selection model:

minx

xT Qx

subject to µT x = β, eT x = 1, x ≥ 0,

where β is the desired expected return of the portfolio.

Modified models:

minx

xT Qx

subject to µT x = β, eT x = 1, x ≥ 0, ‖x‖0 ≤ K .

minx

‖x‖0

subject to µT x = β, xT Qx ≤ α, eT x = 1, x ≥ 0.

Page 15: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sparse Covariance Selection

Undirected graphical models offer a way to describe and explain relationshipsamong a set of variables, central element of multivariate data analysis

From a sample covariance matrix, wish to estimate the true covariancematrix, in which some of entries of its inverse are zero, by maximizing`1-regularized log-likelihood:

maxX∈Sn

log detX −⟨

Σ, X⟩−∑

(i,j) 6∈V ρij |Xij |subject to Xij = 0, ∀(i , j) ∈ V ,

where

Σ ∈ Sn+ is an empirical covariance matrix and Σ is singular or nearly so

May want to impose structural conditions on Σ−1, such as conditionalindependence, which is reflected as zero entries in Σ−1.

V is a collection of all pairs of conditional independent nodes

ρij > 0: parameter controlling the trade-off between the goodness-of-fitand the sparsity of X

Page 16: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sparse Covariance Selection

− log det X +⟨

Σ, X⟩

is strictly convex, cont. diff. on its domain Sn++, O(n3)

opers. to evaluate. In applications, n can exceed 5000.

Dual problem can be expressed:

minX∈Sn

− log det X − n

subject to |(X − Σ)ij | ≤ υij , i , j = 1, ...,n,

where υij = ρij for all (i , j) 6∈ V and υij =∞ for all (i , j) ∈ V .

Page 17: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coordinate Descent Method

When P ≡ 0. Given x ∈ <n, Choose i ∈ N = 1, ...,n. Update

xnew = arg minu|uj=xj ∀j 6=i

f (u).

Repeat until convergence.

Gauss-Seidel rule: Choose i cyclically, 1, 2, ..., n, 1, 2, ...

Gauss-Southwell rule: Choose i with | ∂f∂xi

(x)| maximum.

Page 18: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coordinate Descent Method

Properties:

If f convex, then every cluster point of the x-sequence is a minimizer.

If f nonconvex, then G-Seidel can cycle 1 but G-Southwell stillconverges.

Convergence is possible when P 6≡ 0 2

1M. J. D. Powell, On search directions for minimization algorithms, Math. Program. 4, (1973),193–201

2P. Tseng, Convergence of block coordinate descent method for nondifferentiableminimization, J. Optim. Theory Appl. 109, (2001), 473–492

Page 19: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coord. Gradient Descent Method

Descent direction.For x ∈ domP, choose J ( 6= ∅) ⊆ N and H 0n, Then solve

mind|dj=0 ∀j 6∈J

∇f (x)T d +12

dT Hd + P(x + d)− P(x) direc.subprob

Let dH(x ;J ) and qH(x ;J ) be the opt. soln and obj. value of the direc.subprob.

Facts:

dH(x ;N ) = 0 ⇔ F ′(x ; d) ≥ 0 ∀d ∈ <n.

H is diagonal ⇒ dH(x ;J ) =∑j∈J

dH(x ; j), qH(x ;J ) =∑j∈J

qH(x ; j).

qH(x ;J ) ≤ − 12 dT Hd where d = dH(x ;J ).

Page 20: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coord. Gradient Descent Method

This coord. grad. descent approach may be viewed as a hybrid ofgradient-projection and coordinate descent. In particular,

if J = N and P(x) =

0 if l ≤ x ≤ u∞ else

, then dH(x ;N ) is a scaled

gradient-projection direction for bound-constrained optimization.

if f is quadratic and we choose H = ∇2f (x), then dH(x ;J ) is a (block)coordinate descent direction.

If H is diagonal, then subproblems can be solved in parallel.

If P ≡ 0, then dH(x)j = −∇f (x)j/Hjj .

If P(x) =

0 if l ≤ x ≤ u∞ else

, then

dH(x)j = medianlj − xj ,−∇f (x)j/Hjj ,uj − xj.

If P is the 1-norm, thendH(x)j = −median(∇f (x)j − λ)/Hjj , xj , (∇f (x)j + λ)/Hjj.

Page 21: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coord. Gradient Descent Method

Stepsize: Armijo rule

Choose α to be the largest element of βkk=0,1,... satisfying

F (x + αd)− F (x) ≤ σαqH(x ;J ) (0 < β < 1, 0 < σ < 1).

For the `1-regularized linear least squares problem, the minimization rule

α ∈ arg minF (x + td) | t ≥ 0

or the limited minimization rule

α ∈ arg minF (x + td) | 0 ≤ t ≤ s,

where 0 < s <∞, can also be used.

Page 22: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coord. Gradient Descent Method

Choose J :

Gauss-Seidel rule:J cycles through 1, 2, ..., n.

Gauss-Southwell-r rule:

‖dD(x ;J )‖∞ ≥ υ‖dD(x ;N )‖∞

where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).

Gauss-Southwell-q rule:

qD(x ;J ) ≤ υ qD(x ;N ),

Where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).

Page 23: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Coord. Gradient Descent Method

Advantage of CGD

CGD method is simple, highly parallelizable, and is suited for solvinglarge-scale problems.

CGD not only has cheaper iterations than exact coordinate descent, italso has stronger global convergence properties.

Page 24: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Convergence Results

Global convergence If

0 ≺ λI D, H λI,

J is chosen by G-Seidel, G-Southwell-r , G-Southwell-q rule,

is chosen by Armijo rule,

then every cluster point of the x-sequence generated by CGD method is astationary point of F .

Page 25: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Convergence Results

Local convergence rate If

0 ≺ λI D, H λI,

J is chosen by G-Seidel or Gauss-Southwell-q rule,

α is chosen by Armijo rule,

in addition, if P and f satisfy any of the following assumptions, then thex-sequence generated by CGD method converges at R-linear rate.

C1 f is strongly convex, ∇f is Lipschitz cont. on domP.

C2 f is (nonconvex) quadratic. P is polyhedral.

C3 f (x) = g(Ex) + qT x , where E ∈ <m×N , q ∈ <N , g is stronglyconvex, ∇g is Lipschitz cont. on <m. P is polyhedral.

C4 f (x) = maxy∈Y(Ex)T y − g(y)+ qT x , where Y ⊆ <m ispolyhedral, E ∈ <m×N , q ∈ <N , g is strongly convex, ∇g isLipschitz cont. on <m. P is polyhedral.

Page 26: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Convergence Results

Complexity Bound

If f is convex with Lipschitz cont. grad., then the number of iterations forachieving ε-optimality is

Gauss-Seidel rule:

O(

n2Lr0

ε

),

where L is a Lipschitz constant andr0 = max

dist(x ,X ∗)2 | F (x) ≤ F (x0)

.

Gauss-Southwell-q rule:

O(

Lr0

υε+ max

0,

ln(

e0

r0

)),

where e0 = F (x0)−minx∈X F (x).

Page 27: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Thank you!

II. Incremental Gradient Method

Page 28: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sum of several functions

minx

F (x) := f (x) + P(x),

where c > 0, P : <n → (−∞,∞] is a proper, convex, lower semicontinuous(lsc) function, and

f (x) :=m∑

i=1

fi (x),

where each function fi is real-valued and smooth (i.e., continuouslydifferentiable) on <n.

Page 29: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sum of several functions

In applications, m is often large (exceeding 104).

In this case, traditional gradient methods would be inefficient since theyrequire evaluating ∇fi (x) for all i before updating x .

Incremental gradient methods (IGM), in contrast, update x afterevaluation of ∇fi (x) for only one or a few i .

In the unconstrained case P ≡ 0, this method has the basic form

xk+1 = xk + αk∇fik (xk ), k = 0,1, . . . ,

where ik is chosen to cycle through 1, . . . ,m (i.e.,i0 = 1, i1 = 2, . . . , im−1 = m, im = 1, . . . ) and αk > 0.

Page 30: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sum of several functions

For global convergence of IGMs, stepsize αk (also called “learning rate”)needs to diminish to zero, which can lead to slow convergence3.

If a constant stepsize is used, only convergence to an approximatesolution can be shown4.

Methods5 to overcome this difficulty were proposed. However, thesemethods need additional assumptions such as ∇fi (x) = 0 for all i at astationary point x to achieve global convergence without the stepsizetending to zero.

Moreover, its extension to P 6≡ 0 is problematic.

3D. P. Bertsekas, A new class of incremental gradient methods for least squares problems,SIAM J. Optim., 7 (1997), 913–926

4M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero,Comput. Optim. Appl., 11 (1998), 23–35

5P. Tseng, An incremental gradient(-projection) method with momentum term and adaptivestepsize rule, SIAM J. Optim., 8 (1998), 506–531

Page 31: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Sum of several functions

For the case of P ≡ 0, Blatt, Hero and Gauchman6 proposed a method:

gk = gk−1 +∇fik (xk )−∇fik (xk−m),

xk+1 = xk − αgk ,

with α > 0, g−1 =∑m

i=1∇fi (x i−m−1), x0, x−1, . . . , x−m ∈ <n given.

Computes the gradient of a single component function at each iteration.

But instead of updating x using this gradient, it uses the sum of the mmost recently computed gradients.

This method requires more storage (O(mn) instead of O(n)) and slightlymore communication/computation per iteration.

6D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with aconstant step size, SIAM J. Optim., 18 (2007), pp. 29–51

Page 32: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: Constant Stepsize

0. Choose x0, x−1, · · · ∈ domP and α ∈ (0,1]. Initialize k = 0. Go to 1.

1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, and compute gk , dk ,

and xk+1 by

gk =m∑

i=1

∇fi (xτki ),

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ P(xk + d)

,

xk+1 = xk + αdk .

Increment k by 1 and return to 1.

The method of Blatt et al. corresponds to the special case of P ≡ 0, Hk = I,K = m − 1, and

τ ki =

k if i = (k mod m) + 1;

τ k−1i otherwise,

1 ≤ i ≤ m, k ≥ m,

Page 33: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: Constant Stepsize

Assumption 1

(a) τ ki ≥ k − K for all i and k , where K ≥ 0 is an integer.

(b) λI Hk λI for all k , where 0 < λ ≤ λ.

Assumption 2

‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ domP,

for some Li ≥ 0, i = 1, . . . ,m. Let L =∑m

i=1 Li .

Page 34: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: Constant Stepsize

Global Convergence

Let xk, dk, Hk be sequences generated by Algorithm 1 underAssumptions 1 and 2, and with α < 2λ/(L(2K + 1)).Then dk → 0 and every cluster point of xk is a stationary point.

Page 35: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: Adaptive Stepsize

0. Choose x0, x−1, · · · ∈ domP, β ∈ (0,1), σ > 12 , and α ∈ (0,1]. Initialize

k = 0. Go to 1.

1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, compute gk , dk by

gk =m∑

i=1

∇fi (xτki ),

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ cP(xk + d)

.

Choose αinit

k ∈ [α,1] and let αk be the largest element of αinit

k βjj=0,1,...

satisfying the descent-like condition

Fc(xk + αk dk )− Fc(xk ) ≤ −σKL‖αk dk‖2 +L2

k−1∑j=(k−K )+

‖αjd j‖2

and set xk+1 = xk + αdk . Increment k by 1 and return to 1.

Page 36: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: Adaptive Stepsize

Global Convergence

Let xk, dk, Hk, αk be sequences generated by Algorithm 2 underAssumptions 1 and 2. Then the following results hold.

(a) For each k ≥ 0, the descent-like condition holds whenever αk ≤ α,where α = λ

L(σK+K/2+1/2) .

(b) We have αk ≥ minα, βα for all k .

(c) dk → 0 and every cluster point of xk is a stationary point.

Page 37: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: 1-memory with Adaptive Stepsize

0. Choose x0 ∈ domP. Initialize k = 0 and g−1 = 0. Go to 1.

1. Choose Hk 0 and compute gk , dk , and xk+1 by

gk =k

k + 1gk−1 +

mk + 1

∇fik (xk ) with ik = (k mod m) + 1,

dk = arg mind∈<n

〈gk ,d〉+

12〈d ,Hk d〉+ cP(xk + d)

,

xk+1 = xk + αk dk ,

with αk ∈ (0,1].Increment k by 1 and return to 1.

Page 38: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: 1-memory with Adaptive Stepsize

Assumption 3

(a)∞∑

k=0

αk =∞.

(b) lim`→∞

∑j=0

j + 1`+ 1

δj = 0, where δj := maxi=0,1,...,m

‖xk+i − xk+m‖∣∣∣∣k=jm−1

(x−1 = x0).

Page 39: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

IGM: 1-memory with Adaptive Stepsize

Global Convergence

Let xk, dk, Hk, αk be sequences generated by Algorithm 3 underAssumptions 1(b), 2, and 3. Then the following results hold.

(a) ‖xk+1 − xk‖ → 0 and ‖∇f (xk )− gk‖ → 0.

(b) lim infk→∞ ‖dk‖ = 0.

(c) If xk is bounded, then there exists a cluster point of xk that is astationary point.

Page 40: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Thank you!

III. (Linearized) Alternating Direction Method ofMultipliers

Page 41: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Total Variation Regularized Linear Least SquaresProblemImage restorations such as image deconvolution, image inpainting and imagedenoising is often formulated as an inverse problem

b = Au + η,

unknown true image u ∈ <n.

observed image (or measurements) b ∈ <`.

η is a Gaussian noise.

A ∈ <`×n is a linear operator, typically a convolution operator indeconvolution, a projection in inpainting and the identity in denoising.

The unknown images can be recovered by solving TV regularized linear leastsquares problem (Primal):

minx∈<n

12‖Ax − b‖2

2 + µ‖∇x‖,

where µ > 0 and ‖∇x‖ =∑n

i=1 ‖(∇x)i‖2 with (∇x)i ∈ <2.

Page 42: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Gaussian Noise

Figure: denoising

(a) original: 512 × 512 (b) noised image (c) recovered image

Page 43: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Gaussian Noise

Figure: deblurring

(a) original: 512 × 512 (b) motion blurred image (c) recovered image

Page 44: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Gaussian Noise

Figure: inpainting

(a) original: 512 × 512 (b) scratched image (c) recovered image

Page 45: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

AlgorithmsSaddle Point Formulation and Dual Problem

1. Dual Norm:min

umax‖p‖∗≤1

〈∇u, p〉+12‖Au − b‖2

2.

2. Convex Conjugate (Legendre-Fenchel transform) of J(u) = ‖∇u‖:

minu

supp〈p, ∇u〉 − J∗(p) +

12‖Au − b‖2

2.

where J∗(p) = supz 〈p, z〉 − J(z).

3. Lagrangian Function:

maxp

infu,zL1(u,p, z).

whereL1(u,p, z) =:

12‖Au − b‖2

2 + ‖z‖+ 〈p, ∇u − z〉

Page 46: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Algorithms

4. Dual Problem:

maxp

infu〈p, ∇u〉 − J∗(p) +

12‖Au − b‖2

2 = −J∗(p)− H∗(div p)

.

where H(u) = 12‖Au − b‖2

2.

5. Lagrangian Function with y = div p:

maxu

infp,yL2(u,p, y).

whereL2(u,p, y) =: J∗(p) + H∗(div p) + 〈u, div p − y〉

Page 47: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

ADMM

Alternating Direction Method of Multipliers7

Augmented Lagrangian Function:

Lα(u,p, z) =:12‖Au − b‖2

2 + ‖z‖+ 〈p, ∇u − z〉+α

2‖∇u − z‖2

2

uk+1 = arg min

u

12‖Au − b‖2

2 +⟨pk , ∇u − zk

⟩+ α

2 ‖∇u − zk‖22

zk+1 = arg mind‖z‖+

⟨pk , ∇uk+1 − z

⟩+ α

2 ‖∇uk+1 − z‖22

pk+1 = pk + α(zk+1 −∇uk+1)

ADMM on primal (or 3) is equivalent to Douglas-Rachford8 on dual (4).ADMM on dual (or 5) is equivalent to Douglas-Rachford on primal.

7E. Esser, X. Zhang, and T. Chan, A general framework for a class of first order primal-dualalgorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 2010

8J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems intwo or three space variables, Trans. Amer. Math. Soc., 1956

Page 48: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

ADMMAlternating Minimization Algorithm & Forward-Backward Splitting Method

AMA9 on primal (or 3)uk+1 = arg min

u

12‖Au − b‖2

2 +⟨pk , ∇u − zk

⟩zk+1 = arg min

d‖z‖+

⟨pk , ∇uk+1 − z

⟩+ α

2 ‖∇uk+1 − z‖22

pk+1 = pk + α(zk+1 −∇uk+1)

FBS10 on dual (4) uk+1 = arg minu

12‖Au − b‖2

2 +⟨pk , ∇u

⟩pk+1 = arg min

dJ∗(p)−

⟨p, ∇uk+1

⟩+ 1

2α‖p − pk‖22

FBS on dual is equivalent to AMA on primal (or 3).

9P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization, 1991

10P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting,Multiscale Model. Simul., 2005

Page 49: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Proximal Splitting Method

Linearly Constrained Separable Convex Programming Problem:

minx,yf (x) + g(y) | Qx = y .

Augmented Lagrangian Function:

Lα(x , y , z) := f (x) + g(y) + 〈z, y −Qx〉+α

2‖Qx − y‖2

2.xk+1 = arg min

uLα(x , yk , zk )

yk+1 = arg minzLα(xk+1, y , zk )

zk+1 = zk + α(Qxk+1 − yk+1)

Page 50: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Proximal Splitting Method

If f (x) = 12‖Ax − b‖2

2,

xk+1 = (AT A + αQT Q)−1(AT b + QT zk + αQT yk ).

If f (x) = 〈1, Ax〉 − 〈b, log(Ax)〉Need inner solver or inversion involving Q and/or A.

If f (x) = 〈x + be−x , 1〉Need inner solver.

Page 51: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Poisson Noise

Figure: deblurring

(a) original (b) blurred image (c) recovered image

Page 52: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

Page 53: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

Copyright : Sandia Nat. Lab. (http://www.sandia.gov/RADAR/sar.html)

Page 54: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

4M (222) pixel @ 20sec.

Page 55: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

Page 56: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

Page 57: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Multiplicative Noise

Page 58: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Proximal Splitting Method

In order to avoid any inner iterations or inversions involving Laplacianoperator required in algorithms based on augment Lagrangian.

Alternating minimization algorithmuk+1 = L0(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)

Linearized augmented Lagrangian with proximal functionuk+1 = LLα(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)

LLα(x , xk , yk , zz) :=⟨∇x f (xk ), x − xk⟩+ g(yk ) +

⟨zk , yk −Qx

⟩+⟨αQT (Qxk − yk ), x − xk⟩+

12δ‖x − xk‖2

2

Page 59: First-order methods for structured nonsmooth optimizationcmac.yonsei.ac.kr/lecture/SangwoonYun.pdf · First-order methods for structured nonsmooth optimization Sangwoon Yun Department

Thank you!Thank you!