Accelerated primal-dual methods for linearly constrained ... · Accelerated ﬁrst-order primal-dual proximal methods for linearly constrained composite convex programming, SIAM J

Accelerated primal-dual methodsfor linearly constrained convex problems

Yangyang Xu

SIAM Conference on Optimization

May 24, 2017

1 / 23

Accelerated proximal gradient

For convex composite problem: minimizex

F (x) := f(x) + g(x)

• f : convex and Lipschitz differentiable

• g: closed convex (possibly nondifferentiable) and simple

Proximal gradient:xk+1 = arg min

x

〈∇f(xk), x〉+Lf

2‖x− xk‖2 + g(x)

• convergence rate: F (xk)− F (x∗) = O(1/k)

Accelerated Proximal gradient [Beck-Teboulle’09, Nesterov’14]:xk+1 = arg min

x

〈∇f(xk), x〉+Lf

2‖x− xk‖2 + g(x)

• xk: extrapolated point

• convergence rate (with smart extrapolation): F (xk)− F (x∗) = O(1/k2)

This talk: ways to accelerate primal-dual methods

2 / 23

Part I: accelerated linearized augmented Lagrangian

3 / 23

Affinely constrained composite convex problems

minimizex

F (x) = f(x) + g(x), subject to Ax = b (LCP)

• f : convex and Lipschitz differentiable

• g: closed convex and simple

Examples

• nonnegative quadratic programming: f = 12x>Qx+ c>x, g = ιRn

+

• TV image denoising: min{ 12‖X −B‖

2F + λ‖Y ‖1, s.t. D(X) = Y }

4 / 23

Augmented Lagrangian method (ALM)

At iteration k,

xk+1 ← arg minx

f(x) + g(x)− 〈λk, Ax〉+ β

2 ‖Ax− b‖2,

λk+1 ← λk − γ(Axk+1 − b)

• augmented dual gradient ascent with stepsize γ

• β: penalty parameter; dual gradient Lipschitz constant 1/β

• 0 < γ < 2β: convergence guaranteed

• also popular for (nonlinear, nonconvex) constrained problems

x-subproblem as difficult as original problem

5 / 23

Linearized augmented Lagrangian method

• Linearize the smooth term f :

xk+1 ← arg minx

〈∇f(xk), x〉+η

2‖x− xk‖2 + g(x)− 〈λk, Ax〉+

β

2‖Ax− b‖2.

• Linearize both f and ‖Ax− b‖2:

xk+1 ← arg minx

〈∇f(xk), x〉+ g(x)− 〈λk, Ax〉+ 〈βA>rk, x〉+η

2‖x− xk‖2,

where rk = Axk − b is the residual.

Easier updates and nice convergence speed O(1/k)

6 / 23

Accelerated linearized augmented Lagrangian method

At iteration k,

xk ← (1− αk)xk + αkxk,

xk+1 ← arg minx

〈∇f(xk)−A>λk, x〉+ g(x) +βk

2‖Ax− b‖2 +

ηk

2‖x− xk‖2,

xk+1 ← (1− αk)xk + αkxk+1,

λk+1 ← λk − γk(Axk+1 − b).

• Inspired by [Lan ’12] on accelerated stochastic approximation

• reduces to linearized ALM if αk = 1, βk = β, ηk = η, γk = γ, ∀k• convergence rate: O(1/k) if η ≥ Lf and 0 < γ < 2β

• adaptive parameters to have O(1/k2) (next slides)

7 / 23

Better numerical performance

Objective error Feasibility Violation

0 200 400 600 800 100010

−6

10−5

10−4

10−3

10−2

10−1

100

Iteration numbers

|obj

ectiv

e m

inus

opt

imal

val

ue|

Nonaccelerated ALMAccelerated ALM

0 200 400 600 800 100010

−10

10−8

10−6

10−4

10−2

100

Iteration numbers

viol

atio

n of

feas

ibili

ty

Nonaccelerated ALMAccelerated ALM

• Tested on quadratic programming (subproblems solved exactly)

• Parameters set according to theorem (see next slide)

• Accelerated ALM significantly better

8 / 23

Guaranteed fast convergence

Assumptions:

• There is a pair of primal-dual solution (x∗, λ∗).

• ∇f is Lipschitz continuous: ‖∇f(x)−∇f(y)‖ ≤ Lf‖x− y‖

Convergence rate of order O(1/k2):

• Set parameters to

∀k : αk =2

k + 1, γk = kγ, βk ≥

γk

2, ηk =

η

k,

where γ > 0 and η ≥ 2Lf . Then

|F (xk+1)− F (x∗)| ≤1

k(k + 1)

(η‖x1 − x∗‖2 +

4‖λ∗‖2

γ

),

‖Axt+1 − b‖ ≤1

k(k + 1) max(1, ‖λ∗‖)

(η‖x1 − x∗‖2 +

4‖λ∗‖2

γ

),

9 / 23

Sketch of proof

Let Φ(x, x, λ) = F (x)− F (x)− 〈λ,Ax− b〉.

1. Fundamental inequality (for any λ):

Φ(xk+1, x∗, λ)− (1− αk)Φ(xk, x∗, λ)

≤−αkηk2

[‖xk+1 − x∗‖2 − ‖xk − x∗‖2 + ‖xk+1 − xk‖2

]+ α2

kLf

2 ‖xk+1 − xk‖2

+ αk2γk

[‖λk − λ‖2 − ‖λk+1 − λ‖2 + ‖λk+1 − λk‖2

]− αkβk

γ2k

‖λk+1 − λk‖2,

2. αk = 2k+1 , γk = kγ, βk ≥ γk

2 , ηk = ηk

and multiply k(k+ 1) to the above ineq.:

k(k + 1)Φ(xk+1, x∗, λ)− k(k − 1)Φ(xk, x∗, λ)

≤− η[‖xk+1 − x∗‖2 − ‖xk − x∗‖2

]+

1γ

[‖λk − λ‖2 − ‖λk+1 − λ‖2

].

3. Set λ1 = 0 and sum the above inequality over k:

Φ(xk+1, x∗, λ) ≤1

k(k + 1)

(η‖x1 − x∗‖2 +

1γ‖λ‖2

)4. Take λ = max (1 + ‖λ∗‖, 2‖λ∗‖) Axk+1−b

‖Axk+1−b‖ and use the optimality conditionΦ(x, x∗, λ∗) ≥ 0⇒ F (xk+1)− F (x∗) ≥ −‖λ∗‖ · ‖Axk+1 − b‖

10 / 23

Literature

• [He-Yuan ’10]: accelerated ALM to O(1/k2) for smooth problems

• [Kang et. al ’13]: accelerated ALM to O(1/k2) for nonsmooth problems

• [Huang-Ma-Goldfarb ’13]: accelerated linearized ALM (with linearization ofaugmented term) to O(1/k2) for strongly convex problems

11 / 23

Part II: accelerated linearized ADMM

12 / 23

Two-block structured problems

Variable is partitioned into two blocks, smooth part involves one block, andnonsmooth part is separable

minimizey,z

h(y) + f(z) + g(z), subject to By + Cz = b (LCP-2)

• f convex and Lipschitz differentiable

• g and h closed convex and simple

Examples:

• Total-variation regularized regression:{

miny,z

λ‖y‖1 + f(z), s.t. Dz = y}

13 / 23

Alternating direction method of multipliers (ADMM)

At iteration k,

yk+1 ← arg miny

h(y)− 〈λk, By〉+ β

2 ‖By + Czk − b‖2,

zk+1 ← arg minz

f(z) + g(z)− 〈λk, Cz〉+ β

2 ‖Byk+1 + Cz − b‖2,

λk+1 ← λk − γ(Byk+1 + Czk+1 − b)

• 0 < γ < 1+√

52 β: convergence guaranteed [Glowinski-Marrocco’75]

• updating y, z alternatingly: easier than jointly update• but z-subproblem can still be difficult

14 / 23

Accelerated linearized ADMM

At iteration k,

yk+1 ← arg miny

h(y)− 〈λk, By〉+βk

2‖By + Czk +−b‖2,

zk+1 ← arg minz

〈∇f(zk)− C>λk + βkC>rk+ 1

2 , z〉+ g(z) +ηk

2‖z − zk‖2,

λk+1 ← λk − γk(Byk+1 + Czk+1 − b)

where rk+ 12 = Byk+1 + Czk − b.

• reduces to linearized ADMM if βk = β, ηk = η, γk = γ, ∀k

• convergence rate: O(1/k) if 0 < γ ≤ β and η ≥ Lf + β‖C‖2

• O(1/k2) if adaptive parameters and strong convexity on z (next two slides)

15 / 23

Accelerated convergence speed

Assumptions:

• Existence of a pair of primal-dual solution (y∗, z∗, λ∗)• ∇f Lipschitz continuous: ‖∇f(z)−∇f(z)‖ ≤ Lf‖z − z‖

• f strongly convex with modulus µf (not required for y)

Convergence rate of order O(1/k2)

• Set parameters as follows (with γ > 0 and γ < η ≤ µf/2)

∀k : βk = γk = (k + 1)γ, ηk = (k + 1)η + Lf ,

Then

max(‖zk − z∗‖2, |F (yk, zk)− F ∗|, ‖Byk + Czk − b‖

)≤ O(1/k2),

where F (y, z) = h(y) + f(z) + g(z) and F ∗ = F (y∗, z∗).

16 / 23

Sketch of proof

1. Fundamental inequality from optimality conditions of each iterate:

F (yk+1, zk+1)− F (y, z)− 〈λ,Byk+1 + Czk+1 − b〉

≤−⟨

1γk

(λk − λk+1), λ− λk + βkγk

(λk − λk+1)− βkC(zk+1 − zk)⟩

+ Lf

2 ‖zk+1 − zk‖2 − µf

2 ‖zk − z‖2 − ηk〈zk+1 − z, zk+1 − zk〉,

2. Plug in parameters and bound cross terms:

F (yk+1, zk+1)− F (y∗, z∗)− 〈λ,Byk+1 + Czk+1 − b〉

+ 12

(η(k + 1)‖zk+1 − z∗‖2 + Lf‖zk+1 − z∗‖2

)+ 1

2γ(k+1)‖λ− λk+1‖2

≤ 12

(η(k + 1)‖zk − z∗‖2 + (Lf − µf )‖zk − z∗‖2

)+ 1

2γ(k+1)‖λ− λk‖2.

3. Multiply k + k0 (here k0 ∼2Lf

µf) and sum the inequality over k:

F (yk+1, zk+1)− F (y∗, z∗)− 〈λ,Byk+1 + Czk+1 − b〉 ≤φ(y∗, z∗, λ)

k2

4. Take a special λ and use KKT conditions

17 / 23

Literature

• [Ouyang et. al’15]: O(Lf/k2 + C0/k) with only weak convexity

• [Goldstein et. al’14]: O(1/k2) with strong convexity on both y and z

• [Chambolle-Pock’11, Chambolle-Pock’16, Dang-Lan’14, Bredies-Sun’16]:accelerated first-order methods on bilinear saddle-point problems

Open question: weakest conditions to have O(1/k2)

18 / 23

Numerical experiments(More results in paper)

19 / 23

Accelerated (linearized) ADMM

Tested problem: total-variation regularized image denoising

minimizeX,Y

12‖X −B‖

2F + µ‖Y ‖1, subject to DX = Y. (TVDN)

• B observed noisy Cameraman image, and D finite difference operator

Compared methods:

• original ADMM

• accelerated ADMM

• linearized ADMM

• accelerated linearized ADMM

• accelerated Chambolle-Pock

20 / 23

Performance of compared methods

0 100 200 300 400 50010

−8

10−6

10−4

10−2

100

102

104

Iteration numbers

|obj

ectiv

e m

inus

opt

imal

val

ue|

Accelerated ADMMAccelerated Linearized ADMMNonaccelerated ADMMNonaccelerated Linearized ADMMChambolle−Pock

0 10 20 30 40 5010

−15

10−10

10−5

100

105

Running time (sec.)

|obj

ectiv

e m

inus

opt

imal

val

ue|

Accelerated ADMMAccelerated Linearized ADMMNonaccelerated ADMMNonaccelerated Linearized ADMMChambolle−Pock

• Accelerated (linearized) ADMM significantly better than nonaccelerated one

• (accelerated) ADMM faster than (accelerated) linearized ADMM regardingiteration number (but the latter takes less time)

21 / 23

Conclusions

• accelerated linearized ALM to O(1/k2) from O(1/k) with merely convexity

• accelerated (linearized) ADMM to O(1/k2) from O(1/k) with strongconvexity on one block variable

• performed numerical experiments

22 / 23

References

1. Y. Xu. Accelerated first-order primal-dual proximal methods for linearlyconstrained composite convex programming, SIAM J. Optimization, 2017.

2. T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternatingdirection optimization methods, SIAM J. on Imaging Sciences, 2014.

3. B. He and X. Yuan. On the acceleration of augmented Lagrangian method forlinearly constrained optimization, Optimization Online, 2010.

4. B. Huang, S. Ma, and D. Goldfarb. Accelerated linearized Bregman method,Journal of Scientific Computing, 2013.

5. M. Kang, S. Yun, H. Woo, and M. Kang. Accelerated bregman method forlinearly constrained `1-`2 minimization, Journal of Scientific Computing, 2013.

23 / 23

Documents

Accelerated primal-dual methods for linearly constrained ... · Accelerated ﬁrst-order primal-dual proximal methods for linearly constrained composite convex programming, SIAM J