Upload
others
View
7
Download
1
Embed Size (px)
Citation preview
Accelerated primal-dual methodsfor linearly constrained convex problems
Yangyang Xu
SIAM Conference on Optimization
May 24, 2017
1 / 23
Accelerated proximal gradient
For convex composite problem: minimizex
F (x) := f(x) + g(x)
• f : convex and Lipschitz differentiable
• g: closed convex (possibly nondifferentiable) and simple
Proximal gradient:xk+1 = arg min
x
〈∇f(xk), x〉+Lf
2‖x− xk‖2 + g(x)
• convergence rate: F (xk)− F (x∗) = O(1/k)
Accelerated Proximal gradient [Beck-Teboulle’09, Nesterov’14]:xk+1 = arg min
x
〈∇f(xk), x〉+Lf
2‖x− xk‖2 + g(x)
• xk: extrapolated point
• convergence rate (with smart extrapolation): F (xk)− F (x∗) = O(1/k2)
This talk: ways to accelerate primal-dual methods
2 / 23
Part I: accelerated linearized augmented Lagrangian
3 / 23
Affinely constrained composite convex problems
minimizex
F (x) = f(x) + g(x), subject to Ax = b (LCP)
• f : convex and Lipschitz differentiable
• g: closed convex and simple
Examples
• nonnegative quadratic programming: f = 12x>Qx+ c>x, g = ιRn
+
• TV image denoising: min{ 12‖X −B‖
2F + λ‖Y ‖1, s.t. D(X) = Y }
4 / 23
Augmented Lagrangian method (ALM)
At iteration k,
xk+1 ← arg minx
f(x) + g(x)− 〈λk, Ax〉+ β
2 ‖Ax− b‖2,
λk+1 ← λk − γ(Axk+1 − b)
• augmented dual gradient ascent with stepsize γ
• β: penalty parameter; dual gradient Lipschitz constant 1/β
• 0 < γ < 2β: convergence guaranteed
• also popular for (nonlinear, nonconvex) constrained problems
x-subproblem as difficult as original problem
5 / 23
Linearized augmented Lagrangian method
• Linearize the smooth term f :
xk+1 ← arg minx
〈∇f(xk), x〉+η
2‖x− xk‖2 + g(x)− 〈λk, Ax〉+
β
2‖Ax− b‖2.
• Linearize both f and ‖Ax− b‖2:
xk+1 ← arg minx
〈∇f(xk), x〉+ g(x)− 〈λk, Ax〉+ 〈βA>rk, x〉+η
2‖x− xk‖2,
where rk = Axk − b is the residual.
Easier updates and nice convergence speed O(1/k)
6 / 23
Accelerated linearized augmented Lagrangian method
At iteration k,
xk ← (1− αk)xk + αkxk,
xk+1 ← arg minx
〈∇f(xk)−A>λk, x〉+ g(x) +βk
2‖Ax− b‖2 +
ηk
2‖x− xk‖2,
xk+1 ← (1− αk)xk + αkxk+1,
λk+1 ← λk − γk(Axk+1 − b).
• Inspired by [Lan ’12] on accelerated stochastic approximation
• reduces to linearized ALM if αk = 1, βk = β, ηk = η, γk = γ, ∀k• convergence rate: O(1/k) if η ≥ Lf and 0 < γ < 2β
• adaptive parameters to have O(1/k2) (next slides)
7 / 23
Better numerical performance
Objective error Feasibility Violation
0 200 400 600 800 100010
−6
10−5
10−4
10−3
10−2
10−1
100
Iteration numbers
|obj
ectiv
e m
inus
opt
imal
val
ue|
Nonaccelerated ALMAccelerated ALM
0 200 400 600 800 100010
−10
10−8
10−6
10−4
10−2
100
Iteration numbers
viol
atio
n of
feas
ibili
ty
Nonaccelerated ALMAccelerated ALM
• Tested on quadratic programming (subproblems solved exactly)
• Parameters set according to theorem (see next slide)
• Accelerated ALM significantly better
8 / 23
Guaranteed fast convergence
Assumptions:
• There is a pair of primal-dual solution (x∗, λ∗).
• ∇f is Lipschitz continuous: ‖∇f(x)−∇f(y)‖ ≤ Lf‖x− y‖
Convergence rate of order O(1/k2):
• Set parameters to
∀k : αk =2
k + 1, γk = kγ, βk ≥
γk
2, ηk =
η
k,
where γ > 0 and η ≥ 2Lf . Then
|F (xk+1)− F (x∗)| ≤1
k(k + 1)
(η‖x1 − x∗‖2 +
4‖λ∗‖2
γ
),
‖Axt+1 − b‖ ≤1
k(k + 1) max(1, ‖λ∗‖)
(η‖x1 − x∗‖2 +
4‖λ∗‖2
γ
),
9 / 23
Sketch of proof
Let Φ(x, x, λ) = F (x)− F (x)− 〈λ,Ax− b〉.
1. Fundamental inequality (for any λ):
Φ(xk+1, x∗, λ)− (1− αk)Φ(xk, x∗, λ)
≤−αkηk2
[‖xk+1 − x∗‖2 − ‖xk − x∗‖2 + ‖xk+1 − xk‖2
]+ α2
kLf
2 ‖xk+1 − xk‖2
+ αk2γk
[‖λk − λ‖2 − ‖λk+1 − λ‖2 + ‖λk+1 − λk‖2
]− αkβk
γ2k
‖λk+1 − λk‖2,
2. αk = 2k+1 , γk = kγ, βk ≥ γk
2 , ηk = ηk
and multiply k(k+ 1) to the above ineq.:
k(k + 1)Φ(xk+1, x∗, λ)− k(k − 1)Φ(xk, x∗, λ)
≤− η[‖xk+1 − x∗‖2 − ‖xk − x∗‖2
]+
1γ
[‖λk − λ‖2 − ‖λk+1 − λ‖2
].
3. Set λ1 = 0 and sum the above inequality over k:
Φ(xk+1, x∗, λ) ≤1
k(k + 1)
(η‖x1 − x∗‖2 +
1γ‖λ‖2
)4. Take λ = max (1 + ‖λ∗‖, 2‖λ∗‖) Axk+1−b
‖Axk+1−b‖ and use the optimality conditionΦ(x, x∗, λ∗) ≥ 0⇒ F (xk+1)− F (x∗) ≥ −‖λ∗‖ · ‖Axk+1 − b‖
10 / 23
Literature
• [He-Yuan ’10]: accelerated ALM to O(1/k2) for smooth problems
• [Kang et. al ’13]: accelerated ALM to O(1/k2) for nonsmooth problems
• [Huang-Ma-Goldfarb ’13]: accelerated linearized ALM (with linearization ofaugmented term) to O(1/k2) for strongly convex problems
11 / 23
Part II: accelerated linearized ADMM
12 / 23
Two-block structured problems
Variable is partitioned into two blocks, smooth part involves one block, andnonsmooth part is separable
minimizey,z
h(y) + f(z) + g(z), subject to By + Cz = b (LCP-2)
• f convex and Lipschitz differentiable
• g and h closed convex and simple
Examples:
• Total-variation regularized regression:{
miny,z
λ‖y‖1 + f(z), s.t. Dz = y}
13 / 23
Alternating direction method of multipliers (ADMM)
At iteration k,
yk+1 ← arg miny
h(y)− 〈λk, By〉+ β
2 ‖By + Czk − b‖2,
zk+1 ← arg minz
f(z) + g(z)− 〈λk, Cz〉+ β
2 ‖Byk+1 + Cz − b‖2,
λk+1 ← λk − γ(Byk+1 + Czk+1 − b)
• 0 < γ < 1+√
52 β: convergence guaranteed [Glowinski-Marrocco’75]
• updating y, z alternatingly: easier than jointly update• but z-subproblem can still be difficult
14 / 23
Accelerated linearized ADMM
At iteration k,
yk+1 ← arg miny
h(y)− 〈λk, By〉+βk
2‖By + Czk +−b‖2,
zk+1 ← arg minz
〈∇f(zk)− C>λk + βkC>rk+ 1
2 , z〉+ g(z) +ηk
2‖z − zk‖2,
λk+1 ← λk − γk(Byk+1 + Czk+1 − b)
where rk+ 12 = Byk+1 + Czk − b.
• reduces to linearized ADMM if βk = β, ηk = η, γk = γ, ∀k
• convergence rate: O(1/k) if 0 < γ ≤ β and η ≥ Lf + β‖C‖2
• O(1/k2) if adaptive parameters and strong convexity on z (next two slides)
15 / 23
Accelerated convergence speed
Assumptions:
• Existence of a pair of primal-dual solution (y∗, z∗, λ∗)• ∇f Lipschitz continuous: ‖∇f(z)−∇f(z)‖ ≤ Lf‖z − z‖
• f strongly convex with modulus µf (not required for y)
Convergence rate of order O(1/k2)
• Set parameters as follows (with γ > 0 and γ < η ≤ µf/2)
∀k : βk = γk = (k + 1)γ, ηk = (k + 1)η + Lf ,
Then
max(‖zk − z∗‖2, |F (yk, zk)− F ∗|, ‖Byk + Czk − b‖
)≤ O(1/k2),
where F (y, z) = h(y) + f(z) + g(z) and F ∗ = F (y∗, z∗).
16 / 23
Sketch of proof
1. Fundamental inequality from optimality conditions of each iterate:
F (yk+1, zk+1)− F (y, z)− 〈λ,Byk+1 + Czk+1 − b〉
≤−⟨
1γk
(λk − λk+1), λ− λk + βkγk
(λk − λk+1)− βkC(zk+1 − zk)⟩
+ Lf
2 ‖zk+1 − zk‖2 − µf
2 ‖zk − z‖2 − ηk〈zk+1 − z, zk+1 − zk〉,
2. Plug in parameters and bound cross terms:
F (yk+1, zk+1)− F (y∗, z∗)− 〈λ,Byk+1 + Czk+1 − b〉
+ 12
(η(k + 1)‖zk+1 − z∗‖2 + Lf‖zk+1 − z∗‖2
)+ 1
2γ(k+1)‖λ− λk+1‖2
≤ 12
(η(k + 1)‖zk − z∗‖2 + (Lf − µf )‖zk − z∗‖2
)+ 1
2γ(k+1)‖λ− λk‖2.
3. Multiply k + k0 (here k0 ∼2Lf
µf) and sum the inequality over k:
F (yk+1, zk+1)− F (y∗, z∗)− 〈λ,Byk+1 + Czk+1 − b〉 ≤φ(y∗, z∗, λ)
k2
4. Take a special λ and use KKT conditions
17 / 23
Literature
• [Ouyang et. al’15]: O(Lf/k2 + C0/k) with only weak convexity
• [Goldstein et. al’14]: O(1/k2) with strong convexity on both y and z
• [Chambolle-Pock’11, Chambolle-Pock’16, Dang-Lan’14, Bredies-Sun’16]:accelerated first-order methods on bilinear saddle-point problems
Open question: weakest conditions to have O(1/k2)
18 / 23
Numerical experiments(More results in paper)
19 / 23
Accelerated (linearized) ADMM
Tested problem: total-variation regularized image denoising
minimizeX,Y
12‖X −B‖
2F + µ‖Y ‖1, subject to DX = Y. (TVDN)
• B observed noisy Cameraman image, and D finite difference operator
Compared methods:
• original ADMM
• accelerated ADMM
• linearized ADMM
• accelerated linearized ADMM
• accelerated Chambolle-Pock
20 / 23
Performance of compared methods
0 100 200 300 400 50010
−8
10−6
10−4
10−2
100
102
104
Iteration numbers
|obj
ectiv
e m
inus
opt
imal
val
ue|
Accelerated ADMMAccelerated Linearized ADMMNonaccelerated ADMMNonaccelerated Linearized ADMMChambolle−Pock
0 10 20 30 40 5010
−15
10−10
10−5
100
105
Running time (sec.)
|obj
ectiv
e m
inus
opt
imal
val
ue|
Accelerated ADMMAccelerated Linearized ADMMNonaccelerated ADMMNonaccelerated Linearized ADMMChambolle−Pock
• Accelerated (linearized) ADMM significantly better than nonaccelerated one
• (accelerated) ADMM faster than (accelerated) linearized ADMM regardingiteration number (but the latter takes less time)
21 / 23
Conclusions
• accelerated linearized ALM to O(1/k2) from O(1/k) with merely convexity
• accelerated (linearized) ADMM to O(1/k2) from O(1/k) with strongconvexity on one block variable
• performed numerical experiments
22 / 23
References
1. Y. Xu. Accelerated first-order primal-dual proximal methods for linearlyconstrained composite convex programming, SIAM J. Optimization, 2017.
2. T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternatingdirection optimization methods, SIAM J. on Imaging Sciences, 2014.
3. B. He and X. Yuan. On the acceleration of augmented Lagrangian method forlinearly constrained optimization, Optimization Online, 2010.
4. B. Huang, S. Ma, and D. Goldfarb. Accelerated linearized Bregman method,Journal of Scientific Computing, 2013.
5. M. Kang, S. Yun, H. Woo, and M. Kang. Accelerated bregman method forlinearly constrained `1-`2 minimization, Journal of Scientific Computing, 2013.
23 / 23