Upload
dothuy
View
224
Download
0
Embed Size (px)
Citation preview
Algorithms for sparse optimization
Michael P. FriedlanderUniversity of California, Davis
GoogleMay 19, 2015
Sparse representations
= +
y = Hx y = Dx y =[H D
]x
20 40 60 80 100 20 40 60 80 100 50 100 150 200
Haar DCT Haar/DCT dictionary
2 / 40
Sparsity via a transform
Represent a signal y as a superposition of elementary signal atoms:
y =∑
j φjxj ie, y = Φx
Orthogonal bases yield a unique representation:
x = ΦTy
Overcomplete dictionaries, eg, A = [Φ1 Φ2] have more flexibility.But representation is not unique. One approach:
minimize nnz(x) subj to Ax ≈ y
3 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Efficient transforms
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
0 20 40 60 80 1000
20
40
60
80
100
Percentage of coefficients
Per
cent
age
of e
nerg
y
Identity Fourier Wavelet4 / 40
Prototypical workflow
Measure structured signal y via linear measurements:
bi = 〈fi , y〉, i = 1, . . . ,m
ie, b = My , M =
Decode the m measurements:
MΦx ≈ b, x “sparse”
Reconstruct the signal:y := Φx
Guarantees: number of samples m needed to reconstruct signal w.h.p.• Compressed sensing (eg, M Gaussian/Φ orthogonal): O(k log n
k )• matrix completion: # matrix samples O(n5/4r log n)
5 / 40
Sparse optimization
Problem: find sparse solution x
minimize nnz(x) subj to Ax ≈ b
Convex relaxations, eg:
basis pursuit [Chen et al. (1998); Candes et al. (2006)]
minimizex∈Rn
‖x‖1 subj to Ax ≈ b (m n)
nuclear norm [Fazel (2002); Candes and Recht (2009)]
minimizeX∈Rn×p
∑i σi (X ) subj to AX ≈ b (m np)
6 / 40
APPLICATIONSand
FORMULATIONS
Compressed sensing
= +
〈φ1, y〉, . . . , 〈φm, y〉 Φy = Φ = Gaussian
Φy ≈ Φ Haar DCT x ⇒ b ≈ A x
minimize ‖x‖1 subj to Ax ≈ b7 / 40
Image deblurring
Observe b = My , y = image, M = blurring operatorRecover significant coeff’s via
minimize 12‖MWx − b‖2 + λ‖x‖1
λ 2.3e+0 2.3e-2 2.3e-4‖Ax − b‖ 2.6e+2 3.0e+0 3.1e-2nnz(x) 72 1,741 28,484
Sparco problem blurrycam [Figueiredo et al. (2007)]8 / 40
Joint sparsity: source localization
s5s4s3s2s1
λ
θ
9 / 40
[Malioutov et al. (2005)]
minimize ‖X‖1,2subj to ‖B − AX‖F ≤ σ
phase-shift gain matrix=
arrival angles
sens
ors
x2x1 xpb1 b2 bp
Mass spectrometry – separating mixtures
20 40 60 80 1000
20
40
60
80
100
m/zR
elat
ive
inte
nsity
20 40 60 80 1000
20
40
60
80
100
m/z
Rel
ativ
e in
tens
ity
0 20 40 60 80 1000
20
40
60
80
100
m/z
Rel
ativ
e in
tens
ity
0 20 40 60 80 1000
20
40
60
80
100
m/z
Rel
ativ
e in
tens
ity
= + +
spectral fingerprints
A =
0 100 200 300 400 5000
5
10
15
20
25
Component
Per
cent
age
ActualRecoveredRecovered (Noisy)
+
mixture molecule 1 molecule 2 molecule 3
minimize∑
i xi subj to Ax ≈ b, x ≥ 0 [Du-Angeletti ’06, Berg-F. ’11]
10 / 40
Phase retrievalObserve signal x ∈ Cn through magnitudes.
bk = |〈ak , x〉|2, k = 1, . . . ,m
Lift.
bk = |〈ak , x〉|2 = 〈aka∗k , xx∗〉 = 〈aka∗k ,X 〉 with X = xx∗
Decode. Find X such that
〈aka∗k ,X 〉 ≈ bk , X 0, rank(X ) = 1
Convex relaxation.
minimizeX∈Hn
tr(X ) subj to 〈aka∗k ,X 〉 ≈ bk , X 0
[Chai et al. (2011); Candes et al. (2012); Waldspurger et al. (2015)]
11 / 40
Matrix completion
Observe random matrix entries of an m × p matrix Y .
bk = Yik ,jk ≡ eTikYejk = 〈eik eT
jk ,Y 〉, k = 1, . . . ,m
Decode. Find m × p matrix X such that
〈eik eTjk ,X 〉 ≈ bk , rank(X ) = min
Convex relaxation.
minimizeX∈Rm×p
∑iσi (X ) subj to 〈eik eT
jk ,X 〉 ≈ bk
[Fazel (2002); Candes and Recht (2009)]
12 / 40
CONVEX OPTIMIZATION
Convex optimization
minimizex∈C
f (x)
f : Rn → R is a convex function, C ⊆ Rn is a convex set
Essential objective.
f (x) : Rn → R := R ∪ +∞ =
f (x) if x ∈ C+∞ otherwise
No loss of generality in restricting ourselves to unconstrained problems:
minimizex
f (x) + g(Ax)
f : Rn → R, g : Rn → R, A is m × n matrix
13 / 40
Convex functions
Definition. f : Rn → R is convex if its epigraph
epi f = (x , τ) ∈ Rn × R | f (x) ≤ τ
is convex.
14 / 40
Examples
Indicator on a convex set.
δC(x) =
0 if x ∈ C,+∞ otherwise,
epi δC = C × R+
Supremum of convex functions.
f := supj∈J
fj , fjj∈J convex functions
epi f =⋂j∈J
epi fj
Infimal convolution (epi-addition).
(f g)(x) = infz
f (z) + g(x − z)
epi(f g) = epi f + epi g
15 / 40
Examples
Generic form.minimize
xf (x) + g(Ax)
Basis pursuit.
minx
‖x‖1 st Ax = b, f = λ‖ · ‖1, g = δb
Basis pursuit denoising.
minx
‖x‖1 st ‖Ax − b‖2 ≤ σ, f = ‖ · ‖1, g = δ‖·−b‖2≤σ
Lasso.
minx
12‖Ax − b‖2 st ‖x‖1 ≤ τ, f = δτB1
, g = 12‖ · −b‖2
16 / 40
DUALITY
Duality
minimizex
f (x) + g(Ax)
Decouple the objective terms:
minimizex ,z
f (x) + g(z) subj to Ax = z
Lagrangian function.
L(x , z , y) := f (x) + g(z) + 〈y ,Ax − z〉
encapsulates all the information about the problem:
supy
L(x , z , y) =
f (x) + g(z) if Ax = z+∞ otherwise
p∗ = infx ,z
supy
L(x , z , y)
17 / 40
Primal-dual pair.
p∗ = infx ,z
supy
L(x , z , y) (primal problem)
d∗ = supy
infx ,z
L(x , z , y) (dual problem)
The following always holds:
p∗ ≥ d∗ (weak duality)
Under some constraint qualification,
p∗ = d∗ (strong duality)
18 / 40
Fenchel dualDual problem.
supy
infx ,z
L(x , z , y) = supy
infx ,z
f (x) + g(z) + 〈y ,Ax − z〉
= supy
− sup
x
[〈ATy , x〉 − f (x)
]− sup
z
[〈−y , z〉 − g(z)
]
Conjugate function.
h∗(u) = supx〈u, x〉 − h(u)
Fenchel-dual pair.
minimizex
f (x) + g(Ax)
minimizey
f ∗(ATy) + g∗(−y)
19 / 40
PROXIMAL METHODS
Proximal algorithm
minimizex
f (x) (g ≡ 0)
Proximal operator.
proxαf (x) := arg minz
f (z) + 1
2α‖z − x‖2≡ f 1
2α‖ · ‖2
Optimality. Every optimal solution x∗ solves
x = proxαf (x), ∀α > 0
Proximal iteration.x+ = proxαf (x)
[Martinet (1970)]
20 / 40
xk+1 = proxαf (xk) := arg minz
f (x) + 1
2α‖z − x‖2
Descent. Because xk+1 is the minimizing argument
f (xk+1) + 12αk‖xk+1 − xk‖
2 ≤ f (x) + 12αk‖x − xk‖
2 ∀x
Set x = xk :f (xk+1)− f (xk) ≤ − 1
2αk‖xk+1 − xk‖
2
Convergence.• finite convergence for f polyhedral [Polyak and Tretyakov (1973)]• subproblems may be solved approximately [Rockafellar (1976)]• f (xk)− p∗ ≤ O(1/k) [Guler (1991)]• f (xk)− p∗ ≤ O(1/k2) via an accelerated variant [Guler (1992)]
21 / 40
Moreau-Yosida regularization
M-Y envelope fα(x) = minz
f (z) + 12α‖z − x‖2
proximal map proxαf (x) = arg minz
f (z) + 12α‖z − x‖2
Useful properties:
• differentiable: ∇fα(x) = 1α
[x − proxαf (x)
]• preserves minima: argmin fα = argmin f• decomposition: x = proxf (x) + proxf ∗(x)
Examples
vector 1-norm proxα‖·‖1(x) = sgn(x) . ∗max
|x | − α, 0
Schatten 1-norm proxα‖·‖1
(X ) = UΣV T , Σ = Diag[
proxα‖·‖1
(σ(X )
)]22 / 40
Augmented Lagrangian (AL) algorithm
Rockafellar (’73) connects AL to proximal method on dual
minimize f (x) subj to Ax = b ie, f (x) + δb(Ax)
Lagrangian: L (x , y) = f (x) + yT(Ax − b)
aug Lagrangian: Lρ(x , y) = f (x) + yT(Ax − b) + 12ρ‖Ax − b‖2
AL algorithm:
xk+1 = arg minx
Lρ(x , yk) [may be inexact]
yk+1 = yk + ρ(Axk+1 − b) [multiplier update]
• for linear constraints, AL algorithm ≡ Bregman• used by L1-Bregman for f (x) = ‖x‖1 [Yin et al. (2008)]• proposed by Hestenes/Powell (’69) for smooth nonlinear optimization
23 / 40
SPLITTING METHODS
projected gradient, proximal-gradient, iterative soft-thresholding,alternating projections
Proximal gradient
minimizex
f (x) + g(x)
Algorithm.
x+ = proxαg(x − α∇f (x)
)with
α ∈ (µ, 2/L), µ > 0, ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖
Convergence.
• f (xk)− p∗ ≤ O(1/k ) constant stepsize (α ≡ 1/L)
• f (xk)− p∗ ≤ O(1/k2) accelerated variant [Beck and Teboulle (2009)]
• f (xk)− p∗ ≤ O(γk), γ < 1, with stronger assumptions
24 / 40
Interpretations
Majorization minimization. Lipshitz assumption on ∇f implies
f (z) ≤ f (x) + 〈∇f (x), z − x〉+ L2‖z − x‖2 ∀x , z
Proximal-gradient iteration minimizes the majorant: ∀α ∈ (0, 1/L)
x+ = proxαg(x − α∇f (x)
)≡ arg min
z
g(z) + f (x) + 〈∇f (x), z − x〉+ 1
2α‖z − x‖2
Forward-backward splitting.
x+ = proxαg (x − α∇f (x)) = (I + α∂g)−1︸ ︷︷ ︸forwardstep
(I − α∇f )︸ ︷︷ ︸backwardstep
(x)
25 / 40
Projected gradient
minimize f (x) subj to x ∈ C xk+1 = projC(xk − αk∇f (xk))
LASSO
minimize‖x‖1≤τ
12‖Ax − b‖2 via f (x) = 1
2‖Ax − b‖2, g(x) = δτB1(x)
• projC(·) can be computed in O(n) (randomized) [Duchi et al. (2008)]• used by SPGL1 for Lasso subproblems [vdBerg and Friedlander (2008)]
BPDN (Lagrangian form)
minimizex∈Rn
12‖Ax − b‖2 + λ‖x‖1 via minimize
x∈R2n+
12‖Ax − b‖2 + λeTx
• proj≥0(·) can be computed in O(n)• GPSR uses Barzilai-Borwein for step αk [Figueiredo-Nowak-Wright 07]
26 / 40
Iterative soft-thresholding (IST)
minimizex
12‖Ax − b‖2 + λ‖x‖1
x+ = proxαλ‖·‖1(x − αATr), r ≡ Ax − b
= Sαλ(x − αATr)
where
α ∈ (0, 2/‖A‖2) and Sτ (x) = sgn(x) ·max|x | − τ, 0
[aka, iterative shrinkage, iterative Landweber]
Derived from various viewpoints:• expectation-maximization [Figueiredo & Nowak ’03]• surrogate functionals [Daubechies et al ’04]• forward-backward splitting: [Combettes & Wajs ’05]• FISTA (accelerated variant) [Beck-Teboulle ’09]
27 / 40
Low-rank approximation via nuclear-norm:
minimize 12‖AX − b‖2 + λ‖X‖n with ‖X‖n =
∑σj(X )
Singular-value soft-thresholding algorithm:
x+ = Sαλ(X − αA∗(r))
with
r := AX − b, Sλ(Z ) = U diag(σ)V T , σ = Sλ(σ(Z ))
Computing Sλ:
No need to compute full SVD of Z , only leading singular values/vectors• Monte-Carlo approach [Ma-Goldfarb-Chen ’08]• Lanczos bidiagonalization (via PROPACK)
[Cai-Candes-Shen ’08; Jain et al ’10]
28 / 40
Typical behavior
minimizeX∈Rn1×n2
‖X‖1 subj to XΩ ≈ b
recover 1000× 1000 matrix Xrank(X∗) = 4
0 10 20 300
100
200
iteration
no.o
fsin
gular
trip
les
29 / 40
Backward-backward splitting
Approximate one of the functions via its (smooth) Moreau envelope.
minimizex
fα(x) + g(x) (both nonsmooth)
fα = f 12α‖ · ‖
2, ∇fα(x) = 1α [x − proxαf (x)]
Proximal gradient.
x+ = proxαg(x − α∇fα(x)
)= proxαg proxαf (x)
Projection onto convex sets (POCS).
f = δC , g = δD
x+ = projD projC(x)solves the problem
minimizex ,z
12‖z − x‖2 subj to x ∈ D, z ∈ C
30 / 40
Alternating direction of method of multipliers (ADMM)
minimizex
f (x) + g(x)
ADMM algorithm.
x+ = proxαf (z − u)z+ = proxαg (x+ + u)u+ = u + x+ − z+
• used by YALL1 [Zhang et al. (2011)]• ADMM algo ≡ Douglas-Rachford ≡(
linearconstraints
) split Bregman
• Esser’s UCLA PhD thesis (’10) ; survey by Boyd et al. (’11)
31 / 40
PROXIMAL SMOOTHING
Primal smoothing
minimizex
f (x) + g(Ax)
Partial smoothing via the Moreau envelope:
minimizex
fµ(x) + g(Ax), fµ = f 12µ‖ · ‖
2
Example (Basis Pursuit). Huber approximation to 1-norm:
f = ‖ · ‖1, g = δb =⇒ minimizex
‖x‖1 subj to Ax = b
Smoothed function is Huber with parameter µ:
fµ(x) =
x2/2µ if |x | ≤ µ|x | − µ/2 otherwise
µ 0 =⇒ x∗µ → x∗
32 / 40
Dual smoothing
Regularize the primal problem.minimize
xf (x) + µ
2 ‖x‖2 + g(Ax)
Dualize. Regularizing f smoothes the conjugate:
(f + µ2 ‖ · ‖
2)∗ = (f ∗ 12µ‖ · ‖
2) = f ∗µ
minimizey
f ∗µ (ATy) + g∗(−y)
Example (Basis Pursuit).f = ‖ · ‖1, g = δb =⇒ min
x‖x‖1 + µ
2 ‖x‖2 subj to Ax = b
f ∗µ = 12µ dist2
B∞ =⇒ maxy
〈b, y〉+ 12µ dist2
B∞(ATy)
Exact regularization: x∗µ = x∗ for all µ ∈ [0, µ), µ > 0[Mangasarian and Meyer (1979); Friedlander and Tseng (2007)]
[Nesterov (2005); Beck and Teboulle (2012)]; TFOCS [Becker et al. (2011)] 33 / 40
CONCLUSION
much left unsaid, eg,• optimal first-order methods [Nesterov (1988, 2005)]• block coordinate descent [Sardy et al. (2004)]• active-set methods (eg, LARS & Homotopy) [Osborne et al. (2000)]
my own work.• avoiding proximal operators [Friedlander et al. (2014) (SIOPT)]• greedy coordinate descent [Nutini et al. (2015) (ICML)]
email• [email protected]
web• www.math.ucdavis.edu/˜mpf
34 / 40
REFERENCES
References I
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm forlinear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,2009.
A. Beck and M. Teboulle. Smoothing and first order methods: A unifiedframework. SIAM J. Optim., 22(2):557–580, 2012.
S. Becker, E. Candes, and M. Grant. Templates for convex cone problems withapplications to sparse signal recovery. Math. Program. Comp., 3:165–218,2011. doi: 10.1007/s12532-011-0029-5.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction method ofmultipliers. Foundations and Trends in Machine Learning, 3(1):1–122,2011.
J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithmfor matrix completion. SIAM J. Optim., 4(20):1956–1982, 2010. doi:10.1137/080738970.
E. J. Candes and B. Recht. Exact matrix completion via convex optimization.Found. Comput. Math., 9:717–772, December 2009.
35 / 40
References IIE. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete
and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006.
E. J. Candes, T. Strohmer, and V. Voroninski. Phaselift: Exact and stablesignal recovery from magnitude measurements via convex programming.Commun. Pur. Appl. Ana., 2012.
A. Chai, M. Moscoso, and G. Papanicolaou. Array imaging using intensity-onlymeasurements. Inverse Problems, 27(1):015005, 2011. URLhttp://stacks.iop.org/0266-5611/27/i=1/a=015005.
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basispursuit. SIAM J. Sci. Comput., 20(1):33–61, 1998.
P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backwardsplitting. Multiscale Model. Simul., 4(4):1168–1200, 2005.
I. Daubechies, M. Defrise, and C. D. Mol. An iterative thresholding algorithmfor linear inverse problems with a sparsity constraint. Comm. Pure Appl.Math., 57:1413–1457, 2004.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projectionsonto the `1-ball for learning in high dimensions. In Proceedings of the 25thInternational Conference on Machine Learning, pages 272–279, 2008.
36 / 40
References IIIJ. E. Esser. Primal Dual Algorithms for Convex Models and Applications to
Image Restoration, Registration and Nonlocal Inpainting. PhD thesis,University of California, Los Angeles, 2010.
M. Fazel. Matrix rank minimization with applications. PhD thesis, Elec.Eng. Dept, Stanford University, 2002.
M. Figueiredo, R. Nowak, and S. J. Wright. Gradient Projection for SparseReconstruction: Application to Compressed Sensing and Other InverseProblems. Sel. Top. in Signal Process., IEEE J., 1(4):586–597, 2007.
M. A. T. Figueiredo and R. D. Nowak. An EM algorithm for wavelet-basedimage restoration. IEEE Trans. Image Process., 12(8):906–916, August2003.
M. P. Friedlander and P. Tseng. Exact regularization of convex programs.SIAM J. Optim., 18(4):1326–1350, 2007. doi: 10.1137/060675320.
M. P. Friedlander, I. Macedo, and T. K. Pong. Gauge optimization and duality.SIAM J. Optim., 24(4):1999–2022, 2014. doi: 10.1137/130940785.
O. Guler. On the convergence of the proximal point algorithm for convexminimization. SIAM Journal on Control and Optimization, 29(2):403–419,1991. doi: 10.1137/0329022. URL http://dx.doi.org/10.1137/0329022.
37 / 40
References IVO. Guler. New proximal point algorithms for convex minimization. SIAM
Journal on Optimization, 2(4):649–664, 1992. doi: 10.1137/0802032. URLhttp://dx.doi.org/10.1137/0802032.
M. R. Hestenes. Multiplier and gradient methods. J. Optim. Theory Appl., 4(5):303–320, 1969.
S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methodsfor matrix rank minimization. Math. Program., 128:321–353, 2011. doi:10.1007/s10107-009-0306-5.
D. Malioutov, M. Cetin, and A. S. Willsky. A sparse signal reconstructionperspective for source localization with sensor arrays. IEEE Trans. Sig.Proc., 53(8):3010–3022, August 2005.
O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs.SIAM J. Control Optim., 17(6):745–752, November 1979.
B. Martinet. Regularisation d’inequations variationnelles par approximationssuccessives. Rev. FranÂÿcaise Informat. Recherche OpÂťerationnelle, 4(Ser. R-3):154âĂŞ158, 1970.
Y. Nesterov. On an approach to the construction of optimal methods ofminimization of smooth convex functions. Ekonom. i. Mat. Metody, 24,1988.
38 / 40
References VY. Nesterov. Smooth minimization of non-smooth functions. Math. Program.,
103:127–152, 2005.P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating
minimization. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, andK. Weinberger, editors, Advances in Neural Information ProcessingSystems 26, pages 2796–2804. Curran Associates, Inc., 2013. URLhttp://papers.nips.cc/paper/5041-pha.
J. Nutini, I. Laradji, M. Schmidt, and M. Friedlander. Coordinate descentconverges faster with the gauss-southwell rule than random selection. InInter. Conf. Mach. Learning, 2015.
M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variableselection in least squares problems. IMA J. Numer. Anal., 20(3):389–403,2000.
B. Polyak and N. Tretyakov. The method of penalty bounds for constrainedextremum problems. Zh. Vych. Mat i Mat. Fiz, 13:34–46, 1973.
M. J. D. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, chapter 19. Academic Press, Londonand New York, 1969.
39 / 40
References VIR. T. Rockafellar. Augmented Lagrangians and applications of the proximal
point algorithm in convex programming. Mathematics of OperationsResearch, 1(2):97–116, 1976.
S. Sardy, A. Antoniadis, and P. Tseng. Automatic smoothing with wavelets fora wide class of distributions. J. Comput. Graph. Stat., 13, 2004.
E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basispursuit solutions. SIAM J. Optim., 31(2):890–912, 2008. doi:10.1137/080714488.
I. Waldspurger, A. dâĂŹAspremont, and S. Mallat. Phase recovery, maxcut andcomplex semidefinite programming. Math. Program., 149(1-2):47–81, 2015.
W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithmsfor l1 minimization with applications to compressed sensing. SIAM J. Imag.Sci., 1, 2008.
Y. Zhang, W. Deng, J. Yang, , and W. Yin, 2011. URLhttp://yall1.blogs.rice.edu/.
40 / 40