Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate DescentLecturer: Pradeep Ravikumar

Co-instructor: Aarti Singh

Convex Optimization 10-725/36-725

Slides adapted from Tibshirani

Coordinate descent

We’ve seen some pretty sophisticated methods thus far

We now focus on a very simple technique that can be surprisinglye�cient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization

Q: Given convex, di↵erentiable f : Rn ! R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then have

we found a global minimizer?

I.e., does f(x+ �ei

) � f(x) for all �, i ) f(x) = min

z

f(z)?

(Here ei

= (0, . . . , 1, . . . 0) 2 Rn, the ith standard basis vector)

4

Coordinate descent

We’ve seen some pretty sophisticated methods thus far

We now focus on a very simple technique that can be surprisinglye�cient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization

Q: Given convex, di↵erentiable f : Rn ! R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then have

we found a global minimizer?

I.e., does f(x+ �ei

) � f(x) for all �, i ) f(x) = min

z

f(z)?

(Here ei

= (0, . . . , 1, . . . 0) 2 Rn, the ith standard basis vector)

4

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0

Q: Same question, but now for f convex, and not di↵erentiable?

5

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0


5

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0


5

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

●

A: No! Look at the above counterexample

Q: Same question again, but now f(x) = g(x) +P

n

i=1 hi(xi), withg convex, di↵erentiable and each h

i

convex? (Here the nonsmoothpart is called separable)

6

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

●



n


i


6

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

●



n


i


6

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

●

A: Yes! Proof: for any y,

f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

●


f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

●


f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

Coordinate descent

This suggests that for f(x) = g(x) +P

n

i=1 hi(xi), with g convex,di↵erentiable and each h

i

convex, we can use coordinate descentto find a minimizer: start with some initial guess x(0), and repeat

x(k)1 2 argmin

x1

f�x1, x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

�

x(k)2 2 argmin

x2

f�x(k)1 , x2, x

(k�1)3 , . . . x(k�1)

n

�

x(k)3 2 argmin

x2

f�x(k)1 , x

(k)2 , x3, . . . x

(k�1)n

�

. . .

x(k)n

2 argmin

x2

f�x(k)1 , x

(k)2 , x

(k)3 , . . . x

n

�

for k = 1, 2, 3, . . .. Important: after solving for x(k)i

, we use its newvalue from then on!

8

Tseng (2001) proves that for such f (provided f is continuous oncompact set {x : f(x) f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1

Notes:

• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . n}

• Can everywhere replace individual coordinates with blocks ofcoordinates

• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge

• The analogy for solving linear systems: Gauss-Seidel versusJacobi method

1

Using real analysis, we know that x

(k)has subsequence converging to x

?

(Bolzano-Weierstrass), and f(x(k)) converges to f

?(monotone convergence)

9

Tseng (2001) proves that for such f (provided f is continuous oncompact set {x : f(x) f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1

Notes:

• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . n}

• Can everywhere replace individual coordinates with blocks ofcoordinates

• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge

• The analogy for solving linear systems: Gauss-Seidel versusJacobi method

1

Using real analysis, we know that x

(k)has subsequence converging to x

?

(Bolzano-Weierstrass), and f(x(k)) converges to f

?(monotone convergence)

9

Example: linear regression

Given y 2 Rn, and X 2 Rn⇥p with columns X1, . . . Xp

, considerlinear regression:

min

�

1

2

ky �X�k22

Minimizing over �i

, with all �j

, j 6= i fixed:

0 = ri

f(�) = XT

i

(X� � y) = XT

i

(Xi

�i

+X�i

��i

� y)

i.e., we take

�i

=

XT

i

(y �X�i

��i

)

XT

i

Xi

Coordinate descent repeats this update for i = 1, 2, . . . , p, 1, 2, . . .

10

Coordinate descent vs gradient descent for linear regression: 100instances with n = 100, p = 20

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDCD

11

Is it fair to compare 1 cycle of coordinate descent to 1 iteration ofgradient descent? Yes, if we’re clever

• Gradient descent: � � + tXT

(y �X�), costs O(np) flops

• Coordinate descent, one coordinate update:

�i

XT

i

(y �X�i

��i

)

XT

i

Xi

=

XT

i

r

kXi

k22+ �

i

where r = y �X�

• Each coordinate costs O(n) flops: O(n) to update r, O(n) tocompute XT

i

r

• One cycle of coordinate descent costs O(np) operations, sameas gradient descent

12

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDAGDCD

Same example, butnow with acceler-ated gradient de-scent for comparison

Is this contradicting the optimality of accelerated gradient descent?No! Coordinate descent uses more than first-order information

13

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDAGDCD

Same example, butnow with acceler-ated gradient de-scent for comparison

Is this contradicting the optimality of accelerated gradient descent?No! Coordinate descent uses more than first-order information

13

Example: lasso regression

Consider the lasso problem:

min

�

1

2

ky �X�k22 + �k�k1

Note that the nonsmooth part is separable: k�k1 =P

p

i=1 |�i|

Minimizing over �i

, with �j

, j 6= i fixed:

0 = XT

i

Xi

�i

+XT

i

(X�i

��i

� y) + �si

where si

2 @|�i

|. Solution is simply given by soft-thresholding

�i

= S�/kXik22

✓XT

i

(y �X�i

��i

)

XT

i

Xi

◆

Repeat this for i = 1, 2, . . . p, 1, 2, . . .

14

Proximal gradient vs coordinate descent for lasso regression: 100instances with n = 100, p = 20

0 10 20 30 40

1e−13

1e−10

1e−07

1e−04

1e−01

k

f(k)−fstar

PGAPGCD

For same reasons asbefore:

• All methodsuse O(np)flops periteration

• Coordinatedescent usesmuch moreinformation

15

Example: box-constrained QP

Given b 2 Rn, Q 2 Sn+, consider box-constrained quadratic program

min

x

1

2

xTQx+ bTx subject to l x u

Fits into our framework, as I{l x u} =

Pn

i=1 I{li xi

ui

}

Minimizing over xi

with all xj

, j 6= i fixed: same basic steps give

xi

= T[li,ui]

✓bi

�P

j 6=i

Qij

xj

Qii

◆

where T[li,ui] is the truncation (projection) operator onto [li

, ui

]:

T[li,ui](z) =

8><

>:

ui

if z > ui

z if li

z ui

li

if z < li

16

Example: support vector machines

A coordinate descent strategy can be applied to the SVM dual:

min

↵

1

2

↵T

˜X ˜XT↵� 1

T↵ subject to 0 ↵ C1, ↵T y = 0

Sequential minimal optimization or SMO (Platt 1998) is basicallyblockwise coordinate descent in blocks of 2. Instead of cycling, itchooses the next block greedily

Recall the complementary slackness conditions

↵i

�1� ⇠

i

� (

˜X�)i

� yi

�0�= 0, i = 1, . . . n (1)

(C � ↵i

)⇠i

= 0, i = 1, . . . n (2)

where �,�0, ⇠ are the primal coe�cients, intercept, and slacks.Recall that � =

˜XT↵, �0 is computed from (1) using any i suchthat 0 < ↵

i

< C, and ⇠ is computed from (1), (2)

17

SMO repeats the following two steps:

• Choose ↵i

,↵j

that do not satisfy complementary slackness,greedily (using heuristics)

• Minimize over ↵i

,↵j

exactly, keeping all other variables fixed

Using equality constraint,reduces to minimizing uni-variate quadratic over aninterval (From Platt 1998)

Note this does not meet separability assumptions for convergencefrom Tseng (2001), and a di↵erent treatment is required

Many further developments on coordinate descent for SVMs havebeen made; e.g., a recent one is Hsiesh et al. (2008)

18

Coordinate descent in statistics and ML

History in statistics:

• Idea appeared in Fu (1998), and again in Daubechies et al.(2004), but was inexplicably ignored

• Three papers around 2007, especially Friedman et al. (2007),really sparked interest in statistics and ML communities

Why is it used?

• Very simple and easy to implement

• Careful implementations can be near state-of-the-art

• Scalable, e.g., don’t need to keep full data in memory

Examples: lasso regression, lasso GLMs (under proximal Newton),SVMs, group lasso, graphical lasso (applied to the dual), additivemodeling, matrix completion, regression with nonconvex penalties

19

What’s in a name?

The name coordinate descent is confusing. For a smooth functionf , the method that repeats

x(k)1 = x

(k�1)1 � t

k,1 ·r1f�x(k�1)1 , x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

�

x(k)2 = x

(k�1)2 � t

k,2 ·r2f�x(k)1 , x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

�

x(k)3 = x

(k�1)3 � t

k,3 ·r3f�x(k)1 , x

(k)2 , x

(k�1)3 , . . . x(k�1)

n

�

. . .

x(k)n

= x(k�1)n

� tk,n

·rn

f�x(k)1 , x

(k)2 , x

(k)3 , . . . x(k�1)

n

�

for k = 1, 2, 3, . . . is also (rightfully) called coordinate descent. Iff = g+ h, where g is smooth and h is separable, then the proximalversion of the above is also called coordinate descent

These versions are often easier to apply that exact coordinatewiseminimization, but the latter makes more progress per step

22

Convergence analyses

Theory for coordinate descent moves quickly. The list given belowis incomplete (may not be the latest and greatest). Warning: somereferences below treat coordinatewise minimization, some do not

• Convergence of coordinatewise minimization for solving linearsystems, the Gauss-Seidel method, is a classic topic. E.g., seeGolub and van Loan (1996), or Ramdas (2014) for a moderntwist that looks at randomized coordinate descent

• Nesterov (2010) considers randomized coordinate descent forsmooth functions and shows that it achieves a rate O(1/✏)under a Lipschitz gradient condition, and a rate O(log(1/✏))under strong convexity

• Richtarik and Takac (2011) extend and simplify these results,considering smooth plus separable functions, where now eachcoordinate descent update applies a prox operation

23

• Saha and Tewari (2013) consider minimizing `1 regularizedfunctions of the form g(�) + �k�k1, for smooth g, and studyboth cyclic coordinate descent and cyclic coordinatewise min.Under (very strange) conditions on g, they show both methodsdominate proximal gradient descent in iteration progress

• Beck and Tetruashvili (2013) study cyclic coordinate descentfor smooth functions in general. They show that it achieves arate O(1/✏) under a Lipschitz gradient condition, and a rateO(log(1/✏)) under strong convexity. They also extend theseresults to a constrained setting with projections

• Nutini et al. (2015) analyze greedy coordinate descent (calledGauss-Southwell rule), and show it achieves a faster rate thanrandomized coordinate descent for certain problems

• Wright (2015) provides some unification and a great summary.Also covers parallel versions (even asynchronous ones)

• General theory is still not complete; still unanswered questions(e.g., are descent and minimization strategies the same?)

24

Screening rules

In some problems, screening rules can be used in combination withcoordinate descent to further wittle down the active set. Screeningrules themselves have amassed a sizeable literature recently. Hereis an example, the SAFE rule for the lasso2:

|XT

i

y| < �� kXi

k2kyk2�max

� �

�max

) ˆ�i

= 0, all i = 1, . . . p

where �max

= kXT yk1 (the smallest value of � such that ˆ� = 0)

Note: this is not an if and only if statement! But it does give us away of eliminating features apriori, without solving the lasso

(There have been many advances in screening rules for the lasso,but SAFE is the simplest, and was the first)

2

El Ghaoui et al. (2010), “Safe feature elimination in sparse learning”

30

Why is the SAFE rule true? Construction comes from lasso dual:

max

u

g(u) subject to kXTuk1 �

where g(u) = 12kyk

22 � 1

2ky � uk22. Suppose that u0 is dual feasible(e.g., take u0 = y · �/�

max

). Then � = g(u0) is a lower bound onthe dual optimal value, so dual problem is equivalent to

max

u

g(u) subject to kXTuk1 �, g(u) � �

Now consider computing

mi

= max

u

|XT

i

u| subject to g(u) � �, for i = 1, . . . p

Then we would have

mi

< � ) |XT

i

u| < � ) ˆ�i

= 0, i = 1, . . . p

The last implication comes from the KKT conditions31

Another dual argument shows that

max

u

XT

i

u subject to g(u) � �

=min

µ>0��µ+

1

µkµy �X

i

k22

=kXi

k2qkyk22 � 2� �XT

i

y

where the last equality comes from direct calculation

Thus mi

is given the maximum of the above quantity over ±Xi

,

mi

= kXi

k2qkyk22 � 2� + |XT

i

y|, i = 1, . . . p

Lastly, subtitute � = g(y · �/�max

). Then mi

< � is precisely thesafe rule given on previous slide

33

References

Early coordinate descent references in optimization:

• D. Bertsekas and J. Tsitsiklis (1989), “Parallel and distributeddomputation: numerical methods”

• Z. Luo and P. Tseng (1992), “On the convergence of thecoordinate descent method for convex di↵erentiableminimization”

• J. Ortega and W. Rheinboldt (1970), “Iterative solution ofnonlinear equations in several variables”

• P. Tseng (2001), “Convergence of a block coordinate descentmethod for nondi↵erentiable minimization”

35

Early coordinate descent references in statistics and ML:

• I. Daubechies and M. Defrise and C. De Mol (2004), “Aniterative thresholding algorithm for linear inverse problemswith a sparsity constraint”

• J. Friedman and T. Hastie and H. Hoefling and R. Tibshirani(2007), “Pathwise coordinate optimization”

• W. Fu (1998), “Penalized regressions: the bridge versus thelasso”

• T. Wu and K. Lange (2008), “Coordinate descent algorithmsfor lasso penalized regression”

• A. van der Kooij (2007), “Prediction accuracy and stability ofregresssion with optimal scaling transformations”

36

Applications of coordinate descent:

• O. Banerjee and L. Ghaoui and A. d’Aspremont (2007),“Model selection through sparse maximum likelihoodestimation”

• J. Friedman and T. Hastie and R. Tibshirani (2007), “Sparseinverse covariance estimation with the graphical lasso”

• J. Friedman and T. Hastie and R. Tibshirani (2009),“Regularization paths for generalized linear models viacoordinate descent”

• C.J. Hsiesh and K.W. Chang and C.J. Lin and S. Keerthi andS. Sundararajan (2008), “A dual coordinate descent methodfor large-scale linear SVM”

• R. Mazumder and J. Friedman and T. Hastie (2011),“SparseNet: coordinate descent with non-convex penalties”

• J. Platt (1998), “Sequential minimal optimization: a fastalgorithm for training support vector machines”

37

Recent theory for coordinate descent:

• A. Beck and L. Tetruashvili (2013), “On the convergence ofblock coordinate descent type methods”

• Y. Nesterov (2010), “E�ciency of coordinate descentmethods on huge-scale optimization problems”

• J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke(2015), “Coordinate descent converges faster with the Gauss-Southwell rule than random selection”

• A. Ramdas (2014), “Rows vs columns for linear systems ofequations—randomized Kaczmarz or coordinate descent?”

• P. Richtarik and M. Takac (2011), “Iteration complexity ofrandomized block-coordinate descent methods for minimizinga composite function”

• A. Saha and A. Tewari (2013), “On the nonasymptoticconvergence of cyclic coordinate descent methods”

• S. Wright (2015), “Coordinate descent algorithms”

38

Screening rules and graphical lasso references:

• L. El Ghaoui and V. Viallon and T. Rabbani (2010), “Safefeature elimination in sparse supervised learning”

• R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J.Taylor, and R. J. Tibshirani (2011), “Strong rules fordiscarding predictors in lasso-type problems”

• R. Mazumder and T. Hastie (2011), “The graphical lasso:new insights and alternatives”

• R. Mazumder and T. Hastie (2011), “Exact covariancethresholding into connected components for large-scalegraphical lasso”

• J. Wang, P. Wonka, and J. Ye (2015), “Lasso screening rulesvia dual polytope projection”

• D. Witten and J. Friedman and N. Simon (2011), “Newinsights and faster computations for the graphical lasso”

39

Documents

Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵