37
Coordinate Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Slides adapted from Tibshirani

Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate DescentLecturer: Pradeep Ravikumar

Co-instructor: Aarti Singh

Convex Optimization 10-725/36-725

Slides adapted from Tibshirani

Page 2: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate descent

We’ve seen some pretty sophisticated methods thus far

We now focus on a very simple technique that can be surprisinglye�cient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization

Q: Given convex, di↵erentiable f : Rn ! R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then have

we found a global minimizer?

I.e., does f(x+ �ei

) � f(x) for all �, i ) f(x) = min

z

f(z)?

(Here ei

= (0, . . . , 1, . . . 0) 2 Rn, the ith standard basis vector)

4

Page 3: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate descent

We’ve seen some pretty sophisticated methods thus far

We now focus on a very simple technique that can be surprisinglye�cient, scalable: coordinate descent, or more appropriately calledcoordinatewise minimization

Q: Given convex, di↵erentiable f : Rn ! R, if we are at a point xsuch that f(x) is minimized along each coordinate axis, then have

we found a global minimizer?

I.e., does f(x+ �ei

) � f(x) for all �, i ) f(x) = min

z

f(z)?

(Here ei

= (0, . . . , 1, . . . 0) 2 Rn, the ith standard basis vector)

4

Page 4: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0

Q: Same question, but now for f convex, and not di↵erentiable?

5

Page 5: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0

Q: Same question, but now for f convex, and not di↵erentiable?

5

Page 6: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1 x2

f

A: Yes! Proof:

rf(x) =

✓@f

@x1(x), . . .

@f

@xn

(x)

◆= 0

Q: Same question, but now for f convex, and not di↵erentiable?

5

Page 7: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

A: No! Look at the above counterexample

Q: Same question again, but now f(x) = g(x) +P

n

i=1 hi(xi), withg convex, di↵erentiable and each h

i

convex? (Here the nonsmoothpart is called separable)

6

Page 8: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

A: No! Look at the above counterexample

Q: Same question again, but now f(x) = g(x) +P

n

i=1 hi(xi), withg convex, di↵erentiable and each h

i

convex? (Here the nonsmoothpart is called separable)

6

Page 9: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1

x2

−4 −2 0 2 4

−4−2

02

4

A: No! Look at the above counterexample

Q: Same question again, but now f(x) = g(x) +P

n

i=1 hi(xi), withg convex, di↵erentiable and each h

i

convex? (Here the nonsmoothpart is called separable)

6

Page 10: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

A: Yes! Proof: for any y,

f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

Page 11: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

A: Yes! Proof: for any y,

f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

Page 12: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

x1

x2

f

x1x2

−4 −2 0 2 4

−4−2

02

4

A: Yes! Proof: for any y,

f(y)� f(x) � rg(x)T (y � x) +nX

i=1

[hi

(yi

)� hi

(xi

)]

=

nX

i=1

[ri

g(x)(yi

� xi

) + hi

(yi

)� hi

(xi

)]| {z }�0

� 0

7

Page 13: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate descent

This suggests that for f(x) = g(x) +P

n

i=1 hi(xi), with g convex,di↵erentiable and each h

i

convex, we can use coordinate descentto find a minimizer: start with some initial guess x(0), and repeat

x(k)1 2 argmin

x1

f�x1, x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

x(k)2 2 argmin

x2

f�x(k)1 , x2, x

(k�1)3 , . . . x(k�1)

n

x(k)3 2 argmin

x2

f�x(k)1 , x

(k)2 , x3, . . . x

(k�1)n

. . .

x(k)n

2 argmin

x2

f�x(k)1 , x

(k)2 , x

(k)3 , . . . x

n

for k = 1, 2, 3, . . .. Important: after solving for x(k)i

, we use its newvalue from then on!

8

Page 14: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Tseng (2001) proves that for such f (provided f is continuous oncompact set {x : f(x) f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1

Notes:

• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . n}

• Can everywhere replace individual coordinates with blocks ofcoordinates

• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge

• The analogy for solving linear systems: Gauss-Seidel versusJacobi method

1

Using real analysis, we know that x

(k)has subsequence converging to x

?

(Bolzano-Weierstrass), and f(x(k)) converges to f

?(monotone convergence)

9

Page 15: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Tseng (2001) proves that for such f (provided f is continuous oncompact set {x : f(x) f(x(0))} and f attains its minimum), anylimit point of x(k), k = 1, 2, 3, . . . is a minimizer of f1

Notes:

• Order of cycle through coordinates is arbitrary, can use anypermutation of {1, 2, . . . n}

• Can everywhere replace individual coordinates with blocks ofcoordinates

• “One-at-a-time” update scheme is critical, and “all-at-once”scheme does not necessarily converge

• The analogy for solving linear systems: Gauss-Seidel versusJacobi method

1

Using real analysis, we know that x

(k)has subsequence converging to x

?

(Bolzano-Weierstrass), and f(x(k)) converges to f

?(monotone convergence)

9

Page 16: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Example: linear regression

Given y 2 Rn, and X 2 Rn⇥p with columns X1, . . . Xp

, considerlinear regression:

min

1

2

ky �X�k22

Minimizing over �i

, with all �j

, j 6= i fixed:

0 = ri

f(�) = XT

i

(X� � y) = XT

i

(Xi

�i

+X�i

��i

� y)

i.e., we take

�i

=

XT

i

(y �X�i

��i

)

XT

i

Xi

Coordinate descent repeats this update for i = 1, 2, . . . , p, 1, 2, . . .

10

Page 17: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate descent vs gradient descent for linear regression: 100instances with n = 100, p = 20

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDCD

11

Page 18: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Is it fair to compare 1 cycle of coordinate descent to 1 iteration ofgradient descent? Yes, if we’re clever

• Gradient descent: � � + tXT

(y �X�), costs O(np) flops

• Coordinate descent, one coordinate update:

�i

XT

i

(y �X�i

��i

)

XT

i

Xi

=

XT

i

r

kXi

k22+ �

i

where r = y �X�

• Each coordinate costs O(n) flops: O(n) to update r, O(n) tocompute XT

i

r

• One cycle of coordinate descent costs O(np) operations, sameas gradient descent

12

Page 19: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDAGDCD

Same example, butnow with acceler-ated gradient de-scent for comparison

Is this contradicting the optimality of accelerated gradient descent?No! Coordinate descent uses more than first-order information

13

Page 20: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

0 10 20 30 40

1e−10

1e−07

1e−04

1e−01

1e+02

k

f(k)−fstar

GDAGDCD

Same example, butnow with acceler-ated gradient de-scent for comparison

Is this contradicting the optimality of accelerated gradient descent?No! Coordinate descent uses more than first-order information

13

Page 21: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Example: lasso regression

Consider the lasso problem:

min

1

2

ky �X�k22 + �k�k1

Note that the nonsmooth part is separable: k�k1 =P

p

i=1 |�i|

Minimizing over �i

, with �j

, j 6= i fixed:

0 = XT

i

Xi

�i

+XT

i

(X�i

��i

� y) + �si

where si

2 @|�i

|. Solution is simply given by soft-thresholding

�i

= S�/kXik22

✓XT

i

(y �X�i

��i

)

XT

i

Xi

Repeat this for i = 1, 2, . . . p, 1, 2, . . .

14

Page 22: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Proximal gradient vs coordinate descent for lasso regression: 100instances with n = 100, p = 20

0 10 20 30 40

1e−13

1e−10

1e−07

1e−04

1e−01

k

f(k)−fstar

PGAPGCD

For same reasons asbefore:

• All methodsuse O(np)flops periteration

• Coordinatedescent usesmuch moreinformation

15

Page 23: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Example: box-constrained QP

Given b 2 Rn, Q 2 Sn+, consider box-constrained quadratic program

min

x

1

2

xTQx+ bTx subject to l x u

Fits into our framework, as I{l x u} =

Pn

i=1 I{li xi

ui

}

Minimizing over xi

with all xj

, j 6= i fixed: same basic steps give

xi

= T[li,ui]

✓bi

�P

j 6=i

Qij

xj

Qii

where T[li,ui] is the truncation (projection) operator onto [li

, ui

]:

T[li,ui](z) =

8><

>:

ui

if z > ui

z if li

z ui

li

if z < li

16

Page 24: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Example: support vector machines

A coordinate descent strategy can be applied to the SVM dual:

min

1

2

↵T

˜X ˜XT↵� 1

T↵ subject to 0 ↵ C1, ↵T y = 0

Sequential minimal optimization or SMO (Platt 1998) is basicallyblockwise coordinate descent in blocks of 2. Instead of cycling, itchooses the next block greedily

Recall the complementary slackness conditions

↵i

�1� ⇠

i

� (

˜X�)i

� yi

�0�= 0, i = 1, . . . n (1)

(C � ↵i

)⇠i

= 0, i = 1, . . . n (2)

where �,�0, ⇠ are the primal coe�cients, intercept, and slacks.Recall that � =

˜XT↵, �0 is computed from (1) using any i suchthat 0 < ↵

i

< C, and ⇠ is computed from (1), (2)

17

Page 25: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

SMO repeats the following two steps:

• Choose ↵i

,↵j

that do not satisfy complementary slackness,greedily (using heuristics)

• Minimize over ↵i

,↵j

exactly, keeping all other variables fixed

Using equality constraint,reduces to minimizing uni-variate quadratic over aninterval (From Platt 1998)

Note this does not meet separability assumptions for convergencefrom Tseng (2001), and a di↵erent treatment is required

Many further developments on coordinate descent for SVMs havebeen made; e.g., a recent one is Hsiesh et al. (2008)

18

Page 26: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Coordinate descent in statistics and ML

History in statistics:

• Idea appeared in Fu (1998), and again in Daubechies et al.(2004), but was inexplicably ignored

• Three papers around 2007, especially Friedman et al. (2007),really sparked interest in statistics and ML communities

Why is it used?

• Very simple and easy to implement

• Careful implementations can be near state-of-the-art

• Scalable, e.g., don’t need to keep full data in memory

Examples: lasso regression, lasso GLMs (under proximal Newton),SVMs, group lasso, graphical lasso (applied to the dual), additivemodeling, matrix completion, regression with nonconvex penalties

19

Page 27: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

What’s in a name?

The name coordinate descent is confusing. For a smooth functionf , the method that repeats

x(k)1 = x

(k�1)1 � t

k,1 ·r1f�x(k�1)1 , x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

x(k)2 = x

(k�1)2 � t

k,2 ·r2f�x(k)1 , x

(k�1)2 , x

(k�1)3 , . . . x(k�1)

n

x(k)3 = x

(k�1)3 � t

k,3 ·r3f�x(k)1 , x

(k)2 , x

(k�1)3 , . . . x(k�1)

n

. . .

x(k)n

= x(k�1)n

� tk,n

·rn

f�x(k)1 , x

(k)2 , x

(k)3 , . . . x(k�1)

n

for k = 1, 2, 3, . . . is also (rightfully) called coordinate descent. Iff = g+ h, where g is smooth and h is separable, then the proximalversion of the above is also called coordinate descent

These versions are often easier to apply that exact coordinatewiseminimization, but the latter makes more progress per step

22

Page 28: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Convergence analyses

Theory for coordinate descent moves quickly. The list given belowis incomplete (may not be the latest and greatest). Warning: somereferences below treat coordinatewise minimization, some do not

• Convergence of coordinatewise minimization for solving linearsystems, the Gauss-Seidel method, is a classic topic. E.g., seeGolub and van Loan (1996), or Ramdas (2014) for a moderntwist that looks at randomized coordinate descent

• Nesterov (2010) considers randomized coordinate descent forsmooth functions and shows that it achieves a rate O(1/✏)under a Lipschitz gradient condition, and a rate O(log(1/✏))under strong convexity

• Richtarik and Takac (2011) extend and simplify these results,considering smooth plus separable functions, where now eachcoordinate descent update applies a prox operation

23

Page 29: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

• Saha and Tewari (2013) consider minimizing `1 regularizedfunctions of the form g(�) + �k�k1, for smooth g, and studyboth cyclic coordinate descent and cyclic coordinatewise min.Under (very strange) conditions on g, they show both methodsdominate proximal gradient descent in iteration progress

• Beck and Tetruashvili (2013) study cyclic coordinate descentfor smooth functions in general. They show that it achieves arate O(1/✏) under a Lipschitz gradient condition, and a rateO(log(1/✏)) under strong convexity. They also extend theseresults to a constrained setting with projections

• Nutini et al. (2015) analyze greedy coordinate descent (calledGauss-Southwell rule), and show it achieves a faster rate thanrandomized coordinate descent for certain problems

• Wright (2015) provides some unification and a great summary.Also covers parallel versions (even asynchronous ones)

• General theory is still not complete; still unanswered questions(e.g., are descent and minimization strategies the same?)

24

Page 30: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Screening rules

In some problems, screening rules can be used in combination withcoordinate descent to further wittle down the active set. Screeningrules themselves have amassed a sizeable literature recently. Hereis an example, the SAFE rule for the lasso2:

|XT

i

y| < �� kXi

k2kyk2�max

� �

�max

) ˆ�i

= 0, all i = 1, . . . p

where �max

= kXT yk1 (the smallest value of � such that ˆ� = 0)

Note: this is not an if and only if statement! But it does give us away of eliminating features apriori, without solving the lasso

(There have been many advances in screening rules for the lasso,but SAFE is the simplest, and was the first)

2

El Ghaoui et al. (2010), “Safe feature elimination in sparse learning”

30

Page 31: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Why is the SAFE rule true? Construction comes from lasso dual:

max

u

g(u) subject to kXTuk1 �

where g(u) = 12kyk

22 � 1

2ky � uk22. Suppose that u0 is dual feasible(e.g., take u0 = y · �/�

max

). Then � = g(u0) is a lower bound onthe dual optimal value, so dual problem is equivalent to

max

u

g(u) subject to kXTuk1 �, g(u) � �

Now consider computing

mi

= max

u

|XT

i

u| subject to g(u) � �, for i = 1, . . . p

Then we would have

mi

< � ) |XT

i

u| < � ) ˆ�i

= 0, i = 1, . . . p

The last implication comes from the KKT conditions31

Page 32: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Another dual argument shows that

max

u

XT

i

u subject to g(u) � �

=min

µ>0��µ+

1

µkµy �X

i

k22

=kXi

k2qkyk22 � 2� �XT

i

y

where the last equality comes from direct calculation

Thus mi

is given the maximum of the above quantity over ±Xi

,

mi

= kXi

k2qkyk22 � 2� + |XT

i

y|, i = 1, . . . p

Lastly, subtitute � = g(y · �/�max

). Then mi

< � is precisely thesafe rule given on previous slide

33

Page 33: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

References

Early coordinate descent references in optimization:

• D. Bertsekas and J. Tsitsiklis (1989), “Parallel and distributeddomputation: numerical methods”

• Z. Luo and P. Tseng (1992), “On the convergence of thecoordinate descent method for convex di↵erentiableminimization”

• J. Ortega and W. Rheinboldt (1970), “Iterative solution ofnonlinear equations in several variables”

• P. Tseng (2001), “Convergence of a block coordinate descentmethod for nondi↵erentiable minimization”

35

Page 34: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Early coordinate descent references in statistics and ML:

• I. Daubechies and M. Defrise and C. De Mol (2004), “Aniterative thresholding algorithm for linear inverse problemswith a sparsity constraint”

• J. Friedman and T. Hastie and H. Hoefling and R. Tibshirani(2007), “Pathwise coordinate optimization”

• W. Fu (1998), “Penalized regressions: the bridge versus thelasso”

• T. Wu and K. Lange (2008), “Coordinate descent algorithmsfor lasso penalized regression”

• A. van der Kooij (2007), “Prediction accuracy and stability ofregresssion with optimal scaling transformations”

36

Page 35: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Applications of coordinate descent:

• O. Banerjee and L. Ghaoui and A. d’Aspremont (2007),“Model selection through sparse maximum likelihoodestimation”

• J. Friedman and T. Hastie and R. Tibshirani (2007), “Sparseinverse covariance estimation with the graphical lasso”

• J. Friedman and T. Hastie and R. Tibshirani (2009),“Regularization paths for generalized linear models viacoordinate descent”

• C.J. Hsiesh and K.W. Chang and C.J. Lin and S. Keerthi andS. Sundararajan (2008), “A dual coordinate descent methodfor large-scale linear SVM”

• R. Mazumder and J. Friedman and T. Hastie (2011),“SparseNet: coordinate descent with non-convex penalties”

• J. Platt (1998), “Sequential minimal optimization: a fastalgorithm for training support vector machines”

37

Page 36: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Recent theory for coordinate descent:

• A. Beck and L. Tetruashvili (2013), “On the convergence ofblock coordinate descent type methods”

• Y. Nesterov (2010), “E�ciency of coordinate descentmethods on huge-scale optimization problems”

• J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, H. Koepke(2015), “Coordinate descent converges faster with the Gauss-Southwell rule than random selection”

• A. Ramdas (2014), “Rows vs columns for linear systems ofequations—randomized Kaczmarz or coordinate descent?”

• P. Richtarik and M. Takac (2011), “Iteration complexity ofrandomized block-coordinate descent methods for minimizinga composite function”

• A. Saha and A. Tewari (2013), “On the nonasymptoticconvergence of cyclic coordinate descent methods”

• S. Wright (2015), “Coordinate descent algorithms”

38

Page 37: Coordinate Descent - Carnegie Mellon School of Computer ... · A coordinate descent strategy can be applied to the SVM dual: min ↵ 1 2 ↵T X˜ X˜ T ↵ 1T ↵ subject to 0 ↵

Screening rules and graphical lasso references:

• L. El Ghaoui and V. Viallon and T. Rabbani (2010), “Safefeature elimination in sparse supervised learning”

• R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J.Taylor, and R. J. Tibshirani (2011), “Strong rules fordiscarding predictors in lasso-type problems”

• R. Mazumder and T. Hastie (2011), “The graphical lasso:new insights and alternatives”

• R. Mazumder and T. Hastie (2011), “Exact covariancethresholding into connected components for large-scalegraphical lasso”

• J. Wang, P. Wonka, and J. Ye (2015), “Lasso screening rulesvia dual polytope projection”

• D. Witten and J. Friedman and N. Simon (2011), “Newinsights and faster computations for the graphical lasso”

39