Outline - Applied Mathematics and Statisticsabasu9/AMS_553-761/lecture09_handout4.pdf · Outline 1 Introduction 2 Algorithms Cyclic order with exact minimization Cyclic order with

Coordinate Minimization

Daniel P. RobinsonDepartment of Applied Mathematics and Statistics

Johns Hopkins University

November 19, 2019

Outline

1 Introduction

2 AlgorithmsCyclic order with exact minimizationCyclic order with fixed step sizeSteepest direction (Gauss-Southwell rule) with fixed step sizeAlternativesSummary

3 ExamplesLinear equationsLogistic regression

Given a function f : Rn → R, consider the unconstrained optimization problem

minimizex∈Rn

f(x) (1)

We will consider various assumptions on f :I nonconvex and differentiable fI convex and differentiable fI strongly convex and differentiable f

We will not consider general non-smooth f , because we can not prove anything.

We will briefly consider structured non-smooth problems, i.e., problems that usean additional (separable) regularizer.

Notation: fk := f(xk) and gk := ∇f(xk)

Basic idea (coordinate minimization): Compute the next iterative using the update

xk+1 = xk − αkei(k)

Algorithm 1 General coordinate minimization framework.

1: Choose x0 ∈ Rn and set k← 0.2: loop3: Choose i(k) ∈ {1, 2, . . . , n}.4: Choose αk > 05: Set xk+1 ← xk − αkei(k).6: Set k← k + 1.7: end loop

αk is the step size. Options include:I fixed, but sufficiently smallI inexact linesearchI exact linesearch

i(k) ∈ {1, 2, . . . , n} has to be chosen. Options include:I cycle through the entire setI choose it uniformly at randomlyI choose it based on which element of ∇f(xk) is the largest in absolute value

ei(k) is the i(k)-th coordinate vector

this update seeks better points in span{ei(k)}.

Algorithm 2 Coordinate minimization with cyclic order and exact minimization.

1: Choose x0 ∈ Rn and set k← 0.2: loop3: Choose i(k) = mod(k, n) + 1.4: Calculate the exact coordinate minimizer:

αk ← argminα∈R

f(xk − αei(k)

)5: Set xk+1 ← xk − αkei(k).6: Set k← k + 1.7: end loop

Comments:

This algorithm assumes that the exact minimizers exist and that they are unique.

A reasonable stopping condition should be incorporated, such as

‖∇f(xk)‖2 ≤ 10−6 max{1, ‖∇f(x0)‖2}

An interesting example introduced by Powell [5, formula 2] is

minimize(x1,x2,x3)

f(x1, x2, x3) := −(x1x2 + x2x3 + x1x3) +

3∑i=1

(|xi| − 1)2

f is continuously differentiable and nonconvexf has minimizers at (−1,−1,−1) and (1, 1, 1) of the unit cube.Coordinate descent with exact minimization started just outside the unit cube nearany nonoptimal vertex cycles around neighborhoods of all 6 non-optimal vertices.Powell shows that the cyclic nonconvergence behavior is special and is destroyedby small perturbations on this particular example.

Figure : Three dimensional example givenabove. It shows the possible lack ofconvergence of a coordinate descent methodwith exact minimization. This example andothers in [5] show that we cannot expect ageneral convergence result for nonconvexfunctions similar to that for full-gradientdescent. This picture was taken from [7].

Theorem 2.1 (see [6, Theorem 5.32])Assume that the following hold:

f is continuously differentiable;

the level set L0 := {x ∈ Rn : f(x) ≤ f(x0)} is bounded; and

for every x ∈ L0 and all j ∈ {1, 2, . . . , n}, the optimization problem

minimizeζ∈R

f(x + ζej)

has a unique minimizer.

Then, for any limit point x∗ of the sequence {xk} generated by Algorithm 2 satisfies

∇f(x∗) = 0.

Proof: Since f(xk+1) ≤ f(xk) for all k ∈ N, we know that the sequence {xk}∞k=0 ⊂ L0.Since L0 is bounded that {xk}∞k=0 has at least one limit point; let x∗ be any such limitpoint. Thus, there exists a subsequence K ⊆ N satisfying

limk∈K

xk = x∗. (2)

Combining this with monotonicity of { f(xk)} and continuity of f also shows that

limk→∞

f(xk) = f(x∗) and f(xk) ≥ f(x∗) for all k ∈ N. (3)

We assume∇f(x∗) 6= 0, and then reach a contradiction.

First, consider the subsets Ki ⊆ K, i = 0, . . . , n− 1 defined as

Ki := {k ∈ K : k ≡ i mod n}.

Since K is an infinite subsequence of the natural numbers, one of the Ki must be aninfinite set. Without loss of generality, we assume it is K0 (the argument is very similarfor any other i, because we are using cyclic order).

Let us perform a hypothetical “sweep" of coordinate minimization starting from x∗, sothat we would obtain

y∗ := x∗n , with x∗` := x∗ +∑j=1

[τ∗]jej for all ` = 1, . . . , n

and note that since∇f(x∗) 6= 0 by assumption, we must have

f(y∗) < f(x∗). (why?) (4)

NOTE: If Ki was infinite for some i 6= 0, then we would do above “sweep" at x∗ startingwith coordinate i and going in cyclic order to cover all n coordinates.

Next, notice that by construction of the coordinate minimization scheme, that

xk+` = xk +∑j=1

τk+j−1ej for all k ∈ K0 and 1 ≤ ` ≤ n, (5)

where τi is the step length at iteration i. Therefore,

‖xk+` − xk‖ =

∥∥∥∥∥∥∥∥∥

τk

τk+1...

τk+`−1

∥∥∥∥∥∥∥∥∥ ≤ 2 max{‖x‖ : x ∈ L0} <∞

for all k ∈ K0 and 1 ≤ ` ≤ n. We used the assumption that L0 is bounded.Since this shows that the set {(τk, τk+1, . . . τk+n−1)

T}k∈K0 is bounded, we may passto a subsequence K′ ⊆ K0 with

limk∈K′

τk

τk+1...

τk+n−1

= τ L for some τ L ∈ Rn. (6)

Taking the limit of (5) over k ∈ K′ ⊆ K0 for each `, and using (2) and (6) we find that

limk∈K′

xk+` = x∗ +∑j=1

[τ L]jej for each 1 ≤ ` ≤ n. (7)

We next claim the following, which we will prove by induction:

[τ L]p = [τ∗]p for all 1 ≤ p ≤ n, and (8)

limk∈K′

xk+p = x∗p for all 1 ≤ p ≤ n. (9)

Base case: p = 1.We know from the coordinate minimization that

f(xk+1) ≤ f(xk + τ e1) for all k and τ ∈ R.

Taking limits over k ∈ K′ ⊆ K and using continuity of f , (7) with ` = 1, and (2) yields

f(x∗ + [τ L]1e1) = f( limk∈K′

xk+1) = limk∈K′

f(xk+1) ≤ limk∈K′

f(xk + τ e1)

= f( limk∈K′

xk + τ e1) = f(x∗ + τ e1) for all τ ∈ R.

Since the minimizations in coordinate directions are unique by assumption, we knowthat [τ L]1 = [τ∗]1, which is the first desired result. Also, combining it with (7) gives

limk∈K′

xk+1 = x∗ + [τ L]1e1 = x∗ + [τ∗]1e1 ≡ x∗1 ,

which completes the base base.

Induction step: assume that (8) and (9) hold for 1 ≤ p ≤ p ≤ n− 1.We know from the coordinate minimization that

f(xk+p+1) ≤ f(xk+p + τ ep+1) for all k ∈ N and τ ∈ R.

Taking the limit over k ∈ K′ , continuity of f , (7) with ` = p + 1, and (9) give

f

x∗ +

p+1∑j=1

[τ L]jej

= f( limk∈K′

xk+p+1) = limk∈K′

f(xk+p+1) ≤ limk∈K′

f(xk+p + τ ep+1)

= f( limk∈K′

xk+p + τ ep+1) = f(x∗p + τ ep+1) for all τ ∈ R.

Thus, the definition of x∗p , and the fact that (8) holds for all 1 ≤ p ≤ p show that

f(x∗p + [τ L]p+1ep+1) = f(x∗ +

p∑j=1

([τ∗]jej) + [τ L]p+1ep+1)

= f(x∗ +

p∑j=1

[τ L]jej + [τ L]p+1ep+1)

= f(x∗ +

p+1∑j=1

[τ L]jej) ≤ f(x∗p + τ ep+1) for all τ ∈ R.

Uniqueness of the minimizer implies [τ L]p+1 = [τ∗]p+1 and combining with (7) gives

limk∈K′

xk+p+1 = x∗ +

p+1∑j=1

[τ L]jej = x∗ +

p+1∑j=1

[τ∗]jej ≡ x∗p+1,

which completes the proof by induction.

From our induction proof, we have that

τ∗ = τ L.

Combining this with (7) and the definition of y∗ gives

limk∈K′

xk+n = x∗ +

n∑j=1

τ Lj ej = x∗ +

n∑j=1

τ∗j ej ≡ x∗n ≡ y∗. (10)

Finally, combining (3), continuity of f , (10), and (4) shows that

f(x∗) = limk∈K′

f(xk+n) = f( limk∈K′

xk+n) = f(y∗) < f(x∗),

which is a contradiction. This completes the proof.

Notation:Let Lj denote the jth component Lipschitz constant, i.e., it satisfies∣∣∇j f

(x + tej

)−∇j f(x)

∣∣ ≤ Lj|t| for all x ∈ Rn and t.

Let Lmax denote the coordinate Lipschitz constant, i.e., it satisfies

Lmax := max1≤i≤n

Li.

Let L denote the Lipschitz constant for∇f .

Algorithm 3 Coordinate minimization with cyclic order and a fixed step size.

1: Choose α ∈ (0, 1/Lmax].2: Choose x0 ∈ Rn and set k← 0.3: loop4: Choose i(k) = mod(k, n) + 1.5: Set xk+1 ← xk − α∇i(k) f(xk)ei(k).6: Set k← k + 1.7: end loop

Comments:A reasonable stopping condition should be incorporated, such as

‖∇f(xk)‖2 ≤ 10−6 max{1, ‖∇f(x0)‖2}

A maximum number of allowed iterations should be included in practice.

Theorem 2.2 (see [1, Theorem 3.6,Theorem 3.9] and [7, Theorem 3])

Suppose that α = 1/Lmax and let the following assumptions hold:

∇f is globally Lipschitz continuous

f has a minimizer x∗ and f∗ := f(x∗) = minx∈Rn f(x)

there exists a scalar R0 such that the diameter of the level set{x ∈ Rn : f(x) ≤ f(x0)} is bounded by R0.

Then, the iterate sequence {xk} of Algorithm 3 satisfies

mink=0,...,T

‖∇f(xk)‖ ≤

√4nLmax(1 + nL2/L2

max)(f(x0)− f∗)T + 1

∀T ∈ {n, 2n, 3n, . . . }

(11)If f is convex, then

f(xT)− f∗ ≤ 4nLmax(1 + nL2/L2max)R2

0

T + 8∀T ∈ {n, 2n, 3n, . . . }. (12)

If f is µ-strongly convex, then

f(xT)− f∗ ≤(

1− µ

2Lmax(1 + nL2/L2max)

)T/n (f(x0)− f∗

)∀T ∈ {n, 2n, 3n, . . . }

Proof: See [1, Lemma 3.3, Theorem 3.6 and Theorem 3.9] and use (i) each iteration “k"in [1] is a cycle of n iterations; (ii) choose in [1] the values Li = Lmax for all i; (iii) in [1]we have p = 1 since our blocks of variables are singletons, i.e., coordinate descent.

Comments on Theorem 2.2:The numerator in (11) and (12) is O(n2), while the numerator in the analogousresult for the random coordinate choice with fixed step size is O(n) (see SGDnotes and HW 4). But Theorem 2.2 is a deterministic result, while the result forrandom coordinate choice is in expectation.Recall also with the full gradient the iteration complexity has no dependence on n.But every iteration itself is n times more expensive than cyclic or random choice.It can be shown that L ≤

∑nj=1 Lj. (see [3, Lemma 2 with α = 1])

It follows from the fact that

|∇j f(x + tej)−∇j f(x)| ≤ ‖∇f(x + tej)−∇f(x)‖2 ≤ L|t|

holds for all j, t, and x that Lj ≤ L.By combining the previous two bullet points, we find that

Lmax ≡ maxj

Lj ≤ L ≤n∑

j=1

Lj ≤ nLmax

so that1 ≤ L

Lmax≤ n

Roughly speaking, L/Lmax is closer to 1 when the coordinates are “moredecoupled". In light of (12), the complexity result for coordinate descent becomesbetter as the variables become more decoupled. This makes sense!

Notation:Let Lj denote the jth component Lipschitz constant, i.e., it satisfies∣∣∇j f

(x + tej

)−∇j f(x)

∣∣ ≤ Lj|t| for all x ∈ Rn and t.

Let Lmax denote the coordinate Lipschitz constant, i.e., it satisfies

Lmax := max1≤i≤n

Li.

Algorithm 4 Coordinate minimization with Gauss-Southwell Rule and a fixed step size.

1: Choose α ∈ (0, 1/Lmax].2: Choose x0 ∈ Rn and set k← 0.3: loop4: Calculate i(k) as the steepest coordinate direction, i.e.,

i(k)← argmax1≤i≤n

|∇i f(xk)|

5: Set xk+1 ← xk − α∇i(k) f(xk)ei(k).6: Set k← k + 1.7: end loop

Comments:A reasonable stopping condition should be incorporated, such as

‖∇f(xk)‖2 ≤ 10−6 max{1, ‖∇f(x0)‖2}

Theorem 2.3Suppose that α = 1/Lmax and let the following assumptions hold:


f has a minimizer x∗ and f∗ := f(x∗) = minx∈Rn f(x)

there exists a scalar R0 such that the diameter of the level set{x ∈ Rn : f(x) ≤ f(x0)} is bounded by R0.

Then, the iterate sequence {xk} computed from Algorithm 4 satisfies

mink=0,...,T

‖∇f(xk)‖ ≤

√2nLmax(f(x0)− f∗)

T + 1∀T ≥ 1

If f is convex, then

f(xT)− f∗ ≤ 2nLmaxR20

T∀T ≥ 1.

If f is µ-strongly convex, then

f(xT)− f∗ ≤(

1− µ

nLmax

)k (f(x0)− f∗

)∀T ≥ 1

Proof: We recall our fundamental inequality (see the “Stochastic Gradient Descent"lecture notes on random coordinate choice)

f(xk+1) ≤ f(xk)−1

2Lmax

(∇i(k) f(xk)

)2.

Combining this with the choice i(k)← argmax1≤i≤n |∇i f(xk)| and the standard norminequality ‖v‖2 ≤

√n‖v‖∞, it holds that

f(xk+1) ≤ f(xk)−1

2Lmax

(∇i(k) f(xk)

)2

= f(xk)−1

2Lmax‖∇ f(xk)‖2

∞ (13)

≤ f(xk)−1

2nLmax‖∇ f(xk)‖2

2. (14)

Recall equation (2.1) from the “Smooth Convex Optimization" lecture notes:

f(xk)− f∗ ≤ R0‖∇f(xk)‖2.

Substituting the above into (14)

f(xk+1)− f∗ ≤ f(xk)− f∗− 12nLmax

‖∇ f(xk)‖22 ≤ f(xk)− f∗− 1

2nLmaxR20

(f(xk)− f∗

)2.

Using the notation ∆k = f(xk)− f∗, this is equivalent to

∆k+1 ≤ ∆k −1

2nLmaxR20∆2

k.

From the previous slide, we have

∆k+1 ≤ ∆k −1

2nLmaxR20∆2

k

which is exactly the same as the inequality (2.2) from the “Smooth ConvexOptimization" lecture notes, except with an extra factor of the dimension n in thedenominator of the last term. Then, as shown in that proof, we have

f(xk)− f∗ = ∆k ≤2nLmaxR2

0

k

which is the desired result for convex f .

Next, assume that f is µ-strongly convex, and recall the following inequality weestablished in the proof of Theorem 2.3 of the “Smooth Convex Optimization" lecturenotes:

f∗ ≥ f(xk)−1

2µ‖∇f(xk)‖2

2.

Subtracting f∗ from each side of (14) and then using the previous inequality shows that

f(xk+1)− f∗ ≤ f(xk)− f∗ − 12nLmax

‖∇f(xk)‖22

≤ f(xk)− f∗ − µ

nLmax

(f(xk)− f∗

)=

(1− µ

nLmax

) (f(xk)− f∗

)so that

f(xk)− f∗ ≤(

1− µ

nLmax

)k (f(x0)− f∗

)which is the last desired result.

Comments so far for fixed step size:Cyclic has the worst dependence on n:

I Cyclic: O(n2)I Random and Gauss-Southwell: O(n)

F Random is a rate in expectation.F Gauss-Southwell is a deterministic rate.

But Gauss-Southwell has O(n) complexity for each iteration (similar to full gradientdescent), whereas cyclic and random choice have O(1) complexity for eachiteration.

There is a better analysis for Gauss-Southwell when we assume that f is stronglyconvex that changes the above comment! (See [4]). We show this next.

Theorem 2.4Suppose that α = 1/Lmax and let the following assumptions hold:

f is `1-strongly convex, i.e., there exists µ1 > 0 such that

f(y) ≥ f(x) +∇f(x)T(y− x) +µ1

2‖y− x‖2

1 for all {x, y} ⊂ Rn


the minimum value of f is obtained

Then, the iterate sequence {xk} computed from Algorithm 4 satisfies

f(xk)− f∗ ≤(

1− µ1

Lmax

)k (f(x0)− f∗

)

Proof (see [4]): Using `1-strong convexity means that

f(y) ≥ f(x) +∇f(x)T(y− x) +µ1

2‖y− x‖2

1 for all {x, y} ⊂ Rn

for the `1-strong convexity parameter µ1. If we now minimize both sides with respect toy and replace x by xk, we find that

f∗ = minimizey∈Rn

f(y)

≥ minimizey∈Rn

f(xk) +∇f(xk)T(y− xk) +

µ1

2‖y− xk‖2

1

= f(xk) +∇f(xk)T(y∗k − xk) +

µ1

2‖y∗k − xk‖2

1 (why? exercise)

= f(xk)−1

2µ1‖∇f(xk)‖2

∞

where y∗k := xk + z∗k with

[z∗k ]i :=

{0 if i 6= `

−∇i f(xk)µ1

if i = `

and ` any index satisfying

` ∈ { j : |∇jf(xk)| = ‖∇f(xk)‖∞}.

Therefore, we have that

‖∇f(xk)‖2∞ ≥ 2µ1

(f(xk)− f∗

).

From the previous slide, we showed that

‖∇f(xk)‖2∞ ≥ 2µ1

(f(xk)− f∗

).

Subtracting f∗ from both sides of (13) and using the previous inequality shows that

f(xk+1)− f∗ ≤ f(xk)− f∗ − 12Lmax

‖∇f(xk)‖2∞

≤ f(xk)− f∗ − µ1

Lmax

(f(xk)− f∗

)=

(1− µ1

Lmax

) (f(xk)− f∗

).

Applying this inequality recursively gives

f(xk)− f∗ ≤(

1− µ1

Lmax

)k (f(x0)− f∗

)which is the desired result.

For strongly convex functions:

Random coordinate choice has the expected rate of

E[ f(xk)]− f∗ ≤(

1− µ

nLmax

)k (f(x0)− f∗

).

Gauss-Southwell coordinate choice has the determinstic rate of

f(xk)− f∗ ≤(

1− µ1

Lmax

)k (f(x0)− f∗

)(15)

The bound for Gauss-Southwell is better sinceµ

n≤ µ1 ≤ µ

so that

µ1 ≥µ

n⇐⇒ µ1

Lmax≥ µ

nLmax⇐⇒

(1− µ1

Lmax

)≤(

1− µ

nLmax

)

Example: A Simple Diagonal Quadratic Function

Consider the problemminimize

x∈RngTx + 1

2 xTHx

whereH = diag(λ1, λ2, . . . , λn)

with λi > 0 for all i ∈ {1, 2, . . . , n}. For this problem, we know that

µ = min{λ1, λ2, . . . , λn} and µ1 =

(n∑

i=1

1λi

)−1

Case 1: For λ1 = α for some α > 0, the minimum value for µ1 occurs whenα = λ1 = λ2 = · · · = λn, which gives

µ = α and µ1 =α

n.

Thus, the convergence constants are:

(random selection) :

(1− µ

nLmax

)=

(1− α

nLmax

)(Gauss-Southwell selection) :

(1− µ1

Lmax

)=

(1− α

nLmax

)so the convergence constants are the same; this is the worst case for Gauss-Southwell.

Case 2: For this other extreme case, let us suppose that

λ1 = β and λ2 = λ3 = · · · = λn = α

with α ≥ β. For this case, it can be shown that

µ = β and µ1 =βαn−1

αn−1 + (n− 1)βαn−2 =βα

α+ (n− 1)β.

If we now take the limit as α→∞ we find that

µ = β and µ1 → β = µ

Thus, the convergence constants (in the limit) are:

(random selection) :

(1− µ

nLmax

)=

(1− β

nLmax

)(Gauss-Southwell selection) :

(1− µ1

Lmax

)=

(1− β

Lmax

)so that Gauss-Southwell is a factor n faster than using a random coordinate selection.

Alternative 1 (strongly convex): individual coordinate Lipschitz constants.

The iteration update is

xk+1 = xk +1

Li(k)∇i(k) f(xk)ei(k)

Using a similar analysis as before, it can be shown

f(xk)− f∗ ≤

k∏j=1

(1− µ1

Lij

) ( f(x0)− f∗)

Better decrease than prior analysis since (see (15))

new rate =

k∏j=1

(1− µ1

Lij

) ≤ (1− µ1

Lmax

)k

= previous rate

faster provided at least one of the used Lij satisfies Lij < Lmax.

Alternative 2 (strongly convex): Lipschitz sampling.

Use a random coordinate direction chosen using a non-uniform probability distribution:

P(i(k) = j) =Lj∑n`=1 L`

for all j ∈ {1, 2, . . . , n}

Using an analysis similar to the previous one, but using the new probabilitydistribution when computing the expectation, it can be shown that

E[ f(xk+1)]− f∗ ≤(

1− µ

nL

) (E[ f(xk)]− f∗

)with L being the average component Lipschitz constant, i.e.,

L :=1n

n∑i=1

Li

The analysis was first performed in [2].

This rate is faster than uniform random sampling if not all of the componentLipschitz constants are the same.

Alternative 3 (strongly convex): Gauss-Southwell-Lipschitz rule.

Choose i(k) according to the rule

i(k)← max1≤i≤n

(∇i f(xk)

)2

Li(16)

We recall our fundamental inequality for coordinate descent with step sizeαk = 1

Li(k)

f(xk+1) ≤ f(xk)−1

2Li(k)

(∇i(k) f(xk)

)2 (17)

The update (16) is designed to choose i(k) to minimize the guaranteed decreasegiven by (17), which uses the component Lipschitz constants.

It may be shown, using this update, that

f(xk+1)− f∗ ≤ (1− µL)(

f(xk)− f∗)

where µL is the strong convexity parameter with respect to ‖v‖L :=∑n

i=1√

Li|vi|.It is shown in [4, Appendix 6.2] that

max{µ

nL,µ1

Lmax

}≤ µL ≤

µ1

min1≤i≤n {Li}

At least as fast as the fastest of Gauss-Southwell and Lipschitz sampling options.

Ordering of constant in linear convergence results when f is strongly convex:

random (uniform sampling, Lmax)

<

Gauss-Southwell (Lmax)

<

Gauss-Southwell with {Li}

≈

random (Lipschitz sampling, {Li})

<

Gauss-Southwell-Lipschitz

(max1≤j≤n

(∇i f(xk)

)2

Li

)

Comments:

Gauss-Southwell-Lipschitz: the best rate, but is the most expensive per iteration.

Better rates if you know and use {Li} instead of just using their max, i.e., Lmax.

Linear Equations

Let m ≤ n, b ∈ Rm, and AT =(a1 . . . am

)∈ Rn×m with ‖ai‖2 = 1 for all i.

Furthermore, suppose that AT has full column rank, meaning that the linear system

Aw = b

has infinitely many solutions. To seek the least-length solution, we wish to solve

minimizew∈Rn

12‖w‖

22 subject to Aw = b.

The Lagrangian dual problem is

minimizex∈Rm

f(x) := 12‖A

Tx‖22 − bTx,

where we note that∇f(x) = AATx− b and ∇i f(x) = aTi ATx− bi. The solutions to

the primal and dual are related via w∗ = ATx∗. Coordinate descent gives

xk+1 = xk − α(aTi ATxk − bi)ei.

If we maintain an estimate wk = ATxk, then we see that

wk+1 = ATxk+1 = AT(xk − α(aTi ATxk − bi)ei)

= ATxk − α(aTi ATxk − bi)ai = wk − α(aT

i wk − bi)ai.

Note that if α = 1, then it follows by using ‖ai‖2 = 1 that

aTi wk+1 = aT

i (wk − (aTi wk − bi)ai) = aT

i wk − (aTi wk − bi)aT

i ai = bi

so that the i-th equation is satisfied exactly.

Linear Equations

Summary

Coordinate minimization for solving the dual problem associated with linearequations along the direction ei with α = 1 satisfies the ith linear equation exactly.

Sometimes called the method of successive projections.Update: wk+1 = wk − α(aT

i wk − bi)aiI (n + 1) addition/subtractionsI (2n + 1) multiplicationsI (3n + 2) total floating-point operations

Computing∇f(x) requires a multiplication with A, which is much more expensive.

Logistic Regression

Give data {dj}Nj=1 ⊂ Rn and labels { yj}N

j=1 ⊂ {−1, 1} associated with the data, solve

minimizex∈Rn

f(x) :=1N

N∑j=1

log(

1 + e−yjdTj x).

If we define the data matrix D such that

D =

dT1...

dTN

,then it follows that

∇i f(x) = − 1N

N∑j=1

e−yjdTj x(

1 + e−yjdTj x) yj dji.

Consider the coordinate minimization update

xk+1 = xk − α∇i f(x)ei(k) for some i(k) ∈ {1, 2, . . . , n} and α ∈ R.

For efficiency, we store and update the required quantities {Dxk} using

Dxk+1︸︷︷︸new value

= D(xk+α∇i f(x)ei(k)) = Dxk+α∇i f(x)Dei(k) = Dxk︸︷︷︸old value

+α∇i f(x) D(:, i(k)),

where D(:, i(k)) denotes the i(k)-th column of D; if x0 ← 0, then we can set Dx0 ← 0.

Logistic Regression

Summary

Update to obtain Dxk+1 requires a single vector-vector add.

Computing∇i f(xk) only requires accessing a single column of the data matrix D.

Computing∇f(x) requires accessing the entire data matrix D.

References I

[1] A. BECK AND L. TETRUASHVILI, On the convergence of block coordinate descenttype methods, SIAM Journal on Optimization, 23 (2013), pp. 2037–2060.

[2] D. LEVENTHAL AND A. S. LEWIS, Randomized methods for linear constraints:convergence rates and conditioning, Mathematics of Operations Research, 35(2010), pp. 641–654.

[3] Y. NESTEROV, Efficiency of coordinate descent methods on huge-scaleoptimization problems, SIAM Journal on Optimization, 22 (2012), pp. 341–362.

[4] J. NUTINI, M. SCHMIDT, I. H. LARADJI, M. FRIEDLANDER, AND H. KOEPKE,Coordinate descent converges faster with the gauss-southwell rule than randomselection, in Proceedings of the 32nd International Conference on MachineLearning (ICML-15), 2015, pp. 1632–1641.

[5] M. J. POWELL, On search directions for minimization algorithms, Mathematicalprogramming, 4 (1973), pp. 193–201.

[6] A. P. RUSZCZYNSKI, Nonlinear optimization, vol. 13, Princeton university press,2006.

[7] S. J. WRIGHT, Coordinate descent algorithms, Mathematical Programming, 151(2015), pp. 3–34.

Documents

Outline - Applied Mathematics and Statisticsabasu9/AMS_553-761/lecture09_handout4.pdf · Outline 1 Introduction 2 Algorithms Cyclic order with exact minimization Cyclic order with