Lectures on Greedy-type Algorithms in Convex Optimization · 2021. 1. 26. · 1 Motivation: Binary Classi cation and Convex Opti-mization Suppose we are given mpixilated digital images

Lectures on Greedy-type Algorithms inConvex Optimization

prepared for the Machine Learning Summer School 2015

University of Texas, Austin

by Robert M. Freund, MIT

c©2015 Massachusetts Institute of Technology. All rights reserved.

January 12, 2015

1

1 Motivation: Binary Classification and Convex Opti-

mization

Suppose we are given m pixilated digital images of faces; some of the m faces are femalefaces and the others are male faces. For each i = 1, . . . ,m let ai ∈ Rn denote the digitalimage of face i, where each of the n coefficients (ai)j , j = 1, . . . , n, corresponds to a pixel.We also have the classification yi ∈ {−1,+1} for each face i, where yi = −1 corresponds toa male face and yi = +1 corresponds to a female face, for i = 1, . . . ,m. We would like tocompute a linear decision rule, comprised of a vector λ ∈ Rn, that will satisfy λTai > 0 ifyi = +1, and λTai < 0 if yi = −1, for all i = 1, . . . ,m. If we can compute λ that satisfiesthese conditions, we can then use λ to establish a prediction procedure to declare the genderof any other digital face image as follows: given a face image c ∈ Rn, we will declare whetherc is the image of a male or female face as follows:

• if λT c > 0, then we declare that c is a female face image, and

• if λT c < 0, then we declare that c is a male face image.

The accuracy of this prediction procedure is a nontrivial matter. We will use convex opti-mization to try to compute an accurate linear predictor model λ.

Much more generally, suppose we are given training data a1, . . . , am ∈ Rn and yi ∈ {−1,+1}for i = 1, . . . ,m for which:

• if yi = +1, then data record i has property “P”, and

• if yi = −1, then data record i does not have property “P”.

We would like to use the m training data records to develop a linear rule λ that can be usedto predict whether or not other points c might or might not have property P. In particular,we seek a vector λ for which:

• if yi = +1, then λTai > 0 , and

• if yi = −1, then λTai < 0 .

We will then use λ to predict whether or not any other point c has property P or not. If weare given another vector c, we will declare whether c has property P or not as follows:

• if λT c > 0, then we declare that c has property P, and

2

• if λT c < 0, then we declare that c does not have property P.

Let us now simplify our notation a bit. Note that our criteria for selecting λ can be re-writtenas:

yi(ai)Tλ > 0 , i = 1, . . . ,m .

We say that the training data is separable if there exists λ for which the above strict inequal-ities are satisfied.

Let us write this in matrix form. Define the m× n matrix A as follows:

Aij := yi(ai)j , i = 1, . . . ,m, , j = 1, . . . , n .

The ith row of A is simply the vector ai multiplied by yi. Then we seek a vector λ for which

Aλ > 0 .

The above strict inequality is satisfiable if and only if the training data is separable. Ofcourse, there is no guarantee that the data points will be separable, in which case we saythat the training data is not separable. If the training data is not separable, we may seekto compute λ for which “most” of the rows of Aλ are positive, and for which the negativecomponents of Aλ are “not too negative.” We will write this rather informally as:

Aλ>≈ 0 ;

there are several ways to make the above notions mathematically precise. For now it sufficesto keep these ideas informal.

1.1 Separable Training Data, and the Maximum Margin Opti-mization Problem

When the training data is separable, we seek a vector λ for which

Aλ > 0 .

It makes reasonable sense to want to find λ for which each of the inner products (ai)Tλ =(Aλ)i is very positive, so that the decision rule will work “very well” on the training data.We are thus motivated to make the smallest component of (Aλ), namely mini∈{1,...,m}(Aλ)i,as large as possible. This is written as:

maxλ

mini=1,...,m

(Aλ)i .

3

The quantity mini=1,...,m(Aλ)i is called the margin of λ, so this optimization problem ismaximizing the margin. One aspect of the margin is that it is homogeneous in λ, which isto say that if we scale λ by a factor of θ > 0 then the margin, which is the objective functionabove, scales by θ as well. Since we can scale θλ arbitrarily large yielding an arbitrarily largemargin, we impose a normalization criterion, and ours will be that ‖λ‖1 :=

∑nj=1 |λj| ≤ 1 .

Our optimization problem now is:

MMP: maximizeλ mini=1,...,m(Aλ)i

s.t. ‖λ‖1 ≤ 1

λ ∈ Rn .

Notice that MMP (for Maximum Margin Problem) will have a positive optimal objectivefunction value if and only if the training data is separable, that is, if and only if there existsλ for which Aλ > 0.

1.2 Non-Separable Training Data, and the Exponential Loss andLogistic Loss Optimization Problems

Let us now presume that the training data is not separable, as is the case in most practicalapplications. Consider the following optimization problem:

LELP : L∗exp := minimizeλ Lexp(λ) := ln(

1m

∑mi=1 exp ((−Aλ)i)

)s.t. λ ∈ Rn .

The objective function Lexp(λ) of LELP is called the the log-exponential loss function. Thisfunction is a monotone decreasing function of (Aλ)i for i = 1, . . . ,m; therefore minimizingthis function will encourage larger values of (Aλ)i for i = 1, . . . ,m.

Another loss function that is often used is the logistic loss function Llogit(·) : Rn → R whichis defined by:

Llogit(λ) := 1m

m∑i=1

ln (1 + exp ((−Aλ)i)) ,

which has very appealing statistical properties in addition to some very special optimizationproperties. This function is also a monotone decreasing function of (Aλ)i for i = 1, . . . ,m.

4

The logistic loss optimization problem is that of minimizing the logistic loss function Llogit(·)and is formulated as:

LOGIT : L∗logit := minλ Llogit(λ) := 1m

∑mi=1 ln (1 + exp ((−Aλ)i))

s.t. λ ∈ Rn .

1.3 Computing Sparse Solutions, and Optimization Algorithmsfor Binary Classification

The above optimization problems are illustrative of optimization problems that arise quitenaturally in machine learning. Let λ∗ be an optimal (or a nearly optimal) solution to one ofthe optimization problems MMP, LELP, or LOGIT. Let ‖λ∗‖0 denote the number of non-zero components of λ∗. In very many contexts it is important that ‖λ∗‖0 be small, that is, λ∗

is a sparse vector (λ∗ has very few non-zero components). This might be for computationalreasons – to manage computations and memory especially when n� 0 is huge-scale. But inmost instances the sparsity of λ∗ arises from the notion that the solution ought to dependon only a very few underlying features (components), etc. And in the presence of potentialover-fitting of the solution based on the training data, experience in many contexts hasshown that sparse solutions usually are more accurate at prediction on out-of-sample data.Therefore in addition to optimization of the loss functions in MMP, LELP, and/or LOGIT,we also seek a solution that is sparse in the sense that ‖λ∗‖0 is relatively small. We couldtry to enforce the sparsity requirement by adding a constraint of the type “‖λ∗‖0 ≤ r”to the optimization problem, for some small value of r. However, this direct approach totackling the sparsity issue renders the resulting problem very hard – indeed NP-complete inthe language of theoretical computer science. Instead, we will see that certain algorithms,that are usually referred to as “greedy” methods, are by their design able to encourage orguarantee sparse solutions.

In these notes we discuss three basic types of “greedy” algorithms in convex optimization,namely:

1. Greedy Coordinate Descent. Coordinate descent methods are methods for solv-ing convex optimization problems that adjust one coordinate at each iteration. Thesemethods were first developed in the early days of optimization as ways to keep compu-tation and memory costs low. Today such methods have experienced a huge resurgencein interest due to their ability to produce sparse solutions, that is, solutions that usevery few coordinates. See Beck and Tetruashvili [1] and Nesterov [9] for two excellentmodern treatments of these methods.

2. Frank-Wolfe Method. The Frank-Wolfe method is originally due to MargueriteFrank and Philip Wolfe [5], and is also known as the “conditional gradient” method,see Levitin and Polyak [8]. Some older references for the method include Polyak [10],

5

Demayanov and Rubinov [3], and Dunn and Harshbarger [4]. More modern referencesinclude Jaggi [7] and Freund and Grigas [6]. See Clarkson [2] especially for a treatmentthat is more focused on computer science applications.

3. Randomized Coordinate Descent. The treatment of randomized coordinate de-scent that is developed in these notes is based very specifically on Richtarik and Takac[11], which itself is motivated from the modern treatment of this topic by Nesterov [9].

6

2 Greedy Coordinate Descent Method for Unconstrained

Smooth Optimization

2.1 Problem Setting and Basics for Coordinate Descent Method

Our problem of interest isP : minimizex f(x)

s.t. x ∈ Rn ,(1)

where f(·) is a differentiable convex function defined on Rn. We denote the optimal objectivevalue of P by f ∗, and let x∗ denote an optimal solution of P , when such a solution exists.

Let ∇f(x) denote the gradient of f(·) at x, and recall the gradient inequality:

f(y) ≥ f(x) +∇f(x)T (y − x) for all x, y ∈ Rn . (2)

We will measure distances between points using the `1 norm ‖x‖1 :=∑n

j=1 |xj|.We will measure the size of the gradients with the `∞ norm ‖g‖∞ := max

j{|gj|}.

We assume that f(·) has a Lipschitz gradient. Given our use of the above norms, this meansthat there is a constant L for which:

‖∇f(y)−∇f(x)‖∞ ≤ L‖y − x‖1 for all x, y ∈ Rn . (3)

Let B1(c, R) denote the `1 ball centered at c with radius R, namely:

B1(c, R) := {x ∈ Rn : ‖x− c‖1 ≤ R} .

Let ei denote the ith unit vector in Rn, namely ei = (0, . . . , 0, 1, 0, . . . , 0)T where the 1 appearsin the ith coefficient.

Algorithm 1 presents the greedy coordinate descent method for solving (P).

Suppose that Algorithm 1 is initiated at x0 ← 0. Then due to the update rule in Step (3.)of the algorithm, it will hold that ‖xk+1‖0 ≤ ‖xk‖0 + 1, whereby it holds by induction that‖xk‖0 ≤ k for all k. This sparsity guarantee is particularly useful in the case of huge-scaleproblems, where n� 0 might be on the order of 106 or larger.

7

Algorithm 1 Greedy Coordinate Descent Method

Initialize. Initialize at x0, i← 0

At iteration i:

1. Compute Gradient. Compute ∇f(xi)

2. Compute Coordinate. Compute:

ji ∈ arg maxj∈{1,...,n}

{|∇f(xk)j|}, and

si ← sgn(∇f(xi)ji)

3. Compute Step-Size and New Point. Compute:

αi ←|∇f(xi)ji |

L

xi+1 ← xi − αisieji

2.2 Analysis and Complexity of the Greedy Coordinate DescentMethod

We start with the following:

Proposition 1. If f(·) has Lipschitz gradient with constant L, then

f(y) ≤ f(x) +∇f(x)T (y − x) + L2‖y − x‖2

1 for all x, y ∈ Rn .

Proof. We use the fundamental theorem of calculus applied in a multivariate setting. Wehave

f(y) = f(x) +∫ 1

0∇f(x+ t(y − x))T (y − x)dt

= f(x) +∇f(x)T (y − x) +∫ 1

0[∇f(x+ t(y − x))−∇f(x)]T (y − x)dt

≤ f(x) +∇f(x)T (y − x) +∫ 1

0‖∇f(x+ t(y − x))−∇f(x)‖∞‖(y − x)‖1dt

≤ f(x) +∇f(x)T (y − x) +∫ 1

0tL‖(y − x)‖2

1dt

= f(x) +∇f(x)T (y − x) + L2‖(y − x)‖2

1 .

8

We will also use the following property of a nonnegative series, whose proof is given at theend of this subsection.

Proposition 2. Suppose there is a constant C > 0 for which the nonnegative series {ai}satisfies

ai+1 ≤ ai −a2i

Cfor all i ≥ 0 .

Then it holds that

ak ≤1

1a0

+ kC

for all k ≥ 0 .

Let S0 denote the set of all x whose objective value is at most f(x0), namely S0 := {x ∈ Rn :f(x) ≤ f(x0)}, and let S∗ denote the set of optimal solutions of (1), namely S∗ := {x ∈ Rn :f(x) = f ∗}. Now let Dist0 denote the largest distance of points in S0 to the set of optimalsolutions S∗:

Dist0 := maxx∈S0

{minx∗∈S∗

‖x− x∗‖1

}. (4)

Our main algorithmic analysis result for the greedy coordinate descent method is:

Theorem 1. Let {xk} be generated according to the greedy coordinate descent method (Al-gorithm 1). Then for all k ≥ 0 it holds that:

f(xk)− f ∗ ≤ 11

f(x0)−f∗ + k2L(Dist0)2

≤ 2L(Dist0)2

k. (5)

Proof: From Proposition 1 we have for each i ≥ 0:

f(xi+1) ≤ f(xi) +∇f(xi)T (xi+1 − xi) + L2‖xi+1 − xi‖2

1

= f(xi)− αi|∇f(xi)ji |+ L2α2i

= f(xi)− 12L|∇f(xi)ji |2∞

= f(xi)− 12L‖∇f(xi)‖2

∞ .

(6)

This shows that the values f(xi) are decreasing and in particular f(xi) ≤ f(x0), wherebyxi ∈ S0 for all i ≥ 0. Therefore it follows from (4) that there exists x∗ ∈ S∗ for which‖xi − x∗‖1 ≤ Dist0, and from the gradient inequality for the convex function f(·) it holds

9

that

f ∗ = f(x∗) ≥ f(xi) +∇f(xi)T (x∗ − xi) (from Gradient Inequality)

≥ f(xi)− ‖∇f(xi)‖∞‖x∗ − xi‖1 (from Norm Inequality)

≥ f(xi)− ‖∇f(xi)‖∞Dist0 , (from above discussion)

and rearranging the above yields ‖∇f(xi)‖∞ ≥ f(xi)−f∗Dist0

. Substituting this inequality into (6)and subtracting f ∗ from both sides yields:

f(xi+1)− f ∗ ≤ f(xi)− f ∗ − (f(xi)− f ∗)2

2LDist20

.

Define ai := f(xi) − f ∗, and it follows that the nonnegative series {ai} satisfies ai+1 ≤ai − a2i

2LDist20. It then follows from Proposition 2 using C = 2L(Dist0)2 that

ak ≤1

1a0

+ k2LDist20

,

which establishes that:

f(xk)− f ∗ ≤ 11

f(x0)−f∗ + k2L(Dist0)2

.

Proof of Proposition 2: Notice that the conclusion of the proposition holds for k = 0. Byinduction, suppose that the conclusion is true for some k ≥ 0. Then:

ak+1 ≤ ak −a2kC

≤ ak − akak+1

C,

and collecting terms and rearranging yields:

ak+1

(1 +

akC

)≤ ak .

Taking reciprocals and rearranging yields:

1

ak+1

≥ 1

ak

(1 +

akC

)=

1

ak+

1

C

≥ 1

a0

+k

C+

1

C

=1

a0

+k + 1

C,

10

which rearranges to yield:

ak+1 ≤1

1a0

+ k+1C

,

completing the proof.

2.3 Comments and Extensions

1. How might the method and the analysis be modified if L is not explicitly known?

2. How might the method and the analysis be modified if one can efficiently do a line-search to compute:

αi := arg minα{f(xi − αsieji)} ,

instead of assigning the value αi ←|∇f(xi)ji |

Lat each iteration?

11

3 The Frank-Wolfe Method for Constrained Smooth

Optimization

Let us now consider the following constrained optimization problem:

P : minimizex f(x)

s.t. x ∈ S ,

where f(x) is a differentiable convex function on S ⊂ Rn, and S is a closed and boundedconvex set. Let f ∗ denote the optimal objective value of P . We assume that f(·) has aLipschitz gradient on S, which means there exists L for which:

‖∇f(y)−∇f(x)‖ ≤ L‖y − x‖ for all x, y ∈ S . (7)

(Throughout this section the norm is the Euclidean norm ‖v‖ =√vTv.)

We now describe the Frank-Wolfe method for solving P . This method is based on thepremise that the set S is well-suited for linear optimization. This means that either S isitself a system of linear inequalities S = {x | Ax ≤ b} over which linear optimization is aneasy task, or more generally that the problem:

LOc : minimizex cTx

s.t. x ∈ S

is easy to solve for any given objective function vector c. This being the case, suppose thatwe have a given iterate value xk ∈ S. The linearization of the function f(x) at x = xk is:

z1(x) := f(xk) +∇f(xk)T (x− xk) ,

which is the first-order Taylor expansion of f(·) at xk. Since we can easily do linear opti-mization on S, let us solve:

LP : minimizex f(xk) +∇f(xk)T (x− xk)

s.t. x ∈ S ,

12

which of course is equivalent to:

LP : minimizex ∇f(xk)Tx

s.t. x ∈ S .

Let xk denote the optimal solution to this problem. The Frank-Wolfe method proceeds bychoosing the next iterate as xk+1 ← xk +α(xk−xk) for some step-size α. Since S is a convexset and both xk and xk are contained in S, then xk + α(xk − xk) ∈ S for all α ∈ [0, 1]. Letαk ∈ [0, 1] denote the step-size at iteration k of the method. We have:

xk+1 ← xk + αk(xk − xk) for some αk ∈ [0, 1] .

The value of αk can be determined in a number of different ways. One way to set the step-sizeαk is according to some pre-determined rule, such as the following often-used rule:

αk =2

k + 2.

(We will see shortly that this rule leads to a very good computational bound on the conver-gence of the Frank-Wolfe method.) Another way is to choose αk by performing a line-searchof f(·) over the interval α ∈ [0, 1]. That is, we might determine αk as the optimal solutionto the following line-search problem:

αk ← arg minα f(xk + α(xk − xk))

s.t. 0 ≤ α ≤ 1 .

Algorithm 2 presents a formal statement of the Frank-Wolfe method just described.

Let us now see how the Frank-Wolfe method can be used to generate sparse solutions in thecontext of certain structured problems. Suppose our problem of interest is the constrainedminimization problem:

Pr : minimizex f(x)

s.t. ‖x‖1 ≤ r ,

where f(·) is a differentiable convex loss function associated with a machine learning binaryclassification problem, such as the log-exponential loss function:

Lexp(x) := ln

(1

m

m∑i=1

exp ((−Ax)i)

)

13

Algorithm 2 Frank-Wolfe Method for minimizing f(x) over x ∈ S

Initialize at x0 ∈ S, k ← 0 .

At iteration k :

1. Compute ∇f(xk), and then solve linear optimization problem:

xk ← arg minx∈S{f(xk) +∇f(xk)T (x− xk)} .

2. Set xk+1 ← xk + αk(xk − xk), where αk ∈ [0, 1] .

or the logistic loss function:

Llogit(x) := 1m

m∑i=1

ln (1 + exp ((−Ax)i)) ,

or perhaps some other context-specific loss function. And here the feasible region is describedby the `1 norm constraint “‖x‖1 ≤ r, namely

∑nj=1 |xj| ≤ r. At each iteration of the Frank-

Wolfe method, the algorithm will compute xk ← arg min‖x‖1≤r

{f(xk) +∇f(xk)T (x− xk)}. It is

straightforward to see that the xk can always be chosen as xk ← r · sieji where:

ji ∈ arg maxj∈{1,...,n}

{|∇f(xk)j|} and si ← sgn(∇f(xi)ji) .

Suppose that Algorithm 2 is initiated at x0 = 0, which is feasible for Pr. Then due to theupdate rule in Step (2.) of the algorithm, it holds by induction that xk will be a convexcombination of at most k + 1 of the vectors ±re1, . . . ,±ren. Therefore after k iterations xk

will have at most k non-zeroes, namely, ‖xk‖0 ≤ k. This structural guarantee is particularlyuseful in the case of huge-scale problems, where n� 0 might be on the order of 106 or larger,and when the extreme points of S are themselves characterized in special ways.

3.1 Analysis and Complexity of the Frank-Wolfe Method

Before stating the main algorithmic analysis result for the Frank-Wolfe method, we firstpresent two important preliminary results. The first result states that at each iteration kof the Frank-Wolfe method, the solution of the linear optimization problem leads to a validlower bound on the optimal value f ∗ of the optimization problem P .

Proposition 3. At iteration k of the Frank-Wolfe method (Algorithm 2), it holds that

f ∗ ≥ f(xk) +∇f(xk)T (xk − xk) .

14

Proof: From the gradient inequality for convex functions one has:

f(x) ≥ f(xk) +∇f(xk)T (x− xk) for any x ∈ S .

Therefore:

f ∗ = minx∈S

f(x) ≥ minx∈S

f(xk) +∇f(xk)T (x− xk) = f(xk) +∇f(xk)T (xk − xk) .

We will also use the following property of certain types of nonnegative series’, whose proofis given at the end of this subsection.

Proposition 4. Suppose there is a constant C > 0 for which the nonnegative series {ai}satisfies

ai+1 ≤ ai

(1− 2

i+ 2

)+

C

(i+ 2)2for all i ≥ 0 .

Then it holds that

ak ≤C

k + 2for all k ≥ 1 .

Let Diam(S) denote the largest distance between any two points in S, namely:

Diam(S) := maxx,y∈S

{‖x− y‖} . (8)

Our main algorithmic analysis result for the Frank-Wolfe method is:

Theorem 2. Let {xk} be generated according to the Frank-Wolfe method (Algorithm 2) usingthe step-size rule

αk =2

k + 2.

Then for all k ≥ 1 it holds that:

f(xk)− f ∗ ≤ 2L(Diam(S))2

k + 2. (9)

Proof: For every iteration k ≥ 0 it holds that:

15

f(xk+1) ≤ f(xk) +∇f(xk)T (xk+1 − xk) + L2‖xk+1 − xk‖2 (from Prop. 1)

= f(xk) + αk∇f(xk)T (xk − xk) +Lα2

k

2‖xk − xk‖2

≤ f(xk) + αk∇f(xk)T (xk − xk) +Lα2

k(Diam(S))2

2

= (1− αk)f(xk) + αk(f(xk) +∇f(xk)T (xk − xk)

)+

Lα2k(Diam(S))2

2

≤ (1− αk)f(xk) + αkf∗ +

Lα2k(Diam(S))2

2. (from Prop. 3)

Subtracting f ∗ from each side and rearranging the right-hand side, one obtains:

f(xk+1)− f ∗ ≤ (f(xk)− f ∗)(1− αk) +Lα2

k(Diam(S))2

2,

and substituting the value of αk = 2k+2

we arrive at:

f(xk+1)− f ∗ ≤ (f(xk)− f ∗)(1− 2k+2

) + 2L(Diam(S))2

(k+2)2.

Define ak := f(xk)−f ∗, and define C := 2L(Diam(S))2, and notice that the above inequalityis:

ak+1 ≤ ak

(1− 2

k + 2

)+

C

(k + 2)2for all k ≥ 0 .

Notice that the series {ak} satisfies the condition of Proposition 4. It therefore follows fromProposition 4 that ak ≤ C

k+2for k ≥ 1. Therefore:

f(xk)− f ∗ = ak ≤C

k + 2=

2L(Diam(S))2

k + 2.

Proof of Proposition 4: Notice that the conclusion of the proposition holds for k = 1,since using i = 0 we have:

a1 = ai+1 ≤ ai

(1− 2

i+ 2

)+

C

(i+ 2)2

= a0

(1− 2

0 + 2

)+

C

(0 + 2)2

=C

4<C

3=

C

k + 2.

16

By induction, suppose that the conclusion is true for some k ≥ 0, namely ak ≤ Ck+2

. Then:

ak+1 ≤ ak

(1− 2

k + 2

)+

C

(k + 2)2

≤ C

k + 2

(1− 2

k + 2

)+

C

(k + 2)2

=C(k + 1)

(k + 2)2

<C

(k + 3),

where the last inequality follows since (k + 1)(k + 3) < (k + 2)2. The result then follows byinduction.

3.2 Comments and Extensions

1. How might the Frank-Wolfe method and its analysis be modified if L is not explicitlyknown?

2. How might the method and the analysis be modified if one can efficiently do a line-search to compute:

αk := arg minα∈[0,1]

{f(xk + α(xk − xk))} ,

instead of assigning the value αk = 2k+2

at each iteration?

17

4 Randomized Block-Coordinate Descent for Uncon-

strained Smooth Optimization

4.1 Randomized Block-Coordinate Descent Method

Herein we consider the following unconstrained smooth optimization problem:

f ∗ := minx∈RN

f(x) , (10)

where f(·) is convex and differentiable on x ∈ RN , and x and/or f(·) are assumed to havespecial block structure. We model the block structure of the problem by decomposing thespace RN into n subspaces as follows. Let U ∈ RN×N be a column permutation of theN ×N identity matrix and further let U = [U1, U2, . . . , Un] be a decomposition of U into nsubmatrices, with Ui being of size N × Ni, where

∑iNi = N . Clearly, any vector x ∈ RN

can be written uniquely as x =∑n

i=1 Uix(i), where x(i) = UT

i x ∈ RNi . Notice that:

UTi Uj =

{Ni ×Ni identity matrix if i = j ,

Ni ×Nj zero matrix otherwise .(11)

For simplicity we will write x = (x(1), . . . , x(n))T . We equip RNi with the Euclidean norm:

‖t‖(i) =√〈t, t〉 for t ∈ RNi , (12)

where 〈·, ·〉 is the standard Euclidean inner product. For x =∑n

i=1 Uix(i) ∈ RN we have

‖x‖ :=

√√√√ n∑i=1

‖x(i)‖2(i) =

√√√√ n∑i=1

〈x(i), x(i)〉 . (13)

The ith block component of the gradient of f(·) is denoted by ∇if(x) and is formally definedby:

∇if(x) := UTi ∇f(x) ∈ RNi for i = 1, . . . , N . (14)

We assume that the gradient of f(·) is block coordinate-wise Lipschitz, uniformly in x, withLipschitz constant L > 0. This means that for all x ∈ RN , i = 1, 2, . . . , n, and t ∈ RNi , itholds that:

‖∇if(x+ Uit)−∇if(x)‖(i) ≤ L‖t‖(i) . (15)

Almost identical to Proposition 1, we have:

Proposition 5. If f(·) satisfies (15), then

f(x+ Uit) ≤ f(x) + 〈∇if(x), t〉+ L2‖t‖2

(i) , for i = 1, . . . , N .

18

Proof. Similar to the proof of Proposition 1, we use the fundamental theorem of calculusapplied in a multivariate setting. We have:

f(x+ Uit) = f(x) +∫ 1

0〈∇f(x+ αUit), Uit〉dα

= f(x) +∫ 1

0〈∇if(x+ αUit), t〉dα

= f(x) + 〈∇if(x), t〉+∫ 1

0〈∇if(x+ αUit)−∇if(x), t〉dα

≤ f(x) + 〈∇if(x), t〉+∫ 1

0‖∇if(x+ αUit)−∇if(x)‖(i)‖t‖(i)dα

≤ f(x) + 〈∇if(x), t〉+∫ 1

0αL‖t‖(i)‖t‖(i)dα

= f(x) + 〈∇if(x), t〉+ L2‖t‖2

(i) .

We present and study the Randomized Block Coordinate Descent Method, shown here inAlgorithm 3.

Algorithm 3 Randomized Block Coordinate Descent Method

Initialize. Initialize at x0 ∈ RN , k ← 0

At iteration k:

1. Choose Random Block. Choose ik = i ∈ {1, 2, . . . , n} according to the uniformdistribution on {1, 2, . . . , n}: P(ik = i) = 1/n, for i = 1, . . . , n .

2. Take Step in Chosen Block. xk+1 = xk − 1LUi(∇if(xk)) .

Let S0 denote the set of all x whose objective value is at most f(x0), namely S0 := {x ∈RN : f(x) ≤ f(x0)}, and let S∗ denote the set of optimal solutions of (10), namely, S∗ :={x ∈ RN : f(x) = f ∗}. Now let Dist0 denote the largest distance of points in S0 to the setof optimal solutions S∗:

Dist0 := maxx∈S0

{minx∗∈S∗

‖x− x∗‖}. (16)

The main computational guarantee for Algorithm 3 is presented in Theorem 3.

Theorem 3. Given the initial point x0, a target accuracy ε satisfying 0 < ε < min{f(x0)−f ∗, 2nL(Dist0)2} and an error tolerance ρ ∈ (0, 1), let the Randomized Block CoordinateDescent Method (Algorithm 3) be run for k iterations where:

k ≥ 2nL(Dist0)2

ε

(1 + ln

(1

ρ

))+ 2 . (17)

19

ThenP(f(xk) ≤ f ∗ + ε) ≥ 1− ρ .

We will prove this theorem in Section 4.2.

Suppose that Algorithm 3 is initiated at x0 ← 0. Then due to the update rule in Step (2.)of the algorithm, it holds by induction that xk will use at most k different blocks, for allk. This sparsity guarantee is particularly useful in the case of huge-scale problems, whereN � 0 might be on the order of 106 or larger.

4.2 Proof of Theorem 3

Let us first develop some straightforward properties of the iterate step of random blockcoordinate descent. Let xk be the current iterate and let i denote the randomly chosen indexat iteration k. Then it holds that:

f(xk+1) = f(xk − 1LUi(∇if(xk)))

≤ f(xk)− 1L〈∇if(xk),∇if(xk)〉+ L

2‖ 1L∇if(xk)‖2

(i) (from Prop.5)

= f(xk)− 12L‖∇if(xk)‖2

(i) ,

(18)

which shows in particular that the iterate values of f(xk) are decreasing and therefore xk ∈ S0

for all k ≥ 0.

We now provide a lower bound on ‖∇f(xk)‖. Because xk ∈ S0 there exists x∗ ∈ S∗ thatsatisfies ‖x∗ − xk‖ ≤ Dist0. From the gradient inequality for convex functions we have:

f ∗ = f(x∗) ≥ f(xk)+〈∇f(xk), x∗−xk〉 ≥ f(xk)−‖∇f(xk)‖‖x∗−xk‖ ≥ f(xk)−‖∇f(xk)‖Dist0 ,

which rearranges to:

‖∇f(xk)‖ ≥f(xk)− f ∗

Dist0

. (19)

We will also will need the following technical result, whose proof is deferred to the end ofthis section.

Lemma 1. Fix x0 ∈ RN and let {xk}k≥0 be a sequence of random vectors in RN with xk+1

depending only on xk. Let φ : RN → R be a nonnegative function and define ξk = φ(xk).Lastly, choose accuracy level 0 < ε < ξ0 and error tolerance ρ ∈ (0, 1), and assume that thesequence of random variables {ξk}k≥0 is nonincreasing and has the following property:

20

(i) E[ξk+1 | xk] ≤ ξk −ξ2kc1

, for all k, where c1 > 0 is a constant.

If ε < c1 and

K ≥ c1

ε

(1 + ln

(1

ρ

))+ 2− c1

ξ0

, (20)

thenP(ξK ≤ ε) ≥ 1− ρ . (21)

We are now ready to prove Theorem 3.

Proof. We first provide a lower bound on the expected decrease of the objective functionduring one iteration of the method:

E[f(xk+1) | xk]− f(xk) ≤∑n

i=11n[− 1

2L‖∇if(xk)‖2

(i)] (from (18))

= − 12nL

(‖∇f(xk)‖)2

≤ − (f(xk)−f∗)2

2nL(Dist0)2, (from (19))

which upon subtracting f ∗ from both sides rearranges to:

E[f(xk+1)− f ∗ | xk] ≤ f(xk)− f ∗ − (f(xk)−f∗)2

2nL(Dist0)2.

We now apply Lemma 1 with ξk = f(xk)−f ∗ and c1 = 2nL(Dist0)2 to obtain that P(f(xk) ≤f ∗ + ε) = P(ξk ≤ ε) ≥ 1− ρ holds for all k whose value is at least:

c1

ε

(1 + ln

(1

ρ

))+ 2− c1

ξ0

≤ c1

ε

(1 + ln

(1

ρ

))+ 2 =

2nL(Dist0)2

ε

(1 + ln

(1

ρ

))+ 2 .

We end this section with the proof of Lemma 1.

Proof of Lemma 1: Define the sequence {ξεk}k≥0 by:

ξεk =

{ξk if ξk ≥ ε

0 otherwise ,(22)

and notice that ξεk > ε ⇔ ξk > ε and also ξεk ≤ ξk for all k. Therefore by the Markov

inequality it holds that: P(ξk > ε) = P(ξεk > ε) ≤ E[ξεk]

ε, and hence it suffices to show that

θK ≤ ερ , (23)

21

where θk := E[ξεk]. Towards the proof of (23) we first prove the following two inequalities:

E[ξεk+1 | xk] ≤ ξεk −(ξεk)2

c1, k ≥ 0 (24)

andE[ξεk+1 | xk] ≤ (1− ε

c1)ξεk , k ≥ 0 . (25)

We consider two cases, where the first case is ξk ≥ ε. Then ξεk = ξk ≥ ε and ξεk+1 ≤ ξεk = ξk.Therefore it follows from the hypothesis (i) of the lemma that:

E[ξεk+1 | xk] ≤ E[ξk+1 | xk] ≤ ξk −ξ2kc1

= ξεk −(ξεk)2

c1≤ ξεk −

(ξεk)ε

c1= (1− ε

c1)ξεk ,

which proves (24) and (25) in this case. Next consider the second case wherein ξk < ε. Inthis case ξεk = 0 and also ξεk+1 = 0 from which (24) and (25) follow simply since all termstherein are identically zero.

Next notice from the convexity of the function t 7→ t2 that E[(ξεk)2]) ≥ (E[ξεk])

2 = θ2k. By

taking expectations in (24) and (25) and using the above inequality we obtain:

θk+1 ≤ θk −θ2kc1, k ≥ 0 , (26)

and

θk+1 ≤ (1− εc1

)θk , k ≥ 0 . (27)

Notice that the right-hand side of (26) is less than the right-hand side of (27) precisely whenθk > ε. Since

1θk+1− 1

θk= θk−θk+1

θk+1θk≥ θk−θk+1

θ2k≥ 1

c1,

(where the last inequality above uses (26)), we have 1θk≥ 1

θ0+ k

c1= 1

ξ0+ k

c1. Therefore, if we

let k1 ≥ c1ε− c1

ξ0, we obtain θk1 ≤ ε. Finally, letting k2 ≥ c1

εln(

1ρ

), we obtain (23) as follows:

θK ≤ θk1+k2 ≤ (1− εc1

)k2θk1 ≤(

(1− εc1

)1ε

)c1 ln(

1ρ

)ε ≤

(e− 1

c1

)c1 ln(

1ρ

)ε = ερ ,

where the first inequality above uses (20), the second inequality uses (27), and the thirdinequality uses the convexity the function t 7→ et.

22

5 Computation Exercises

1. Binary Classification with a particular data set. This is under construction.

2. Binary Classification with a another data set. This is under construction.

23

References

[1] A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type meth-ods, SIAM Journal on Optimization 23 (2013), no. 2, 2037–2060.

[2] K.L. Clarkson, Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm,19th ACM-SIAM Symposium on Discrete Algorithms (2008), 922–931.

[3] V. Demyanov and A. Rubinov, Approximate methods in optimization problems, Ameri-can Elsevier Publishing Co., New York, 1970.

[4] J. Dunn and S. Harshbarger, Conditional gradient algorithms with open loop step sizerules, Journal of Mathematical Analysis and Applications 62 (1978), 432–444.

[5] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Lo-gistics Quarterly 3 (1956), 95–110.

[6] Robert M. Freund and Paul Grigas, New analysis and results for the Frank-Wolfemethod, to appear in Mathematical Programming (2014).

[7] Martin Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, Pro-ceedings of the 30th International Conference on Machine Learning (ICML-13), 2013,pp. 427–435.

[8] E. Levitin and B. Polyak, Constrained minimization methods, USSR ComputationalMathematics and Mathematical Physics 6 (1966), 1.

[9] Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimizationproblems, SIAM J Optimization (2012), no. 22, 341–362.

[10] B. Polyak, Introduction to optimization, Optimization Software, Inc., New York, 1987.

[11] Peter Richtarik and Martin Takac, Iteration complexity of randomized block-coordinatedescent methods for minimizing a composite function, Mathematical Programming(2014), no. 2, 1–38.

24

Documents

Lectures on Greedy-type Algorithms in Convex Optimization · 2021. 1. 26. · 1 Motivation: Binary Classi cation and Convex Opti-mization Suppose we are given mpixilated digital images