Iteration-complexity of rst-order penalty methods for convex programmingmonteiro/publications/tech_reports/... · 2008-09-17 · Iteration-complexity of rst-order penalty methods

Iteration-complexity of first-order

penalty methods for convex programming ∗

Guanghui Lan † Renato D.C. Monteiro ‡

July 24, 2008

Abstract

This paper considers a special but broad class of convex programing (CP) problems whosefeasible region is a simple compact convex set intersected with the inverse image of a closed convexcone under an affine transformation. We study two first-order penalty methods for solving theabove class of problems, namely: the quadratic penalty method and the exact penalty method.In addition to one or two gradient evaluations, an iteration of these methods requires one or twoprojections onto the simple convex set. We establish the iteration-complexity bounds for thesemethods to obtain two types of near optimal solutions, namely: near primal and near primal-dual optimal solutions. Finally, we present variants, with possibly better iteration-complexitybounds than the aforementioned methods, which consist of applying penalty-based methods tothe perturbed problem obtained by adding a suitable perturbation term to the objective functionof the original CP problem.

Keywords: Convex programming, quadratic penalty method, exact penalty method, Lagrangemultiplier

AMS 2000 subject classification: 90C25, 90C06, 90C22, 49M37

1 Introduction

The basic problem of interest in this paper is the convex programming (CP) problem

f∗ := inff(x) : A(x) ∈ K∗, x ∈ X, (1)

where f : X → IR is a convex function with Lipschitz continuous gradient, X ⊆ <n is a sufficientlysimple closed convex set, A : <n → <m is an affine function, and K∗ denotes the dual cone of aclosed convex cone K ⊆ <m, i.e., K∗ := s ∈ <m : 〈s, x〉 ≥ 0, ∀x ∈ K.

For the case where the feasible region consists only of the set X, or equivalently A ≡ 0, Nesterov([2, 4]) developed a method which finds a point x ∈ X such that f(x)− f ∗ ≤ ε in at most O(ε−1/2)

∗The work of both authors were partially supported by NSF Grants CCF-0430644 and CCF-0808863 and ONRGrant N00014-08-1-0033.

†School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (email:[email protected]).

‡School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (email:[email protected]).

1

iterations. Moreover, each iteration of his method requires one gradient evaluation of f and computa-tion of two projections onto X. It is shown that his method achieves, uniformly in the dimension, thelower bound on the number of iterations for minimizing convex functions with Lipschitz continuousgradient over a closed convex set. When A is not identically 0, Nesterov’s optimal method can stillbe applied directly to problem (1) but this approach would require the computation of projectionsonto the feasible region X ∩ x : A(x) ∈ K∗, which for most practical problems is as expensive assolving the original problem itself.

In this paper, we are interested in alternative first-order methods for (1) whose iterations consistof, in addition to a couple of gradient evaluations, only projections onto the simple set X. Morespecifically, we present two penalty-based approaches for solving (1), namely: the quadratic andthe exact penalty methods, and establish their iteration-complexity bounds for computing two typesof near optimal solutions of (1). It is well-known that the penalty parameters of the penalizationproblems for the above penalty-based approaches must be chosen sufficiently large so that nearoptimal solutions of the penalization problems yield near optimal solutions for the original problem(1). Accordingly, we establish theoretical lower bounds on the penalty parameters that depend notonly on the desired solution accuracies but also on the size ‖λ∗‖ of the minimum norm Lagrangemultiplier associated with the constraint A(x) ∈ K∗. Theoretically, setting the penalty parametersto their corresponding lower bounds would yield the lowest provably iteration-complexity bounds,and hence would be the best strategy to pursue. But since ‖λ∗‖ is not known a priori, we presentalternative penalty-based approaches based on simple “guess-and-check” procedures for these penaltyparameters whose iteration-complexity bounds are of the same order as the approach in which ‖λ∗‖is known as priori, or equivalently, the theoretical lower bound on the penalty parameter is available.Finally, we present variants of the aforementioned methods, with possibly better iteration-complexitybounds, which consist of applying penalty-based methods to the problem obtained by adding asuitable perturbation term to the objective function of the original CP problem.

The paper is organized as follows. In Section 2, we give a more detailed description of problem(1) and definitions of two types of approximate solutions of (1). Section 3 discusses some technicalresults that will be used in our analysis and reviews two first order algorithms due to Nesterov[2, 4] for solving special classes of CP problems. In Section 4, we establish the iteration-complexityof quadratic penalty based methods for solving (1). More specifically, we present the iteration-complexity of the quadratic penalty method applied to (1) in Subsection 4.1, and of a variant, whichconsists of applying the quadratic penalty approach to the perturbed problem obtained by addinga suitable perturbation term to the objective function of (1), in Subsection 4.2. In Section 5, weestablish the iteration-complexity of exact penalty based methods for solving (1). More specifically,we present the iteration-complexity of the exact penalty applied directly to (1) in Subsection 5.1,and of a variant, which consists of applying the exact penalty method to the perturbed problemcorresponding to (1), in Subsection 5.2.

1.1 Notation and terminology

We denote the p-dimensional Euclidean space by IRp. Also, IR

p+ and IR

p++ denote the nonnegative

and the positive orthants of IRp, respectively. In this paper, we use the notation <p to denote a p-

dimensional vector space inherited with a inner product space 〈·, ·〉. The dual norm ‖ · ‖∗ associatedwith an arbitrary norm ‖ · ‖ on <p is defined as

‖s‖∗ := maxx〈s, x〉 : ‖x‖ ≤ 1, ∀s ∈ <p.

2

Note that if the norm ‖·‖ is the one induced by the inner product, i.e. ‖·‖ = 〈·, ·〉1/2, then ‖·‖∗ ≡ ‖·‖.Given a closed convex set C ∈ <p, we define the distance function dC : <p → IR to C with respect

to ‖·‖ as dC(u) := min‖u−c‖ : c ∈ C for every u ∈ <p. It is well-known that this minimum is alwaysachieved at some c ∈ C. Moreover, this minimizer is unique whenever ‖ · ‖ is a inner product norm.In such a case, we denote this unique minimizer by ΠC(u), i.e., ΠC(u) = argmin‖u− c‖ : c ∈ C forevery u ∈ <p. The support function of a set C ⊂ <p is defined as σC(u) := sup〈u, c〉 : c ∈ C.

2 Problem of interest

Consider inner product spaces <n and <m endowed with arbitrary norms, both denoted simply by‖ · ‖. In this paper, we consider the CP problem (1) where f : X → IR is a convex function withLf -Lipschitz-continuous gradient, i.e.:

‖∇f(x)−∇f(x)‖∗ ≤ Lf‖x− x‖, ∀x, x ∈ X. (2)

We define the norm of the map A as being the operator norm of its linear part A0 := A(·) −A(0),i.e.:

‖A‖ := ‖A0‖ = max‖A0(x)‖∗ : ‖x‖ ≤ 1 = max‖A(x)−A(0)‖∗ : ‖x‖ ≤ 1.The Lagrangian dual function and value function associated with (1) are defined as

d(λ) := inff(x) + 〈λ,A(x)〉 : x ∈ X, ∀λ ∈ −K, (3)

v(u) := inff(x) : A(x) + u ∈ K∗, x ∈ X, ∀u ∈ <m. (4)

We make the following assumptions throughout the paper:

Assumption 1 A.1) the set X is bounded (and hence f ∗ ∈ IR);

A.2) there exists a Lagrange multiplier for (1), i.e., a vector λ∗ ∈ −K such that f ∗ = d(λ∗).

Throughout this paper, we assume that the norm ‖ · ‖ in <m is induced by the inner pr. Firstnote that x∗ is an optimal solution of (1) if, and only if, x∗ ∈ X, A(x∗) ∈ K∗ and f(x∗) ≤ f∗. Thisobservation leads us to our first definition of a near optimal solution x ∈ X of (1), which essentiallyrequires the primal infeasibility measure dK∗(A(x)) and the primal optimality gap [f(x) − f ∗]+ tobe both small.

Definition 1 For a given pair (εp, εo) ∈ IR++ × IR++, x ∈ X is called an (εp, εo)-primal solution for(1) if

dK∗(A(x)) ≤ εp and f(x)− f ∗ ≤ ε0. (5)

One drawback of the above notion of near optimality of x is that it says nothing about the size of[f(x)− f ∗]−. We will now show that this quantity can be bounded as [f(x)− f ∗]− ≤ εp‖λ∗‖, whereλ∗ is an arbitrary Lagrange multiplier for (1). We start by stating the following result whose proofis given in the Appendix.

Proposition 1 Let K ⊆ <m be a closed convex cone. Then, the following statements hold:

a) dK∗ = σC , where C := (−K) ∩B(0, 1), where B(0, 1) := u ∈ <m : ‖u‖ ≤ 1;

3

b) for every u ∈ <m and λ ∈ K, we have 〈u, λ〉 ≥ −‖λ‖ dK∗(u).

As a consequence of the above result, we obtain the following technical inequality which will befrequently used in our analysis.

Corollary 2 Let λ∗ be an Lagrange multiplier for (1). Then, for every x ∈ X, we have f(x)− f ∗ ≥−‖λ∗‖ dK∗(A(x)).

Proof. It is well-known that our assumptions imply that v is a convex function such that λ∗ ∈∂v(0), and hence that

v(u)− v(0) ≥ 〈λ∗, u〉 = 〈 − λ∗,−u〉 ≥ −‖ − λ∗‖ dK∗(−u) = −‖λ∗‖ dK∗(−u), ∀u ∈ <m,

where the inequality follows from Proposition 1(b) and the fact that −λ∗ ∈ K. Now, let x ∈ X begiven. Since x is clearly feasible for problem (4) with u = −A(x), the definition of v(·) in (4) impliesthat v(u) ≤ f(x). Hence,

f(x)− f ∗ ≥ v(u)− v(0) ≥ −‖λ∗‖ dK∗(−u) = −‖λ∗‖ dK∗(A(x)).

Corollary 3 If x ∈ X is an εp-feasible solution, i.e., if dK∗(A(x)) ≤ εp, then f(x)− f ∗ ≥ −εp ‖λ∗‖,where λ∗ is an arbitrary Lagrange multiplier for (1).

Another way of defining a near optimal solution of (1) is based on the following optimalityconditions: x∗ ∈ X is an optimal solution of (1) and λ∗ ∈ −K is a Lagrange multiplier for (1) if, andonly if, (x, λ) = (x∗, λ∗) satisfies

A(x) ∈ K∗, 〈λ,A(x)〉 = 0,

∇f(x) + (A0)∗ λ ∈ −NX(x),

(6)

where NX(x) := s ∈ <n : 〈s, x − x〉 ≤ 0, ∀x ∈ X denotes the normal cone of X at x. Based onthis observation, we can introduce our second definition of a near optimal solution of (1).

Definition 2 For a given pair (εp, εd) ∈ IR++× IR++, (x, λ) ∈ X × (−K) is called an (εp, εd)-primal-

dual solution of (1) if

dK∗(A(x)) ≤ εp, 〈λ,ΠK∗(A(x))〉 = 0, (7)

∇f(x) + (A0)∗ λ ∈ −NX(x) + B(εd), (8)

where B(η) := x ∈ <n : ‖x‖ ≤ η for every η ≥ 0.

In Sections 4 and 5, we will study the iteration-complexities of well-known penalty methods,namely the quadratic and the exact penalty methods, for computing the two types of near optimalsolutions defined in this section.

4

3 Technical results

This section discusses some technical results that will be used in our analysis and reviews two firstorder algorithms due to Nesterov [2, 4] for solving special classes of CP problems. It consists of threesubsections. The first one develops several technical results involving projected gradients. The secondsubsection reviews Nesterov’s optimal method for solving a class of smooth CP problems. The thirdsubsection reviews Nesterov’s smooth approximation scheme for solving a class of non-smooth CPproblems that can be reformulated as smooth convex-concave saddle point (i.e., min-max) problems.

3.1 Projected gradient and the optimality conditions

In this subsection, we consider an inner product space <n endowed with the norm ‖ · ‖ associatedwith its inner product. We consider the CP problem

φ∗ := minx∈X

φ(x), (9)

where X ⊂ <n is a convex set and φ : X → IR is a convex function that has Lφ-Lipschitz-continuousgradient over X with respect to the norm ‖ · ‖.

It is well-known that x∗ ∈ X is an optimal solution of (9) if and only if ∇φ(x∗) ∈ −NX(x∗).Moreover, this optimality condition is in turn related to the projected gradient of the function φ overX defined as follows.

Definition 3 Given a fixed constant τ > 0, we define the projected gradient of φ at x ∈ X withrespect to X as (see, for example, [3])

∇φ(x)]τX :=1

τ[x−ΠX(x− τ∇φ(x))] , (10)

where ΠX(·) is the projection map onto X defined in terms of the inner product norm ‖ · ‖ (seeSubsection 1.1).

The following proposition relates the projected gradient to the aforementioned optimality condi-tion.

Proposition 4 Let x ∈ X be given and define x+ := ΠX(x− τ∇φ(x)). Then, for any given ε ≥ 0,the following statements hold:

a) ‖∇φ(x)]τX‖ ≤ ε if, and only if, ∇φ(x) ∈ −NX(x+) + B(ε);

b) ‖∇φ(x)]τX‖ ≤ ε implies that ∇φ(x+) ∈ −NX(x+) + B ((1 + τLφ)ε).

Proof. To simplify notation, define v := ∇φ(x)]τX . By (10) and well-known properties of theprojection operator ΠX , we have

v = ∇φ(x)]τX ⇔ x− τv = ΠX (x− τ∇φ(x))

⇔ 〈x− τ∇φ(x)− (x− τv) , y − (x− τv)〉 ≤ 0, ∀y ∈ X⇔ 〈v −∇φ(x), y − (x− τv)〉 ≤ 0, ∀y ∈ X⇔ 〈v −∇φ(x), y − x+〉 ≤ 0, ∀y ∈ X⇔ ∇φ(x)− v ∈ −NX(x+).

5

from which statement a) clearly follows. To show b), assume that ‖v‖ ≤ ε. Then, we have

‖x+ − x‖ ≤ ‖ΠX(x− τ∇φ(x))− x‖ = ‖(x− τv)− x‖ = τ‖v‖ ≤ τε.

which, together with the assumption that φ has Lφ-Lipschitz-continuous gradient, implies that‖∇φ(x+) − ∇φ(x)‖ ≤ τLφε. Statement b) now follows immediately from the latter conclusionand statement a).

The following lemma summarizes some interesting properties of the projected gradient.

Lemma 5 Let x ∈ X be given and x+ := ΠX(x − τ∇φ(x)). Denoting g(·) := ∇φ(·), gX(·) :=∇φ(·)]τX . We have:

a) φ(x+)− φ(x) ≤ −τ‖gX(x)‖2/2 for any τ ≤ 1/Lφ;

b) For any x ∈ X, there holds〈g(x)− gX(x), x− x+〉 ≥ 0; (11)

in particular, setting x = x, we obtain

〈g(x)− gX(x), gX (x)〉 ≥ 0; (12)

c) For any x ∈ X, there holds

φ(x+)− φ(x) ≤ (1 + τLφ) ‖gX(x)‖ ‖x+ − x‖. (13)

d) If τ = 1/Lφ in definition (10), then

φ(x)− φ(x∗) ≥ 1

2Lφ‖gX(x)‖2, ∀x ∈ X, (14)

where x∗ ∈ Argminx∈Xφ(x).

Proof. a) This statement is proved in p. 87 of [3].b) Noting that x+ = x− τgX(x) = ΠX(x − τg(x)), it follows from well-known properties of the

projection map ΠX that

〈x− (x− τgX), x− τg(x)− (x− τgX(x))〉 = 〈x− (x− τgX(x)), τ(gX (x)− g(x))〉= 〈x− x+, τ(gX(x)− g(x))〉 ≤ 0, ∀x ∈ X,

which clearly implies statement b).c) It follows from from the convexity of φ(·), (11), the assumption that φ(·) has Lφ-Lipschitz-

continuous gradient, and definition (10) that

φ(x+)− φ(x) ≤ 〈g(x+), x+ − x〉= 〈g(x)− gX(x), x+ − x〉+ 〈gX(x), x+ − x〉+ 〈g(x+)− g(x), x+ − x〉≤ 〈gX(x), x+ − x〉+ 〈g(x+)− g(x), x+ − x〉≤ 〈gX(x), x+ − x〉+ Lφ‖x+ − x‖ ‖x+ − x‖= 〈gX(x), x+ − x〉+ τLφ‖gX(x)‖ ‖x+ − x‖≤ (1 + τLφ) ‖gX(x)‖ ‖x+ − x‖, ∀x ∈ X.

6

d) Using the fact that Φ(x∗) ≤ Φ(x), ∀x ∈ X and the assumption that φ(·) has Lφ-Lipschitz-continuous gradient, we conclude that

φ(x∗) ≤ φ

(

x− 1

LφgX(x)

)

≤ φ(x)− 1

Lφ〈g(x), gX (x)〉+ 1

2Lφ‖gX(x)‖2

≤ φ(x)− 1

2Lφ‖gX(x)‖2

for any x ∈ X, where the last inequality follows from (12).

3.2 Nesterov’s Optimal Method

In this subsection, we discuss Nesterov’s smooth first-order method for solving a class of smooth CPproblems.

Our problem of interest is still the CP problem (9), which is assumed to satisfy the same assump-tions mentioned in Subsection 3.1, except that now we allow ‖ · ‖ to be an arbitrary norm. Moreover,we assume throughout our discussion that the optimal value φ∗ of problem (9) is finite and that itsset of optimal solutions is nonempty. Let h : X → IR be a differentiable strongly convex functionwith modulus σh > 0 with respect to ‖ · ‖, i.e.,

h(x) ≥ h(x) + 〈∇h(x), x− x〉+ σh

2‖x− x‖2, ∀x, x ∈ X. (15)

The Bregman distance dh : X ×X → IR associated with h is defined as

dh(x; x) ≡ h(x)− lh(x; x), ∀x, x ∈ X, (16)

where lh : <n ×X → IR is the “linear approximation” of h defined as

lh(x; x) = h(x) + 〈∇h(x), x− x〉, ∀(x, x) ∈ <n ×X.We are now ready to state Nesterov’s smooth first-order method for solving (9). We use the

superscript ‘sd’ in the sequence obtained by taking a steepest descent step and the superscript ‘ag’(which stands for ‘aggregated gradient’) in the sequence obtained by using all past gradients.

Nesterov’s Algorithm:

0) Let xsd0 = xag

0 ∈ X be given and set k = 0

1) Set xk = 2k+2x

agk + k

k+2xsdk and compute φ(xk) and φ′(xk).

2) Compute (xsdk+1, x

agk+1) ∈ X ×X as

xsdk+1 ∈ Argmin

lφ(x;xk) +Lφ

2‖x− xk‖2 : x ∈ X

, (17)

xagk+1 ≡ argmin

Lφ

σhdh(x;x0) +

k∑

i=0

i+ 1

2[lφ(x;xi)] : x ∈ X

. (18)

3) Set k ← k + 1 and go to step 1.

end

The main convergence result established by Nesterov [4] regarding the above algorithm is sum-

7

marized in the following theorem.

Theorem 6 The sequence xsdk generated by Nesterov’s optimal method satisfies

φ(xsdk )− φ∗ ≤ 4Lφ dh(x∗;xsd

0 )

σh k(k + 1), ∀k ≥ 1,

where x∗ is an optimal solution of (9). As a consequence, given any ε > 0, an iterate xsdk ∈ X

satisfying φ(xsdk )− φ∗ ≤ ε can be found in no more than

2

√

dh(x∗;xsd0 )Lφ

σhε

(19)

iterations.

The following result is as an immediate special case of Theorem 6.

Corollary 7 Suppose that ‖ · ‖ is a inner product norm and h : X → < is chosen as h(·) = ‖ · ‖2/2in Nesterov’s optimal method. Then, for any ε > 0, an iterate xsd

k ∈ X satisfying φ(xsdk ) − φ∗ ≤ ε

can be found in no more than⌈

‖xsd0 − x∗‖

√

2Lφ

ε

⌉

(20)

iterations, where x∗ is an optimal solution of (9).

Proof. If h(x) = ‖x‖2/2, then (16) implies that dh(x∗;xsd0 ) = ‖xsd

0 − x∗‖2/2. The corollary nowfollows from this bound and relation (19).

Now assume that the objective function φ is strongly convex over X, i.e., for some µ > 0,

〈∇φ(x)−∇φ(x), x− x〉 ≥ µ‖x− x‖2, ∀x, x ∈ X. (21)

Nesterov shows in Theorem 2.2.2 of [3] that, under the assumptions of Corollary 7, a variant of hisoptimal method finds a solution xk ∈ X satisfying φ(xk)− φ∗ ≤ ε in no more than

⌈√

Lφ

µlog

Lφ ‖xsd0 − x∗‖2ε

⌉

(22)

iterations. The following result gives an iteration-complexity bound for Nesterov’s optimal methodthat replaces the term log(Lφ‖xsd

0 −x∗‖2/ε) in (22) with log(µ‖xsd0 −x∗‖2/ε). The resulting iteration-

complexity bound is not only sharper but also more suitable since it makes it easier for us to comparethe quality of the different bounds obtained in our analysis of first-order penalty methods.

Theorem 8 Let ε > 0 be given and suppose that the assumptions of Corollary 7 hold and that thefunction φ is strongly convex with modulus µ. Then, the variant where we restart Nesterov’s optimalmethod, with proximal function h(·) = ‖ · ‖2/2, every

K :=

⌈√

8Lφ

µ

⌉

(23)

8

iterations finds a solution x ∈ X satisfying φ(x)− φ∗ ≤ ε in no more than

⌈√

8Lφ

µ

⌉

max1, dlogQe (24)

iterations, where

Q :=µ ‖xsd

0 − x∗‖22ε

(25)

and x∗ := argminx∈Xφ(x).

Proof. Denote D := ‖xsd0 −x∗‖. First consider the case where Q ≤ 1. Clearly we have ε ≥ µD2/2,

which, in view of Corollary 7, implies that the number of iterations is bounded by⌈

2√

Lφ/µ⌉

andhence bounded by (24).

We now show that bound (24) holds for the case when Q > 1. Let xj be the iterate obtainedat the end of (j − 1)-th restart after K iterations of Nesterov’s optimal method are performed. ByTheorem 6 with h(·) = ‖ · ‖2/2 and inequality (21), we have

φ(x1)− φ(x∗) ≤ 2Lφ ‖xsd0 − x∗‖2K2

≤ 2LφD2

K2

φ(xj)− φ(x∗) ≤ 2Lφ ‖xj−1 − x∗‖2K2

≤ 4Lφ

µK2

[

φ(xj−1)− φ(x∗)]

, ∀ j ≥ 2.

Using the above relations inductively, we conclude that

φ(xj)− φ(x∗) ≤ (4Lφ)j D2

2µj−1K2j.

Setting j = dlogQe and observing that

K2j ≥[

(

8Lφ

µ

)1

2

]2j

= 2j

(

4Lφ

µ

)j

≥ 2log Q

(

4Lφ

µ

)j

≥ µD2

2ε

(4Lφ)j

µj=D2

2ε

(4Lφ)j

µj−1,

we conclude that φ(xj)−φ(x∗) ≤ ε. Hence, the overall number of iterations is bounded by KdlogQe,or equivalently, by (24).

It is interesting to compare the complexity bounds of Corollary 7 and Theorem 8. Indeed, it canbe easily seen that there exists a threshold value Q > 0 such that the condition Q ≥ Q, where Qis defined in (25), implies that the bound (20) is always greater than or equal to the bound (24).Moreover, as Q goes to infinity, the ratio between (20) and (24) converges to infinity. Hence, whenthe function φ is strongly convex and Q satisfies (25), we will always use the bound (24) in ouranalysis.

3.3 Nesterov’s smooth approximation scheme

In this subsection, we discuss Nesterov’s smooth approximation scheme for solving a class of non-smooth CP problems which can be reformulated as special smooth convex-concave saddle point (i.e.,min-max) problems.

9

Our problem of interest in this subsection is still problem (9). We assume that X ⊆ <n is aclosed convex set and φ : X → IR is a convex function given by

φ(x) := φ(x) + sup〈Ex, y〉 − ψ(y) : y ∈ Y , ∀x ∈ X, (26)

where φ : X → IR is a convex function with Lφ-Lipschitz-continuous gradient with respect to an

arbitrary norm ‖ · ‖ on <n, Y ⊆ <m is a compact convex set, φ : Y → IR is a continuous convexfunction, and E : <n → <m is an affine operator.

Unless stated explicitly otherwise, we assume that the CP problem (9) referred to in this subsec-tion is the one with its objective function given by (26). Also, we assume throughout our discussionthat f∗ is finite and that the set of optimal solutions of (9) is nonempty.

The function φ defined as in (26) is generally non-differentiable but can be closely approximatedby a function with Lipschitz-continuous gradient using the following construction due to Nesterov[4]. Let h : Y → IR be a continuous strongly convex function with modulus σh > 0 with respect to

an arbitrary norm ‖ · ‖ on <m satisfying minh(y) : y ∈ Y = 0. For some smoothness parameterη > 0, consider the following function

φη(x) = φ(x) + maxy

〈Ex, y〉 − ψ(y)− ηh(y) : y ∈ Y

. (27)

The next result, due to Nesterov [4], shows that φη is a function with Lipschitz-continuous gradientwith respect to ‖ · ‖, whose “closeness” to φ depends linearly on the parameter η.

Proposition 9 The following statements hold:

a) For every x ∈ X, we have φη(x) ≤ φ(x) ≤ φη(x) + ηDh, where

Dh := max

h(y) : y ∈ Y

; (28)

b) The function φη(x) has Lη-Lipschitz-continuous gradient with respect to ‖ · ‖, where

Lη := Lφ +‖E‖2ησh

. (29)

We are now ready to state Nesterov’s smooth approximation scheme to solve (9).

Nesterov’s smooth approximation scheme:

0) Let ε > 0 be given.

1) Let η := ε/(2Dh) and consider the approximation problem

φ∗η ≡ infφη(x) : x ∈ X. (30)

2) Apply Nesterov’s optimal method (or its variant) to (30) and terminate whenever an iteratexsd

k satisfying φ(xsdk )− φ∗ ≤ ε is found.

end

The following result describes the convergence behavior of the above scheme.

10

Theorem 10 Nesterov’s smooth approximation scheme generates an iterate xsdk satisfying φ(xsd

k )−φ∗ ≤ ε in no more than

⌈√

8Dh(xsd0 )

σhε

(

2Dh‖E‖2σhε

+ Lφ

)

⌉

(31)

iterations, where Dh(xsd0 ) := maxx∈X dh(x;xsd

0 ) and Dh is defined in (28).

4 The quadratic penalty method

The goal of this section is to establish the iteration-complexity (in terms of Nesterov’s optimalmethod iterations) of the quadratic penalty method for solving (1). Specifically, we present theiteration-complexity of the classical quadratic penalty method in Subsection 4.1, and a variant ofthis method, for which a perturbation term is added into the objective function of (1), is discussedand analyzed in Subsection 4.2.

The basic idea underlying penalty methods is rather simple, namely: instead of solving problem(1) directly, we solve certain relaxations of (1) obtained by penalizing some violation of the constraintA(x) ∈ K∗. More specifically, in the case of the quadratic penalty method, given a penalty parameterρ > 0, we solve the relaxation

Ψ∗ρ := inf

x∈X

Ψρ(x) := f(x) +ρ

2[dK∗(A(x))]2

. (32)

In order to guarantee that the function Ψρ is smooth, we assume throughout this section that thedistance function dK∗ is defined with respect to an inner product norm ‖ · ‖ in <m. Note that in thiscase, we have ‖ · ‖∗ = ‖ · ‖.

We will now see that the objective function Ψρ of (32) has Lipschitz continuous gradient. Wefirst state the following well-known result which guarantees that the distance function has Lipschitzcontinuous gradient (see for example Proposition 15 of [1] for its proof).

Proposition 11 Given a closed convex set C ⊆ <m, consider the distance function dC : <m → IR

to C with respect some inner product norm ‖ · ‖ on <m. Then, the function ψ : <m → < definedas ψ(u) = [dC(u)]

2 is convex and its gradient is given by ∇ψ(u) = 2[u − ΠC(u)] for every u ∈ <m.Moreover, ‖∇ψ(u)−∇ψ(u)‖ ≤ 2‖u− u‖ for every u, u ∈ <m.

As an immediate consequence of Proposition 11, we obtain the following result.

Corollary 12 The function Ψρ has Mρ-Lipschitz continuous gradient where Mρ := Lf + ρ‖A‖2.

Proof. The differentiability of Ψρ follows immediately from the assumption that f is differen-tiable and Proposition 11. Moreover, it easily follows from the chain rule that ∇Ψρ(x) = ∇f(x) +ρA∗

0∇(d2K)(A(x))/2, which together with (2) and Proposition 11, then imply that

‖∇Ψρ(x1)−∇Ψρ(x2)‖∗ ≤ ‖∇f(x1)−∇f(x2)‖∗ +ρ

2‖A∗

0∇(d2K)(A(x1))−A∗

0∇(d2K)(A(x2))‖∗

≤ Lf‖x1 − x2‖+ρ

2‖A∗

0‖‖∇d2K(A(x1))−∇d2

K(A(x2))‖≤ Lf‖x1 − x2‖+ ρ‖A∗

0‖‖A0(x1 − x2)‖≤ Lf‖x1 − x2‖+ ρ‖A∗

0‖‖A0‖‖x1 − x2‖ = Mρ‖x1 − x2‖,

11

for every x1, x2 ∈ X, where the last equality follows from the fact that ‖A‖ = ‖A0‖ = ‖A∗0‖.

Therefore, problem (32) can be solved, for example, by Nesterov’s optimal method or one of itsvariants (see for example [1, 2, 4]).

4.1 Quadratic penalty method applied to the original problem

In this subsection, we consider the application of the quadratic penalty method directly to the originalproblem (1), which consists of solving the penalized problem (32) for a suitably chosen penaltyparameter ρ. This subsection contains two subsubsections. The first one considers the complexity ofobtaining an (εp, εo)-primal solution of (1) while the second one discusses the complexity of computingan (εp, εd)-primal-dual solution of (1).

4.1.1 Computation of a primal approximate solution

In this subsection, we are interested in deriving the iteration-complexity involved in the computationof an (εp, εo)-primal solution of problem (1) by approximately solving the penalized problem (32) forsome value of the penalty parameter ρ.

Given an approximate solution x ∈ X for the penalized problem (32), the following result providesbounds on the primal infeasibility and optimality gap of x with respect to (1).

Proposition 13 If x ∈ X is a δ-approximate solution of (32), i.e., it satisfies

Ψρ(x)−Ψ∗ρ ≤ δ, (33)

then

dK∗(A(x)) ≤ 2

ρ‖λ∗‖+

√

2δ

ρ(34)

f(x)− f ∗ ≤ δ, (35)

where λ∗ is an arbitrary Lagrange multiplier associated with (1).

Proof. Using the fact that v(0) = f ∗ ≥ Ψ∗ρ, Corollary 2 and assumption (33), we conclude that

δ ≥ Ψρ(x)−Ψ∗ρ = f(x) +

ρ

2[dK∗(A(x))]2 −Ψ∗

ρ

≥ f(x)− f ∗ +ρ

2[dK∗(A(x))]2 ≥ −‖λ∗‖ dK∗(A(x)) +

ρ

2[dK∗(A(x))]2,

which clearly implies both (34) and (35).

For the sake of future reference, we state the following simple result.

Lemma 14 Let αi, i = 0, 1, 2, be given positive constants. Then, the only positive scalar ρ satisfyingthe equation α2ρ

−1 + α1ρ−1/2 = α0 is given by

ρ =

(

α1 +√

α21 + 4α0α2

2α0

)2

.

12

With the aid of Proposition 13, we can now derive an iteration-complexity bound for Nesterov’smethod applied to the quadratic penalty problem (32) to compute an (εp, εo)-primal solution of (1).

Theorem 15 Let λ∗ be an arbitrary Lagrange multiplier for (1) and let (εp, εo) ∈ IR++ × IR++ begiven. If

ρ = ρp(t) :=

(√εo +

√

εo + 4εp t√2εp

)2

(36)

for some t ≥ ‖λ∗‖, then Nesterov’s optimal method applied to problem (32) finds an (εp, εo)-primalsolution of (1) in at most

Np(t) :=

2

√

Dh(xsd0 )

σh

√

Lf

εo+

√2‖A‖εp

+ ‖A‖√

2t

εpεo

, (37)

iterations, where Dh(xsd0 ) is defined in Theorem 10. In particular, if ‖ · ‖ is the inner product norm

on <n and h(·) = ‖ · ‖2/2, then the above bound reduces to

Np(t) :=

⌈

√2DX

√

Lf

εo+

√2‖A‖εp

+ ‖A‖√

2t

εpεo

⌉

, (38)

where DX := maxx1,x2∈X ‖x1 − x2‖.

Proof. Let x ∈ X satisfies (33) with δ = εo. Proposition 13, the assumption that t ≥ ‖λ∗‖,relation (36) and Lemma 14 with α0 = εp, α1 =

√εo and α2 = t then imply that f(x)− f ∗ ≤ εo and

dK∗(A(x)) ≤ 2

ρp(t)‖λ∗‖+

√

2εoρp(t)

≤ 2t

ρp(t)+

√

2εoρp(t)

= εp,

and hence that x is an (εp, εo)-primal solution of (1). In view of Corollary 7, Nesterov’s optimalmethod finds an approximate solution x as above in at most

2

√

Dh(xsd0 )Mρ

σhε0

iterations, where Mρ := Lf +ρ‖A‖2. Substituting the value of ρ given by (36) in this iteration boundand using some trivial majorization, we obtain bound (37).

We now make a few observations regarding Theorem 15. First, the choice of ρ given by (36)requires that t ≥ ‖λ∗‖ so as to guarantee that an εo-approximate solution x of (32) satisfiesdK∗(A(x)) ≤ εp. Second, note that the iteration-complexity Np(t) obtained in Theorem 15 achievesits minimum possible value over the interval t ≥ ‖λ∗‖ exactly when t = ‖λ∗‖. However, since thequantity ‖λ∗‖ is not known a priori, it is necessary to use a “guess and check” procedure for t so asto develop a scheme for computing an (εp, εo)-primal solution of (1) whose iteration-complexity hasthe same order of magnitude as the ideal one in which t = ‖λ∗‖.

We now describe the aforementioned “guess and check” procedure for t.

Search Procedure 1:

13

1) Set k = 0 and define

β0 := 2

√

Dh(xsd0 )

σh

(

√

Lf

εo+

√2‖A‖εp

)

, β1 := 2‖A‖√

2Dh(xsd0 )

σhεpεo, t0 :=

[

max(1, β0)

β1

]2

. (39)

2) Set ρ = ρp(tk), and perform at most dNp(tk)e iterations of Nesterov’s optimal method appliedto problem (32). If an iterate x is obtained such that (33) with δ = εo and dK∗(A(x)) ≤ εp aresatisfied, then stop; otherwise, go to step 3;

3) Set tk+1 = 2tk, k = k + 1, and go to step 2.

Before establishing the iteration-complexity of the above procedure, we state a technical result,namely Lemma 17, which will be used more than once in the paper. Lemma 16 states a simpleinequality which is also used in a few times in our presentation.

Lemma 16 For any scalars τ ≥ 0, x > 0, and α ≥ 0, we have τx+ α ≤ (τ + α)dxe.

Lemma 17 Let a positive scalar p be given. Then, there exists a constant C = C(p) such that forany positive scalars β1, β2, t, we have

K∑

k=0

dβ1 + β2tpke ≤ Cdβ1 + β2 t

pe, (40)

where tk = t02k for k = 1, . . . ,K, and

t0 :=

(

max(β1, 1)

β2

)1/p

, K := max

0,

⌈

log

(

t

t0

)⌉

. (41)

Proof. Assume first that t ≤ t0. Due to the definition of K in (41), we have K = 0 in this case,and hence that

K∑

k=0

dβ1 + β2tpke = dβ1 + β2t

p0e = dβ1 + max(β1, 1)e ≤ 2β1 + 2 ≤ 4dβ1e ≤ 4dβ1 + β2 t

pe,

where the second equality follows from the definition of t0. Hence, in the case where t ≤ t0, inequality(40) holds with C = 4.

Assume now that t > t0. By the definition of K in (41), we have K = dlog(t/t0)e, from which weconclude that K < log(t/t0) + 1, and hence that t02

K+1 < 4t. Using these relations, the inequality

14

log x = (log xp)/p ≤ xp/p for any x > 0, and the definition of t0 in (41), we obtain

K∑

k=0

dβ1 + β2tpke ≤

K∑

k=0

1 + β1 + β2 tp0 2pk ≤ (1 + β1)(1 +K) + β2t

p0

2(K+1)p

2p − 1

≤ (1 + β1)

[

2 + log

(

t

t0

)]

+ β2(4t)p

2p − 1

≤ (1 + β1)

[

2 +1

p

(

t

t0

)p]

+4p

2p − 1β2t

p

≤ (1 + β1)

[

2 +1

p

(

β2 tp

max(β1, 1)

)]

+4p

2p − 1β2 t

p

≤ 2(1 + β1) +2

pβ2 t

p +4p

2p − 1β2 t

p

≤ 2 + max

2,2

p+

4p

2p − 1

(β1 + β2tp),

which, in view of Lemma 16, implies that (40) holds with

C = C(p) := 2 + max

2,2

p+

4p

2p − 1

.

The following result gives the iteration-complexity of Search Procedure 1 for obtaining an (εp, εo)-primal solution of (1).

Corollary 18 Let λ∗ be the minimum norm Lagrange multiplier for (1). Then, the overall numberof iterations of Search Procedure 1 for obtaining an (εp, εo)-primal solution of (1) is bounded byO(Np(‖λ∗‖)), where Np(·) is defined in (37).

Proof. In view of Theorem 15, the iteration count k in Search Procedure 1 can not exceedK := max0, dlog(‖λ∗‖/t0)e, and hence its overall number of inner (i.e., Nesterov’s optimal method)

iterations is bounded by∑K

k=0Np(tk) =∑K

k=0dβ0 + β1t1/2k e, where β0 and β1 are defined by (39).

The result now follows from the definition of t0 in (39) and (40) with p1 = 1/2 and t = ‖λ∗‖.

4.1.2 Computation of a primal-dual approximate solution

In this subsection, we are interested in deriving the iteration-complexity involved in the computationof an (εp, εd)-primal-dual solution of problem (1) by approximately solving the penalized problem(32) for certain value of the penalty parameter ρ. We assume throughout this subsection that thenorms on <n and <m are the ones associated with their respective inner products.

Given an approximate solution x ∈ X for the penalized problem (32), the following result showsthat there exists a pair (x+, λ) ∈ X×(−K) depending on x that approximately satisfies the optimalityconditions (6).

15

Proposition 19 If x ∈ X is a δ-approximate solution of (32), i.e., it satisfies (33), then the pair(x+, λ) defined as

x+ := ΠX(x−∇Ψρ(x)/Mρ), (42)

λ := ρ[A(x+)−ΠK∗(A(x+))] (43)

is in X × (−K) and satisfies the second relation in (7) and the relations

dK∗(A(x+)) ≤ 1

ρ‖λ∗‖+

√

δ

ρ, (44)

∇f(x+) + (A0)∗ λ ∈ −NX(x+) + B(2

√

2Mρδ), (45)

where λ∗ is an arbitrary Lagrange multiplier for (1).

Proof. It follows from Lemma 5(a) with φ = Ψρ and τ = Mρ that Ψρ(x+) ≤ Ψρ(x), and hence

that Ψρ(x+)−Ψ∗

ρ ≤ δ in view of (33). By Proposition 13, this implies that (44) holds. Moreover, byLemma 5(d) with φ = Ψρ and Lφ = Mρ and assumption (33), we have

‖∇Ψρ(x)]1/Mρ

X ‖ ≤√

2Mρ[Ψρ(x)−Ψ∗ρ] ≤

√

2Mρδ.

Relation (45) now follows from this inequality, Proposition 4(b) with Lφ = Mρ, τ = 1/Mρ andε =

√

2Mρδ, and the fact that ∇Ψρ(x+) = ∇f(x+) + (A0)

∗ λ, where λ is given by (43). Finally,λ ∈ (−K) and the second relation of (7) holds in view of the definition of λ and well-known propertiesof the projection operator onto a cone.

With the aid of Proposition 19, we can now derive an iteration-complexity bound for Nesterov’smethod applied to the quadratic penalty problem (32) to compute an (εp, εd)-primal-dual solutionof (1).

Theorem 20 Let λ∗ be an arbitrary Lagrange multiplier for (1) and let (εp, εd) ∈ IR++ × IR++ begiven. If

ρ = ρpd(t) :=1

εp

(

t+εd√8‖A‖

)

(46)

for some t ≥ ‖λ∗‖, then Nesterov’s optimal method with h(·) = ‖ · ‖2/2 applied to problem (32) findsan (εp, εd)-primal-dual solution of (1) in at most

Npd(t) :=

⌈

4DX

(

Lf

εd+‖A‖2tεpεd

+‖A‖√8εp

)⌉

(47)

iterations, where DX is defined in Theorem 15.

Proof. Let x ∈ X be a δ-approximate solution of (32) where δ := ε2d/(8Mρ). Noting thatδ ≤ ε2d/(8ρ‖A‖2) in view of the fact that Mρ := Lf + ρ‖A‖2 ≥ ρ‖A‖2, we conclude from Proposition19 that the pair (x+, λ) defined as x+ := ΠX(x − ∇Ψρ(x)/Mρ) and λ := ρ[A(x+) − ΠK∗(A(x+))]satisfies ∇f(x+) +A∗λ ∈ −NX(x+) + B(εd) and

dK∗(A(x+)) ≤ 1

ρ‖λ∗‖+

√

δ

ρ≤ 1

ρ

(

‖λ∗‖+εd√8‖A‖

)

≤ εp,

16

where the last inequality is due to (46) and the assumption that t ≥ ‖λ∗‖. We have thus shown that(x+, λ) is an (εp, εd)-primal-dual solution of (1). In view of Corollary 7, Nesterov’s optimal methodfinds an approximate solution x as above in at most

⌈

√2DX

(

Mρ

δ

)1/2⌉

=

√2DX

(

8M2ρ

ε2d

)1/2

=

⌈

4DXMρ

εd

⌉

=

⌈

4DXLf + ρ‖A‖2

εd

⌉

iterations. Substituting the value of ρ given by (36) in the above bound, we obtain bound (37).

The following result states the iteration-complexity of a “guess and check” procedure based onTheorem 20. Its proof is based on similar arguments as those used in the proof of Corollary 18.

Corollary 21 Let λ∗ be the minimum norm Lagrange multiplier for (1). Consider the “guess andcheck” procedure similar to Search Procedure 1 where the functions Np and ρp are replaced by thefunctions Npd and ρpd defined in Theorem 20, δ is set to ε2d/(8Mρ), and t0 is set to [max(β0, 1)/β1]with

β0 = 4DX

(

Lf

εd+‖A‖√8εp

)

, β1 =4DX‖A‖2

εpεd.

Then, the overall number of iterations of this “guess and check” procedure for obtaining an (εp, εd)-primal-dual solution of (1) is bounded by O(Npd(‖λ∗‖)), where Npd(·) is defined in (47).

4.2 Quadratic penalty method applied to a perturbed problem

Throughout this subsection, we assume that the norm on <n is the one associated with its innerproduct. (Recall that throughout Section 4, we are also assuming that the norm on <m is the oneassociated with its inner product.) Consider the perturbation problem

f∗γ := minfγ(x) := f(x) +γ

2‖x− x0‖2 : A(x) ∈ K∗, x ∈ X, (48)

where x0 is a fixed point in X and γ > 0 is a pre-specified perturbation parameter. It is well-knownthat if γ is sufficiently small, then an approximate solution of (48) will also be an approximatesolution of (1).

In this subsection, we will derive the iteration-complexity (in terms of Nesterov’s optimal methoditerations) of computing an approximate solution of (1) by applying the quadratic penalty approachto the perturbation problem (48) for a conveniently chosen perturbation parameter γ > 0. Thesubsection is divided into two subsubsections: the computation of a primal approximate solution of(1) is treated in the first subsubsection while the computation of a primal-dual approximate solutionof (1) is treated in the second one.

4.2.1 Computation of a primal approximate solution

The quadratic penalty problem associated with (48) is given by

Ψ∗ρ,γ := min

x∈X

Ψρ,γ(x) := f(x) +γ

2‖x− x0‖2 +

ρ

2dK∗(A(x))2

. (49)

17

It can be easily seen that the function Ψρ,γ has Mρ,γ-Lipschitz continuous gradient where

Mρ,γ := Lf + ρ‖A‖2 + γ. (50)

The following simple lemma relates the optimal values of the perturbation problem (48) and theoriginal problem (1).

Lemma 22 Let f ∗, Ψ∗ρ, f

∗γ , and Ψ∗

ρ,γ be the optimal values defined in (1), (32), (48), and (49),respectively. Then,

0 ≤ f∗γ − f∗ ≤ γD2X/2 (51)

0 ≤ Ψ∗ρ,γ −Ψ∗

ρ ≤ γD2X/2, (52)

where DX := maxx1,x2∈X ‖x1 − x2‖.

Proof. The first inequalities in both relations (51) and (52) follow immediately from the fact thatfγ ≥ f and Ψρ,γ ≥ Ψρ. Now, let x∗ and x∗γ be optimal solutions of (1) and (48), respectively. Then,

f∗γ = f(x∗γ) +γ

2‖x∗γ − x0‖2 ≤ f(x∗) +

γ

2‖x∗ − x0‖2 ≤ f∗ +

γD2X

2,

from which the second inequality in (51) immediately follows. The second inequality in (52) can besimilarly shown.

The following theorem gives the iteration-complexity of computing an (εp, εo)-primal solution of(1) by applying the quadratic penalty approach to the perturbation problem (48) with an appropriatechoice of perturbation parameter γ > 0.

Theorem 23 Let λ∗γ be an arbitrary Lagrange multiplier for (48) and let (εp, εo) ∈ IR++ × IR++ begiven. If

ρ = ρp(t) :=4t

εpand γ =

εoD2

X

, (53)

for some t ≥ ‖λ∗γ‖, then the variant of Nesterov’s optimal method of Theorem 8 with µ = γ andLφ = Mρ,γ applied to problem (49) finds an (εp, εo)-primal solution of (1) in at most

Np(t) :=

⌈

2√

2DX

(

√

Lf

εo+ 2‖A‖

√

t

εpεo

)

+ 1

⌉

max

1,

⌈

logεotεp

⌉

(54)

iterations, where DX is defined in Theorem 15.

Proof. Assume that ρ and γ are given by (53) for some t ≥ ‖λ∗γ‖. Let δ := minεo/4, εpt/2

and x ∈ X be a δ-approximate solution of (49), i.e., Ψρ,γ(x)−Ψ∗ρ,γ ≤ δ. Then, Proposition 13 with

f = fγ, Ψρ = Ψρ,γ and ρ = ρp(t) and the assumption that t ≥ ‖λ∗γ‖ imply that

dK∗(A(x)) ≤ 2

ρp(t)‖λ∗γ‖+

√

2δ

ρp(t)≤

εp‖λ∗γ‖2t

+

√

tεp4t/εp

≤ εp2

+εp2

= εp

18

and fγ(x)− f ∗γ ≤ δ ≤ εo/4. The later inequality together with (51) and (53) in turn imply that

f(x)− f ∗ ≤ fγ(x)− f ∗ ≤ [fγ(x)− f ∗γ ] + [f∗γ − f∗] ≤εo4

+εo2< εo.

We have thus proved that x is an (εp, εo)-primal solution of (1). Now, using Theorem 8 with φ = Ψρ,γ ,µ = γ and ε = δ, and noting that the definition of γ in (53) and the definition of δ imply that

Q =µ‖xsd

0 − x∗‖22ε

=γ‖xsd

0 − x∗‖22δ

≤ γD2X

2δ=

εominεo/2, εpt

= max

2,εoεpt

,

we conclude that the number of iterations performed by Nesterov’s optimal method (with the restart-ing feature) for finding a δ-approximate solution x as above is bounded by

⌈√

8Mρ,γ

γ

⌉

dlogQe =

⌈√

8Mρ,γ

γ

⌉

max

1,

⌈

logεotεp

⌉

.

Now, using relations (50) and (53) and some trivial majorization, we easily see that the latter boundis majorized by (54).

Note that the complexity bound (54) derived in Theorem 23 is guaranteed only under the assump-tion that t ≥ ‖λ∗γ‖, where λ∗γ is the minimum norm Lagrange multiplier for (48). Since the bound(54) is a monotonically increasing function of t, the ideal (theoretical) choice of t would be to sett = ‖λ∗γ‖. Without assuming any knowledge of this Lagrange multiplier, the following result shows

that a “guess and check” procedure similar to Search Procedure 1 still has an O(Np(‖λ∗γ‖)) iteration-

complexity bound, where Np(·) is defined in (54). Before establishing the iteration-complexity of this“guess and check” procedure, we state the following technical result which substantially generalizesthe one stated in Lemma 17. The proof of this result is given in the Appendix.

Proposition 24 For some positive integer L, let positive scalars p1, p2, · · · , pL be given. Then, thereexists a constant C = C(p1, · · · , pL) such that for any positive scalars β0, β1, · · · , βL, ν, and t, wehave

K∑

k=0

⌈

β0 +

L∑

l=1

(

βl tpl

k

)

⌉

max

1,

⌈

logν

tk

⌉

≤ C⌈

β0 +

L∑

l=1

(βl tpl)

⌉

max

1,⌈

logν

t

⌉

, (55)

where

K := max

0,

⌈

log

(

t

t0

)⌉

, t0 := min1≤l≤L

(

max(β0, 1)

βl

)1/pl

, tk = t02k, ∀k = 1, . . . ,K. (56)

In particular, if ν = t, then (55) implies that

K∑

k=0

⌈

β0 +L∑

l=1

(

βl tpl

k

)

⌉

≤ C⌈

β0 +L∑

l=1

(βl tpl)

⌉

. (57)

We are now ready to discuss the iteration-complexity bound of the above-mentioned “guess andcheck” procedure.

19

Corollary 25 Let λ∗γ be the minimum norm Lagrange multiplier for (48). Consider the “guess andcheck” procedure similar to Search Procedure 1 where the functions Np and ρp are replaced by thefunctions Np and ρp defined in Theorem 23, δ is set to δ := minεo/4, εptk/2, and t0 is set to[max(β0, 1)/β1]

2 with

β0 = 2√

2DXLf

εo+ 1, β1 =

4√

2DX‖A‖√εpεo

. (58)

Then, the overall number of iterations of this “guess and check” procedure for obtaining an (εp, εo)-primal-dual solution of (1) is bounded by O(Np(‖λ∗γ‖)), where Np(·) is defined in (54).

Proof. In view of Theorem 23, the iteration count k in Search Procedure 1 can not exceedK := max0, dlog(‖λ∗γ‖/t0)e, and hence its overall number of inner (i.e., Nesterov’s optimal method)

iterations is bounded by∑K

k=0 Np(tk) =∑K

k=0dβ0 + β1t1/2k emax1, dlog(ν/tk)e, where β0 and β1

are defined by (58), and ν = ε0/εp. The result now follows from the definition of t0 and Proposition24 with L = 1, p1 = 1/2, and t = ‖λ∗γ‖.

It is interesting to compare the two functions Np(t) and Np(t) defined in Theorems 15 and 23,respectively. Indeed, both functions (up to some constant factor) are comparable for the case whenrt := εo/(tεp) = O(1). Otherwise, if rt = Ω(1) and, we further assume that the term involving tdominates the other terms in the right hand side of (54), i.e.,

2√

2DX

√

Lf

εo+ 1 = O

(

4√

2DX‖A‖√

t

εpεo

)

,

thenNp(t)

Np(t)≤ O(1)

(

log rt√rt

)

,

for some universal constant O(1). Hence, in the latter case, we conclude that the function Np(t) isasymptotically smaller than the function Np(t).

4.2.2 Computation of a primal-dual approximate solution

In this subsection, we derive the iteration-complexity of computing a primal-dual approximate solu-tion of (1) by applying the quadratic penalty method to the perturbation problem (48). We will showthat the resulting approach has a substantially better iteration-complexity than the one discussedin Subsection 4.1.2, which consists of applying the quadratic penalty method directly to the originalproblem (1).

Theorem 26 Let λ∗γ be an arbitrary Lagrange multiplier for (48) and let (εp, εd) ∈ IR++ × IR++ begiven. Also assume that

ρ = ρpd(t) :=1

εp

(

t+εd√

32‖A‖

)

, γ =εd

2DX, (59)

for some t ≥ ‖λ∗γ‖. Then, the variant of Nesterov’s optimal method of Theorem 8 with µ = γ andLφ = Mρ,γ applied to problem (49) finds an (εp, εd)-primal-dual solution of (1) in at most

Npd(t) := d8S(t)e d2 log(2S(t))e , (60)

20

where DX is defined in Theorem 15 and

S(t) :=

[

2DX

(

Lf

εd+‖A‖√32εp

)

+ 1

]1

2

+

√2D

1/2X ‖A‖√εpεd

t1/2. (61)

Proof. Define δ := ε2d/(32Mρ,γ) and let x ∈ X be a δ-approximate solution for (49), i.e., Ψρ,γ(x)−Ψ∗

ρ,γ ≤ δ. It then follows from Proposition 19 with Ψρ = Ψρ,γ , f = fγ , and Mρ = Mρ,γ that the pair(x+, λ) defined as x+ := ΠX(x−∇Ψρ,γ(x)/Mρ,γ) and λ := ρ[A(x+)−ΠK∗(A(x+))] satisfies

∇fγ(x+) +A∗λ ∈ −NX(x+) + B(2√

2Mρ,γδ) = −NX(x+) + B(εd/2).

This together with the fact that γ‖x+ − x0‖ ≤ γDX ≤ εd/2 then imply that

∇f(x+) +A∗λ = [∇fγ(x+)− γ(x+ − x0)] +A∗λ

∈ −γ(x+ − x0)−NX(x+) + B(εd/2) ⊆ −NX(x+) + B(εd).

Moreover, noting that δ := ε2d/(32Mρ,γ) ≤ ε2d/(32ρ‖A‖2), we conclude that Proposition 19 alsoimplies that

dK∗(A(x+)) ≤ 1

ρ‖λ∗γ‖+

√

δ

ρ≤ 1

ρ

(

‖λ∗γ‖+εd√

32‖A‖

)

≤ εp,

where the last inequality is due to (59) and the assumption that t ≥ ‖λ∗γ‖. We have thus shown that

(x+, λ) is an (εp, εd)-primal-dual solution of (1). Now, using Theorem 8 with φ = Ψρ,γ , µ = γ andε = δ, and noting that the definition of δ and the definition of γ in (59) imply that

Q =µ‖xsd

0 − x∗‖22ε

=γ‖xsd

0 − x∗‖22δ

≤ γD2X

2δ=

γD2X

ε2d/(16Mρ,γ)=

4Mρ,γ

γ,

we conclude that the number of iterations performed by Nesterov’s optimal method (with the restart-ing feature) for finding a δ-approximate solution x as above is bounded by

⌈

8

√

Mρ,γ

γ

⌉

dlogQ e =

⌈

8

√

Mρ,γ

γ

⌉

⌈

log

(

4Mρ,γ

γ

)⌉

. (62)

Now, using (50) and the definitions of ρ and γ in (53), we easily see that the latter bound is majorizedby (60).

Note that the complexity bound (60) derived in Theorem 26 is guaranteed only under the as-sumption that t ≥ ‖λ∗γ‖, where λ∗γ is the minimum norm Lagrange multiplier for (48). Since thebound (60) is a monotonically increasing function of t, the ideal (theoretical) choice of t would beto set t = ‖λ∗γ‖. Without assuming any knowledge of this Lagrange multiplier, the following result

shows that a “guess and check” procedure similar to Search Procedure 1 still has an O(Npd(‖λ∗γ‖))iteration-complexity bound, where Npd(·) is defined in (60).

Corollary 27 Let λ∗γ be the minimum norm Lagrange multiplier for (48). Consider the “guess andcheck” procedure similar to Search Procedure 1 where the functions Np and ρp are replaced by thefunctions Npd and ρpd defined in Theorem 26, Nesterov’s optimal method is replaced by its variant

21

of Theorem 8 with µ = γ and Lφ = Mρ,γ (see step 2 of Search Procedure 1), δ is set to ε2d/(32Mρ,γ),and t0 is set to [max(β0, 1)/β1]

2 with

β0 = 8

[

2DX

(

Lf

εd+‖A‖√32εp

)

+ 1

]1/2

, β1 =8√

2D1/2X ‖A‖

(εpεd)1/2. (63)

Then, the overall number of iterations of this “guess and check” procedure for obtaining an (εp, εd)-primal-dual solution of (1) is bounded by O(Npd(‖λ∗γ‖)), where Npd(·) is defined in (60).

Proof. In view of Theorem 26, the iteration count k of the above “guess and check” (seeSearch Procedure 1) for obtaining an (εp, εd)-primal-dual solution of (1) can not exceed K :=max0, dlog(‖λ∗‖/t0)e, and hence its overall number of inner (i.e., Nesterov’s optimal method) iter-ations is bounded by

K∑

k=0

Npd(tk) =

K∑

k=0

d8S(tk)e d2 log(2S(tk))e ≤ d2 log(2S(tK))eK∑

k=0

d8S(tk)e . (64)

Now, the definition of t0 and relation (40) with p = 1/2 and t = ‖λ∗γ‖ imply that

K∑

k=0

d8S(tk)e =K∑

k=0

⌈

β0 + β1t1/2k

⌉

= O(⌈

β0 + β1‖λ∗γ‖1/2⌉)

= O(⌈

8S(‖λ∗γ‖)⌉)

,

where β0 and β1 are defined by (63). Moreover, using the fact that tK ≤ 2‖λ∗γ‖, we easily seethat d2 log(2S(tK))e = O(

⌈

2 log(2S(‖λ∗γ‖))⌉

). Substituting the last two bounds into (64), we then

conclude that∑K

k=0 Npd(tk) = O(Npd(‖λ∗γ‖)), and hence that the corollary holds.

It is interesting to compare the functions Npd(t) and Npd(t) defined in Theorems 20 and 26,respectively. It follows from (47), (60), and (61) that

Npd(t)

Npd(t)≤ (d8S(t)e d2 log(2S(t))e)

⌈

(S(t))2 − 1⌉ ≤ O(1)

log S(t)

S(t)− 1, (65)

where O(1) denotes an absolute constant. Hence, when S(t) is large, the bound Npd(t) can besubstantially smaller than the bound Npd(t).

Note that we can not use the previous observation to compare the iteration-complexity of Corol-lary 21 with that obtained in Corollary 27 since the first one is expressed in terms of ‖λ∗‖ and thelater in terms of ‖λ∗γ‖. However, if ‖λ∗γ‖ = O(‖λ∗‖), then it can be easily seen that (65) implies that

Npd(‖λ∗γ‖)Npd(‖λ∗‖)

≤ O(1)log S(‖λ∗‖)S(‖λ∗‖)− 1

.

Hence, the first complexity is better than the second one whenever ‖λ∗γ‖ = O(‖λ∗‖) and S(‖λ∗‖) is

sufficiently large.The following result describes an alternative way of choosing the penalty parameter ρ in sub-

problem (49) as a function of t, but requires that t ≥ ‖λ∗‖ instead of t ≥ ‖λ∗γ‖.

22

Theorem 28 Let λ∗ be an arbitrary Lagrange multiplier for (1) and let (εp, εd) ∈ IR++ × IR++ begiven. Assume that

ρ = ρpd(t) :=

(

√

εdDX/2 +√

(εdDX/2) + 4α(t)εp2εp

)2

, γ =εd

2DX, (66)

where α(t) := t+εd/(√

32 ‖A‖) for some t ≥ ‖λ∗‖. Then, the variant of Nesterov’s optimal method ofTheorem 8 with µ = γ and Lφ = Mρ,γ applied to problem (49) finds an (εp, εd)-primal-dual solutionof (1) in at most

Npd(t) :=⌈

8S(t)⌉ ⌈

2 log(2S(t))⌉

, (67)

where DX is defined in Theorem 15 and

S(t) :=

√

2LfDX

εd+ ‖A‖

(

DX

εp+

√

2tDX

εpεd

)

+

√

‖A‖DX

81/4√εp+ 1. (68)

Proof. Define δ := ε2d/(32Mρ,γ ) and let x ∈ X be a δ-approximate solution for (49), i.e.,Ψρ,γ(x) − Ψ∗

ρ,γ ≤ δ. As shown in the proof of Theorem 26, the pair (x+, λ) defined as x+ :=ΠX(x−∇Ψρ,γ(x)/Mρ,γ) and λ := ρ[A(x+)−ΠK∗(A(x+))] satisfies

∇f(x+) +A∗λ ∈ −NX(x+) + B(εd).

Moreover, it follows from Lemma 5(a) with φ = Ψρ,γ and τ = Mρ,γ that Ψρ,γ(x+) ≤ Ψρ,γ(x), andhence that Ψρ,γ(x+) − Ψ∗

ρ,γ ≤ Ψρ,γ(x) − Ψ∗ρ,γ ≤ δ. This observation together with (32), (49), (52)

and (66) then imply that

Ψρ(x+)−Ψ∗

ρ = Ψρ,γ(x+)− γ

2‖x+ − x0‖2 −Ψ∗

ρ ≤ [Ψρ,γ(x+)−Ψ∗ρ,γ ] + [Ψ∗

ρ,γ −Ψ∗ρ]

≤ δ + γD2X ≤ ε2d

32ρ‖A‖2 +εdDX

2,

where the last inequality follows from the definition of δ and the fact that Mρ ≥ ρ‖A‖2. The aboveinequality together with Proposition 13, the assumption that t ≥ ‖λ∗‖, relation (66) and Lemma 14with α0 = εp, α1 =

√

εdDX/2 and α2 = t+ εd/(√

32 ‖A‖) =: α(t) then imply that

dK∗(A(x+)) ≤ 1

ρ‖λ∗‖+

√

ε2d32ρ2‖A‖2 +

DXεd2ρ

≤ 1

ρ

(

t+εd√

32 ‖A‖

)

+

√

DXεd2ρ

= εp.

We have thus shown that (x+, λ) is an (εp, εd)-primal-dual solution of (1). Now, using Theorem 8with φ = Ψρ,γ , µ = γ and ε = δ, and noting that the definition of δ and the definition of γ in (66)imply that

Q =µ‖xsd

0 − x∗‖22ε

=γ‖xsd

0 − x∗‖22δ

≤ γD2X

2δ=

γD2X

ε2d/(16Mρ,γ)=

4Mρ,γ

γ,

we conclude that the number of iterations performed by Nesterov’s optimal method (with the restart-ing feature) for finding a δ-approximate solution x as above is bounded by (62). Since relations (50)and (66) and the definition of α(t) imply that

√

Mρ,γ

γ=

√

Lf + ρ‖A‖2 + γ

γ≤√

Lf

γ+ ‖A‖

√

ρ

γ+ 1 =

√

2LfDX

εd+ ‖A‖

√

ρ

γ+ 1

23

and

√

ρ

γ≤

(

√

εdDX/2

εp+

√

α(t)

εp

)

√

2DX

εd=

DX

εp+

√

t+ εd/(√

32‖A‖)εp

√

2DX

εd

≤(

DX

εp+

√

2tDX

εpεd+

√

DX√8εp‖A‖

)

,

it follows that (62) can be bounded by (67).

The following result states the iteration-complexity of a “guess and check” procedure based onTheorem 28. Its proof is based on similar arguments as those used in the proof of Corollary 27.

Corollary 29 Let λ∗ be the minimum norm Lagrange multiplier for (1). Consider the “guess andcheck” procedure similar to Search Procedure 1 where the functions Np and ρp are replaced by thefunctions Npd and ρpd defined in Theorem 26, Nesterov’s optimal method is replaced by its variantof Theorem 8 with µ = γ and Lφ = Mρ,γ (see step 2 of Search Procedure 1), δ is set to ε2d/(32Mρ,γ),and t0 is set to [max(β0, 1)/β1]

2 with

β0 = 8

(

√

2LfDX

εd+‖A‖DX

εp+

√

‖A‖DX√8εp

+ 1

)

, β1 = 8‖A‖√

2DX

εpεd.

Then, the overall number of iterations of this “guess and check” procedure for obtaining an (εp, εd)-primal-dual solution of (1) is bounded by O(Npd(‖λ∗‖)), where Npd(·) is defined in (67).

It is interesting to compare the functions Npd(t) and Npd(t) defined in Theorems 20 and 26,respectively. When the second term in the right hand side of (68) is dominated by the other terms,i.e.,

‖A‖DX

εp= O

(

‖A‖t1/2D1/2X√

εpεd+

√

LfDX

εd

)

, (69)

then it can be easily seen that

Npd(t)

Npd(t)≤ O(1)

Npd(t)

Npd(t)≤ O(1)

log S(t)

S(t) − 1.

Hence, if (69) holds and S(t) is large, then Npd(t) can be considerably smaller than Npd(t). It is also

interesting to observe when εp = εd = ε, the dependence on ε of the function Npd is O(1/ε) log(1/ε)while that of Npd is O(1/ε2).

5 The exact penalty method

The goal of this section is to establish the iteration-complexity (in terms of Nesterov’s optimalmethod iterations) of first-order exact penalty methods for computing a εp-primal solution of (1).It consists of two subsections. The first one considers the complexity of a first-order exact penaltymethod applied directly to the original problem (1) while the second one deals with the complexityof a first-order exact penalty method applied to the perturbation problem (48) for some suitablychosen perturbation parameter θ.

24

5.1 Exact penalty method applied to the original problem

In this subsection, we consider the complexity of a first-order exact penalty method applied directlyto the original problem (1).

Instead of using the quadratic penalty term ρ[dK∗(A(x))]2/2, the exact penalty method employsthe penalty term given by θdK∗(A(x)) for some θ > 0, and solves the following relaxation problem

Φ∗θ := inf

x∈XΦθ(x) := f(x) + θdK∗(A(x)) (70)

Unless explicitly mentioned, we assume throughout section 5 that the norm used to define the distancefunction dK∗ is arbitrary.

Given an approximate solution x ∈ X for the penalized problem (70), the following result providesbounds on the primal infeasibility and optimality gap of x with respect to (1).

Proposition 30 If x ∈ X is a δ-approximate solution of (70), i.e., it satisfies

Φθ(x)− Φ∗θ ≤ δ, (71)

for some θ > ‖λ∗‖, where λ∗ is an arbitrary Lagrange multiplier for (1), then

dK∗(A(x)) ≤ δ

θ − ‖λ∗‖ , (72)

f(x)− f(x∗) ≤ δ. (73)

Proof. Using the fact that v(0) = f ∗ ≥ Φ∗θ, Corollary 2 and assumption (71), we conclude that

δ ≥ Φθ(x)− Φ∗θ = f(x) + θ dK∗(A(x))− Φ∗

θ

≥ f(x)− f ∗ + θ dK∗(A(x)) ≥ −‖λ∗‖ dK∗(A(x)) + θ dK∗(A(x)),

which clearly implies both (72) and (73).

Note that in view of Proposition 1(a) we can reformulate the penalized problem (70) as thesmooth saddle-point problem

minx∈X

f(x) + θmaxy∈Y〈A(x), y〉, (74)

where Y := (−K) ∩ B(0, 1). Clearly, this problem can be solved by means of Nesterov’s smoothapproximation scheme discussed in Subsection 3.3. Using Proposition 30 and applying the results ofSubsection 3.3 to the above reformulation, we can now derive an iteration-complexity bound for Nes-terov’s smooth approximation scheme to compute an (εp, εo)-primal solution of (1) by approximatelysolving the saddle point problem (74) for some value of the penalty parameter θ.

Theorem 31 Let λ∗ be an arbitrary Lagrange multiplier for (1) and let (εp, εo) ∈ IR++ × IR++ begiven. If

θ = θ(t) :=εoεp

+ t, (75)

for some t ≥ ‖λ∗‖, then Nesterov’s smooth approximation scheme applied to problem (74) finds an(εp, εo)-primal solution of (1) in at most

Ne(t) :=

√

8Dh(xsd0 )

σh

√

Lf

εo+ ‖A‖

√

2Dh

σh

(

1

εp+

t

εo

)

. (76)

25

iterations, where Dh(xsd0 ) and Dh are defined in Theorem 10 and relation (28), respectively. In

particular, if the norm ‖ · ‖ on <m is the inner product norm and h(·) = ‖ · ‖2/2, then the abovebound reduces to

Ne(t) :=

√

8Dh(xsd0 )

σh

√

Lf

εo+ ‖A‖

(

1

εp+

t

εo

)

. (77)

Proof. Let x ∈ X be a point satisfying (71) with δ = εo. It follows from (75), the assumptionthat t ≥ ‖λ∗‖ and Proposition 30 that f(x)− f ∗ ≤ εo and

dK∗(A(x)) ≤ δ

θ − ‖λ∗‖ ≤εo

εo/εp= εp,

and hence that x is an (εp, εo)-primal solution of (1). Using Theorem 10 with ‖E‖ = θ‖A‖ andLφ = Lf , we conclude that Nesterov’s smooth approximation scheme finds an approximate solutionx as above in at most

√

8Dh(xsd0 )

σhεo

(

2θ2Dh‖A‖2σhεo

+ Lf

)

,

iterations. Substituting the value of θ given by (75) in this iteration bound and using some trivialmajorization, we obtain bound (76). The last statement follows from the fact σh = 1 and Dh = 1/2

when the norm ‖ · ‖ on <m is the inner product norm and h(·) = ‖ · ‖2/2.We now make a few observations about the iteration-complexity boundNe(t) obtained in Theorem

31. First, note that if the norm on <n is the inner product one, then the bound Ne(t) on (77) reducesto

Ne(t) :=

⌈

2DX

√

Lf

εo+ ‖A‖

(

1

εp+

t

εo

)

⌉

. (78)

Second, it is interesting to compare the bound Ne(t) in (77) with the bound Np(t) obtained inTheorem 15. Indeed, if

χt :=

√

tεpεo

= O(

1 +εp‖A‖

√

Lf

εo

)

,

then it is easy to see that Np(t)/Ne(t) = O(1). Otherwise, we have

Np(t)

Ne(t)=O(1)

χt.

Moreover, observations similar to ones made in the paragraph following Theorem 15 applies to theabove result as well. In particular, a “guess and check” procedure similar to the Search Procedure 1in Subsection 4.1.1 can be stated as follows.

Search Procedure 2:

1) Set k = 0 and define

β1 :=

√

8Dh(xsd0 )

σh

(

√

Lf

εo+‖A‖εp

√

2Dh

σh

)

, β2 :=4‖A‖εo

√

Dh(xsd0 )Dh

σhσh

, t0 :=max(1, β1)

β2.

26

2) Set θ = θ(tk), and perform at most dNe(tk)e iterations of Nesterov’s optimal method appliedto problem (74). If an iterate x is obtained such that (71) with δ = εo and dK∗(A(x)) ≤ εp aresatisfied, then stop; otherwise, go to step 3;

3) Set tk+1 = 2tk, k = k + 1, and go to step 2.

The following result, whose proof is similar to that of Corollary 18, gives the iteration-complexityof Search Procedure 2 for obtaining an (εp, εo)-primal solution of (1).

Corollary 32 Let λ∗ be the minimum norm Lagrange multiplier for (1). Then, the overall numberof iterations of Search Procedure 2 for obtaining an (εp, εo)-primal solution of (1) is bounded byO(Ne(‖λ∗‖)), where Ne(·) is defined in (37).

5.2 Exact penalty method applied to a perturbed problem

In this subsection, we will derive the iteration-complexity of solving (1) by applying the exact penaltymethod to the perturbed problem (48) for a properly chosen perturbation parameter γ > 0. Weassume throughout this subsection that the norm on <n is the inner product one.

The exact penalty problem associated with (48) is given by

Φ∗θ,γ := inf

x∈X

Φθ,γ(x) := fγ(x) + θdK∗(A(x)) = f(x) + θdK∗(A(x)) +γ

2‖x− x0‖2

. (79)

Problem (79) is a non-smooth optimization problem and it can be reformulated as the followingsmooth saddle-point problem

infx∈X

Φθ,γ(x) = fγ(x) + maxy∈Y〈θA(x), y〉

, (80)

where Y := (−K) ∩ B(0, 1). The function Φθ,γ is non-smooth but, in view of Nesterov’s smoothapproximation scheme, it can be closely approximated by the following function with Lipschitz-continuous gradient:

Φηθ,γ := inf

x∈X

Φηθ,γ(x) := fγ(x) + max

y∈Y

〈θA(x), y〉 − ηh(y)

, (81)

where η > 0 is the smooth parameter (see (27)) and h : Y → IR is as in Subsection 3.3. In view ofProposition 9(b) with E = θA, ψ = 0, φ = fγ and φη = Φη

θ,γ , the function Φηθ,γ has Lη

θ,γ-Lipschitzcontinuous gradient where

Lηθ,γ := Lf + γ +

θ2‖A‖2σhη

. (82)

With the aid of reformulation (80) and Proposition 30, we can now derive an iteration-complexitybound for solving (1) by applying the exact penalty method to the perturbation problem (48) withan appropriately chosen parameter γ > 0.

Theorem 33 Let λ∗γ be an arbitrary Lagrange multiplier for (48) and let (εp, εo) ∈ IR++ × IR++ begiven. If

θ = θ(t) := 2t, η =minεo/2, tεp

2Dh

, and γ =εoD2

X

, (83)

27

for some t ≥ ‖λ∗γ‖, then the variant of Nesterov’s optimal method of Theorem 8 with µ = γ andLφ = Lη

θ,γ applied to problem (81) finds an (εp, εo)-primal solution of (1) in at most

Ne(t) :=

⌈

8DX

[

√

Lf

εo+ ‖A‖

√

2Dh

σh

(√8 t

εo+

2√t

√εpεo

)]

+ 8

⌉

max

1,

⌈

log

(

εotεp

)⌉

(84)

iterations, where DX is defined in Theorem 15. In particular, if the norm on <m is the inner productone and h(·) = ‖ · ‖2/2, then the above bound reduces to

Ne(t) :=

⌈

8DX

[

√

Lf

εo+ ‖A‖

(√8 t

εo+

2√t

√εpεo

)]

+ 8

⌉

max

1,

⌈

log

(

εotεp

)⌉

. (85)

Proof. Define δ := minεo/2, tεp = 2ηDh, where the last equality is due to (75). Also, let x ∈ Xbe an δ/2-approximate solution of (81), i.e., Φη

θ,γ(x) − Φηθ,γ ≤ δ/2. Using the fact that ηDh = δ/2

and Proposition 9(a) with φ = Φθ,γ and φη = Φηθ,γ, we easily see that that x is a δ-approximate

solution of (80), i.e., Φθ,γ(x) − Φ∗θ,γ ≤ δ. It then follows from (83), the assumption t ≥ ‖λ∗

γ‖, thedefinition of δ, and Proposition 30 with f = fγ and Φθ = Φθ,γ that

dK∗(A(x)) ≤ δ

θ − ‖λ∗γ‖=

δ

2t− ‖λ∗γ‖≤ tεp

t= εp

fγ(x)− f ∗γ ≤ δ ≤ εo/2.

The latter inequality together with (51) and (83) imply that

f(x)− f ∗ ≤ fγ(x)− f ∗ ≤ [fγ(x)− f ∗γ ] + [f∗γ − f∗] ≤εo2

+εo2≤ εo.

We have thus shown that x is an (εp, εo)-primal solution of (1).To end the proof, it remains to derive the iteration-complexity for the variant of Nesterov’s

optimal method of Theorem 8 with µ = γ and Lφ = Lηθ,γ to obtain a δ/2-approximate solution of

problem (81). First, observe that Φηθ,γ is strongly convex with modulus γ. Now, using Theorem 8

with φ = Φηθ,γ , µ = γ := εo/D

2X and ε = δ/2, and noting that the definition of δ and the definition

of γ in (83) imply that

Q =µ‖xsd

0 − x∗‖22ε

=γ‖xsd

0 − x∗‖2δ

≤ γD2X

δ=

εominεo/2, εpt

= max

2,εoεpt

,

we conclude that the number of iterations for the aforementioned variant to find a ε = δ/2-approximate solution x as above is bounded by

8

√

Lηθ,γ

γ

dlogQ e =

8

√

Lηθ,γ

γ

max

1,

⌈

logεotεp

⌉

.

The result now follows by noting that (82) and (83) imply that√

Lηθ,γ

γ=

√

Lf

γ+ 1 +

θ2‖A‖2γσhη

=

√

D2XLf

εo+ 1 +

8t2D2XDh‖A‖2

εoσh minεo/2, tεp

≤ DX

[

√

Lf

εo+ ‖A‖

√

2Dh

σh

(√8 t

εo+

2√t

√εpεo

)]

+ 1.

28

The following result, whose proof is similar to that of Corollary 25, gives the iteration-complexityof Search Procedure 2 for obtaining an (εp, εo)-primal solution of (1).

Corollary 34 Let λ∗ be the minimum norm Lagrange multiplier for (1). Consider the “guess andcheck” procedure similar to Search Procedure 2 where the functions Ne and θ are replaced by thefunctions Ne and θ defined in Theorem 33, δ is set to δ := minεo/2, tkεp, and t0 is set tominmax(β0, 1)/β1, [max(β0, 1)/β2]

2 with

β0 = 8DXLf

εo+ 8, β1 =

8√

8DX‖A‖εo

, β2 =16DX‖A‖√

εpεo.

Then, the overall number of iterations of this “guess and check” procedure for obtaining an (εp, εo)-primal-dual solution of (1) is bounded by O(Ne(‖λ∗γ‖)), where Ne(·) is defined in (85).

It is interesting to compare the two functions Ne(t) and Ne(t) defined in (78) and (85), respec-tively. Indeed, both functions are comparable for the case when rt := εo/(tεp) = O(1). Otherwise, ifrt = Ω(1) and, we further assume that the terms involving t dominate the other terms in the righthand side of (85), i.e.,

DX

√

Lf

εo+ 1 = O

(

DX‖A‖(√

8 t

εo+

2√t

√εpεo

))

,

thenNe(t)

Ne(t)≤ O(1)

(

log rt√rt

)

,

where O(1) denotes an absolute constant. Hence, in the latter case, we conclude that the functionNe(t) is asymptotically smaller than the function Ne(t).

Appendix

In this section, we prove Propositions 1 and 24.

Proof of Proposition 1. Define C := (v, t) ∈ <m × IR : ‖v‖ ≤ t and let C∗ denote the dual cone ofC. It is easy to see that C∗ = (v, t) ∈ <m × IR : ‖v‖∗ ≤ t. By definition of dK∗ and conic duality,we have

dK∗(u) = infk,tt : ‖u− k‖∗ ≤ t, k ∈ K∗

= infk,t,vt : v + k = u, (v, t) ∈ C∗, k ∈ K∗

= sup(v,k,t)

〈u, y〉 : t = 1, y + v = 0, y + k = 0, (v, t) ∈ C, k ∈ K

= sup〈u, y〉 : (−y, 1) ∈ C, −y ∈ K = sup〈u, y〉 : y ∈ (−K) ∩B(0, 1).

29

Statement (a) follows from the above identity and the definition of the support function of a set (seeSubsection 1.1).

To show statement (b), let u ∈ <m and λ ∈ K be given and assume without loss of generalitythat λ 6= 0. Now noting that −λ/‖λ‖ ∈ C := (−K) ∩ B(0, 1), we conclude from the above identitythat dK(u) ≥ 〈u,−λ/‖λ‖〉, or equivalently, 〈u, λ〉 ≥ −‖λ‖ dK(u).

Our goal in the remaining part of this section is to prove Proposition 24. We first start with aneasy case of the result.

Lemma 35 For some positive integer L, let positive scalars p1, p2, · · · , pL be given and define

C0 := 2 + max

2, max1≤l≤L

(

2

pl+

4pl

2pl − 1

)

.

Then, for any positive scalars β0, β1, · · · , βL, and t > t0, we have

K∑

k=0

⌈

β0 +

L∑

l=1

βltpl

k

⌉

≤ C0

⌈

β0 +

L∑

l=1

βl tpl

⌉

, (86)

where K and t0, . . . , tK are defined in (56).

Proof. Without loss of generality, assume that t0 = (max(β0, 1)/β1)1/p1 . Clearly, due to the

definition of K in (56) and the assumption t > t0, we have K < log(t/t0) + 1, and hence thatt02

K+1 < 4t. Using these relations and the inequality log x = (log xp)/p ≤ xp/p for any x > 0, p > 0,we obtain

K∑

k=0

⌈

β0 +

L∑

l=1

βltpl

k

⌉

≤K∑

k=0

(

1 + β0 +

L∑

l=1

βltpl

0 2plk

)

≤ (1 + β0)(1 +K) +

L∑

l=1

βltpl

0

2(K+1)pl

2pl − 1

≤ (1 + β0)

[

2 + log

(

t

t0

)]

+L∑

l=1

βl(4t)pl

2pl − 1

≤ (1 + β0)

[

2 +1

p1

(

t

t0

)p1]

+

L∑

l=1

4pl

2pl − 1βl t

pl

≤ (1 + β0)

[

2 +1

p1

(

β2 tp1

max(β1, 1)

)]

+

L∑

l=1

4pl

2pl − 1βl t

pl

≤ 2(1 + β0) +2

p1β1t

p1 +L∑

l=1

4pl

2pl − 1βl t

pl

≤ 2 + max

2,2

p1+

4p1

2p1 − 1, max2≤l≤L

4pl

2pl − 1

(β0 +

L∑

l=1

βl tpl).

Inequality (86) now clearly follows from the above conclusion, Lemma 16, and some trivial majoriza-tion.

The following technical lemma provides an useful inequality to prove a more difficult case of ourresult.

30

Lemma 36 Let the positive scalars a and p be given. For any 0 ≤ K ≤ a− 1/(p ln 2)− 1, we have

K∑

k=0

2pk(a− k) ≤ 2p(K+1)

p ln 2

[

a− (K + 1) +1

p ln 2

]

.

Proof. Noting that the function ψ(s) := 2ps(a − s) is non-decreasing for any s ≤ a − 1/(p ln 2),we obtain

K∑

k=0

2pk (a− k) ≤∫ K+1

02ps (a− s) ds =

2ps

p ln 2

(

a− s+1

pl ln 2

)∣

∣

∣

∣

K+1

0

≤ 2p(K+1)

p ln 2

[

a− (K + 1) +1

pl ln 2

]

.

We now prove a more difficult case of the result.

Lemma 37 For some positive integer L, let positive scalars p1, p2, · · · , pL be given and define

C1 := 1 +1

(ln 2) min1≤l≤L pl, (87)

C2 := 5 + max

5, max1≤l≤L

(

4

p2l

+7

pl+ 2C14

pl

)

. (88)

Then, for any positive scalars β0, β1, · · · , βL, and ν, we have

K∑

k=0

⌈

β0 +

L∑

l=1

βltpl

k

⌉

max

1,

⌈

logν

tk

⌉

≤ C2

⌈

β0 +

L∑

l=1

βl tpl

⌉

max

1,⌈

logν

t

⌉

(89)

for every t ∈ (t0, t ], where t := 2−C1ν and the scalars K and t0, . . . , tK are defined in (56).

Proof. Without loss of generality, assume that t0 = (max(β0, 1)/β1)1/p1 . Clearly, due to the

definition of K in (56) and the assumption t > t0, we have log(t/t0) ≤ K ≤ log(t/t0) + 1. Usingthese relations, the inequalities log x = (log xp)/p ≤ xp/p for any x > 0, p > 0, and (log x)2 =(2 log xp/2)/p)2 ≤ 4xp/p2 for any x ≥ 1, p > 0, and the facts log(ν/t) ≥ C1 ≥ 1 and t/t0 ≥ 1 due to

31

the assumption t0 < t ≤ 2−C1ν, we obtain

K∑

k=0

(

logν

tk+ 1

)

=

K∑

k=0

(

logν

t0+ 1− k

)

= (K + 1)

(

logν

t0+ 1− K

2

)

= (K + 1)

(

logν

t0−K + 1 +

K

2

)

≤ (K + 1)

(

logν

t+ 1 +

K

2

)

=1

2

[

2(K + 1) logν

t+K2 + 3K + 2

]

≤ 1

2(K2 + 5K + 4)

⌈

logν

t

⌉

≤ 1

2

(

log2 t

t0+ 7 log

t

t0+ 10

)

⌈

logν

t

⌉

≤ 1

2

[(

4

p21

+7

p1

)(

t

t0

)p1

+ 10

]

⌈

logν

t

⌉

=1

2

[(

4

p21

+7

p1

)

β1 tp1

max1, β0+ 10

]

⌈

logν

t

⌉

≤[(

4

p21

+7

p1

)

β1 tp1

1 + β0+ 5

]

⌈

logν

t

⌉

. (90)

Moreover, it follows from the fact K ≤ log(t/t0) + 1 and the assumption t0 < t ≤ 2−C1ν that

K + 1 ≤ logt

t0+ 2 = log

2ν

t0+ log

t

2ν+ 2 ≤ log

2ν

t0− C1 + 1

= log2ν

t0− 1

(ln 2) min1≤l≤L pl≤ log

2ν

t0− 1

(ln 2) pl, ∀ 1 ≤ l ≤ L,

which, together with Lemma 36 (with a = log(2ν/t0) and p = pl) and the fact that K ≥ log(t/t0),then imply that

K∑

k=0

2plk

(

log2ν

t0− k)

≤ 2pl(K+1)

pl ln 2

(

log2ν

t0− (K + 1) +

1

pl ln 2

)

≤ 2pl(K+1)

pl ln 2

(

log2ν

t0− (log

t

t0+ 1) +

1

pl ln 2

)

=2pl(K+1)

pl ln 2

(

logν

t+

1

pl ln 2

)

≤ 2

pl ln 22pl(K+1)

⌈

logν

t

⌉

≤ 2C12pl(K+1)

⌈

logν

t

⌉

, (91)

where the last two inequalities follow from the relation log(ν/t) ≥ log(ν/t) = C1 ≥ 1/(pl ln 2).By the definitions of K and tk,∀k = 1, · · · ,K in (56) and the assumption t ≤ 2−C0ν, we have

tk = t02k ≤ t02log(t/t0)+1 = 2t ≤ 2−C0+1ν, which implies that log(ν/tk) ≥ C0 − 1 > 0 and hence that

max1, dlog(ν/tk)e = dlog(ν/tk)e ≤ log(ν/tk) + 1. Using these relations, (90), (91), the relation

32

t02K+1 < 4t due to K < log(t/t0) + 1, we obtain

K∑

k=0

⌈

β0 +

L∑

l=1

βltpl

k

⌉

max

1,

⌈

logν

tk

⌉

≤K∑

k=0

(

1 + β0 +

L∑

l=1

βltpl

k

)

(

logν

tk+ 1

)

= (1 + β0)

K∑

k=0

(

logν

tk+ 1

)

+

L∑

l=1

K∑

k=0

βltpl

0 2kpl

(

logν

tk+ 1

)

= (1 + β0)K∑

k=0

(

logν

tk+ 1

)

+L∑

l=1

K∑

k=0

βltpl

0 2kpl

(

log2ν

t0− k)

≤

5(1 + β0) +

(

4

p21

+7

p1

)

β1 tp1 + 2C0

L∑

l=1

βltpl

0 2pl(K+1)

⌈

logν

t

⌉

≤[

5(1 + β0) +

(

4

p21

+7

p1

)

β1 tp1 + 2C0

L∑

l=1

4plβl tpl

]

⌈

logν

t

⌉

≤[

5 + max

5,4

p21

+7

p1+ 2C04

p1 , 2C0 max2≤l≤L

4pl

(

β0 +L∑

l=1

βl tpl

)]

⌈

logν

t

⌉

.

Inequality (89) now immediately follows from the above conclusion, Lemma 16, the fact thatmax1, dlog(ν/t)e = dlog(ν/t)e due to the assumption t ≤ ν2−C1 , and some trivial majorization.

We are now ready to prove Proposition 24.

Proof of Proposition 24. Assume first that t ≤ t0. Due to the definition of K in (56), we have K = 0in this case, which, in view of the definition of t0 in (56) and Lemma 16, then imply that

K∑

k=0

⌈

β0 +L∑

l=1

βltpl

k

⌉

max

1,

⌈

logν

tk

⌉

=

⌈

β0 +L∑

l=1

βltpl

0

⌉

max

1,

⌈

logν

t0

⌉

≤⌈

β0 +

L∑

l=1

βltpl

0

⌉

max

1,⌈

logν

t

⌉

≤ dβ0 + Lmaxβ0, 1emax

1,⌈

logν

t

⌉

≤ d(L+ 1)β0 + Lemax

1,⌈

logν

t

⌉

≤ (2L+ 1) dβ0emax

1,⌈

logν

t

⌉

≤ (2L+ 1)

⌈

β0 +

L∑

l=1

βl tpl

⌉

max

1,⌈

logν

t

⌉

.

Hence, in the case where t ≤ t0, inequality (55) holds with C = 2L+ 1.Assume now that t > t0. Denoting t := 2−C1ν, where C1 is given in (87), we further consider two

subcases: a) t ≥ t; b) t ≤ t. In subcase a), we have t ≥ t > t0, which, in view of Lemma 37, clearlyimplies that inequality (55) holds with C = C2.

We now consider the remaining subcase b) where t ≤ t. Denoting K := max0, dlog(t/t0)e, wehave tk ≥ t for any k ≥ K and hence that,

max

1,

⌈

logν

tk

⌉

≤ max

1,⌈

logν

t

⌉

= max 1, dC1e ≤ C1 + 1, ∀k ≥ K,

33

which together with Lemma 35, then imply that

K∑

k=K

⌈

β0 +

L∑

l=1

(

βl tpl

k

)

⌉

max

1,

⌈

logν

tk

⌉

≤ (C1 + 1)

K∑

k=K

⌈

β0 +

L∑

l=1

(

βl tpl

k

)

⌉

≤ (C1 + 1)

K∑

k=0

⌈

β0 +

L∑

l=1

(

βl tpl

k

)

⌉

≤ C0(C1 + 1)

⌈

β0 +

L∑

l=1

βl tpl

⌉

max

1,⌈

logν

t

⌉

.

Moreover, if K ≥ 1, i.e., t > t0, it then follows from Lemma 37 with t = t that

K−1∑

k=0

⌈

β0 +

L∑

l=1

βltpl

k

⌉

⌈

logν

tk

⌉

≤ C2

⌈

β0 +

L∑

l=1

βl tpl

⌉

⌈

logν

t

⌉

= C2

⌈

β0 +

L∑

l=1

βl tpl

⌉

dC1e

≤ C2

⌈

β0 +L∑

l=1

βl tpl

⌉

dC1e

≤ (C1 + 1)C2

⌈

β0 +L∑

l=1

βl tpl

⌉

max

1,⌈

logν

t

⌉

,

where the second inequality follows from the assumption t ≤ t. Combining the previous two conclu-sions, we conclude that (55) holds with C = (C1 + 1)(C0 + C2).

References

[1] G. Lan, Z. Lu, and R.D.C. Monteiro. Primal-dual first-order methods with O(1/ε) iteration-complexity for cone programming. Manuscript, School of ISyE, Georgia Tech, Atlanta, GA,30332, USA, December 2006. Conditionally accepted in Mathematical Programming.

[2] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2). Doklady AN SSSR, 269:543–547, 1983. translated as Soviet Math. Docl.

[3] Y. E. Nesterov. Introductory Lectures on Convex Optimization: a basic course. Kluwer AcademicPublishers, Massachusetts, 2004.

[4] Y. E. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming,103:127–152, 2005.

34

Documents

Iteration-complexity of rst-order penalty methods for convex programmingmonteiro/publications/tech_reports/... · 2008-09-17 · Iteration-complexity of rst-order penalty methods