Self-concordant inclusions: A uniﬁed framework for path ...shulu.web.unc.edu/files/2019/07/SelfConInclusion_Jan2018.pdf · third inexact path-following scheme to achieve an "-solution,

Noname manuscript No.(will be inserted by the editor)

Self-concordant inclusions: A unified framework for path-following

generalized Newton-type algorithms

Quoc Tran-Dinh⇤ · Tianxiao Sun · Shu Lu

Received: date / Accepted: date

Abstract We study a class of monotone inclusions called “self-concordant inclusion” whichcovers three fundamental convex optimization formulations as special cases. We develop anew generalized Newton-type framework to solve this inclusion. Our framework subsumesthree schemes: full-step, damped-step, and path-following methods as specific instances,while allows one to use inexact computation to form generalized Newton directions. Weprove a local quadratic convergence of both full-step and damped-step algorithms. Then, wepropose a new two-phase inexact path-following scheme for solving this monotone inclusionwhich possesses an O(

p⌫ log(1/"))-worst-case iteration-complexity to achieve an "-solution,

where ⌫ is the barrier parameter and " is a desired accuracy. As byproducts, we customize ourscheme to solve three convex problems: convex-concave saddle-point, nonsmooth constrainedconvex program, and nonsmooth convex program with linear constraints. We also providethree numerical examples to illustrate our theory and compare with existing methods.

Keywords Self-concordant inclusion · generalized Newton-type methods · path-followingschemes · monotone inclusion · constrained convex programming · saddle-point problems

Mathematics Subject Classification (2000) 90C25 · 90C06 · 90-08

1 Introduction1.1 Problem statementThis paper is devoted to studying the following monotone inclusion which covers threeimportant convex optimization templates [2,16,46]:

Find z

? 2 Rp such that: 0 2 AZ(z?) := A(z

?) +NZ(z

?), (1)

where Z is a nonempty, closed, and convex set in Rp; A : Rp ◆ 2

Rp

is a multivalued andmaximally monotone operator (cf. Definition 1); NZ(z) is the normal cone of Z at z given by{w 2 Rp | hw, z� ˆ

zi � 0, 8ˆz 2 Z} if z 2 Z, and ; otherwise; and “ :=” stands for “is definedas”. Throughout this paper, we assume that Z is endowed with a “⌫ - self-concordant barrier”F (cf. Definition 3). We denote by Z?

:= {z? | 0 2 A(z

?) +NZ(z

?)} the solution set of (1).

Without the self-concordance of Z, (1) is a classical monotone inclusion [2,46], and can bereformulated into a multivalued variational inequality problem [16]. In particular, (1) coversthe optimality (or KKT) conditions of unconstrained and constrained convex programs, andconvex-concave saddle-point problems as described in Subsection 1.2. Therefore, (1) can be

⇤Corresponding author ([email protected])

Q. Tran-Dinh, T. Sun and S. LuDepartment of Statistics and Operations Research, The University of North Carolina at Chapel Hill (UNC).318 Hanes Hall, UNC Chapel Hill, NC 27599, USA. E-mail: {quoctd, tianxias, shulu}@email.unc.edu

This is a post-peer-review, pre-copyedit version of an article published in Mathematical Programming. The final authenticated version is available online at: https://link.springer.com/article/10.1007/s10107-018-1264-6

[email protected]

2 Q. Tran-Dinh et al.

used as a unified tool to study and develop numerical methods for these problems [2,16].Methods for solving (1) and its special instances are well-developed under different structureassumptions imposed on A and Z [2,16]. See Section 6 for a more thorough discussion.

We instead focus on a class of (1), where Z is equipped with a “self-concordant” barrier(cf. Definition 3). The self-concordance notion was introduced by Nesterov and Nemirovskii[32,37] in the 1990s to develop a unified theory and polynomial time algorithms in interior-point methods for structural convex programming, but has not been well exploited in otherclasses of optimization methods in both the convex and nonconvex cases.

Our approach in this paper can briefly be described as follows. Let Z be equipped witha ⌫-self-concordant barrier F . Since NZ(z) = {0p} for any z 2 int (Z), the interior of Z, wecan define the following barrier problem associated with (1):

Find z

?t 2 int (Z) such that: 0 2 At(z

?t ) := trF (z

?t ) +A(z

?t ), (2)

where t > 0 is a penalty parameter. For any t > 0, At remains a maximally monotoneoperator. Hence, (2) is a parametric monotone inclusion depending on the parameter t. Aswe will show in Lemma 1 that the solution z

?t of (2) exists and is unique for any t > 0 under

mild conditions. By perturbation theory [13,43], one can show that z

?t is continuous w.r.t.

t > 0. The set {z?t | t > 0} containing solutions of (2) for each t generates a trajectory calledthe central path of (1). Each point z

?t on this path is called a central point. Our objective

is to design efficiently numerical methods for solving (1) from the linearization of (2).

1.2 Three fundamental convex optimization templatesWe present three basic problems in convex optimization covered by (1) to motivate our work.

1.2.1 Constrained convex programs

Consider a general constrained convex optimization problem as studied in [52,53]:

g? := min

x

{g(x) | x 2 X} , (3)

where g : Rn!R [ {+1} is proper, closed, and convex, and X is a nonempty, closed andconvex set in Rn endowed with a ⌫-self-concordant barrier f (cf. Definition 3). Let @g bethe subdifferential of g (cf. Section 2). The following optimality condition is necessary andsufficient for x? 2 Rn to be an optimal solution of (3) under a given constraint qualification:

0 2 @g(x?) +NX (x

?).

By letting z := x, A := @g and Z := X , this inclusion exactly has the same form as (1).The barrier problem associated with (3) becomes

B?(t) := min

x2Rn{B(x; t) := g(x) + tf(x) | x 2 int (X )} ,

where t > 0 is a penalty parameter. The optimality condition of this barrier problem is0 2 trf(x?t ) + @g(x?t ) which is exactly (2) with F := f .1.2.2 Constrained convex programs with linear constraints

We are interested in the following constrained convex optimization problem:

G? := max

x2Rn,s2Rm{G(x, s) := hc,xi � g(s) | Lx�W s = b, x 2 K} , (4)

where c 2 Rn, b 2 Rp, L : Rn ! Rp and W : Rns ! Rp are linear operators, g :

Rns ! R [ {+1} is a proper, closed, convex and possibly nonsmooth function, and Kis a proper, nonempty, closed, pointed, and convex cone endowed with a ⌫-self-concordantlogarithmically homogeneous barrier f (cf. Definition 3). In addition, we assume that n p.

The corresponding dual problem of (4) can be written as follows:

H⇤:= min

y

{H(y) := g⇤(W ⇤y) + hb,yi | L⇤

y � c 2 K⇤} , (5)

Self-concordant inclusions: A unified framework for generalized interior-point methods 3

where K⇤:= {u 2 Rn | hx,ui � 0, 8x 2 K} is the dual cone of K, L⇤ and W ⇤ are the adjoint

operators of L and W , respectively, and g⇤(u) := sup

s

{hu, si � g(s)} is the conjugate of g.Let Y := {y 2 Rp | L⇤

y � c 2 K⇤}. Then, the optimality condition of (5) becomes0 2 @g⇤(W ⇤

y

?) + b+NY(y

?),

which fits the form of (1). The barrier problem associated with the dual problem (5) ismin

y2Rp{g⇤(W ⇤

y) + hb,yi+ tf⇤(c� L⇤

y)} , (6)

where f⇤ is the Fenchel conjugate of f . If we define (·) = g⇤(W ⇤(·)) + hb, ·i and '(·) :=

f⇤(c� L⇤

(·)) the barrier of Y, then the optimality condition of (6) becomes0 2 �tL (rf⇤

(c� L⇤y

?t )) + @ (y?t ), (7)

which falls into the form (2) with z := y, F (·) := '(·) = f⇤(c� L⇤

(·)), and A(·) := @ (·).1.2.3 Convex-concave saddle-point problems

Consider the following convex-concave saddle-point problem that covers many applicationsincluding signal/image processing and duality theory [9,11]:

�? := min

y2Y

�

�(y) := (y) + max

x2X{hy, Lxi � g(x)}

, (8)

where g : Rn ! R [ {+1} is a proper, closed and convex function; : Rm ! R [ {+1}is also a proper, closed and convex function; X and Y are two nonempty, closed and convexsets in Rn and Rm, respectively; and L : Rn ! Rm is a given linear operator. The optimalitycondition of (8) for a saddle point (x

?,y?) is(

0 2 @g(x?)� L⇤y

?+NX (x

?),

0 2 @ (y?) + Lx? +NY(y?).

(9)

where NX and NY are the normal cone of X and Y, respectively. If we define z := (x,y),Z := X ⇥ Y,

A(z) :=

✓

@g(x)� L⇤y

@ (y) + Lx

◆

, and NZ(z) := NX (x)⇥NY(y), (10)

then (9) can be cast into the form (1).Let X and Y be endowed with self-concordant barriers f and ', respectively. Then, we

can write down the barrier problem of (8) as

B?(t) := min

y2int(Y)

n

B(y; t) := (y) + t'(y) + max

x2int(X )

{hy, Lxi � g(x)� tf(x)}o

,

where t > 0 is a penalty parameter. Hence, its optimality condition becomes(

0 2 trf(x?t )� L⇤y

?t + @g(x?t )

0 2 tr'(y?t ) + Lx?t + @ (y?t ).(11)

If we define F (z) := f(x) + '(y), then (11) can be written into the form (2).1.3 Our contributionWe unify the proximal-point and the path-following interior-point schemes to design a jointtreatment between these methods for solving the monotone inclusion (1). Our approach isfundamentally different from existing methods, where we use the means of self-concordantbarriers of the feasible set Z in (1) to develop generalized Newton-type algorithms.

We propose a unified framework that covers three fundamental convex problems as pre-viously described. We develop three different generalized Newton-type methods for solving(1). Our framework covers the previous work in [52,53] for the convex problem (3) as specialcases. Our approach relies on specific structure of Z in (1) where we can treat (1) via thelinearization of its barrier formulation (2). By introducing a new scaled resolvent mappingand generalized proximal Newton decrement, we develop a generalized Newton frameworkfor solving (1). Then, we combine it and a homotopy strategy for the penalty parameter t toobtain a path-following scheme for solving (1). Our approach relates to classical proximal-point and interior-point methods in the literature as discussed in Section 6.


Contribution: To this end, we can summarize the contribution of this paper as follows:(a) (Theory) We study a class of monotone inclusions, which we call “self-concordant inclu-

sions”, that provides a unified framework using self-concordant barriers to investigatethree fundamental classes of convex optimization problems. We prove the existence anduniqueness of the central path of (2) under mild assumptions.

(b) (Algorithms) We propose a generalized Newton-type framework for solving (1). Thisframework covers three methods: full-step generalized Newton, damped-step generalizedNewton, and full-step path-following generalized Newton schemes. Our methods allowone to use inexact computation to form generalized Newton-search directions, and adap-tively update the contraction factor for the penalty parameter t associated with F .

(c) (Convergence theory) We prove a local quadratic convergence of the first two inexactgeneralized Newton-methods, and estimate the worst-case iteration-complexity of thethird inexact path-following scheme to achieve an "-solution, where " is a desired accu-racy. Surprisingly, this worst-case complexity is O(

p⌫ log(1/✏)) which is the same as in

standard path-following methods for smooth convex programming [32,36].(d) (Special instances) We customize our path-following framework to solve three convex

problems: (3), (4) and (8), and investigate the overall worst-case iteration-complexity foreach method. In addition, we provide an explicit scheme to recover primal solutions fromthe dual ones in the linear constrained case (4) with a rigorous convergence guarantee.Let us emphasize the following points of our contribution. First, using a barrier function

for the constraint set Z in (1) allows us to handle a wide class of problems where projectionsonto Z are no longer efficient, e.g., when Z is a general polyhedron, or a hyperbolic cone.Second, these are second-order methods which often achieve high accuracy solutions andhave a fast local convergence rate. This is an advantage when the evaluation of barrierfunction values and its derivatives is expensive. In addition, they are known to be robust toinexact computation and noise. However, as a compensation, the complexity-per-iteration isoften higher than first-order methods. Fortunately, inexact computation allows us to applyiterative methods for computing generalized Newton search directions. Third, when appliedto (3), (4), and (8), the efficiency of our algorithms depends on the cost of the scaledproximal operator of g and which is a key component in first-order, primal-dual, andsplitting methods. Finally, our framework is sufficiently general and can be customized tospecific classes of structural convex problems such as conic and geometric programming.

1.4 Outline of the paperThe rest of this paper is organized as follows. In Section 2, we recall some preliminaryresults including monotone operators and self-concordance notions [36] used in this paper.Section 3 presents a unified generalized Newton-type framework that covers three differentmethods and analyzes their local convergence properties as well as their worst-case iteration-complexity. Section 4 customizes our path-following framework to solve the convex-concaveminimax problem (8), the primal constrained convex problem (3), and the linear constrainedconvex problem (4). Section 5 deals with specific applications and illustrates numerically theperformance of our algorithm. For clarity of exposition, technical proofs of the results in themain text are deferred to the appendix.

2 Preliminaries: monotonicity, convexity, and self-concordanceWe recall some preliminary results from classical convex analysis including monotonicity,convexity and self-concordance which will be used in the sequel.

2.1 Basic definitionsLet hu,vi or u

>v denote the inner product, and kuk

2

denote the Euclidean norm for anyu,v 2 Rp. For a proper, closed, and convex function F : Rp ! R [ {+1}, dom(F ) :=

{z 2 Rp | F (z) < +1} denotes its domain, and Dom(F ) := cl(dom(F )) denotes the closure


of dom(F ), @F (z) :=

�

w 2 Rp | F (u) � F (z)+ hw,u�zi, 8u 2 dom(F )

denotes its subd-ifferential at z [45]. We also use C3

(Z) to denote the class of three-time continuously differen-tiable functions from Z ✓ Rp to R. Given a multivalued operator A : Rp ◆ 2

Rp

, dom(A) :=

{z 2 Rp | A(z) 6= ;} denotes the domain of A, and gr (A) := {(z,w) 2 Rp ⇥ Rp | w 2 A(z)}denotes the graph of A. Sp

+

stands for the symmetric positive semidefinite cone of di-mension p, and Sp

++

is its interior, i.e., Sp++

= int

�

Sp+

�

. For any Q 2 Sp++

, we denotekzk

Q

:= hQz, zi1/2 the weighted norm of z, and kzk⇤Q

:= hQ�1

z, zi1/2 is its dual norm.For the three-time continuously differentiable and convex function F : Rp ! R defined in

(2) such that r2F (z) � 0 at some z 2 dom(F ) (i.e., r2F (z) is symmetric positive definite),we define a local norm, and its dual norm, respectively as

kukz

:= hr2F (z)u,ui1/2, and kvk⇤z

:= hr2F (z)

�1

v,vi1/2, (12)

for given u,v 2 Rp. Clearly, with this definition, the well-known Cauchy-Schwarz inequalityhu,vi kuk

z

kvk⇤z

holds.

2.2 Maximally monotone, resolvent, and proximal operatorsDefinition 1 Given a multivalued operator A : Rp ◆ 2

Rp

, we say that A is monotone if

for any z, ˆz 2 dom(A), hw � ˆ

w, z� ˆ

zi � 0 for w 2 A(z) and

ˆ

w 2 A(

ˆ

z); and A is maximal

if its graph is not properly contained in the graph of any other monotone operator.

Given a maximally monotone operator A : Rp ◆ 2

Rp

, and Q 2 Sp++

, we define

JQ

�1A(z) = (I+Q

�1A)

�1

(z) := {w 2 Rp | 0 2 Q(w � z) +A(w)} , (13)

the scaled resolvent operator of A [2,46]. It is well-known that dom(JQ

�1A) = Rp and JQ

�1Ais well-defined and single-valued. If Q = I, the identity operator, then JI�1A ⌘ JA is thestandard resolvent of A. When A = @g, the subdifferential of a proper, closed and convexfunction g, J

Q

�1A becomes a scaled proximal operator of g, which is defined as follows:

proxQ

�1g(x) := argmin

u

�

g(u) + (1/2)ku� xk2Q

| u 2 dom(g)

. (14)

Methods for evaluating proxQ

�1g have been discussed in the literature, see, e.g., [4,18]. IfQ = I, then prox

Q

�1g = proxg, the standard proximal operator of g. Examples of suchfunctions can be found, e.g., in [2,10,40].

2.3 Self-concordant functions and self-concordant barriersWe also use the self-concordance concept introduced by Nesterov and Nemirovskii [32,36].

Definition 2 A univariate convex function ' 2 C3

(dom(')) is called standard self-concordantif |'000

(⌧)| 2'00(⌧)3/2 for all ⌧ 2 dom('), where dom(') is an open set in R. A function

F : dom(F ) ✓ Rp ! R is standard self-concordant if for any z 2 dom(F ) and v 2 Rp, the

univariate function ' defined by ⌧ 7! '(⌧) := F (z+ ⌧v) is standard self-concordant.

Definition 3 A standard self-concordant function F : Z ⇢ Rp ! R is a ⌫-self-concordantbarrier for a convex set Z with parameter ⌫ > 0 if dom(F ) = int (Z) and

sup

u2Rp

�

2hrF (z),ui � kuk2z

⌫, 8z 2 dom(F ).

In addition, F (z) tends to +1 as z approaches the boundary of Z. A function F is called a ⌫-

self-concordant logarithmically homogeneous barrier function of Z if F (⌧z) = F (z)�⌫ log(⌧)for all z 2 int (Z) and ⌧ > 0.


Several simple sets are equipped with a self-concordant logarithmically homogeneous bar-rier. For instance, FRp

+

(z) := �Pp

i=1

log(zi) is a p-self-concordant barrier of Rp+

, FSn+

(Z) :=

� log det(Z) is an n-self-concordant barrier of Sn+

, and F (z, t) = � log(t2 � kzk22

) is a 2-self-concordant barrier of the Lorentz cone Lp+1

:= {(z, t) 2 Rp ⇥ R+

| kzk2

t}.When Z is bounded and F is a ⌫-self-concordant barrier for Z, the analytical center ¯

z

?f

of f exists and is unique. It is defined by

¯

z

?F := argmin

z2int(Z)

F (z), (and its optimality condition is rF (

¯

z

?F ) = 0). (15)

Let us define := ⌫ + 2

p⌫ for a general self-concordant barrier, and := 1 for a self-

concordant logarithmically homogeneous barrier. Then, we have kvk⇤z

kvk⇤¯z

?F

for anyz 2 int(Z) and v 2 Rp.

Let K be a proper, closed and pointed convex cone. If K is endowed with a ⌫-self-concordant logarithmically homogeneous barrier function F , then its Fenchel conjugate (alsocalled Legendre transformation [36])

F ⇤(w) := sup

z

{hw, zi � F (z) | z 2 K}

is also a ⌫-self-concordant logarithmically homogeneous barrier of the anti-dual cone �K⇤

of K. For instance, if K = Sn+

, then K⇤= Sn

+

= K (self-dual cone). A barrier function of Sn+

is F (z) := � log det(z). Hence, F ⇤(w) = �n� log det(�w) is a barrier function of �K⇤.

3 Generalized Newton-type methods for self-concordant inclusionsWe propose a novel generalized Newton-type scheme for solving (1). Then, we develop threeinexact generalized Newton-type schemes: full-step, damped-step, and path-following algo-rithms based on the linearization of (2). We provide a unified analysis for convergence.

3.1 Fundamental assumptions and fixed-point characterizationThroughout this paper, we rely on the following fundamental, but standard assumption.Assumption A. 1 (a) The feasible set Z is nonempty, closed, and convex. Moreover, Z is

equipped with a ⌫-self-concordant barrier F .

(b) The operator A is maximally monotone, int (Z)\dom(A) 6= ;, and dom(A) is either an

open set or a closed set.

(c) The solution set Z?of (1) is nonempty.

Note that since dom(rF ) = int (Z), Assumption A.1 is sufficient for At defined by (2) tobe maximally monotone [2, Corollary 25.5]. This assumption can be relaxed to differentconditions as discussed in [2, Section 25.1], which we omit here.

Our aim is to compute an approximate solution of (1) up to a given accuracy as follows:

Definition 4 Given " � 0, we say that

˜

z

?" 2 int (Z) is an "-solution to (1) if

dist

˜z

?"

�

0,A(

˜

z

?")�

:= min

e

n

kek⇤˜z

?"| e 2 A(

˜

z

?")

o

".

Here, distz

(w,⌦) defines a weighted distance from w 2 Rp to a nonempty, closed and convexset ⌦ in Rp, and 0 is the zero vector. Since ˜

z

?" 2 int (Z), we have NZ(˜z

?") = {0}. Hence,

AZ(˜z?") ⌘ A(

˜

z

?").

If " = 0, then Definition 4 says that 0 2 A(

˜

z

?"). Hence, ˜z?" is an exact solution of (1) in

the interior of Z. If all solutions z

? of (1) are on the boundary of Z, then Definition 4 onlyworks if " > 0.

We can modify Definition 4 as dist

z

�

0,AZ(˜z?")�

", where z 2 int (Z) is fixed a priori.Then, all the results in the next sections remain preserved but require a slight justification.In the sequel, we develop different numerical methods to generate a sequence

�

z

k

from theinterior of Z.


The scaled resolvent operator of A: Let us fix ˆ

z 2 int (Z) and t > 0. Then, we haver2F (

ˆ

z) 2 Sp++

. For simplicity of presentation, using (13) we denote by

Pˆz

(·; t) := J(tr2F (ˆz))

�1A(·) =�

I+ t�1r2F (

ˆ

z)

�1A��1

(·), (16)

the scaled resolvent of A. Using Pˆz

(·; t), we can formulate the monotone inclusion (2) as afixed-point equation

z

?t = P

ˆz

�

z

?t �r2F (

ˆ

z)

�1rF (z

?t ); t

�

. (17)

Clearly, if we define Rˆz

(·) := Pˆz

(·�r2F (

ˆ

z)

�1rF (·); t), then z

?t is a fixed-point of R

ˆz

(·).The existence of the central path: We prove in Appendix 7.1 the following existenceresult for (2). Let us recall that the horizon cone of a convex set C consists of vectors ! suchthat z+ ⌧! 2 cl(C) for any z 2 C and any ⌧ > 0, where cl(C) stands for the closure of C.

Lemma 1 Suppose that for any nonzero ! in the horizon cone of int (Z) \ dom(A), there

exists some

ˆ

z 2 int (Z) \ dom(A) with

ˆ

a 2 A(

ˆ

z) such that hâ,!i > 0. Then, for each t > 0,

problem (2) has a unique solution. Moreover, we have dist

z

?t

�

0,A(z

?t )�

tp⌫, which shows

that z

?t is an "-solution to (1) in the sense of Definition 4 if t "p

⌫.

The assumption in Lemma 1 is quite general. There are two special cases in which thisassumption holds. First, if int (Z)\dom(A) is bounded, then the only element in the horizoncone of int (Z)\dom(A) is 0, and the assumption trivially holds. Second, if the solution setZ? of (1) is nonempty and bounded, and the set-valued map A is continuous at points inbdry(Z \ dom(A)) relative to Z \dom(A), then this assumption also holds as can be shownusing [46, Theorem 12.51]. Here, bdry(Z) stands for the boundary of Z, and we refer to [46,Definition 5.4] for the definition of the continuity of a set-valued map.Generalized gradient mapping: Fix ˆ

z 2 int (Z) with r2F (

ˆ

z) � 0. We consider thefollowing linear monotone inclusion in s:

0 2 trF (z) + tr2F (

ˆ

z)(s� z) +A(s). (18)

If we take z =

ˆ

z, then it becomes a linearization (with respect to rF ) of (2) at a given pointz. It is obvious that (18) is strongly and maximally monotone so that its solution exists andis unique. We denote this solution by s

ˆz

(z; t), and, by using Pˆz

(·; t), it can be written as

s

ˆz

(z; t) := Pˆz

�

z�r2F (

ˆ

z)

�1rF (z); t�

. (19)

Next, we define the following mapping

Gˆz

(z; t) := r2F (

ˆ

z) (z� s

ˆz

(z; t)) ⌘ r2F (

ˆ

z)

�

z� Pˆz

�

z�r2F (

ˆ

z)

�1rF (z); t��

, (20)

When A = 0, Gˆz

(z; t) = rF (z), which is exactly the gradient of F . Then, we adopt thename in [32] to call G

ˆz

(·; t) a generalized gradient mapping.Given G

z

(z; t) as in (20) with ˆ

z = z, we define the following generalized Newton decre-ment �t(z) to analyze the convergence of generalized Newton-type methods below:

�t(z) := kGz

(z; t)k⇤z

= kz� Pz

�

z�r2F (z)

�1rF (z); t�

kz

. (21)

If A(z) = c, a constant operator, then �t(z) = kt�1

c+rF (z)k⇤z

, which is exactly the Newtondecrement defined in [32, Formula 4.2.16].

To conclude, we summarize the result of this subsection in the following lemma. Thisresult is a direct consequence of the definition of G

z

(·; t) and �t(·). We omit the proof.

Lemma 2 The solution s

ˆz

(·; t) of (18) exists and is unique for any z 2 dom(F ). Conse-

quently, Gˆz

(·; t) given by (20) is well-defined on dom(F ).

Let z

?t 2 int (Z) be a given point and �t(·) be defined by (21). Then, �t(z

?t ) = 0 if and

only if z

?t is a solution to (2).


In the sequel, we only work with the solution s

ˆz

(z; t) of (18) which exists and is unique.However, we assume throughout this paper that the assumptions of Lemma 1 hold so thatthe solution z

?t of (2) exists and is unique for each t > 0. We do not use z

?t of (2) at any

step of our algorithms. Since z

?t is on the central path of (1) at t > 0, z?t 2 Z. If t > 0 is

sufficiently small, e.g., t := "/p⌫, then we can say that z

?t is also an "-solution of (1) as

stated in Lemma 1 in the sense of Definition 4.3.2 Inexact generalized Newton-type schemesThe main step of the generalized Newton method is presented as follows: For a fixed valuet > 0, and a given iterate z 2 int (Z), we approximate F by its Taylor’s expansion and define

bAt(w; z) := t⇥

rF (z) +r2F (z)(w � z)

⇤

+A(w). (22)

Since r2F (z) � 0, we can compute the unique solution of the linearized inclusion:

s

z

(z; t) :=�

w 2 int (Z) | 0 2 bAt(w; z)

⌘ (

bAt(·; z))�1

(0). (23)

Computing s

z

(z; t) exactly is often impractical, so we allow one to approximate it as follows.

Definition 5 Given an accuracy � 2 [0, 1), we say that z

+

is a �-approximation to the true

solution

¯

z

+

:= s

z

(z; t) defined in (23) (and is denoted by z

+

⇡ ¯

z

+

) if

dist

z

�

0, bAt(z+; z)�

= min

e

n

kek⇤z

| e 2 bAt(z+; z)

o

t�. (24)

First, we show that, under (24), we have kz+

� ¯

z

+

kz

�. Next, since we are working withthe linearization (23) of (2), the following lemma, whose proof is in Appendix 7.2, showsthat an approximate solution of (23) is also an approximate solution of problem (1).

Lemma 3 Let z

+

be a �-approximate solution to

¯

z

+

of (23) in the sense of Definition 5.

Then, we have kz+

� ¯

z

+

kz

�. Furthermore, if �t(z) + � < 1, then

dist

z

+

(0,AZ(z+)) (1� �t(z)� �)�1

�p⌫ + �t(z) + 2�

�

t. (25)

If we choose t > 0 such that t (1��t(z)� �) (p⌫ + �t(z) + 2�)

�1

" for a given " > 0, then

z

+

2 int (Z), and z

+

is an "-solution to (1) in the sense of Definition 4.

We now investigate the convergence of the inexact full-step, damped-step, and path-followinggeneralized Newton methods.3.2.1 A key estimate

The following theorem provides a key estimate to analyze the convergence of the generalizedNewton-type scheme above, whose proof can be found in Appendix 7.3.

Theorem 1 For a given z 2 int (Z), let z

+

be the point generated by the inexact generalized

Newton scheme (in the sense of Definition 5):

z

+

⇡ ¯

z

+

:= Pz

�

z�r2F (z)

�1rF (z); t+

�

. (26)

Then, if �t+

(z) + �(z) < 1, where �t+

(z) is defined by (21) and �(z) := kz+

� ¯

z

+

kz

, then

z

+

2 int (Z), and the following estimate holds:

�t+

(z

+

) ✓

�t+

(z) + �(z)

1� �t+

(z)� �(z)

◆

2

+

�(z)

(1� �t+

(z)� �(z))3. (27)

Moreover, the right-hand side of (27) is monotonically increasing w.r.t. �t+

(z) and �(z).

Clearly, if z+

=

¯

z

+

(i.e., the subproblem (26) is solved exactly), then (27) reduces to

�t+

(z

+

) ✓

�t+

(z)

1� �t+

(z)

◆

2

, (28)

which is in the form of [32, Theorem 4.1.14], but for the exact variant of (26).


3.2.2 Neighborhood of the central path and quadratic convergence region

Given the generalized Newton decrement �t(·) defined by (21), we consider the following set

⌦t(�) := {z 2 int (Z) | �t(z) �} , (29)

where � 2 (0, 1). We call ⌦t(�) a neighborhood of the central path of (2) with the radius �.If we can choose � 2 (0, 1) such that:

(i) the sequence�

z

k

generated by a generalized Newton scheme starting from z

0 2 ⌦t(�)

belongs to ⌦t(�), and(ii) the corresponding sequence of the generalized Newton decrements

�

�t(zk)

convergesquadratically to zero,

then we call ⌦t(�) a quadratic convergence region of this method, and denote it by Qt(�).Next, we propose two inexact generalized Newton schemes: full-step and damped step, to

generate a sequence�

z

k

starting from z

0 2 Qt(�) for some predefined � 2 (0, 1), and showthat

�

�t(zk)

converges quadratically to zero. In these schemes, the penalty parameter t isfixed at a sufficiently small value a priori, which may cause some difficulty for computingz

k+1 from z

k due to the ill-condition of r2F (z

k). To avoid this situation, we then suggest

using a path-following scheme to gradually decrease t starting from a larger value t = t0

> 0.

3.2.3 Inexact full-step generalized Newton method (FGN): Local convergence

We investigate the convergence of the FGN and maximize the radius of its quadratic con-vergence region Qt(�). The following theorem shows a quadratic convergence of the inexactgeneralized Newton scheme, whose proof is deferred to Appendix 7.4.

Theorem 2 Given a fixed parameter t > 0, let

�

z

k

be a sequence generated by the following

inexact full-step generalized Newton scheme (FGN):

z

k+1 ⇡ ¯

z

k+1

:= Pz

k

�

z

k �r2F (z

k)

�1rF (z

k); t�

. (FGN)

where the approximation ⇡ is in the sense of Definition 5. Then, we have three statements:

(a) Let 0 < � < 1

2

(3 �p5) be a given radius, and ⌦t(�) be defined by (29). If we choose

z

0 2 ⌦t(�) and the tolerance �k in Definition 5 such that

kzk+1 � ¯

z

k+1kz

k �k ¯�k(�) :=�(1� 3� + �2

)(1� �)4

2�3 � 5�2

+ 3� + 1

,

then

�

z

k

generated by FGN belongs to ⌦t(�).

(b) If we choose �k �t(zk)

2

1��t(zk)

, then, for k � 0 and �t(z0

) < 1, we have

�t(zk+1

) ✓

2� 4�t(zk) + �t(z

k)

(1� 2�t(zk))3

◆

�t(zk)

2 < 1. (30)

For any � 2 (0, 0.18858], if we choose z

0 2 Qt(�), where Qt(�) is the quadratic con-

vergence region of FGN, then

�

z

k

⇢ Qt(�), and {�t(zk)} quadratically converges to

zero.

(c) Let c :=

2�4�+�2

(1�2�)3 2 (0, 1), and " > 0 be a given tolerance for � 2 (0, 0.18858]. If we

choose t := (1� ✏)(p⌫+ ✏+2✏2/(1� ✏))�1" for a sufficiently small ✏ 2 (0,�), and update

�k :=

2�2

k1��k

with �k := c2k�1�2

k

, then after at most k := O(ln(ln(1/✏))) iterations, z

kis

an "-solution of (1) in the sense of Definition 4.

By a numerical experiment, we can show that ¯�k defined in Theorem 2(a) is maxi-mized at �⇤ = 0.0997 2 (0, 0.18858] with ¯�⇤k = 0.0372. Therefore, if we choose these val-ues, we can maximize the tolerance �k. Note that ¯�k is decreasing when � is increasing in�

0.0997, 1

2

(3�p5)

�

and vice versa. Hence, we can trade-off between the radius � of ⌦t(�)

and the tolerance �k of the subproblem in (FGN).


3.2.4 Inexact damped-step GN method (DGN): Local convergence

We now consider a damped-step generalized Newton scheme. The following theorem sum-marizes the result whose proof is moved to Appendix 7.5.Theorem 3 Given a fixed parameter t > 0, let

�

z

k

be the sequence generated by the

following inexact damped-step generalized Newton scheme (DGN):8

>

>

<

>

>

:

˜

z

k+1 ⇡ ¯

z

k+1

:= Pz

k

�

z

k �r2F (z

k)

�1rF (z

k); t�

,

↵k :=

1

(1+

˜�t(zk))

with

˜�t(zk) := k˜zk+1 � z

kkz

k ,

z

k+1

:= (1� ↵k)zk+ ↵k˜z

k+1.

(DGN)

Then, we have the following three statements:

(a) If we choose �k such that �k ˜�t(zk)

2

1+

˜�t(zk)

, then

˜�t(zk+1

)

0

@

2

˜�t(zk)

2

+ 4

˜�t(zk) + 3

1� ˜�t(zk)2⇣

2

˜�t(zk)2 + 4

˜�t(zk) + 3

⌘

1

A

˜�t(zk)

2. (31)

For any � 2 (0, 0.21027], the sequence

�

z

k

generated by DGN starting from any z

0 2⌦t(�) belongs to ⌦t(�), i.e.,

�

z

k

⇢ ⌦t(�).

(b) If we choose � 2 (0, 0.21027], then the sequence

�

˜�(zk)

generated by DGN starting from

any z

0 2 Qt(�) also converges quadratically to zero.

(c) For � 2 (0, 0.21027], let c := 2�2

+4�+3

1��2

(2�2

+4�+3)

2 (0, 1), and " > 0 be a given tolerance. If

we choose t := (1� 2✏2)�p⌫(1 + ✏) + ✏+ 3✏2

��1

" for a sufficiently small ✏ 2 (0,�), and

update �k := (1 + �k)�1�2

k with �k := c2k�1�2

k

, then after at most k := O (ln (ln(1/✏)))

iterations, z

kis an "-solution of (1) in the sense of Definition 4.

Note that the quadratic convergence stated in Theorem 3 is given through�

˜�t(zk)

,which is computable as opposed to {�t(zk)} in Theorem 2. Due to the fact that �t(zk) ˜�t(z

k)+�(zk) ˜�t(z

k)+

˜�t(zk)

2

1+

˜�t(zk)

! 0

+ as k ! 1, we conclude that {�t(zk)} also convergesto zero at a quadratic rate in the DGN scheme.3.2.5 Inexact path-following GN method (PFGN): The worst-case iteration-complexity

We consider the following inexact path-following generalized Newton scheme (PFGN) forsolving (1) directly by simultaneously updating both z and t at each iteration:

(

tk+1

:= (1� ��)tk

z

k+1 ⇡ ¯

z

k+1

:= Pz

k

�

z

k �r2F (z

k)

�1rF (z

k); tk+1

�

,(PFGN)

where �� 2 (0, 1) is a given factor. As before, the approximation z

k+1 ⇡ ¯

z

k+1 is in the senseof Definition 5 with a tolerance �k � 0.

We emphasize that our PFGN scheme updates t by decreasing it at each iteration, whilethe standard path-following scheme in [32, 4.2.23] increases the penalty parameter at eachiteration. When A(z) = c is constant, we can define s := 1

t to obtain the scheme [32, 4.2.23],and it allows us to start from s = 0. This is not the case in our scheme when A(z) 6= c.

Given � 2 (0, 1

2

(3�p5)), we first find �� 2 (0, 1) such that if zk 2 ⌦tk(�), then the new

point z

k+1 at a new parameter tk+1

still satisfies z

k+1 2 ⌦tk+1

(�). The following lemmaproves this key property, whose proof is deferred to Appendix 7.6.

Lemma 4 Let

�

(z

k, tk)

be the sequence generated by the inexact path-following generalized

Newton scheme (PFGN). Then, for z

kwith �tk(z

k) < 1, we have

�tk+1

(z

k) �tk(z

k) +

✓

��1� ��

◆

h

krF (z

k)k⇤

z

k + �tk(zk)

i

�tk(zk) +

✓

��1� ��

◆

hp⌫ + �tk(z

k)

i

. (32)


Let us fix c 2 (0, 1]. Then, for any 0 < � < 0.5(1+ 2c2 �p1 + 4c2), if the factor �� and the

tolerance �k are respectively chosen such that

0 < �� :=

cp��(1+c

p�)

(1+cp�)

p⌫+c

p�, and

0 �k ¯�t(�) :=(1�c2)�

(1+cp�)3

[

3cp�+c2�+(1+c

p�)3

]

,(33)

then �tk(zk) � implies �tk+1

(z

k+1

) �. In addition, �tk+1

(z

k) c

p�

1+cp�.

As an example, if we choose c := 0.95, then the possible interval for � is (0, 0.32895).Now, if we choose � :=

1

9c2 2 (0, 0.32895) (i.e., � ⇡ 0.12311), then �� =

5

36

p⌫+9

, which is thesame as in the standard path-following method in [32]. In this case, the tolerance �k for thesubproblem at the second line of PFGN must be chosen such that 0 �k 7.45933⇥ 10

�4.Figure 1 plots the values of ¯�t(�) and �� in (33) as a function of �, respectively for givenc = 0.95 and ⌫ = 1000. This figure shows that ¯�t is an increasing function of �, while �� has

0 < β < 0.5(1 + 2c2 −√

1 + 4c2)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

δ t(β

)

×10-3

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 < β < 0.5(1 + 2c2 −√

1 + 4c2)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

σβ

×10-4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Fig. 1: The graph of the two functions ¯�t and �� with respect to �.

the maximum point at � = 0.0870. Hence, a good choice of � is � = 0.0870.The following theorem investigates the worst-case iteration-complexity of PFGN using

the update rule (33) for �� . The proof of this theorem can be found in Appendix 7.7.

Theorem 4 Let

�

(z

k, tk)

be generated by PFGN under the following configuration:

(i) c 2 (0, 1] is given, and � is chosen such that 0 < � < 0.5(1 + 2c2 �p1 + 4c2); and

(ii) the initial points z

0

and t0

> 0 are chosen such that z

0 2 int (Z) and �t0

(z

0

) �.

Then, the following conclusions hold:

(a) �tk(zk) � for all k � 0 (i.e., z

k 2 ⌦tk(�) for k � 0).

(b) The number of iterations k to achieve an "-solution z

kof (1) in the sense of Definition

4 does not exceed

kmax

:=

�✓

(1 + cp�)

p⌫ + c

p�

cp� � �(1 + c

p�)

◆

ln

✓

M0

t0

"

◆⌫

+ 1,

where M0

:=

⇣

1� cp�

1+cp�� ¯�t(�)

⌘�1

⇣p⌫ + c

p�

1+cp�+ 2

¯�t(�)⌘

= O(

p⌫).

(c) Consequently, the worst-case iteration-complexity of PFGN is O⇣p

⌫ ln⇣p

⌫t0

"

⌘⌘

.

Theorem 4 requires a starting point z0 2 ⌦t0

(�) at a given penalty parameter t0

> 0. Inorder to find z

0, we perform an initial phase (called Phase 1) as described below.


3.2.6 Finding an initial point with the path-following iterations using auxiliary problem

When A(·) = @H(·) the subgradient of a proper, closed, and convex function H, we canfind z

0 2 ⌦t0

(�) for PFGN by applying the [inexact] damped-step generalized Newtonscheme (FGN) to solve (2) for a fixed penalty parameter t = t

0

. This scheme has a sublinearconvergence rate [52]. However, it is still unclear how to generalize this method to (1).

We instead propose a new auxiliary problem for (2), and then apply PFGN to solve thisauxiliary problem in order to obtain z

0. Then, we estimate the maximum number of thepath-following iterations needed in this phase.

Let us fix some ˆ

z

0 2 int (Z). We first compute a vector ˆ⇠0 2 A(

ˆ

z

0

) and evaluate rF (

ˆ

z

0

).Then, we define ˆ⇣0 := t

0

rF (

ˆ

z

0

) +

ˆ⇠0, and consider the following auxiliary problem of (2):

Find ˆ

z

?⌧ 2 int (Z) such that: 0 2 t

0

rF (

ˆ

z

?⌧ )� ⌧ ˆ⇣0 +A(

ˆ

z

?⌧ ), (34)

where ⌧ 2 [0, 1] is a new homotopy parameter. Clearly, (34) has a similar form as (2).(a) When ⌧ = 1, we have 0 2 t

0

rF (

ˆ

z

0

) � ˆ⇣0 + A(

ˆ

z

0

) due to the choice of ˆ⇣0. Hence, ˆz0 isan exact solution of (34) at ⌧ = 1.

(b) When ⌧ = 0, we have 0 2 t0

rF (

ˆ

z

?⌧ ) + A(

ˆ

z

?⌧ ). Hence, any solution ˆ

z

?⌧ of (34) is also a

solution of (2) at t = t0

.Now, we can apply PFGN to solve (34) starting from ⌧

0

= 1 but using a different updaterule for ⌧ . More precisely, this scheme can be written as follows:

8

<

:

⌧j+1

:= ⌧j ��j ,

ˆ

z

j+1 ⇡ ¯

ˆ

z

j+1

:= Pˆz

j

⇣

ˆ

z

j �r2F (

ˆ

z

j)

�1

⇣

rF (

ˆ

z

j)� ⌧j+1

t�1

0

ˆ⇣0⌘

; t0

⌘

,(35)

where �j > 0 is a given decrement. Here, the approximation ˆ

z

j+1 ⇡ ¯

ˆ

z

j+1 is in the sense ofDefinition 5 with a given tolerance ˆ�j � 0. We also use the index j to distinguish with theindex k in PFGN, and using the notation “hat” for the iterate vectors.

Similar to (21), we define the following generalized Newton decrement for (34):ˆ�⌧ (ˆz) :=

�

�

ˆ

z� Pˆz

�

ˆ

z�r2F (

ˆ

z)

�1

�

rF (

ˆ

z)� ⌧ t�1

0

ˆ⇣0�

; t0

�

�

�

ˆz

. (36)

The theorem below provides the number of iterations j needed to find an initial pointz

0 2 ⌦t0

(�) for PFGN using (35), whose proof can be found in Appendix 7.8.

Theorem 5 Let c 2 (0, 1] and � be chosen as in Theorem 4, and ⌘ be chosen such that

0 < ⌘ < �. Let

�

(

ˆ

z

j , ⌧j)

be generated by (35). If �j and

ˆ�j are chosen such that

0 �j µ⌘

kˆ⇣0

k⇤ˆzj

with µ⌘ :=

t0

kˆ⇣0

k⇤ˆzj

⇣

cp⌘

1+cp⌘ � ⌘

⌘

, and

0 ˆ�j ¯�⌧ (⌘) :=(1�c2)⌘

(1+cp⌘)3

[

3cp⌘+c2⌘+(1+c

p⌘)3

]

,(37)

then

ˆ�⌧j (ˆzj) defined in (36) satisfies

ˆ�⌧j (ˆzj) ⌘ for all j � 0.

Let z

0

:=

ˆ

z

jmax

be obtained after jmax

iterations. Then,

ˆ�⌧0

(

ˆ

z

0

) = 0 and we have

�t0

(z

0

) ˆ�⌧j (ˆzj) + t�1

0

⌧jkˆ⇣0k⇤ˆz

j ⌘ +kˆ⇣0k⇤

¯z

?F

t0

� j

✓

cp⌘

1 + cp⌘� ⌘

◆

, 8j � jmax

, (38)

where �t(z) is defined by (21), and

¯

z

?F and are defined by (15) and below (15), respectively.

The number of iterations j to achieve z

0

:=

ˆ

z

jsuch that �t

0

(z

0

) � does not exceed

jmax

:=

$

(1 + cp⌘)kˆ⇣0k⇤

¯z

?F

t0

�

cp⌘ � ⌘(1 + c

p⌘)� � (� � ⌘)(1 + c

p⌘)

cp⌘ � ⌘(1 + c

p⌘

%

+ 1.

The worst-case iteration-complexity of (35) to obtain z

0

such that �t0

(z

0

) � is O✓

kˆ⇣0k⇤¯z?F

t0

◆

.


Theorem 5 suggests us to choose t0

:= . In this case, the maximum number of iterations

in Phase 1 does not exceed(1+c

p⌘)kˆ⇣0k⇤

¯z?F

cp⌘�⌘(1+c

p⌘) , which is a constant.

Remark 1 From (38), we can compute kˆ⇣0k⇤ˆz

j directly in order to terminate (35) by checking⌧jkˆ⇣0k⇤

ˆz

j t0

(� � ⌘). Hence, (35) does not require to evaluate the analytical center ¯

z

?F of

F . If F is a self-concordant logarithmically homogeneous barrier, then we can simply chooset0

:= 1. Otherwise, we can choose t0

:= ⌫ + 2

p⌫.

3.2.7 Two-phase inexact path-following generalized Newton algorithm

Putting two schemes (35) and PFGN together, we obtain a two-phase inexact path-followinggeneralized Newton algorithm for solving (1) as described in Algorithm 1.

Algorithm 1 (Two-phase inexact path-following generalized Newton algorithm)1: Initialization:2: Choose an arbitrary initial point ˆ

z

0 2 int (Z) and a desired accuracy " > 0

3: Fix t0

> 0 and c 2 (0, 1] (e.g., t0

:= , and c := 0.95).4: Compute ˆ⇠0 2 A(

ˆ

z

0

) and evaluate rF (

ˆ

z

0

). Set ˆ⇣0 := t0

rF (

ˆ

z

0

) +

ˆ⇠0 and ⌧0

:= 1.5: Fix � as in Theorem 4 (e.g., � :=

1

9c2 ) and choose ⌘ < � (e.g., ⌘ := 0.5�).6: Compute ¯�⌧ , µ⌘, ¯�t, and �� by (37) and (33), respectively, and M

0

from Theorem 4.

7: Phase 1: Computing an initial point by path-following iterations

8: For j = 0, · · · , jmax

, perform:9: If ⌧jkˆ⇣0k⇤

ˆz

j t0

(� � ⌘), then set z

0

:=

ˆ

z

j and TERMINATE.10: Update (

ˆ

z

j+1, ⌧j+1

) by (35) with �j :=µ⌘

kˆ⇣0

k⇤ˆzj

up to an accuracy ˆ�j ¯�⌧ .

11: End for

12: Phase 2: Inexact path-following generalized Newton iterations

13: For k = 0, · · · , kmax

, perform:14: If M

0

tk ", then return z

k as an "-solution of (1), and TERMINATE.15: Update (z

k+1, tk+1

) by (PFGN) up to an accuracy �k ¯�t.16: End for

The main computational cost of Algorithm 1 is the solution of the two linear monotoneinclusions in (35) and PFGN, respectively. When A = @H the subdifferential of a convexfunction H, various methods including fast gradient, primal-dual methods, and splittingtechniques can be used to solve these problems [3,4,7,15,18,35].

The overall worst-case iteration-complexity of Algorithm 1 is given in the following the-orem which is a direct consequence of Lemma 3, Theorem 4, and Theorem 5.

Theorem 6 Let us choose t0

:= as defined below (15). Then, the overall worst-case

iteration-complexity of Algorithm 1 to achieve an "-solution z

kof (1) as in Definition 4 is

O

kˆ⇣0k⇤¯z

?F

t0

+

p⌫ ln

✓

M0

t0

"

◆

!

✓

or simpler O✓p

⌫ ln

✓

p⌫

"

◆◆◆

,

where t0

> 0 is an initial penalty parameter and M0

= O(

p⌫) is defined in Theorem 4.

Proof The total number of iterations requires in Phase 1 and Phase 2 of Algorithm 1 is

Kmax

�(1 + c

p⌘)kˆ⇣0k⇤

¯z

?F

t0

�

cp⌘ � ⌘(1 + c

p⌘)�

+ C2

ln

✓

M0

t0

"

◆

� C1

.


where C1

:=

(��⌘)(1+cp⌘)

cp⌘�⌘(1+c

p⌘ , and C

2

:=

⇣

(1+cp�)

p⌫+c

p�

cp��(1+c

p�)

⌘

. Note that C1

is a constant, while

C2

= O(

p⌫). Hence, K

max

� C3

kˆ⇣0k⇤¯z?F

t0

+O(

p⌫) ln

�

M0

t0

"

�

�C1

, where C3

:=

1+cp⌘

cp⌘�⌘(1+c

p⌘) .

We can write this as Kmax

� O✓

kˆ⇣0k⇤¯z?F

t0

+

p⌫ ln

�

M0

t0

"

�

◆

. We finally note that M0

=

O(

p⌫) and t

0

= , and the first term is a constant and independent of ", which is dominatedby the second term. Hence, we obtain the second estimate of Theorem 6. ⇤

The complexity bound in Theorem 6 also depends on the choice of �, ⌘ and t0

. Adjustingthese parameters allows us to trade-off between Phase 1 and Phase 2 in Algorithm 1. Clearly,if t

0

is large, the number of iterations required in Phase 1 is small, but the number ofiterations in Phase 2 is large, and vice versa.

Remark 2 Note that we can recover the convergence guarantee of the exact generalizedNewton-type schemes as consequences of Theorems 2, 3 and 6, respectively. For instance, inthe exact variant of (PFGN), if we can choose � 2 (0, 0.5(3 �

p5)), then the upper bound

�� in (33) reduces to �� :=

p��(1+

p�)

(1+

p�)

p⌫+

p�. Hence, we can show that the worst-case iteration-

complexity estimate of this exact scheme coincides with the standard path-following schemefor smooth structural convex programming given in [32, Theorem 4.2.9].

4 Inexact path-following proximal Newton algorithmsWe now specify our framework, Algorithm 1, to solve three problems: (3), (4), and (8).

4.1 Inexact primal-dual path-following algorithm for saddle-point problemsWe recall the convex-concave saddle-point problem (8). Our primal-dual path-following prox-imal Newton method relies on the following assumption.

Assumption A. 2 (a) The feasible set X (respectively, Y) is a nonempty, closed, and con-

vex cone with nonempty interior, and is endowed with a ⌫f -self-concordant barrier f

(respectively, a ⌫'-self-concordant barrier ') such that Dom(f) = X (respectively, Dom(') =

Y).

(b) Both g and in (8) are proper, closed, and convex such that int (X ) \ dom(g) 6= ; and

int (Y) \ dom( ) 6= ;.(c) The solution set Z?

of (8) is nonempty.

For any z = (x,y), ˆz = (

ˆ

x, ˆy), (⇠g, ⇠ ) 2 @g(x)⇥ @ (y), and (

ˆ⇠g, ˆ⇠ ) 2 @g(ˆx)⇥ @ (ˆy):

⇠g � L⇤y � ˆ⇠g + L⇤

ˆ

y

⇠ + Lx� ˆ⇠ � Lˆx

!>✓

x� ˆ

x

y � ˆ

y

◆

� 0.

This shows that A defined by (10) is maximally monotone. In addition, F is a self-concordantbarrier of Z = X ⇥ Y with the barrier parameter ⌫ := ⌫f + ⌫'.

4.1.1 Inexact primal-dual path-following proximal Newton method

We specify PFGN to solve (8). Let z

k:= (x

k,yk) 2 int (Z) be a given point at tk > 0. We

update z

k+1

:= (x

k+1,yk+1

) and tk+1

using PFGN, which reduces to the following form:(

0 2 tk+1

⇥

rf(xk) +r2f(xk

)(x� x

k)

⇤

� L⇤y + @g(x),

0 2 tk+1

⇥

r'(yk) +r2'(yk

)(y � y

k)

⇤

+ Lx+ @ (y).(39)

Here, we approximately solve (39) as done in PFGN. Hence, PFGN can be rewritten as8

<

:

tk+1

:= (1� ��)tk,

z

k+1 ⇡ ¯

z

k+1

:=argmin

y

max

x

n

tk+1

Q'(y;yk)+ (y)+hLx,yi�tk+1

Qf (x;xk)�g(x)

o

,(40)


where Qf (·;xk) and Q'(·;yk

) are quadratic surrogates of f and ', respectively, i.e.:(

Qf (x;xk) := hrf(xk

),x� x

ki+ 1

2

hr2f(xk)(x� x

k),x� x

ki,Q'(y;y

k) := hr'(yk

),y � y

ki+ 1

2

hr2'(yk)(y � y

k),y � y

ki.(41)

The second line of (40) is again a linear convex-concave saddle-point problem with stronglyconvex objectives. Methods for solving this problem can be found, e.g., in [2,9,15].

4.1.2 Finding initial point

We specify (35) for finding an initial point as follows.

1. Provide a value t0

> 0 (e.g., t0

:= ), and an initial point ˆ

z

0

:= (

ˆ

x

0, ˆy0

) 2 int (Z).2. Compute a subgradient ˆ⇠0g 2 @g(ˆx0

) and ˆ⇠0 2 @ (ˆy0

), and evaluate rf(ˆx0

) and r'(ˆy0

).3. Define ˆ⇣0g := t

0

rf(ˆx0

)� L⇤ˆ

y

0

+

ˆ⇠0g and ˆ⇣0 := t0

r'(ˆy0

) + Lˆx0

+

ˆ⇠0 .4. Perform Phase 1 of Algorithm 1 applied to solve (8) as follows:

8

>

>

>

>

<

>

>

>

>

:

⌧j+1

:= ⌧j � ¯�j ,

ˆ

z

j+1 ⇡ ¯

ˆ

z

j+1

:= argmin

y

max

x

n

t0

Q'(y; ˆyj)� ⌧j+1

hˆ⇣0 ,yi+ (y) + hLx,yi

� t0

Qf (x; ˆxj) + ⌧j+1

hˆ⇣0g ,xi � g(x)o

.

(42)

Here, ⌧ 2 (0, 1] is referred to as a new homotopy parameter starting from ⌧0

:= 1.

Now, we substitute this scheme into Phase 1, and (40) into Step 15 of Algorithm 1, respec-tively, to obtain a new variant to solve (8), which we call Algorithm 1(a).

The worst-case iteration-complexity of Algorithm 1(a) to achieve an "-primal-dual solu-tion z

k:= (x

k,yk) in the sense of Definition 4 for the optimality condition (9) instead of

(1) is guaranteed by Theorem 6. We omit the detailed proof in this paper.

4.2 Inexact path-following primal proximal Newton algorithmWe present an inexact primal path-following proximal Newton method obtained from Algo-rithm 1 to solve (3). This algorithm has several new features compared to [52,53].

First, associated with the barrier function f of X in (3), we define a local norm kukx

:=

hr2f(x)u,ui1/2 and its corresponding dual norm kvk⇤x

:= hr2f(x)�1

v,vi1/2 for a givenx 2 dom(f). Next, let Qf be the quadratic surrogate of f around x

k as defined in (41).Then, the main step of Algorithm 1 applied to (3) performs the following inexact path-following proximal Newton scheme:

(

tk+1

:= (1� ��)tk,

x

k+1 ⇡ ¯

x

k+1

:= argmin

x

�

hk(x) := Qf (x;xk) + t�1

k+1

g(x)

,(43)

Here, the approximation ⇡ is in the sense of Definition 5, and implies kxk+1� ¯

x

k+1kx

k �kfor a given tolerance �k � 0. As shown in [52], this condition is satisfied if

hk(xk+1

)� hk(¯xk+1

) 0.5�2k,

where hk(·) is the objective function of (43). This condition is different from Definition 5,where we can check it by evaluating the objective values.

We redefine the following proximal Newton decrement using (14) as follows:

�t(x) :=�

�

x� prox

(tr2f(x))�1g

�

x�r2f(x)�1rf(x)�

�

�

x

. (44)

Although the scheme (43) has been studied in [52,53], the following features are new.


1. Phase 1: Finding initial point x

0: We solve the following auxiliary problem by applying(35) to obtain an initial point x

0 2 int (X ) such that �t0

(x

0

) �:

min

x

n

f(x)� ⌧hrf(ˆx0

) + t�1

0

ˆ⇠0,xi+ t�1

0

g(x)o

, (45)

where ˆ

x

0 2 int (X ) is an arbitrary initial point, and ˆ⇠0 2 @g(ˆx0

). The inexact proximalpath-following scheme for solving (45) rendering from (35) becomes8

<

:

⌧j+1

:= ⌧j � ¯�j ,

ˆ

x

j+1 ⇡ ¯

ˆ

x

j+1

:= argmin

x

n

Qf (x; ˆxj)� ⌧j+1

hrf(ˆx0

) + t�1

0

ˆ⇠0,xi+ t�1

0

g(x)o

.(46)

2. New neighborhood of the central path: We choose � 2 (0, 0.329), which is approximatelytwice larger than � 2 (0, 0.15] as given in [52].

3. Adaptive rule for t: We can update tk in (43) adaptively using the value krf(xk)k⇤

x

k as

tk+1

:= (1� �k)tk, where �k :=

cp� � �(1 + c

p�)

(1 + cp�)krf(xk

)k⇤x

k + cp�

� �� .

4. Implementable stopping condition: We can terminate Phase 1 using ⌧jkrf(ˆx0

)+t�1

0

ˆ⇠0k⇤ˆx

j (� � ⌘), which is implementable without incurring significantly computational cost.

Let us denote this algorithmic variant by Algorithm 1(b). Instead of terminating this algo-rithmic variant with tk "

M0

, we use �(�, ⌫)tk " to terminate Algorithm 1(b), where�(�, ⌫) is a function defined as in [52, Lemma 5.1.]. The following corollary provides theworst-case iteration-complexity of Algorithm 1(b) as a direct consequence of Theorem 6.

Corollary 1 Let us choose t0

:= defined below the formula (15). Then, the worst-case

iteration-complexity of Algorithm 1(b) to achieve an "-solution x

kof (3) such that x

k 2 Xand g(xk

)� g? " is

O

0

@

krf(ˆx0

) + t�1

0

ˆ⇠0k⇤¯x

?f

t0

+

p⌫ ln

✓

�(�, ⌫)t0

"

◆

1

A

⇣

or simpler O⇣p

⌫ ln⇣⌫

"

⌘⌘⌘

.

Note that the worst-case iteration-complexity bound in Corollary 1 is the overall iteration-complexity. It is similar to the one given in [53] but the method is different.

4.3 Inexact dual path-following proximal Newton algorithmWe develop an inexact dual path-following scheme to solve (4), which works in the dual space.For simplicity of presentation, we assume that W = I. Otherwise, we can use g(·) := g(W (·)).We first write the barrier formulation of the dual problem (5) as follows:

min

y2Rp{tf⇤

(c� L⇤y) + g⇤(y) + hb,yi} ,

where t > 0 is a penalty parameter. This problem can shortly read as

�?t := min

y2Rp

n

�t(y) := '(y) + t�1 (y)o

, (47)

where ' and are two convex functions defined by

'(y) := f⇤(c� L⇤

y) , and (y) := g⇤(y) + hb,yi. (48)

In order to characterize the relation between the primal problem (4) and its dual form (5),we formally impose the following assumption.


Assumption A. 3 The objective function g in (4) is proper, closed, and convex. The linear

operator L : Rn ! Rpis full-row rank with n p. The following Slater condition holds:

(int (K)⇥ ri(dom(g))) \ {(x, s) | Lx� s = b} 6= ;.In addition, K is a nonempty, closed, and pointed convex cone such that int (K) 6= ;, and K is

endowed with a ⌫-self-concordant logarithmically homogeneous barrier f with Dom(f) = K.

The solution set S? of (4) is nonempty.

The following lemma shows that '(·) := f⇤(c � L⇤

(·)) remains a ⌫-self-concordant barrierassociated with the dual feasible set, while the scaled proximal operator of can be computedfrom the one of g. The proof of this lemma is classical and is omitted, see [2,36].

Lemma 5 Under Assumption A.3, '(·) defined by (48) is a ⌫-self-concordant barrier of the

dual feasible set DY := {y 2 Rp | L⇤y � c 2 K⇤}. The proximal operator of defined in (47)

is computed as prox

Q (y) = y �Qb�Qprox

Q

�1g

�

Q

�1

y � b

�

for any Q 2 Sp++

.

Together with the primal local norm k · kx

given in Subsection 4.2, we also define alocal norm with respect to '(·) as kuk

y

:= hr2'(y)u,ui1/2 and its dual norm kvk⇤y

:=

hr2'(y)�1

v,vi1/2 for a given y 2 dom('). Under Assumption A.3, any primal-dual solution(x

?, s?) 2 S? and y

? 2 Rp of (4) is also the KKT point of (4) and vice versa, i.e.:

0 2 L⇤y

? � c+NK(x?), y

? 2 @g(s?), and Lx? � s

?= b. (49)

In practice, we cannot solve (4) and (5) (or equivalently, (49)) exactly to obtain an optimalsolution (x

?, s?) 2 S? and y

? 2 Rp as indicated by the KKT condition (49). We can onlyfind an "-approximate solution (x

?", s

?") and y

?" as defined in Definition 6.

Definition 6 Given a tolerance " > 0, we say that (x

?", s

?") is an "-solution for (4) associ-

ated with an "-dual solution y

?" 2 Rp

of (5) if x

?" 2 int (K) and

8

>

<

>

:

y

?" 2 @g(s?"),

kL⇤y

?" � ck⇤

x

?"

",

kLx?" � s

?" � bk⇤

y" ".

Note that our path-following method always generates x

?" 2 int (K), which implies x

?" 2 K.

Next, we specify Algorithm 1 to solve the dual problem (5) and provide a recoverystrategy to obtain an "-solution of (4).4.3.1 The inexact path-following proximal Newton scheme for the dual problem

By Lemma 5, the function ' defined by (48) is also self-concordant, and its gradient andHessian-vector product are given explicitly as

r'(y) = �Lrf⇤(c� L⇤

y) and r2'(y)d = Lr2f⇤(c� L⇤

y)L⇤d. (50)

Let us denote by Q' the quadratic surrogate of ' at yk defined by (41). Under AssumptionA.3, r2' is positive definite and hence, Q'(·;yk

) is strongly convex. The main step of ourinexact dual path-following proximal Newton method can be presented as follows:

8

<

:

tk+1

:= (1� ��)tk,

y

k+1 ⇡ ¯

y

k+1:= argmin

y2Rp

�

�(y;yk) := Q'(y;y

k) + t�1

k+1 (y)

,(51)

where the approximation ⇡ is defined as in Definition 5 with a given tolerance �k, and�� 2 (0, 1) is a given factor.

The second line of (51) is a composite convex quadratic minimization problem, whichhas the same form as (43). To analyze (51), we define

�t(y) := ky � Py

(y �r2'(y)�1r'(y); t)ky

, (52)

where Py

(·; t) = proxt�1r2'(y)�1 (·) is defined by (16).


4.3.2 Finding a starting point via an auxiliary problem

Let us fix t0

> 0 (e.g., t0

:= ), and choose � 2 (0, 1) such that ⌦t0

(�) defined in (29) isa central path neighborhood of (52). The aim is to find a starting point y

0 2 ⌦t0

(�). Weagain apply (51) to solve an auxiliary problem of (47) for finding y

0 2 ⌦t0

(�).Given an arbitrary ˆ

y

0 2 int (DY), let ˆ⇠0

2 @ (ˆy0

) be an arbitrary subgradient of atˆ

y

0, and ˆ⇣0 := r'(ˆy0

) + t�1

0

ˆ⇠0

. We consider the following auxiliary convex problem:

min

y2Rp

n

b�⌧ (y) := '(y)� ⌧hˆ⇣0,yi + t�1

0

(y)o

, (53)

where ⌧ 2 [0, 1] is a given continuation parameter.As seen before, when ⌧ = 0, (53) becomes (47) at t := t

0

, while with ⌧ = 1 we haver'(ˆy0

) � ˆ⇣0

= r'(ˆy0

) � r'(ˆy0

) � t�1

0

ˆ⇠0

= �t�1

0

ˆ⇠0

2 �t�1

0

@ (ˆy0

). Hence, 0 2 r'(ˆy0

) �ˆ⇣0

+ t�1

0

@ (ˆy0

), which implies that ˆ

y

0 is a solution of (53) at ⌧ = 1.We customize (51) to solve (53) by tracking the path {⌧j} starting from ⌧

0

:= 1 suchthat {⌧j} converges to zero. We use the index j instead of k to distinguish with Phase 2.

Given ˆ

y

j 2 int (DY) and ⌧j > 0, similar to (51), we update8

<

:

⌧j+1

:= ⌧j ��j

ˆ

y

j+1 ⇡ ¯

ˆ

y

j+1

:= argmin

y2Rp

n

ˆ�(y; ˆyj) := Q'(y; ˆy

j)� ⌧j+1

hˆ⇣0,yi+ t�1

0

(y)o

,(54)

where �j > 0 is given, and ⇡ is in the sense of Definition 5 with a given tolerance �j .Since (54) has the same form as (51) applied to (53), we define

ˆ�j :=�

�

ˆ

yj � proxt�1

0

r2'(ˆyj)

�1

�

ˆ

yj �r2'(ˆyj)�1rb'(ˆyj)

�

�

�

ˆyj, (55)

as the dual proximal Newton decrement for (54).

4.3.3 Primal solution recovery and the worst-case complexity

Our next step is to recover an approximate primal solution (x

?", s

?") of the primal problem

(3) from the dual one y

?" of (5). The following theorem provides such a strategy whose proof

can be found in Appendix 7.9. The notation ⇡@g⇤(y

k)

(Lxk � b) stands for the projection ofLxk � b onto @g⇤(yk

) which is a nonempty, closed, and convex set.

Theorem 7 Let

�

(y

k, tk)

be the sequence generated by (51) and (54) to approximate a

solution of the dual problem (5). Then, (x

k, sk) computed by

x

k:= rf⇤�t�1

k (c� L⇤y

k�

2 int (K) and s

k= ⇡@g⇤

(y

k)

(L(xk)� b), (56)

together with y

ksatisfy the following estimate

8

>

<

>

:

y

k 2 @g(sk),

kL⇤y

k � ck⇤x

k p⌫tk,

kLxk � s

k � bk⇤y

k ✓(c,�)tk,

(57)

where

✓(c,�) := (1�c2)�

(1+cp�)2

[

3cp�+c2�+(1+c

p�)3

]

�(1�c2)�

+

✓

(1�c2)�+cp�(1+c

p�)2

[

3cp�+c2�+(1+c

p�)3

]

(1+cp�)2

[

3cp�+c2�+(1+c

p�)3

]

�(1�c2)�

◆

2

1,

(58)

is a constant for fixed c and � chosen as in Lemma 4.

Consequently, if max {p⌫, ✓(c,�)} tk =

p⌫tk ", then (x

k, sk) is an "-solution to (3)in the sense of Definition 6 associated with the "-dual solution y

kof (5).


4.3.4 Two-phase inexact dual path-following proximal Newton algorithm

Now, we specify Algorithm 1 to solve (4) using (54) and (51) as in Algorithm 2.

Algorithm 2 (Two-phase inexact dual path-following proximal Newton algorithm)1: Initialization:2: Choose ˆ

y

0 2 Rp such that L⇤ˆ

y

0 � c 2 int (K⇤). Fix t

0

:= , and an accuracy " > 0.3: Compute a vector ˆ⇠

0

2 @ (ˆy0

) and evaluate r'(ˆy0

).4: Set ˆ⇣0 := r'(ˆy0

) + t�1

0

ˆ⇠0 and ⌧0

:= 1.5: Choose �, ⌘, then compute ¯�⌧ , µ⌘, ¯�t, �� as in Algorithm 1. Compute ✓(c,�) by (58).

6: Phase 1: Computing an initial point

7: For j = 0, · · · , jmax

, perform:8: If ⌧jkˆ⇣0k⇤

ˆy

j (� � ⌘), then TERMINATE.9: Perform (54) with �j :=

µ⌘

kˆ⇣0k⇤ˆyj

up to an accuracy ˆ�j ¯�⌧ .

10: End for

11: Phase 2: Inexact dual path-following proximal Newton iterations

12: For k = 0, · · · , kmax

, perform:13: If

p⌫tk ", then TERMINATE.

14: Perform (51) up to an accuracy �k ¯�t.15: End for16: Primal recovery: Recover (x

k, sk) from y

k as in (56). Then, return (x

k, sk,yk).

The main per-iteration complexity of Algorithm 2 lies at Steps 9 and 14, where we need tosolve two composite and strongly convex quadratic programs in (54) and (51), respectively.The primal solution recovery at Step 16 does not significantly increase the computationalcost of Algorithm 2. The worst-case iteration-complexity of Algorithm 2 remains the sameas in Theorem 6 with M

0

:=

p⌫, and we do not restate it here.

5 Preliminarily numerical experimentsWe present three numerical examples to illustrate three algorithmic variants described in theprevious sections, respectively. We compare our methods with three state-of-the-art interior-point solvers: SDPT3 [51], SeDuMi [49], and Mosek (a commercial software package). We alsocompare our methods with the SDPNAL+ solver (version 0.5) in [62], which implemented amajorized semi-smooth Newton-CG augmented Lagrangian method. For the second example,we implement an alternating direction method of multipliers (ADMM) [7] to compare withour method. Our numerical experiments are carried out in a Matlab R2014b environment,running on a MacBook Pro Laptop (Retina, 2.7GHz Intel Core i5, 16GM Memory).

5.1 Example 1: Minimizing the maximum eigenvalue with constraintsWe illustrate Algorithm 1(a) via the well-known maximum eigenvalue problem [34]:

�?max

:= min

y2Y{�

max

(C + Ly)} , (59)

where �max

(U) is the maximum eigenvalue of a symmetric matrix U 2 Sn, C 2 Sn is a givenmatrix, L is a linear operator from Rp ! Sn, and Y is a nonempty, closed, and convex setin Rp endowed with a self-concordant barrier '.

As a consequence of J. von Neumann’s trace inequality, we can show that �max

(U) =

max

x

�

x

>Ux | kxk2

= 1

= max

�

trace (UX) | trace (X) = 1, X 2 Sn+

. Hence, if we define


X := Sn+

and g(X) := �{X|trace(X)=1}(X)� trace (CX), and using hLy, Xi = trace ((Ly)X),then we can rewrite (59) as (8), which is of the form:

˜�?max

:= min

y2Y

�

max

X2Sn+

{hLy, Xi � g(X)}

. (60)

The corresponding barrier for Sn+

is f(X) := � log det(X).Now, we can apply Algorithm 1(a) to solve (60). The main computation of this algorithm

is the solution of (40) and (42), which can be written explicitly as follows for (60):

min

y2Rp

n

max

X2Sn

�

hLy, Xi � tQf (X;Xk)� g(X)

+ tQ'(y;yk)

o

, (61)

where Qf (X;Xk) := hrf(Xk

), X�Xki+ 1

2

hr2f(Xk)(X�Xk

), X�Xki and Q'(y;yk) :=

hr'(yk),y�y

ki+ 1

2

hr2'(yk)(y�y

k),y�y

ki. We can solve (61) in a closed form as follows:8

>

<

>

:

X⇤k := mat

✓

trace

(

mat

(

H�1

k hk))+1

trace

(

mat

(

H�1

k vec(I)))

�

H�1

k vec (I)�H�1

k hk

◆

,

y

⇤k := y

k �r2'(yk)

�1

�

r'(yk) + t�1L⇤X⇤

k

�

,

(62)

where Hk := r2f(Xk)+t�2Lr2'(yk

)

�1L⇤ � 0 and hk :=

⇥

rf(Xk)�r2f(Xk

)vec

�

Xk�⇤

�t�1L

�

y

k �r2'(yk)

�1r'(yk)

�

� t�1

vec (C).We consider a simple case, where Y := {y 2 Rp | kyk1 1}. Then, the barrier function

of Y is simply '(y) := �Pp

i=1

log(1 � y

2

i ). In this case, we can compute both r'(·) andr2'(·)�1 in a closed form. The barrier parameter for F := f + ' is ⌫ := 2p+ n.

We test 5 solvers: Algorithm 1(a), SDPT3, SeDuMi, Mosek, and SDPNAL+ on 10medium-size problems, where n varies from 5 to 50, and p = 10n2 (varies from 250 to25, 000). For these three IP solvers and SDPNAL+, we reformulate (60) into the followingSDP problem:

min

s,X,y{s | sI�X � Ly = C, X ⌫ 0, � 1 yj 1, j = 1, · · · , p} .

The linear operator L and matrix C are generated randomly using the standard Gaussiandistribution randn in Matlab, which are completely dense. In Phase 1 of Algorithm 1(a),instead of performing a path-following scheme on the auxiliary problem, we simply perform adamped step variant on the original problem. We set the initial penalty parameter t

0

:= 0.1.We terminate our algorithm if tk 10

�6 and ˜�k 10

�8. When tk 10

�6, if ˜�k doesnot reach the 10

�8 accuracy, we fix tk = 10

�6 and perform at most 15 addional iterationsto decrease ˜�k. We terminate SDPT3, SeDiMi, and Mosek with the same accuracy

p" =

1.49 ⇥ 10

�8, where " is Matlab’s machine precision. We terminate SDPNAL+ using itsdefault setting, but set the maximum number of iterations at 1000.

The result and performance of these solvers are reported in Table 1, where n ⇥ p isthe size of L, iter is the number of iterations in Phase 1 and Phase 2, and �?� is thereported objective value of each solver. The most intensive computation of Algorithm 1(a)is Ldiag(r2'(yk

))

�1L>, which costs from 40% to 80% the overall computational time.

Table 1: Summary of the result and performance of 5 solvers for solving problem (59).

Problem Algorithm 1(a) SDPT3 SeDuMi Mosek SDPNAL+

n p iter time[s] �?ours

time[s] �?sdpt3

time[s] �?sedumi

time[s] �?mosek

time[s] �?sdpnal+

5 250 19/40 0.37 -80.34 2.49 -80.34 1.32 -80.34 3.12 -80.34 27.53 -80.34

10 1000 32/60 0.30 -255.92 3.66 -255.92 1.56 -255.92 2.06 -255.92 12.25 -255.92

15 2250 41/72 1.27 -453.51 13.35 -453.52 4.95 -453.52 2.53 -453.52 36.99 -453.52

20 4000 49/81 6.88 -684.18 55.58 -684.18 21.67 -684.18 4.14 -684.18 38.38 -684.19

25 6250 57/81 23.06 -952.12 309.29 -952.12 135.91 -952.12 7.91 -952.12 315.46 -952.13

30 9000 65/86 155.37 -1265.71 518.82 -1265.71 209.27 -1265.71 14.53 -1265.71 1202.07 -1257.17

35 12250 71/104 181.21 -1582.48 1262.64 -1582.49 494.84 -1582.49 30.40 -1582.49 2912.92 -1582.21

35 12250 71/104 181.21 -1582.48 1262.64 -1582.49 494.84 -1582.49 30.40 -1582.49 2912.92 -1582.21

40 16000 78/110 400.89 -1931.65 2795.90 -1931.66 1064.91 -1931.66 68.78 -1931.66 2487.30 -1925.06

45 20250 84/117 831.43 -2322.09 4777.96 -2322.12 2052.40 -2322.12 77.91 -2322.11 1840.10 -2322.91

50 25000 89/125 1367.36 -2694.27 9474.14 -2694.29 4184.44 -2694.29 130.61 -2694.29 13948.03 -2696.67


In this test, Mosek is the fastest when the size is increased, while SDPNAL+ is the slow-est. SDPT3 is slow but is slightly better than SDPNAL+ in this test. Our algorithm producesnearly optimal objective value while requires reasonable computational time compared tothe other solvers. SDPNAL+ gives a slightly lower objective value in some problems, butalso violates the bound constraint kyk1 1. We emphasize that our algorithm is naivelyimplemented in Matlab without optimizing the code or using mex files as other solvers. Mosekis a well-known commercial software implemented in C++ using several advanced heuristicstrategies, and SDPNAL++ has been developed through several releases using both Matlaband C codes.

5.2 Example 2: Sparse and low-rank matrix approximationThe problem of approximating a given n⇥n-symmetric matrix M as the sum of a low-rankpositive semidefinite matrix X with bounded magnitudes, and a sparse matrix M �X canbe formulated into the following convex optimization problem (see [52]):

(

min

X⇢kvec (X �M) k

1

+ (1� ⇢)trace (X)

s.t. X ⌫ 0, Lij Xij Uij , 1 i < j n.(63)

Here, ⇢ 2 (0, 1) is a regularization parameter, and L and U are the lower and upper bounds.Let us define X := Sn

++

and g(X) := ⇢kvec (X �M) k1

+ (1 � ⇢)trace (X) + �[L,U ]

(X),where �

[L,U ]

is the indicator of [L,U ] := {X 2 Sn | Lij Xij Uij , 1 i < j n}. Then,we can formulate (63) into the form (3) with f(X) := � log det(X).

We implement Algorithm 1(b) for solving (3) and compare it with Mosek, SDPNAL+,and ADMM. The initial parameter is set to t

0

:= 0.1. We use a restarting acceleratedproximal-gradient algorithm proposed in [50] to solve the subproblems (43) and (46) withat most 150 iterations. We apply the same strategy as in Subsection 5.1 to terminate thisalgorithm. For Mosek and SDPNAL+, we use their default configuration to solve an equiv-alent mixed SDP reformulation of (63) by transforming the `

1

-norm term into second-ordercone constraints.

Since problems of the form (63) have been successfully solved by ADMM [58], we alsoimplement an ADMM variant [7] to solve (63). We terminate ADMM using a criterion in [7,page 19] with a tolerance " = 0.5⇥10

�5. With this ", ADMM nearly reaches the same orderof accuracy as in other methods. We also set the maximum number of iterations at 5000.We tune ADMM to find a reasonable penalty parameter for all problems, which is � = 10.0.

We test four algorithms on 12 problems with the size reported in Table 2.

Table 2: Summary of the result and performance of 4 solvers on 12 problem instances

Algorithm 1(b) SDPNAL+[62] Mosek ADMM[7,58]

n iter time g(Xk) spr/rank iter timeg?

sdpnal+

spr/rank time g?mosek

spr/rank time g?admm

spr/rank

20 11/74 0.39 11.37 0.39/13 294 0.91 11.37 0.73/18 0.13 11.37 0.46/4 0.88 11.37 0.39/3

40 16/101 0.93 57.06 0.36/32 500 4.54 57.06 0.82/39 0.65 57.06 0.60/19 1.67 57.06 0.80/15

60 23/124 2.16 162.75 0.27/53 350 1.93 162.74 0.96/59 3.07 162.75 0.89/44 3.87 162.75 0.83/25

80 25/147 2.36 245.39 0.21/73 210 1.62 245.38 0.88/79 8.99 245.39 0.85/53 4.61 245.39 0.86/36

100 29/166 3.24 427.29 0.18/92 216 1.71 427.27 0.90/100 23.32 427.29 0.79/66 5.48 427.29 0.89/46

120 33/184 3.71 662.17 0.14/113 211 2.66 662.14 0.89/119 57.58 662.17 0.89/87 5.99 662.17 0.90/55

140 35/200 4.45 893.47 0.12/133 191 2.74 893.44 0.90/139 141.89 893.47 0.82/96 11.24 893.49 0.90/65

160 38/216 5.98 1185.71 0.10/154 193 3.82 1185.67 0.90/159 266.47 1185.71 0.80/127 9.75 1185.73 0.90/75

180 41/244 8.01 1493.43 0.10/173 191 3.89 1493.38 0.90/179 542.29 1493.43 0.89/169 12.73 1493.45 0.91/85

200 43/272 10.97 1741.32 0.08/196 194 4.30 1741.26 0.90/199 1049.39 1741.33 0.92/194 15.54 1741.35 0.91/99

220 46/297 11.52 2082.55 0.06/214 191 7.13 2082.47 0.90/218 1374.15 2082.55 0.86/218 16.62 2082.58 0.91/111

240 49/329 15.31 2577.35 0.04/236 194 6.21 2577.25 0.90/237 2644.77 2577.36 0.91/219 17.83 2577.38 0.91/117

We limit our test to n = 240 since Mosek can only solve up to this size in our computer.The data is generated as follows. We generate a symmetric matrix M

0

using standard Gaus-sian distribution with the rank of b0.25nc and the sparsity of 25%. Then, we add a sparse


Gaussian noise E with the sparsity of 10% and the variance of 10�4 as M := M0

+ E toobtain M . We generate the lower bound Lij := 0.9min {Mij | 1 i < j n} and the upperbound Uij := 1.1max {Mij | 1 i < j n}. We choose ⇢ = 0.2 for all problem instances.

The performance and result of three algorithms are reported in Table 2.Here, iter is the number of iterations for Phase 1 and Phase 2 of Algorithm 1(b);

time is the computational time in second; g(Xk) (respectively, g?

sdpn

, g?mosek

, and g?admm

) isthe objective value of Algorithm 1(b) (respectively, SDPNAL+, Mosek, and ADMM); andspr/rank is the sparsity level of Xk �M (e.g., spr := nnz(Xk �M)/n2), and the rank ofXk (rounding up to 10

�6 accuracy);As we can see from Table 2, four solvers give similar results in terms of the objective value.

The computational time of Algorithm 1(b), SDPNAL+, and ADMM is in the same order,while Mosek is much slower than other three solvers when p is increasing. More precisely,Algorithm 1(b) is slightly slower than SDPNAL+, but slightly faster than ADMM in termsof time. It is also much faster than Mosek.

Both Algorithm 1(b) and Mosek perfectly satisfy the positive definiteness constraintwith �

min

(Xk) > 0, while SDPNAL+ still slightly violates this constraint with �

min

(Xk)

�O(10

�4

). ADMM also slightly violates the positivity constraint X ⌫ 0 in 6 problem in-stances out of 12. Algorithm 1 gives best results in terms of the sparsity of Xk �M , whileachieves similar rank as Mosek and SDPNAL+ in the majority of the test. ADMM gives abetter rank than other methods but its solution leads to a dense residual Xk �M .

As often observed in first-oder methods, ADMM easily achieves a low-accuracy solution(i.e., 10�4 accuracy) in less than five hundred iterations (i.e., from 282 to 476 iterations).However, in order to reach a high accuracy solution as Algorithm 1(b), or IP solvers, itrequires many iterations (i.e., from 700 to 3210 iterations depending on problem instances).

5.3 Example 3: Cluster recoveryFinally, we test Algorithm 2 for solving the following well studied clustering recovery problemvia an SDP relaxation as studied in [22]:

8

>

>

<

>

>

:

max

X2Sn+

trace

�

A>X�

s.t. Xii 1, Xij � 0, i, j = 1, · · · , n,trace (X) = s

1

, trace (EnX) = s2

,

(64)

where A is the adjacency matrix of a given graph, En is the all-one matrix in Rn⇥n, s1

=

Pri=1

Ki, and s2

:=

Pri=1

K2

i with K1

,K2

, · · · ,Kr being the size of r clusters.Let us define K := Sn

+

, LX := [trace (X) , trace (EnX) , Xii, Xij ] : Rn⇥n ! Rn(n+1)+2,g(s) := �{0}2

(s

1:2

) + ��Rn+

(s

3:n+2

) + �Rn2

+

(sn+3:n(n+1)+2

), and b := (s1

, s2

, en,0n2

)

> 2Rn(n+1)+2, where en is the all-one vector, �X is the indicator function of X , and sk

1

:k2

is thesubvector of s concatenating from the k

1

-th entry to the k2

-entry. Using these notations, wecan reformulate (64) into the constrained convex problem (4).

Since K := Sn+

is a self-dual cone, i.e., K⇤= Sn

+

, and the corresponding self-concordantlogarithmically homogeneous barrier of �K⇤ is f⇤

(S) := � log det(�S)� n.We implement Algorithm 2 to solve (64). We use a restarting proximal-gradient algorithm

[3] to solve the corresponding subproblems in (51) and (54). Since we can compute the dualobjective values, we use a damped-step Newton scheme in Phase 1 to compute an initialpoint y0. Our algorithm uses the first-order method for solving the subproblems, we set theprecision of those solvers above to be low such that the relative error is guaranteed to beless than O(10

�4

) when it is terminated. We choose the number of clusters such that theaverage number of points of each cluster is between 10 and 20. The initial value of t

0

is setto t

0

:= 0.5 if n 120, and to t0

:= 1.0, otherwise. All optimal values of our algorithm have


the relative error smaller than 10

�4 when the algorithm is terminated, which matches theprecision of the above solvers.

The results and performance of these solvers are reported in Table 3 for small-scaleproblems. In this table, we summarize the result and performance of 12 problems of the sizefrom 60 to 250, where K is the number of clusters; iter is the number of iterations in Phase1 and Phase 2 of Algorithm 2; and spu is the speed up ratio (i.e., spu =

time�time

Algorithm 2

) interms of time compared to other solvers.

Table 3: Summary of performance of 4 solvers on 12 problem instances

Problem Algorithm 2 SDPT3 Mosek SDPNAL+

n K ⇢[%] iter time[s] Gk time[s] G?sdpt3

spu time[s] G?mosek

spu time[s] G?sdpnal+

spu

60 5 18.6 52/215 3.63 660.07 6.28 660.00 1.7 4.41 660.00 1.2 1.58 660.00 0.44

70 7 13.0 42/232 3.37 630.02 9.27 630.00 2.8 6.62 630.00 2.0 3.09 630.00 0.92

80 8 11.3 45/247 4.30 720.02 14.46 720.00 3.4 8.39 720.00 2.0 3.37 720.00 0.78

90 9 10.1 48/262 6.61 810.03 27.80 810.00 4.2 14.40 810.00 2.2 3.47 810.00 0.53

100 10 9.1 50/276 7.54 900.03 44.24 900.00 5.9 23.45 900.00 3.1 3.21 900.00 0.43

110 10 9.2 57/289 8.91 1100.05 57.26 1100.00 6.4 33.01 1100.00 3.7 2.34 1100.00 0.26

120 10 9.2 78/302 11.13 1320.13 88.61 1320.00 8.0 50.80 1320.00 4.6 3.58 1320.00 0.32

140 10 9.4 57/359 20.18 1820.13 203.74 1820.00 10.1 110.82 1820.00 5.5 4.22 1820.00 0.21

160 10 9.4 76/383 26.57 2400.23 340.58 2400.00 12.8 214.13 2400.00 8.1 5.02 2400.00 0.19

180 15 6.2 63/406 36.73 1980.16 808.40 1980.00 22.0 423.28 1980.00 11.5 4.52 1980.00 0.12

200 20 4.5 64/428 66.34 1800.14 1945.14 1798.69 29.3 787.78 1800.00 11.9 4.31 1800.00 0.07

250 25 3.6 72/478 184.18 2250.21 11378.33 2250.00 61.8 3415.16 2250.00 18.5 5.48 2250.00 0.03

Table 3 shows that Algorithm 2 can achieve the same order of accuracy as other threesolvers while it highly outperforms SDPT3 and Mosek in terms of computational time. Wecan speed up Algorithm 2 up to 18 times compared to Mosek, and 62 times over SDPT3.This is due to the low cost computation of the projections when we work directly on thedual of the original problem. The other IP solvers require to convert it to an appropriateSDP format which substantially increases the problem size as seen from Subsection 5.2.Since problem (64) fits SDPNAL+ very well, we also compare it with our method. Clearly,SDPNAL+ takes advantages of splitting techniques in ADMM, semi-smooth Newton-CGmethods, and heuristic procedures to perform well in this test. It has a similar performanceas Algorithm 2 on small problems, while it becomes faster when the problem size increases.

6 DiscussionWe have studied a class of self-concordant inclusions of the form (1), and designed an inexactgeneralized Newton-type framework for solving it. Problem (1) is sufficiently general tocope with three fundamental convex optimization formulations discussed in subsection 1.2.Moreover, since this problem can be reformulated into a multivalued variational inequalityproblem [46], theory and methods from this area can be used to deal with (1), see [2,16,46].

Most existing methods for solving (1) exploit specific structures of A and Z. When A issingle-valued, (1) is a standard single-valued variational inequality, and it becomes a com-plementarity problem if additionally Z is a box. The most commonly used methods to solvecomplementarity problems are based on generalized Newton methods developed for nons-mooth equations, including the path following methods [42] and semi-smooth Newton-typemethods [12,28,26,41,62]. The basic idea is to reformulate the complementarity problem asan equation defined by nonsmooth functions, and at each iteration, one can approximatelysolve the equation obtained by some first-order approximation or a generalized Jacobianmatrix of the nonsmooth functions. Another important class of methods to solve (1) arebased on projection and splitting [55,56,60]. These methods can be considered as specialcases of the forward-backward splitting scheme when the second operator is simply thenormal cone of the convex set Z. When A is maximally monotone, its resolvent is well-defined and single-valued. Splitting methods using proximal-point and projected schemessuch as Douglas-Rachford’s methods can be applied to solve (1). Other approaches such as


augmented Lagrangian [62], extragradient, mirror descent, hybrid-gradient, gap functions,smoothing techniques, and interior-point proximal methods are also widely studied in theliterature for different classes of (1), see, e.g., [1,6,15,16,19,20,25,29,31,33,48,59] and thereferences quoted therein.

From a theoretical point of view, the setting (1) can be used as a unified tool to handle awide range of convex problems. Three specific instances (3), (4) and (8) of (1) are well studiedand have a great impact in different fields including operations research, statistics, machinelearning, signal and image processing, and controls [2,5,8,32]. Methods for solving theseinstances include sequential quadratic programming [38], interior-points [36], augmentedLagrangian-type methods [57,62] (e.g., implemented in SDPAD, SDPNAL/SDPNAL+),first-order/second-order primal-dual and splitting methods [2,7,9,11,14,47,54,55], Frank-Wolfe-type algorithms [17,23], and stochastic gradient descents [30,24], just to name a few.

Perhaps, the interior-point method [5,32,37,61] is one of the most well developed methodsfor solving standard conic programs covered by (3) and (4). Interior-point methods togetherwith disciplined programming approach [21] allow us to solve a large class of convex op-timization problems arising in different fields. These techniques have been systematicallyimplemented in several off-the-shelf software packages such as CVX [21], YALMIP [27],CPLEX and Gurobi for both commercial and academic use. While interior-point methodsprovide a powerful framework to solve a large class of constrained convex problems with highaccuracy and numerically robust performance, their high per-iteration complexity preventsthem from solving large-scale applications in modern applications.

Although the interior-point method and the proximal-type method have been separatelywell developed for several classes of convex problems, their joint treatment was first proposedin [52,53] to the best of our knowledge. In these papers, the authors proposed a novel path-following proximal Newton framework for the instance (3) of (1). They characterized anO(

p⌫ log(1/")-worst-case iteration-complexity as in standard path-following methods [32]

to achieve an "-solution of (3), where ⌫ is the barrier parameter of a barrier function of thefeasible set X , and " is a desired accuracy. However, [52,53] obtained a smaller neighborhoodfor the central path compared to standard path-following methods [32]. In addition, thesealgorithms used the points on the central path to measure the proximal Newton decrementwhich leads to a unimplementable stopping criterion. In contrast, this paper focuses ondeveloping a unified theory using self-concordant inclusion (1) as a generic framework. Themain component of our methods is the generalized Newton method studied, e.g., in [6,39,43,44], but we have extended it to self-concordant settings. Moreover, we use a generalizedgradient mapping to measure a neighborhood of the central path as well as a quadraticconvergence region of the generalized Newton iterations, and this neighborhood has thesame size as in standard path-following methods [32]. When this framework is specified tosolve (3) and (4), such a generalized gradient mapping allows us to obtain an implementablestopping criterion, an adaptive rule for the penalty parameter, and the overall polynomialtime worst-case iteration-complexity bounds.

Acknowledgements This work was supported in part by the NSF grant, no. DMS-16-2044 (USA).

7 Appendix: The proofs of technical resultsThis appendix provides the full proofs of all lemmas and theorems in the main text.

7.1 The proof of Lemma 1: The existence and uniqueness of the solution of (2).Under Assumption A.1, the operator trF (·) + A(·) is maximally monotone for any t > 0.We use [46, Theorem 12.51] to prove the solution existence of (2).

To this end, let ! 6= 0 be chosen from the horizon cone of int (Z) \ dom(A). We needto find z 2 int (Z)\ dom(A) with v 2 trF (z) +A(z) such that hv,!i > 0. By assumption,there exists ˆ

z 2 int (Z) \ dom(A) with ˆ

a 2 A(

ˆ

z) such that hâ,!i > 0.


First, we show that z⌧ =

ˆ

z + ⌧! belongs to int (Z) \ dom(A) for any ⌧ > 0. To seethis, note that the assumption int (Z) \ dom(A) 6= ; implies that int (Z) \ ri dom(A) 6=;, which implies that the closure of int (Z) \ dom(A) is exactly Z \ cl(dom(A)). Choose⌧ 0 > ⌧ ; by definition of the horizon cone, z⌧ 0 belongs to the closure of int (Z) \ dom(A), soz⌧ 0 2 Z and z⌧ 0 2 cl(dom(A)). Since z⌧ is a convex combination of ˆz and z⌧ 0 , it belongs toint (Z) \ dom(A), where we use the assumption that dom(A) is either closed or open.

Next, for any a⌧ 2 A(z⌧ ), we have

ha⌧ ,!i = ha⌧ � ˆ

a,!i+ hâ,!i = ha⌧ � ˆ

a, ⌧�1

(z⌧ � ˆ

z)i+ hâ,!i � hâ,!i > 0.

On the other hand, htrF (z⌧ ),!i = htrF (z⌧ ), ⌧�1

(z⌧ � ˆ

z)i � �⌧�1t⌫ by [32, Theorem4.2.4]. Combining the above two inequalities, we can see that

htrF (z⌧ ) + a⌧ ,!i � �⌧�1t⌫ + hâ,!i > 0

as long as ⌧�1t⌫ < hâ,!i. We have thereby verified the condition in [46, Theorem 12.51],which needs to guarantee (2) for having a nonempty (and bounded) solution set. Since rF

is strictly monotone, the solution of (2) is unique.Note that z

?t is the solution of (2) and z

?t 2 int (Z), we have �trF (z

?t ) 2 A(z

?t ) =

AZ(z?t ). Hence, dist

z

?t(0,AZ(z

?t )) t krF (z

?t )k

⇤z

?t t

p⌫ due to the property of F [32].

Using Definition 4, we have the last conclusion. ⇤

7.2 The proof of Lemma 3: Approximate solutionFirst, since ¯z

+

is a zero point of bAt(·; z), i.e., 0 2 bAt(¯z+, z), we have �trF (z)�tr2F (z)(

¯

z

+

�z) 2 A(

¯

z

+

). Second, since z

+

is a �-solution to (23), there exists e such that e 2 trF (z) +

tr2F (z)(z

+

� z)+A(z

+

) with kek⇤z

t� by Definition 5. Combining these expressions, andusing the monotonicity of A in Definition 1, we can show that ht[rF (z)+r2F (z)(z

+

�z)�rF (z)�r2F (z)(

¯

z

+

� z)]� e, ¯z+

� z

+

i � 0. This inequality leads to

tkz+

� ¯

z

+

k2z

he, z+

� ¯

z

+

i kek⇤z

kz+

� ¯

z

+

kz

, (65)

which implies kz+

� ¯

z

+

kz

t�1kek⇤z

. Hence, kek⇤z

t� implies kz+

� ¯

z

+

kz

�.Next, since z

+

is a �-approximate solution to (23) at t in the sense of Definition 5 up tothe accuracy �, there exists e 2 Rp such that

e 2 t⇥

rF (z) +r2F (z)(z

+

� z)

⇤

+A(z

+

) with kek⇤z

t�.

In addition, we have z+

2 int (Z) due to Theorem 1 below. Hence, we have AZ(z+) = A(z

+

).Using this relation and the above inclusion, we can show that

dist

z

(0,AZ(z+)) ke� t⇥

rF (z) +r2F (z)(z

+

� z)

⇤

k⇤z

kek⇤z

+ t krF (z)k⇤z

+ tkr2F (z)(z

+

� z)k⇤z

t⇥

� +p⌫ + kr2F (z)(

¯

z

+

� z)k⇤z

+ kr2F (z)(

¯

z

+

� z

+

)k⇤z

⇤

t [� +p⌫ + �t(z) + kz

+

� ¯

z

+

kz

]

t (p⌫ + �t(z) + 2�) .

(66)

Here, we have used krF (z)k⇤z

p⌫, and kz

+

� ¯

z

+

kz

� by the first part of this lemma.Note that if �t(z) + � < 1, then dist

z

+

(0,AZ(z+)) (1 � �t(z) � �)�1

dist

z

(0,AZ(z+)).Combining this inequality and the last estimate, we obtain (25). Finally, if we choose t (1��t(z)��) (

p⌫ + �t(z) + 2�)

�1

", then dist

z

+

(0,AZ(z+)) ". Hence, z+

is an "-solutionto (1) in the sense of Definition 4. ⇤


7.3 The proof of Theorem 1: Key estimate of generalized Newton-type schemesFirst, similar to [2], we can easily show the the following non-expansive property holds

kPˆz

(u; t)� Pˆz

(v; t)kˆz

ku� vkˆz

, 8u,v 2 Rp. (67)

Note that kz+

� zkz

k¯z+

� zkz

+ kz+

� ¯

z

+

kz

= �t(z) + �(z) < 1 by our assumption. Thisshows that z

+

2 int (Z) due to [32, Theorem 4.1.5 (1)].Next, we consider the generalized gradient mappings G

z

(z; t+

) and Gz

+

(z

+

; t+

) at z andz

+

, respectively defined by (20) as follows:

Gz

(z; t+

) := r2F (z)

�

z� Pz

�

z�r2F (z)

�1rF (z); t+

��

,

Gz

+

(z

+

; t+

) := r2F (z

+

)

�

z

+

� Pz

+

�

z

+

�r2F (z

+

)

�1rF (z

+

); t+

��

.(68)

Let rz

(

¯

z

+

) := rF (z)+r2F (z)(

¯

z

+

�z). Then, by using ¯

z

+

:= Pz

�

z�r2F (z)

�1rF (z); t+

�

from (26), we can show that

� rz

(

¯

z

+

) := �⇥

rF (z) +r2F (z)(

¯

z

+

� z)

⇤

2 t�1

+

A(

¯

z

+

). (69)

Clearly, we can rewrite (69) as ¯

z

+

�r2F (z

+

)

�1rz

(

¯

z

+

) 2 ¯

z

+

+ t�1

+

r2F (z

+

)

�1A(

¯

z

+

). Then,using the definition (16) of P

z

+

(·) :=�

I+ t�1

+

r2F (z

+

)

�1A��1

(·), we can derive

z

+

= Pz

+

�

¯

z

+

�r2F (z

+

)

�1rz

(

¯

z

+

); t+

�

+ (z

+

� ¯

z

+

). (70)

Now, we can estimate �t+

(z

+

) defined by (21) using (68), (70), (67), and (69) as follows:

�t+

(z

+

) :=kGz

+

(z

+

; t+

)k⇤z

+

(68)

= kz+

� Pz

+

�

z

+

�r2F (z

+

)

�1rF (z

+

); t+

�

kz

+

(70)

=

�

�

�

Pz

+

�

¯

z

+

�r2F (z

+

)

�1rz

(

¯

z

+

); t+

�

� Pz

+

�

z

+

�r2F (z

+

)

�1rF (z

+

); t+

�

+(z

+

�¯

z

+

)

�

�

�

z

+

�

�

�

Pz

+

�

¯

z

+

�r2F (z

+

)

�1rz

(

¯

z

+

); t+

�

� Pz

+

�

z

+

�r2F (z

+

)

�1rF (z

+

); t+

�

�

�

�

z

+

+ kz+

� ¯

z

+

kz

+

(67)

�

�

�

r2F (z

+

)

�1

[rF (z

+

)� rz

(

¯

z

+

)] + (

¯

z

+

� z

+

)

�

�

�

z

+

+ kz+

� ¯

z

+

kz

+

(69)

=

�

�

�

r2F (z

+

)

�1

⇥

rF (z

+

)�rF (z)�r2F (z)(z

+

� z) + (r2F (z

+

)

�r2F (z))(

¯

z

+

� z

+

)

⇤

�

�

�

z

+

+ kz+

� ¯

z

+

kz

+

krF (z

+

)�rF (z)�r2F (z)(z

+

� z)k⇤z

+

+ k(r2F (z

+

)�r2F (z))(

¯

z

+

� z

+

)k⇤z

+

+ kz+

� ¯

z

+

kz

+

1

1�kz+

�zkz

h

krF (z

+

)�rF (z)�r2F (z)(z

+

� z)k⇤z

+ k(r2F (z

+

)�r2F (z))(

¯

z

+

� z

+

)k⇤z

i

+

kz+

�¯z

+

kz

1�kz+

�zkz. (71)

Here, in the last equality of (71), we have used the fact that kwk2z

+

= hr2F (z

+

)w,wi (1 � kz

+

� zkz

)

�2hr2F (z)w,wi = (1 � kz+

� zkz

)

�2kwk2z

for any w and z, z+

such thatkz

+

� zkz

< 1, and the analogous fact for the dual norms. Both facts can be derivedfrom [32, Theorem 4.1.6]. The condition kz

+

� zkz

< 1 is guaranteed since kz+

� zkz

kz� ¯

z

+

kz

+ kz+

� ¯

z

+

kz

= �t+

(z) + �(z) < 1 by our assumption.Similar to the proof of [32, Theorem 4.1.14], we can show that

krF (z

+

)�rF (z)�r2F (z)(z

+

� z)k⇤z

kz+

� zk2z

1� kz+

� zkz

. (72)


Next, we need to estimate B := k(r2F (z

+

)�r2F (z))(

¯

z

+

� z

+

)k⇤z

. We define

⌃ := r2F (z)

�1/2�

r2F (z

+

)�r2F (z)

�

r2F (z)

�1/2.

By [32, Theorem 4.1.6], we can show that

k⌃k max

�

1� (1� kz+

� zkz

)

2, (1� kz+

� zkz

)

�2 � 1

=

2kz+

� zkz

� kz+

� zk2z

(1� kz+

� zkz

)

2

.

Using this inequality we can estimate B as

B2

= (

¯

z

+

� z

+

)

>r2F (z)

1/2⌃2r2F (z)

1/2(

¯

z

+

� z

+

) k⌃k2k¯z+

� z

+

k2z

✓

2kz+

� zkz

� kz+

� zk2z

(1� kz+

� zkz

)

2

◆

2

k¯z+

� z

+

k2z

,

which implies

B ✓

2kz+

� zkz

� kz+

� zk2z

(1� kz+

� zkz

)

2

◆

k¯z+

� z

+

kz

. (73)

Substituting (72) and (73) into (71) we get

�t+

(z

+

) kz+

� zk2z

(1� kz+

� zkz

)

2

+

⇥

2kz+

� zkz

� kz+

� zk2z

⇤

k¯z+

� z

+

kz

(1� kz+

� zkz

)

3

+

kz+

� ¯

z

+

kz

1� kz+

� zkz

=

kz+

� zk2z

(1� kz+

� zkz

)

2

+

kz+

� ¯

z

+

kz

(1� kz+

� zkz

)

3

. (74)

Finally, we note that �t+

(z) := kGz

(z; t+

)k⇤z

= kz�Pz

�

z�r2F (z)

�1rF (z); t+

�

kz

= kz�¯

z

+

kz

due to (26). Using the triangle inequality we have kz+

�zkz

kz�¯

z

+

kz

+kz+

�¯

z

+

kz

=

�t+

(z)+ �(z) < 1. Since the right-hand side of (74) is monotonically increasing with respectto kz

+

� zkz

, using the last inequality into (74), we obtain (27). ⇤7.4 The proof of Theorem 2: Local quadratic convergence of FGN

We first prove (a). Given a fixed parameter t > 0 sufficiently small, our objective is to find� 2 (0, 1) such that if �t(zk) �, then �t(z

k+1) �. Indeed, using the key estimate (27)

with t instead of t+

, we can see that to guarantee �t(zk+1) �, we require✓

�t(zk) + �(zk)

1� �t(zk)� �(zk)

◆

2

+

�(zk)

(1� �t(zk)� �(zk))3

�.

Since the left-hand side of this inequality is monotonically increasing when �t(zk) and �(zk)are increasing, we can overestimate it by

✓

� + �

1� � � �

◆

2

+

�

(1� � � �)3 �.

Using the identity �+�1�� =

�1�� +

�(1��)(1��) , we can write the last inequality as

h

2�

(1� �)2(1� � � �)+

�

(1� �)2(1� � � �)2+

1

(1� � � �)3

i

� � �✓

�

1� �

◆

2

. (75)

Clearly, the left-hand side of (75) is positive if 0 < � < 1 � �. Hence, we need to choose� 2 (0, 0.5(3 �

p5)) such that the right-hand side of (75) is also positive. Now, we choose

� � 0 such that � �(1� �) < 1� �. Then, (75) can be one more time overestimated by⇣

2�3 � 5�2

+ 3� + 1

(1� �)4

⌘

� �(1� 3� + �2

),


which implies

0 � �(1� 3� + �2

)(1� �)4

2�3 � 5�2

+ 3� + 1

< �(1� �), 8� 2⇣

0, 0.5(3�p5)

⌘

.

This inequality suggests that we can choose � := �(1�3�+�2

)(1��)42�3�5�2

+3�+1

> 0. In this case, we alsohave �(z) + �t(z) �+ � < 1, which guarantees the condition of Theorem 1. Hence, we canconclude that �t(zk) � implies �t(zk+1

) �. In other words,�

z

k

belongs to Qt(�).(b) Next, to guarantee a quadratic convergence, we can choose �k such that �(zk) �k

¯�k :=

�t(zk)

2

1��t(zk)

. Substituting the upper bound ¯�k of �(zk) into (27) we obtain

�t(zk+1

) ✓

2� 4�t(zk) + �t(z

k)

(1� 2�t(zk))3

◆

�t(zk)

2.

Let us consider the function s(r) := (2�4r+r2)r2

(1�2r)3 on [0, 1]. We can easily check that s(r) < 1

for all r 2 [0, 1]. Hence, �t(zk+1) < 1 as long as �t(zk) < 1. This proves the estimate (30).Now, let us choose some � 2 (0, 1) such that �t(zk) �. Then (30) leads to

�t(zk+1

) ✓

2� 4� + �2

(1� 2�)3

◆

�t(zk)

2

= c�t(z)2,

where c :=

2�4�+�2

(1�2�)3 > 0. We need to choose � 2 (0, 1) such that c�t(zk) < 1. Since

�t(zk) �, we choose � such that c� < 1, which is equivalent to 9�3 � 16�2

+ 8� � 1 < 0.If � 2 (0, 0.18858] then 9�3 � 16�2

+ 8� � 1 < 0. Therefore, the radius of the quadraticconvergence region of

�

�t(zk)

is r := 0.18858.(c) Finally, for any � 2 (0, 0.18858], we can write c�t(z

k+1

) (c�t(zk))

2. By induction,c�t(z

k) (c�t(z

0

))

2

k c2k

�2

k

< 1. We obtain �t(zk) c2

k�1

�2

k

. Let us choose �k :=

�t(zk)

2

1��t(zk)

. For ✏ 2 (0,�), assume that c2k�1

�2

k ✏. From Lemma 3, we can choose t :=

(1� ✏)(p⌫ + ✏+2✏2/(1� ✏))�1" . Then, zk is an "-solution of (1). It remains to use the fact

that c2k�1

�2

k ✏ to upper bound the number of iterations k := O (ln (ln(1/✏))). ⇤7.5 The proof of Theorem 3: Local quadratic convergence of DGN

(a) Given a fixed parameter t > 0 sufficiently small, it follows from DGN and (70) that

¯

z

k+2

= Pz

k+1

�

z

k+1 �r2F (z

k+1

)

�1rF (z

k+1

); t�

,

z

k+1

= Pz

k+1

�

¯

z

k+1 �r2F (z

k+1

)

�1rz

k(¯zk+1

); t�

+ (z

k+1 � ¯

z

k+1

).

Hence, using these notations and the same proof as (74) with t instead of t+

, and supposingkzk+1 � z

kkz

k < 1, we can derive

k¯zk+2

� z

k+1kzk+1

✓

kzk+1 � z

kkz

k

1� kzk+1 � z

kkz

k

◆

2

+

kzk+1 � ¯

z

k+1kz

k

(1� kzk+1 � z

kkz

k)3

. (76)

Now, let us define ˜�t(zk) := k˜zk+1 � z

kkz

k and ↵k := (1 +

˜�t(zk))

�1 as in DGN. From theupdate z

k+1:= (1� ↵k)z

k+ ↵k˜z

k+1 of DGN, we have

kzk+1 � z

kkz

k = ↵kk˜zk+1 � z

kkz

k = ↵k˜�t(z

k), and

kzk+1 � ¯

z

k+1kz

k kzk+1 � ˜

z

k+1kz

k + k˜zk+1 � ¯

z

k+1kz

k = (1� ↵k)k˜zk+1 � z

kkz

k + �(zk)

= (1� ↵k)˜�t(z

k) + �(zk).

Substituting these expressions into (76) we get

k¯zk+2

� z

k+1kzk+1

↵k˜�t(z

k)

1� ↵k˜�t(zk)

!

2

+

�(zk) + (1� ↵k)˜�t(z

k)

⇣

1� ↵k˜�t(zk)

⌘

3

.


Substituting ↵k := (1 +

˜�t(zk))

�1 into the last inequality and simplifying the result, we get

k¯zk+2

� z

k+1kzk+1

⇣

2 + 2

˜�t(zk) +

˜�t(zk)

2

⌘

˜�t(zk)

2

+

⇣

1 +

˜�t(zk)

⌘

3

�(zk).

Next, by the triangle inequality, it follows from (68) and the definition of �t(z) and ˜�t(z)

that ˜�t(zk+1

) = k˜zk+2 � z

k+1kz

k+1 k¯zk+2 � z

k+1kz

k+1 + k˜zk+2 � ¯

z

k+2kz

k+1 = k¯zk+2 �z

k+1kz

k+1 + �(zk+1). Combining this estimate and the above inequality we get

˜�t(zk+1

) ⇣

2 + 2

˜�t(zk) +

˜�t(zk)

2

⌘

˜�t(zk)

2

+

⇣

1 +

˜�t(zk)

⌘

3

�(zk) + �(zk+1).

If we choose �(zk) �k ˜�t(zk)

2

1+

˜�t(zk)

, then, by induction, �(zk+1) �k+1 ˜�t(zk+1

)

2

1+

˜�t(zk+1

)

. Substi-tuting these bounds into the last inequality and simplifying the result, we obtain

˜�t(zk+1

)

0

@

2

˜�t(zk)

2

+ 4

˜�t(zk) + 3

1� ˜�t(zk)2⇣

2

˜�t(zk)2 + 4

˜�t(zk) + 3

⌘

1

A

˜�t(zk)

2,

which is indeed (31).From (31), after a few elementary calculations, we can see that ˜�t(z

k+1) ˜�t(z

k) if

˜�t(zk)(1+

˜�t(zk))(2

˜�t(zk)

2

+4

˜�t(zk)+3) 1. Note that the function s(⌧) := ⌧(1+⌧)(2⌧2+

4⌧ + 3) is increasing on [0, 0.5(3 �p5)). By numerically computing ˜�t(z

k) we can observe

that if ˜�t(zk) 2 [0, 0.21027], then ˜�t(zk+1

) ˜�t(zk). Hence, if ˜�t(zk) � then ˜�t(z

k+1

) �.In other words, we can say that

�

z

k

⇢ ⌦t(�).We now prove (b). Indeed, if we take any � 2 (0, 0.21027], we can show from (31) that

˜�t(zk+1

) ✓

2�2

+ 4� + 3

1� �2

(2�2

+ 4� + 3)

◆

˜�t(zk)

2,

where c :=

⇣

2�2

+4�+3

1��2

(2�2

+4�+3)

⌘

2 (0,+1). To guarantee c� < 1, we need to choose � > 0

such that 2�4

+6�3

+7�2

+3� � 1 < 0. This condition leads to � 2 (0, 0.21027]. Hence, forany 0 < � 0.21027, if z0 2 Qt(�), then ˜�t(z

k+1) c˜�t(z

k)

2 < 1 and, therefore,�

˜�t(zk)

quadratically converges to zero.(c) To prove the last conclusion in (c), from (66), we can show that

dist

z

k(0,AZ(zk+1

)) t�k + t�

�rF (z

k)

�

�

⇤z

k + t�

�

z

k+1 � z

k�

�

z

k t(�k +

p⌫ + ↵k

˜�t(zk)).

Since ˜�t(zk) c2

k�1�t(z0

)

2

k c2k�1�2

k

, �k ˜�t(zk)

2

1+

˜�t(zk)

, and ↵k =

˜�t(zk)

1+

˜�t(zk)

, we obtain thelast conclusion as a consequence of Lemma 3 with the same proof as in Theorem 2. ⇤7.6 The proof of Lemma 4: The update rule for the penalty parameterLet us define ¯

u

k:= P

z

k

�

z

k �r2F (z

k)

�1rF (z

k); tk

�

. Then, �tk(zk) defined by (21) becomes�tk(z

k) := kG

z

k(zk; tk)k⇤

z

k = kzk�Pz

k

�

z

k �r2F (z

k)

�1rF (z

k); tk

�

kz

k = kzk�¯

u

kkz

k . Notethat ¯

u

k= P

z

k

�

z

k �r2F (z

k)

�1rF (z

k); tk

�

leads to

�tk�

rF (z

k) +r2F (z

k)(

¯

u

k � z

k)

�

2 A(

¯

u

k).

Combining this inclusion and (69) and using the monotonicity of A, we can derive

htk+1⇥

rF (z

k) +r2F (z

k)(

¯

z

k+1 � z

k)

⇤

� tk⇥

rF (z

k) +r2F (z

k)(

¯

u

k � z

k)

⇤

, ¯zk+1 � ¯

u

ki 0.

By rearranging this expression using tk+1 := (1� ��)tk from PFGN, we finally obtain

k¯zk+1 � ¯

u

kk2z

k ��1� ��

hrF (z

k) +r2F (z

k)(

¯

u

k � z

k), ¯zk+1 � ¯

u

ki

��1� ��

krF (z

k) +r2F (z

k)(

¯

u

k � z

k)k⇤

z

kk¯zk+1 � ¯

u

kkz

k .


where the last inequality follows from the elementary Cauchy-Schwarz inequality. This in-equality eventually leads to

k¯zk+1 � ¯

u

kkz

k ��

1��krF (z

k) +r2F (z

k)(

¯

u

k � z

k)k⇤

z

k

��

1��

⇥

krF (z

k)k⇤

z

k + kr2F (z

k)(

¯

u

k � z

k)k⇤

z

k

⇤

��

1��

⇥

krF (z

k)k⇤

z

k + k¯uk � z

kkz

k

⇤

.

Now, by the triangle inequality, we have k¯zk+1 � z

kkz

k k¯zk+1 � ¯

u

kkz

k + k¯uk � z

kkz

k .This inequality is equivalent to �tk+1

(z

k) k¯zk+1 � ¯

u

kkz

k + �tk(zk) due to the definitions

�tk+1

(z

k) = k¯zk+1 � z

kkz

k and �tk(zk) = k¯uk � z

kkz

k . Using the last estimate in the aboveinequality we get

�tk+1

(z

k) �tk(z

k) +

��1� ��

⇥

krF (z

k)k⇤

z

k + �tk(zk)

⇤

,

which is (32). The second inequality of (32) follows from the fact that krF (z

k)k⇤

z

k p⌫.

Let us denote by �k :=

⇣

��

1��

⌘

�p⌫ + �tk(z

k)

�

. For a given � 2 (0, 1), we now assumethat �tk(zk) �. Then, by using (32) in (27), and the monotonic increase of its right-handside with respect to �tk+1

(z

k), we can derive

�tk+1

(z

k+1)

✓

�tk(zk) + |�k|+ �k

1� �tk(zk)� |�k|� �k

◆

2

+

�k

(1� �tk(zk)� |�k|� �k)

3

✓

� + |�k|+ �k1� � � |�k|� �k

◆

2

+

�k(1� � � |�k|� �k)3

,

as long as �+ |�k|+�k < 1. Let us denote ✓k := �+ |�k|. By using the identity �+|�k|+�k1��|�k|��k =

�+|�k|1��|�k| +

�k(1�✓k)(1�✓k��k) , we can rewrite the last inequality as

�tk+1

(z

k+1)

✓

✓k1�✓k

◆

2

+

2✓k(1� ✓k)2(1�✓k��k)

+

�k(1� ✓k)2(1�✓k��k)2

+

1

(1�✓k��k)3�

�k.

If we choose �k such that 0 �k ✓k(1� ✓k) < 1� ✓k, then the above inequality implies

�tk+1

(z

k+1)

✓

✓k1�✓k

◆

2

+

2✓k(1�✓k)2 + ✓k(1�✓k)+1

(1� ✓k)6

�

�k :=

✓

✓k1� ✓k

◆

2

+Mk�k. (77)

Take any c 2 (0, 1), .e.g., c := 0.95, and choose �k such that 0 �k (1�c2)c2Mk

⇣

✓k1�✓k

⌘

2

.Hence, in order to guarantee �tk+1

(z

k+1) �, by using (77), we can impose the condition

⇣

✓k1�✓k

⌘

2

+ Mk�k 1

c2

⇣

✓k1�✓k

⌘

2

�, which is equivalent to ✓k1�✓k c

p�. This condition

leads to ✓k � cp�

1+cp�, and therefore, |�k| c

p�

1+cp�� . Since |�k| > 0, we need to choose �

such that 0 < � < 0.5(1 + 2c2 �p1 + 4c2).

Next, by the choice of �k, we require 0 �k min

⇢

(1�c2)c2Mk

⇣

✓k1�✓k

⌘

2

, ✓k(1� ✓k)

�

. Using

the fact that Mk =

2✓k(1�✓k)2+✓k(1�✓k)+1

(1�✓k)6 from (77) and 0 ✓k cp�

1+cp�, we can show that

the condition on �k holds if we choose

�k ¯� :=(1� c2)�

(1 + cp�)3

⇥

3cp� + c2� + (1 + c

p�)3⇤ .


On the other hand, we have |�k| =�

�

�

⇣

��

1��

⌘

�p⌫ + �tk(z

k)

�

�

�

�

⇣

��

1��

⌘

(

p⌫ + �). In order

to guarantee that |�k| cp�

1+cp�� , we use the above estimate to impose a condition

⇣

��

1��

⌘

1p⌫+�

⇣

cp�

1+cp��

⌘

, which leads to

�� :=

cp� � �(1 + c

p�)

(1 + cp�)

p⌫ + c

p�.

This estimate is exactly the right-hand side of (33). Finally, using (32) and the definition of�k, we can easily show that �tk+1

(z

k) �tk(z

k) + |�k| � + |�k| ⌘ ✓k c

p�

1+cp�. ⇤

7.7 The proof of Theorem 4: The worst-case iteration-complexity of PFGN

By Lemma 3 and �tk+1

(z

k) c

p�

1+cp�, we can see that zk is an "-solution of (1) if tk := M�1

0

",

where M0

:=

⇣

1� cp�

1+cp�

⌘�1

⇣p⌫ + c

p�

1+cp�+ 2

¯�t(�)⌘

= O(

p⌫).

On the other hand, by induction, it follows from the update rule tk+1 = (1��)tk of PFGN

that tk = (1 � ��)kt

0

. Hence, zk is an "-solution of (1) if we have tk = (1 � ��)kt

0

"M

0

.

This condition leads to k ln(1 � ��) � ln

⇣

"M

0

t0

⌘

, which implies k ln("/(M0

t0

))

ln(1��). Using an

elementary inequality ln(1� ��) �� , we can upper bound k as

k � 1

��ln

✓

M0

t0

"

◆

=

�

(1 + cp�)

p⌫ + c

p��

cp� � �(1 + c

p�)

ln

✓

M0

t0

"

◆

.

Consequently, the worst-case iteration-complexity of PFGN is O⇣p

⌫ ln⇣p

⌫t0

"

⌘⌘

. ⇤

7.8 The proof of Theorem 5: Finding an initial point for PFGN

From (35), if we define r ˆF (

ˆ

z

j) := rF (

ˆ

z

k) � t�1

0

⌧k+1

⇣0

, then we still have r2

ˆF (

ˆ

z

j) =

r2F (

ˆ

z

j). Hence, the estimate (27) still holds for ˆ�⌧ (ˆz

j).

Next, if we define ¯

v

j:= P

ˆz

j

⇣

ˆ

z

j �r2F (

ˆ

z

j)

�1

⇣

rF (

ˆ

z

j)� ⌧jt

�1

0

ˆ⇣0⌘

; t0

⌘

, then, by thedefinition of P

ˆz

j , we have

� t0

h

r2F (

ˆ

z

j)(

¯

v

j � ˆ

z

j) +rF (

ˆ

z

j)� ⌧jt

�1

0

ˆ⇣0

i

2 A(

¯

v

j). (78)

Similarly, since ¯

ˆ

z

j+1

:= Pˆz

j

⇣

ˆ

z

j �r2F (

ˆ

z

j)

�1

⇣

rF (

ˆ

z

j)� ⌧j+1

t�1

0

ˆ⇣0⌘

; t0

⌘

, we have

� t0

h

r2F (

ˆ

z

j)(

¯

ˆ

z

j+1 � ˆ

z

j) +rF (

ˆ

z

j)� ⌧j+1

t�1

0

ˆ⇣0

i

2 A(

¯

ˆ

z

j+1

). (79)

Using (78), (79), and the monotonicity of A, we have

t0

hr2F (

ˆ

z

j)(

¯

ˆ

z

j+1 � ¯

v

j), ¯ˆzj+1 � ¯

v

ji (⌧j � ⌧j+1

)hˆ⇣0

, ¯vj � ¯

ˆ

z

j+1i.

Using ⌧j+1

:= ⌧j ��j and the Cauchy-Schwarz inequality, the last inequality leads to

t0

�

�

¯

ˆ

z

j+1 � ¯

v

j�

�

ˆz

j �jkˆ⇣0k⇤ˆz

j . (80)

Now, similar to the proof of Lemma 4, using (80), we can derive

ˆ�⌧j+1

(

ˆ

z

j) ˆ�⌧j (ˆz

j) +

�j

t0

kˆ⇣0

k⇤ˆz

j . (81)

By the same argument as the proof of (33), we can show that with �k :=

�j

t0

kˆ⇣0

k⇤ˆz

j , we have

|�k| cp⌘

1+cp⌘ � ⌘. This shows that �j t

0

kˆ⇣0

k⇤ˆzj

⇣

cp⌘

1+cp⌘ � ⌘

⌘

, which is the first estimate of

(37). The second estimate of (37) can be derived as in Lemma 4 using ⌘ instead of �.


We prove (38). From (21) and (36), using the triangle inequality, we can upper bound

�t0

(z

0

) :=

�

�

z

0 � Pz

0

�

z

0 �r2F (z

0

)

�1rF (z

0

); t0

�

�

�

z

0

z

0

:=ˆz

j

=

�

�

ˆ

z

j � Pˆz

j

�

ˆ

z

j �r2F (

ˆ

z

j)

�1rF (

ˆ

z

j); t

0

�

�

�

ˆz

j

�

�

�

ˆ

z

j�Pˆz

j

�

ˆ

z

j�r2F (

ˆ

z

j)

�1

�

rF (

ˆ

z

j)�⌧jt�1

0

ˆ⇣0�

; t0

�

�

�

�

ˆz

j

+

�

�

�

Pˆz

j

�

ˆ

z

j �r2F (

ˆ

z

j)

�1rF (

ˆ

z

j); t

0

�

� Pˆz

j

�

ˆ

z

j�r2F (

ˆ

z

j)

�1

�

rF (

ˆ

z

j)�⌧jt�1

0

ˆ⇣0�

; t0

�

�

�

�

ˆz

j

(36),(67)

ˆ�⌧j (ˆzj) +

�

�t�1

0

⌧jr2F (

ˆ

z

j)

�1

ˆ⇣j�

�

ˆz

j

=

ˆ�⌧j (ˆzj) + ⌧jt

�1

0

kˆ⇣0k⇤ˆz

j ,

which proves the first inequality of (38).By [32, Corollary 4.2.1], we have kˆ⇣0k⇤

ˆz

j kˆ⇣0k⇤¯z

?F, where ¯

x

?F and are given by (15)

and below (15), respectively. Hence, ¯�⌘ :=

µ⌘

kˆ⇣0k⇤¯z?F

¯�j . The second estimate of (38)

follows from ⌧j := ⌧ �Pj�1

l=0

�j 1� j ¯�⌘ due to the update rule (35) with �j :=¯�j � ¯�⌘.

In order to guarantee �t0

(z

0

) �, it follows from (38) and the update rule of ⌧j that

j � 1

¯�⌘

1� (� � ⌘)t0

kˆ⇣0k⇤¯z

?F

!

.

Finally, substituting ¯�⌘ =

t0

kˆ⇣0

k⇤¯z?F

⇣

cp⌘

1+cp⌘ � ⌘

⌘

into this estimate and after simplifying the

result, we obtain the remaining conclusion of Theorem 5. ⇤7.9 The proof of Theorem 7: Primal recovery for (4) in Algorithm 2By the definition of ', we have '(y) := f⇤

(c � L⇤y) = f⇤

(t�1

(c � L⇤y)) � ⌫ ln(t) due

to the self-concordant logarithmic homogeneity of f . Using the property of the Legendretransformation f⇤ of f , we can express this function as

'(y) = t�1

max

x2int(K)

{hc� L⇤y,xi � tf(x)}� ⌫ ln(t).

We show that the point x

k given by (56) solves the above maximization problem. We canwrite down the optimality condition of the above maximization problem as

c� L⇤y

k+1 � tk+1rf(xk+1) = 0,

which leads to rf(xk+1) = t�1

k+1(c�L⇤y

k+1). On the other hand, by the well-known property

of f [32], we have x

k+1= rf⇤

(rf(xk+1)) = rf⇤ �t�1

k+1(c� L⇤y

k+1)

�

2 int (K).Now, we prove (57). Note that c � L⇤

y

k+1 � tk+1rf(xk+1) = 0 and krf(x)k⇤

x

p⌫,

which leads tokL⇤

y

k+1 � ck⇤x

k+1

= tk+1krf(xk+1)k⇤

x

k+1

tk+1p⌫.

Since tk+1 ", this estimate leads to the first inequality of (57).From (24), there exists e

k 2 Rp such that e

k 2 r'(yk) + r2'(yk

)(y

k+1 � y

k) +

t�1

k+1@ (yk+1

) and kekk⇤y

k �k. This condition leads to

e

k+r'(yk+1

)�r'(yk)�r2'(yk

)(y

k+1 � y

k) 2 r'(yk+1

) + t�1

k+1@ (yk+1

).

Therefore, we have

dist

y

k+1

⇣

0,r'(yk+1) + t�1

k+1@ (yk+1

)

⌘

kek +r'(yk+1)�r'(yk

)�r2'(yk)(y

k+1 � y

k)k⇤

y

k+1

kekk⇤y

k+1

+ kr'(yk+1)�r'(yk

)�r2'(yk)(y

k+1 � y

k)k⇤

y

k+1

. (82)


To estimate the right-hand side of this inequality, we define Mk := kr'(yk+1)�r'(yk

)�r2'(yk

)(y

k+1 � y

k)k⇤

y

k+1

. With the same proof as [32, Theorem 4.1.14], we can show that

Mk �

1� kyk+1 � y

kky

k

��2 kyk+1 � y

kk2y

k �

�(yk) + �tk+1

(y

k)

�

2

�

1� �tk+1

(y

k)� �(yk

)

�

2

. (83)

Here, we use kyk+1 � y

kky

k kyk+1 � ¯

y

k+1ky

k + k¯yk+1 � y

kky

k = �(yk) + �tk+1

(y

k) by the

definitions of �t+

(y) in (21) and of �(y) above (27). Substituting (83) into (82) we get

dist

y

k+1

�

0,r'(yk+1) + t�1

k+1@ (yk+1

)

�

kekk⇤y

k+1

+

�

�(yk) + �tk+1

(y

k)

�

2

�

1� �tk+1

(y

k)� �(yk

)

�

2

. (84)

Next, it remains to estimate kekk⇤y

k+1

. Indeed, we have

kekk⇤y

k+1

�

1� kyk+1 � y

ktkky

k

��1kekky

k �

1� �tk+1

(y

k)� �(yk

)

��1 kekky

k

�k1��tk+1

(y

k)��k .

Using this estimate into (84) and �tk+1

(y

k) c

p�(1 + c

p�)�1 from Lemma 4, we obtain

dist

y

k+1

�

0,r'(yk+1) + t�1

k+1@ (yk+1

)

�

�k(1 + cp�)

(1� �k(1 + cp�))

+

(�k(1 + cp�) + c

p�)2

(1� �k(1 + cp�))2

.

Substituting an upper bound �t :=

(1�c2)�

(1+cp�)3

[

3cp�+c2�+(1+c

p�)3

]

of �k from Lemma 4 intothe last estimate and simplifying the result, we get

dist

y

k+1

�

0,r'(yk+1) + t�1

k+1@ (yk+1

)

�

✓(c,�), (85)

where ✓(c,�) is defined as

✓(c,�) := (1�c2)�

(1+cp�)2

[

3cp�+c2�+(1+c

p�)3

]

�(1�c2)�

+

✓

(1�c2)�+cp�(1+c

p�)2

[

3cp�+c2�+(1+c

p�)3

]

(1+cp�)2

[

3cp�+c2�+(1+c

p�)3

]

�(1�c2)�

◆

2

.

(86)

Using the fact that c 2 (0, 1) and 0 � < 0.5(1 + 2c2 �p1 + 4c2), we have ✓(c,�) 1.

Since r'(·) = �Lrf⇤(c� L⇤

(·)) = �t�1

k+1Lrf⇤(t�1

k+1(c� L⇤(·))) due to (48), using (56) we

can show that r'(yk+1) = t�1

k+1Lxk+1. Plugging this expression into (85) and noting that

@ (·) = @g⇤(·) + b, we obtain

dist

y

k+1

�

Lxk+1 � b, @g⇤(yk+1)

�

= dist

y

k+1

�

0,b� Lxk+1+ @g⇤(yk+1

)

�

tk+1✓(c,�).

Let s

k+1

= ⇡@g⇤(y

k+1

)

(Lxk+1 � b) be the projection of Lxk+1 � b onto @g⇤(yk+1

). Then,s

k+1 2 @g⇤(yk+1

), and hence, y

k+1 2 @g(sk+1

), which shows the second term of (57).Using this relation in the last inequality and the definition of sk+1, we obtain kLxk+1 � b�s

k+1k⇤y

k+1

tk+1✓(c,�), which is the third term of (57). Finally, since ✓(c,�) 1, we havemax {p⌫, ✓(c,�)} =

p⌫. Using (57), we can conclude that (x

k, sk) is an "-solution of (3) ifp⌫tk ". ⇤


References1. A. Auslender, M. Teboulle, and S. Ben-Tiba. A logarithmic-quadratic proximal method for variational

inequalities. In Comput. Optim., pages 31–40. Springer US, 1999.2. H.H. Bauschke and P. Combettes. Convex analysis and monotone operators theory in Hilbert spaces.

Springer-Verlag, 2nd edition, 2017.3. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding agorithm for linear inverse problems.

SIAM J. Imaging Sci., 2(1):183–202, 2009.4. S. Becker and M.J. Fadili. A quasi-Newton proximal splitting method. In Proceedings of Neutral

Information Processing Systems Foundation (NIPS), 2012.5. A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization: Analysis, algorithms, and

engineering applications, volume 3 of MPS/SIAM Series on Optimization. SIAM, 2001.6. J.F. Bonnans. Local Analysis of Newton-Type Methods for Variational Inequalities and Nonlinear

Programming. Appl. Math. Optim, 29:161–186, 1994.7. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning

via the alternating direction method of multipliers. Foundations and Trends in Machine Learning,3(1):1–122, 2011.

8. S. Boyd and L. Vandenberghe. Convex Optimization. University Press, Cambridge, 2004.9. A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications

to imaging. J. Math. Imaging Vis., 40(1):120–145, 2011.10. P. Combettes and Pesquet J.-C. Signal recovery by proximal forward-backward splitting. In Fixed-Point

Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer-Verlag, 2011.11. P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale

Model. Simul., 4:1168–1200, 2005.12. R.S. Womersley D. Sun and H. Qi. A feasible semismooth asymptotically Newton method for mixed

complementarity problems. Math. Program., 94(1):167–187, 2002.13. A. L. Dontchev and R. T. Rockafellar. Implicit Functions and Solution Mappings: A View from Vari-

ational Analysis. Springer Verlag, 2014.14. J. Eckstein and D. Bertsekas. On the Douglas - Rachford splitting method and the proximal point

algorithm for maximal monotone operators. Math. Program., 55:293–318, 1992.15. J. E. Esser. Primal-dual algorithm for convex models and applications to image restoration, registration

and nonlocal inpainting. PhD Thesis, University of California, Los Angeles, Los Angeles, USA, 2010.16. F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems,

volume 1-2. Springer-Verlag, 2003.17. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,

3:95–110, 1956.18. M. Friedlander and G. Goh. Efficient evaluation of scaled proximal operators. Electron. Trans. Numer.

Anal, 46:1–22, 2017.19. M. Fukushima. Equivalent differentiable optimization problems and descent methods for asymmetric

variational inequality problems. Math. Program., 53:99–110, 1992.20. T. Goldstein, E. Esser, and R. Baraniuk. Adaptive primal-dual hybrid gradient methods for saddle point

problems. Tech. Report., pages 1–26, 2013. http://arxiv.org/pdf/1305.0546v1.pdf.21. M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. In L. Liberti and N. Maculan, editors,

Global Optimization: From Theory to Implementation, Nonconvex Optimization and its Applications,pages 155–210. Springer, 2006.

22. B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming.volume 62, pages 2788–2797. IEEE, 2016.

23. M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. JMLR W&CP,28(1):427–435, 2013.

24. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems (NIPS), pages 315–323, 2013.

25. G. M. Korpelevic. An extragradient method for finding saddle-points and for other problems. Èkonom.i Mat. Metody., 12(4):747–756, 1976.

26. B. Kummer. Newton’s method for non-differentiable functions. Advances in Mathematical Optimization,45:114–125, 1988.

27. J. Löefberg. YALMIP : A Toolbox for Modeling and Optimization in MATLAB. In Proceedings of theCACSD Conference, Taipei, Taiwan, 2004.

28. T. De Luca, F. Facchinei, and C. Kanzow. A semismooth equation approach to the solution of nonlinearcomplementarity problems. Math. Program., 75(3):407–439, 1996.

29. R.D.C. Monteiro and B.F. Svaiter. Iteration-complexity of a Newton proximal extragradient methodfor monotone variational inequalities and inclusion problems. SIAM J. Optim., 22(3):914–935, 2012.

30. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach tostochastic programming. SIAM J. Opti, 19(4):1574–1609, 2009.

31. A. Nemirovskii. Prox-method with rate of convergence O(1/t) for variational inequalities with Lips-chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Op,15(1):229–251, 2004.


32. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Opti-mization. Kluwer Academic Publishers, 2004.

33. Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and relatedproblems. Math. Program., 109(2–3):319–344, 2007.

34. Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Math. Program.,110(2):245–259, 2007.

35. Y. Nesterov. Gradient methods for minimizing composite objective function. Math. Program.,140(1):125–161, 2013.

36. Y. Nesterov and A. Nemirovski. Interior-point Polynomial Algorithms in Convex Programming. Societyfor Industrial Mathematics, 1994.

37. Y. Nesterov and M.J. Todd. Self-scaled barriers and interior-point methods for convex programming.Math. Oper. Res., 22(1):1–42, 1997.

38. J. Nocedal and S.J. Wright. Numerical Optimization. Springer Series in Operations Research andFinancial Engineering. Springer, 2 edition, 2006.

39. J.-S. Pang. A B-differentiable equation-based, globally and locally quadratically convergent algorithm fornonlinear programs, complementarity and variational inequality problems. Math. Program., 51(1):101–131, 1991.

40. N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231,2013.

41. L. Qi and J. Sun. A nonsmooth version of Newton’s method. Math. Program., 58:353–367, 1993.42. D. Ralph. Global convergence of damped Newton’s method for nonsmooth equations via the path search.

Math. Oper. Res., 19(2):352–389, 1994.43. S. M. Robinson. Strongly Regular Generalized Equations. Math. Opers. Res., Vol. 5, No. 1 (Feb.,

1980), pp. 43-62, 5:43–62, 1980.44. S. M. Robinson. Newton’s Method for a Class of Nonsmooth Functions. Set-Valued Var. Anal., 2:291–

305, 1994.45. R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University

Press, 1970.46. R.T. Rockafellar and R. J-B. Wets. Variational Analysis. Springer-Verlag, 1997.47. R. Shefi and M. Teboulle. Rate of Convergence Analysis of Decomposition Methods Based on the

Proximal Method of Multipliers for Convex Minimization. SIAM J. Optim., 24(1):269–297, 2014.48. MV Solodov and BF Svaiter. A hybrid approximate extragradient–proximal point algorithm using the

enlargement of a maximal monotone operator. Set-Valued Var. Anal., 7(4):323–345, 1999.49. F. Sturm. Using SeDuMi 1.02: A Matlab toolbox for optimization over symmetric cones. Optim. Methods

Software, 11-12:625–653, 1999.50. W. Su, S. Boyd, and E. Candes. A differential equation for modeling Nesterov’s accelerated gradient

method: Theory and insights. In Advances in Neural Information Processing Systems (NIPS), pages2510–2518, 2014.

51. K.-Ch. Toh, M.J. Todd, and R.H. Tütüncü. On the implementation and usage of SDPT3 – a Matlabsoftware package for semidefinite-quadratic-linear programming. Tech. Report 4, NUS Singapore, 2010.

52. Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. An inexact proximal path-following algorithm for con-strained convex minimization. SIAM J. Optim., 24(4):1718–1745, 2014.

53. Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. A single phase proximal path-following framework. Math.Oper. Res., (accepted), 2018.

54. Q. Tran-Dinh, I. Necoara, C. Savorgnan, and M. Diehl. An inexact perturbed path-following method forLagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim., 23(1):95–125,2013.

55. P. Tseng. Applications of splitting algorithm to decomposition in convex programming and variationalinequalities. SIAM J. Control Optim., 29:119–138, 1991.

56. P. Tseng. Alternating projection-proximal methods for convex programming and variational inequalities.SIAM J. Optim., 7(4):951–965, 1997.

57. Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented Lagrangian methods for semidefiniteprogramming. Math. Program. Compt., 2:203–230, 2010.

58. Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by anonlinear successive over-relaxation algorithm. Math. Program. Comput., 4(4):333–361, 2012.

59. S.J. Wright. Applying new optimization algorithms to model predictive control. In J.C. Kantor, C.E.Garcia, and B. Carnahan, editors, Fifth International Conference on Chemical Process Control – CPCV, pages 147–155. American Institute of Chemical Engineers, 1996.

60. N. Xiu and J. Zhang. Some recent advances in projection-type methods for variational inequalities. J.Comput. Appl. Math., 152(1):559–585, 2003.

61. H. Yamashita, H. Yabe, and K. Harada. A primal-dual interior point method for nonlinear semidefiniteprogramming. Math. Program., 135:89–121, 2012.

62. L. Yang, D. Sun, and K.-C. Toh. SDPNAL+: a majorized semismooth Newton-CG augmented La-grangian method for semidefinite programming with nonnegative constraints. Math. Program. Comput.,7(3):331–366, 2015.

Documents

Self-concordant inclusions: A uniﬁed framework for path ...shulu.web.unc.edu/files/2019/07/SelfConInclusion_Jan2018.pdf · third inexact path-following scheme to achieve an "-solution,