Advanced Convex Optimization (PGMO) · Interior-point methods Self-concordant functions...

Preview:

Citation preview

Advanced Convex Optimization (PGMO)

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton method

Yu. Nesterov Advanced Convex Optimization (PGMO)

Structure of the course

Main goals:

Theoretical justification of efficiency of optimization methods.

No gap between theory and practice.

Part 1: Black-Box Optimization

Lecture 1. Complexity of Black-Box Optimization

Difficult problems

Lower complexity bounds for Convex Optimization

Optimal methods

Lecture 2. Second order methods. Systems of nonlinear equations

Globally convergent second-order schemes

Cubic regularization for Newton Method

Modified Gauss-Newton methodYu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

Part 2: Structural Optimization

Lecture 3. Interior-point methods

Self-concordant functions

Self-concordant barriers

Application examples

Lecture 4. Smoothing Technique

Explicit model of objective function

Smoothing

Application examples

Lecture 5. Huge-scale optimization

Sparsity in optimization problems

Coordinate-descent schemes

Gradient methods with sublinear cost of iteration

Yu. Nesterov Advanced Convex Optimization (PGMO)

References

Books:

Yu. Nesterov. Introductory Lectures on Convex Optimization.Kluwer, Boston, 2004.

Yu. Nesterov, A. Nemirovskii. Interior point polynomialmethods in convex programming: Theory and Applications,,SIAM, Philadelphia, 1994.

Yu. Nesterov Advanced Convex Optimization (PGMO)

Papers

1 Yu. Nesterov. Subgradient methods for huge-scaleoptimization problems. Mathematical Programming 146(1-2),275-297 (2014)

2 Yu.Nesterov. Gradient methods for minimizing compositefunctions. Mathematical Programming, 140(1), 125-161(2013).

3 Yu.Nesterov. Efficiency of coordinate-descent methods onhuge-scale optimization problems. SIOPT, 22(2), 341-362(2012).

4 Yu.Nesterov. Simple bounds for boolean quadratic problems.EUROPT Newsletters, 18, 19-23 (2009).

5 Yu. Nesterov. “Primal-dual subgradient methods for convexproblems”. Mathematical Programming, 120(1), 261-283(2009)

Yu. Nesterov Advanced Convex Optimization (PGMO)

6 Yu. Nesterov. “Accelerating the cubic regularization ofNewton’s method on convex problems”. MathematicalProgramming, 112(1) 159-181 (2008)

7 Yu.Nesterov, J.-Ph.Vial. Confidence level solutions forstochastic programming. Automatica, 44(6), 1559-1568(2008)

8 Yu. Nesterov. “Modified Gauss-Newton scheme withworst-case guarantees for its global performance”.Optimization Methods and Software, 22(3) 469-483 (2007)

9 Yu. Nesterov. “Smoothing technique and its applications insemidefinite optimization”. Mathematical Programming,110(2), 245-259 (2007)

Yu. Nesterov Advanced Convex Optimization (PGMO)

10 Yu. Nesterov. “Dual extrapolation and its application forsolving variational inequalities and related problems”.Mathematical Programming, 109(2-3), 319-344 (2007).

11 Yu. Nesterov, B. Polyak. “Cubic regularization of Newton’smethod and its global performance”. MathematicalProgramming, 108(1), 177-205 (2006).

12 Yu. Nesterov. “Excessive gap technique in nonsmooth convexminimization”. SIAM J. Optim. 16 (1), 235-249 (2005).

13 Yu. Nesterov. “Smooth minimization of non-smoothfunctions”, Mathematical Programming (A), 103 (1),127-152 (2005).

Yu. Nesterov Advanced Convex Optimization (PGMO)

Advanced Convex Optimization (PGMO 2016)

Lecture 1. Intrinsic complexity of Black-BoxOptimization

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Outline

1 Basic NP-hard problem

2 NP-hardness of some popular problems

3 Lower complexity bounds for Global Minimization

4 Nonsmooth Convex Minimization. Subgradient scheme.

5 Smooth Convex Minimization. Lower complexity bounds

6 Methods for Smooth Minimization with Simple Constraints

Standard Complexity Classes

Let data be coded in matrix A, and n be dimension of the problem.

Combinatorial Optimization

NP-hard problems: 2n operations. Solvable in O(p(n)‖A‖).

Fully polynomial approximation schemes: O(p(n)εk

lnα ‖A‖)

.

Polynomial-time problems: O(p(n) lnα ‖A‖).

Continuous Optimization

Sublinear complexity: O(p(n)εα ‖A‖

β)

, α, β > 0.

Polynomial-time complexity: O(p(n) ln( 1

ε‖A‖)).

Basic NP-hard problem: Problem of stones

Given n stones of integer weights a1, . . . , an, decide if it is possibleto divide them on two parts of equal weight.

Mathematical formulation

Find a Boolean solution xi = ±1, i = 1, . . . , n, to a single linear

equationn∑

i=1aixi = 0.

Another variant:n∑

i=2aixi = a1.

NB: Solvable in O

(ln n ·

n∑i=1|ai |)

by FFT transform.

Immediate consequence: quartic polynomial

Theorem: Minimization of quartic polynomial of n variables isNP-hard.

Proof: Consider the following function:

f (x) =n∑

i=1x4i −

1n

(n∑

i=1x2i

)2

+

(n∑

i=1aixi

)4

+ (1− x1)4.

The first part is 〈A[x ]2, [x ]2〉, where A = I − 1neneTn 0 with

Aen = 0, and [x ]2i = x2i , i = 1, . . . , n.

Thus, f (x) = 0 iff all xi = τ ,n∑

i=1aixi = 0, and x1 = 1.

Corollary: Minimization of convex quartic polynomial over theunit sphere is NP-hard.

Nonlinear Optimal Control: NP-hard

Problem: minu f (x(1)) : x ′ = g(x , u), 0 ≤ t ≤ 1, x(0) = x0 .

Consider g(x , u) = 1nx · 〈x , u〉 − u.

Lemma. Let ‖x0‖2 = n. Then ‖x(t)‖2 = n, 0 ≤ t ≤ 1.

Proof. Consider g(x , u) =(

xxT

‖x‖2 − I)

u and let x ′ = g(x , u). Then

〈x ′, x〉 = 〈(

xxT

‖x‖2 − I)

u, x〉 = 0.

Thus, ‖x(t)‖2 = ‖x0‖2. Same is true for x(t) defined by g .Note: We have enough degrees of freedom to put x(1) at anyposition of the sphere.Hence, our problem is: minf (y) : ‖y‖2 = n.

Descent direction of nonsmooth nonconvex function

Consider φ(x) =(

1− 1γ

)max

1≤i≤n|xi | − min

1≤i≤n|xi |+ |〈a, x〉|,

where a ∈ Zn+ and γ

def=

n∑i=1

ai ≥ 1. Clearly, φ(0) = 0.

Lemma. It is NP-hard to decide if φ(x) < 0 for some x ∈ Rn.Proof: 1. Assume that σ ∈ Rn with σi = ±1 satisfies 〈a, σ〉 = 0.Then φ(σ) = − 1

γ < 0.2. Assume φ(x) < 0 and max

1≤i≤n|xi | = 1. Denote δ = |〈a, x〉|.

Then |xi | > 1− 1γ + δ, i = 1, . . . , n.

Denoting σi = signxi , we have σixi > 1− 1γ + δ. Therefore,

|σi − xi | = 1− σixi < 1γ − δ, and we conclude that

|〈a, σ〉| ≤ |〈a, x〉|+ |〈a, σ − x〉| ≤ δ + γ max1≤i≤n

|σi − xi |

< (1− γ)δ + 1 ≤ 1.

Since a ∈ Zn , this is possible iff 〈a, σ〉 = 0.

Black-box optimization

Oracle: Special unit for computing function value and derivativesat test points. (0-1-2 order.)

Analytic complexity: Number of calls of oracle, which isnecessary (sufficient) for solving any problem from the class.

(Lower/Upper complexity bounds.)

Solution: ε-approximation of the minimum.

Resisting oracle: creates the worst problem instance for aparticular method.

Starts from “empty” problem.

Answers must be compatible with the description of theproblem class.

The bad problem is created after the method stops.

Bounds for Global Minimization

Problem: f ∗ = minxf (x) : x ∈ Bn, Bn = x ∈ Rn : 0 ≤ x ≤ en.

Problem Class: |f (x)− f (y)| ≤ L‖x − y‖∞ ∀x , y ∈ Bn.

Oracle: f (x) (zero order).

Goal: Find x ∈ Bn: f (x)− f ∗ ≤ ε.

Theorem: N(ε) ≥(L2ε

)n.

Proof. Divide Bn on pn l∞-balls of radius 12p .

Resisting oracle: at each test point reply f (x) = 0.Assume, N < pn. Then, ∃ ball with no questions. Hence, we cantake f ∗ = − L

2p . Hence, ε ≥ L2p .

Corollary: Uniform Grid method is worst-case optimal.

Nonsmooth Convex Minimization (NCM)

Problem: f ∗ = minxf (x) : x ∈ Q, where

Q ⊆ Rn is a convex set: x , y ∈ Q ⇒ [x , y ] ∈ Q. It is simple.

f (x) is a sub-differentiable convex function:

f (y) ≥ f (x) + 〈f ′(x), y − x〉, x , y ∈ Q,

for certain subgradient f ′(x) ∈ Rn.

Oracle: f (x), f ′(x) (first order).

Solution: ε-approximation in function value.

Main inequality: 〈f ′(x), x − x∗〉 ≥ f (x)− f ∗ ≥ 0, ∀x ∈ Q.

NB: Anti-subgradient decreases the distance to the optimum.

Computation of subgradients

Denote by ∂f (x) the subdifferential of f at x .

This is the set of all subgradients at x .

1. For f = α1f1 + α2f2 with α1, α2 > 0, we have

∂f (x) = α1∂f1(x) + α2∂f2(x).

2. For f = maxf1, f2, we have

∂f (x) = Conv ∂f1(x), ∂f2(x).

NCM: Lower Complexity Bounds

.Let Q ≡ ‖x‖ ≤ 2R and xk+1 ∈ x0 + Linf ′(x0), . . . , f ′(xk).Consider the function fm(x) = L max

1≤i≤mxi + µ

2‖x‖2 with µ = L

Rm1/2 .

From the problem: minτ

(Lτ + µm

2 τ2), we get

τ∗ = − Lµm = − R

m1/2 , f ∗m = − L2

2µm = − LRm1/2 , ‖x∗‖2 = mτ2

∗ = R2.

NB: If x0 = 0, then after k iterations we can keep xi = 0 for i > k .

Lipschitz continuity: fk+1(xk)− f ∗k+1 ≥ −f ∗k+1 = LR(k+1)1/2 .

Strong convexity: fk+1(xk)− f ∗k+1 ≥ −f ∗k+1 = L2

2(k+1)·µ .

Both lower bounds are exact!

Subgradient Method (SG)

Problem: minx∈Qf (x) : g(x) ≤ 0,

where Q is a closed convex set, and convex f , g ∈ C 0,0L (Q).

SG: If g(xk )‖g ′(xk )‖ > h then a) xk+1 = πQ

(xk − g(xk )

‖g ′(xk )‖2 g ′(xk))

,

else b) xk+1 = πQ

(xk − h

‖f ′(xk )‖ f ′(xk)).

Denote f ∗N = min0≤k≤N

f (xk) : k ∈ b). Let N = Na + Nb.

Theorem: If N > 1h2 ‖x0 − x∗‖2, then f ∗N − f ∗ ≤ hL. (h = ε

L .)

Proof: Denote rk = ‖xk − x∗‖.

a): r 2k+1 − r 2

k ≤ −2g(xk )‖g ′(xk )‖2 〈g ′(xk), xk − x∗〉+ g2(xk )

‖g ′(xk )‖2 ≤ −h2.

b): r 2k+1 − r 2

k ≤ −2h〈f ′(xk ),xk−x∗〉‖f ′(xk )‖ + h2 ≤ −2h

L (f (xk)− f ∗) + h2.

Thus, Nb2hL (f ∗N − f ∗) ≤ r 2

0 + h2(Nb −Na) = r 20 + h2(2Nb −N).

Smooth Convex Minimization (SCM)

Lipschitz-continuous gradient: ‖f ′(x)− f ′(y)‖ ≤ L‖x − y‖.Geometric interpretation: for all x , y ∈ domF we have

0 ≤ f (y)− f (x)− 〈f ′(x), y − x〉

=1∫

0

〈f ′(x + τ(y − x)− f ′(x), y − x〉dt ≤ L2‖x − y‖2.

Sufficient condition: 0 f ′′(x) L · In, x ∈ dom f .

Equivalent definition:

f (y) ≥ f (x) + 〈f ′(x), y − x〉+ 12L‖f

′(x)− f ′(y)‖2.

Hint: Prove first that f (x)− f ∗ ≥ 12L‖f

′(x)‖2.

SCM: Lower complexity bounds

Consider the family of functions (k ≤ n):

fk(x) = 12

[x2

1 +k−1∑i=1

(xi − xi+1)2 + x2k

]− x1 ≡ 1

2〈Akx , x〉 − x1.

Let Rnk = x ∈ Rn : xi = 0, i > k. Then fk+p(x) = fk(x),

x ∈ Rnk .

Clearly, 0 ≤ 〈Akh, h〉 ≤ h21 +

k−1∑i=1

2(h2i + h2

i+1) + h2k ≤ 4‖h‖2,

Ak =

2 −1 0−1 2 −1

0 −1 20

. . . . . .

0−1 2 −1

0 −1 2

k lines

0n−k,k 0n−k,n−k

,

Hence, Akx = e1 has the solution xki =

k+1−ik+1 , 1 ≤ i ≤ k ,

0, i > k ..

Thus f ∗k = 12〈Ak xk , xk〉 − 〈e1, x

k〉 = −12〈e1, x

k〉 = − k2(k+1) , and

‖ xk ‖2=k∑

i=1

(k+1−ik+1

)2= 1

(k+1)2

k∑i=1

i2 = k(2k+1)6(k+1) .

Let x0 = 0 and p ≤ n is fixed.

Lemma. If xk ∈ Lkdef= Linf ′p(x0), . . . , f ′p(xk−1), then Lk ⊆ Rn

k .

Proof: x0 = 0 ∈ Rn0 , f′p(0) = −e1 ∈ Rn

1 ⇒ x1 ∈ Rn1 , f′p(x1) ∈ Rn

2 ,

Corollary 1: fp(xk) = fk(xk) ≥ f ∗k .

Corollary 2: Take p = 2k + 1. Thenfp(xk )−f ∗pL‖x0−xp‖2 ≥

[− k

2(k+1) + 2k+12(2k+2)

]/[

(2k+1)(4k+3)3(k+1)

]= 3

4(2k+1)(4k+3) .

‖ xk − xp ‖2≥2k+1∑i=k+1

(x2k+1i )2 = (2k+3)(k+2)

24(k+1) ≥ 18‖x

p‖2.

Some remarks

1. The rate of convergence of any Black-Box gradient methods asapplied to f ∈ C 1,1 cannon be high than O( 1

k2 ).

2. We cannot guarantee any rate of convergence in the argument.

3. Let A = LLT and f (x) = 12〈Ax , x〉 − 〈b, x〉. Then

f (x)− f ∗ = 12‖L

T x − d‖2, where d = LT x∗.

Thus, the residual of the linear system LT x = b cannot bedecreased faster than with the rate O( 1

k )(provided that we are allowed to multiply by L and LT .)

4. Optimization problems with nontrivial linear equality constraintscannot be solved faster than with the rate O( 1

k ).

Methods for Smooth Minimization with Simple Constraints

Consider the problem: minxf (x) : x ∈ Q,

where convex f ∈ C 1,1L (Q), and Q is a simple closed convex set

(allows projections).

Gradient mapping: for M > 0 defineTM(x) = arg min

y∈Q[f (x) + 〈f ′(x), y − x〉+ M

2 ‖x − y‖2].

If M ≥ L, thenf (TM(x)) ≤ f (x) + 〈f ′(x),TM(x)− x〉+ M

2 ‖x − TM(x)‖2].

Reduced gradient: gM(x) = M · (x − TM(x)).

Since 〈f ′(x) + M(TM(x)− x), y − TM(x)〉 ≥ 0 for all y ∈ Q,

f (x)− f (TM(x)) ≥ M2 ‖x − TM(x)‖2 = 1

2M ‖gM(x)‖2, (→ 0)

f (y) ≥ f (x) + 〈f ′(x),TM(x)− x〉+ 〈f ′(x), y − TM(x)〉≥ f (TM(x))− 1

2M ‖gM(x)‖2 + 〈gM(x), y − TM(x)〉.

Primal Gradient Method (PGM)

Main scheme: x0 ∈ Q, xk+1 = TL(xk), k ≥ 0.

Primal interpretation: xk+1 = πQ(xk − 1

L f ′(xk)).

Rate of convergence. f (xk)− f (xk+1) ≥ 12L‖gL(xk)‖2.

f (TL(x))− f ∗ ≤ 12L‖gL(x)‖2 + 〈gL(x),TL(x)− x∗〉

≤ 12L(‖gL(x)‖+ LR)2 − L

2 R2.

Hence, ‖gL(x)‖ ≥[2L(f (TL(x))− f ∗) + L2R2

]1/2 − LR

= 2L(f (TL(x))−f ∗)[2L(f (TL(x))−f ∗)+L2R2]1/2+LR

≥ cR · (f (TL(x))− f ∗).

Thus, f (xk)− f (xk+1) ≥ c2

LR2 (f (xk+1)− f ∗)2.

Similar situation: a′(t) = −a2(t)⇒ a(t) ≈ 1t .

Conclusion: PGM converges as O( 1k ). This is far from the lower

complexity bounds.

Dual Gradient Method (DGM)

Model: Let λki ≥ 0, i = 0, . . . , k , and Skdef=

k∑i=0

λki . Then

Sk f (y) ≥ Lλk (y)def=

k∑i=0

λki [f (x i ) + 〈f ′(x i ), y − x i 〉], y ∈ Q.

DGM: xk+1 = arg miny∈Q

ψk(y)

def= Lλk (y) + M

2 ‖y − x0‖2

.

Let us choose λki ≡ 1 and M = L. We prove by induction

(∗) : F ∗kdef=

k∑i=0

f (y i ) ≤ ψ∗kdef= min

y∈Qψk(y). (≤ (k + 1)f ∗ + L

2 R2)

1. k = 0. Then y 0 = TL(x0).2. Assume (∗) is true for some k ≥ 0. Then

ψ∗k+1 = miny∈Q

[ψk(y) + f (xk) + 〈f ′(xk), y − xk〉

]≥ min

y∈Q

[ψ∗k + L

2‖y − xk‖2 + f (xk) + 〈f ′(xk), y − xk〉].

We can take yk+1 = TL(xk). Thus, 1k+1

k∑i=0

f (y i ) ≤ f ∗ + LR2

2(k+1) .

Some remarks

1. Dual gradient method works with the model of the objectivefunction.

2. The minimizing sequence yk is not necessary for thealgorithmic scheme. We can generate it if necessary.

3. Both primal and dual method have the same rate ofconvergence O( 1

k ). It is not optimal.

May be we can combine them in order to get a better rate?

Comparing PGM and DGM

Primal Gradient method

Monotonically improves the current state using the localmodel of the objective.

Interpretation: Practitioners, industry.

Dual Gradient Method

The main goal is to construct a model of the objective.

It is updated by a new experience collected around thepredicted test points (xk).

Practical verification of the advices (yk) is not essential forthe procedure.

Interpretation: Science.

Hint: Combination of theory and practice should give better results

Estimating sequences

Def. A sequences φk(x)∞k=0 and λk∞k=0, λk ≥ 0 are called theestimating sequences if λk → 0 and ∀x ∈ Q, k ≥ 0,

(∗) : φk(x) ≤ (1− λk)f (x) + λkφ0(x).

Lemma: If (∗∗) : f (xk) ≤ φ∗k ≡ minx∈Q

φk(x), then

f (xk)− f ∗ ≤ λk [φ0(x∗)− f ∗]→ 0.

Proof. f (xk) ≤ φ∗k = minx∈Q

φk(x) ≤ minx∈Q

[(1− λk)f (x) + λkφ0(x)]

≤ (1− λk)f (x∗) + λkφ0(x∗).

Rate of λk → 0 defines the rate of f (xk)→ f ∗.

Questions

How to construct the estimating sequences?

How we can ensure (**)?

Updating estimating sequences

Let φ0(x) = L2‖x − x0‖2, λ0 = 1, yk∞k=0 is a sequence in Q, and

αk∞k=0 : αk ∈ (0, 1),∞∑k=0

αk =∞. Then φk(x)∞k=0, λk∞k=0:

λk+1 = (1− αk)λk ,

φk+1(x) = (1− αk)φk(x) + αk [f (yk) + 〈f ′(yk), x − yk〉]are estimating sequences.Proof: φ0(x) ≤ (1− λ0)f (x) + λ0φ0(x) ≡ φ0(x).If (*) holds for some k ≥ 0, then

φk+1(x) ≤ (1− αk)φk(x) + αk f (x)= (1− (1− αk)λk)f (x) + (1− αk)(φk(x)− (1− λk)f (x))≤ (1− (1− αk)λk)f (x) + (1− αk)λkφ0(x)= (1− λk+1)f (x) + λk+1φ0(x).

Updating the points

Denote φ∗k = minx∈Q

φk(x), vk = arg minx∈Q

φk(x). Suppose

φ∗k ≥ f (xk).φ∗k+1 = min

x∈Q

(1− αk)φk(x) + αk [f (yk) + 〈f ′(yk), x − yk〉]

minx∈Q

(1− αk)[φ∗k + λkL

2 ‖x − vk‖2] + αk [f (yk) + 〈f ′(yk), y − yk〉]

≥ minx∈Qf (yk) + (1−αk )λkL

2 ‖x − vk‖2

+〈f ′(yk), αk(x − yk) + (1− αk)(xk − yk)〉(yk

def= (1− αk)xk + αkvk = xk + αk(vk − xk))

= minx∈Qf (yk) + (1−αk )λkL

2 ‖x − vk‖2 + αk〈f ′(yk), x − vk〉

= miny=xk+αk (x−xk )

x∈Q

f (yk) + (1−αk )λkL2α2

k‖y − yk‖2 + 〈f ′(yk), y − yk〉

(?)

≥ f (xk+1)

Answer: α2k = (1− αk)λk . xk+1 = TL(yk).

Optimal method

Choose v 0 = x0 ∈ Q, λ0 = 1, φ0(x) = L2‖x − x0‖2.

For k ≥ 0 iterate:

Compute αk : α2k = (1− αk)λk ≡ λk+1.

Define yk = (1− αk)xk + αkvk .

Compute xk+1 = TL(yk).

φk+1(x) = (1− αk)φk(x) + αk [f (yk) + 〈f ′(yk), x − yk〉].

Convergence: Denote ak = λ−1/2k . Then

ak+1−ak =λ

1/2k −λ

1/2k+1

λ1/2k λ

1/2k+1

= λk−λk+1

λ1/2k λ

1/2k+1(λ

1/2k +λ

1/2k+1)≥ λk−λk+1

2λkλ1/2k+1

= αk

2λ1/2k+1

= 12 .

Thus, ak ≥ 1 + k2 . Hence, λk ≤ 4

(k+2)2 .

Interpretation

1. φk(x) accumulates all previously computed information aboutthe objective. This is a current model of our problem.2. vk = arg min

x∈Qφk(x) is a prediction of the optimal strategy.

3. φ∗k = φk(vk) is an estimate of the optimal value.

4. Acceleration condition: f (xk) ≤ φ∗k . We need a firm, whichis at least as good as the best theoretical prediction.5. Then we create a startup yk = (1− αk)xk + αkvk , and allow itto work one year.

6. Theorem: Next year, its performance will be at least as goodas the new theoretical prediction. And we can continue!

Acceleration result: 10 years instead 100.

Who is in a right position to arrange 5? Government, politicalinstitutions.

Advanced Convex Optimization (PGMO 2016)

Lecture 2. Second-order methods. Solvingsystems of nonlinear equations

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Outline

1 Historical remarks

2 Trust region methods

3 Cubic regularization of second-order model

4 Local and global convergence

5 Accelerated Cubic Newton

6 Solving the system of nonlinear equations

7 Numerical experiments

Historical remarks

Problem: f (x) → min : x ∈ Rn

is treated as a non-linear system f ′(x) = 0.

Newton method: xk+1 = xk − [f ′′(xk)]−1f ′(xk).

Standard objections:

The method is not always well defined (det f ′′(xk) = 0).

Possible divergence.

Possible convergence to saddle points or even to localmaximums.

Chaotic global behavior.

Pre-History (see Ortega, Rheinboldt [1970].)

Bennet [1916]: Newton’s method in general analysis.

Levenberg [1944]: Regularization. If f ′′(x) 6 0, then used = G−1f ′(x) with G = f ′′(x) + γI 0. (See also Marquardt[1963].)

Kantorovich [1948]: Proof of local quadratic convergence.Assumptions:

a) f ∈ C 3(Rn).b) ‖f ′′(x)− f ′′(y)‖ ≤ L2‖x − y‖.c) f ′′(x∗) 0.d) x0 ≈ x∗.

Global convergence: Use line search (good advice).

Global performance: Not addressed.

Modern History (see in Conn, Gould and Toint [2000])

Main idea: Trust Region Approach.

1. By some norm ‖ · ‖k define the trust regionBk = x ∈ Rn : ‖x − xk‖k ≤ ∆k.2. Denotemk(x) = f (xk) + 〈f ′(xk), x − xk〉+ 1

2〈Gk(x − xk), x − xk〉.Variants: Gk = f ′′(xk), Gk = f ′′(xk) + γk I 0, etc.

3. Compute the trial point xk = arg minx∈Bk

mk(x).

4. Compute the ratio ρk = f (xk )−f (xk )f (xk )−mk (xk ) .

5. In accordance to ρk either accept xk+1 = xk or update the value∆k and repeat the steps above.

Comment

Advantages:

More parameters ⇒ Flexibility

Convergence to a point, which satisfies second-order necessaryoptimality condition:

f ′(x∗) = 0, f ′′(x∗) 0.

Disadvantages:

Complicated strategies for parameters’ coordination.

For certain ‖ · ‖k the auxiliary problem is difficult.

Line search abilities are quite limited.

Unselective theory.

Global complexity issues are not addressed.

Development of numerical schemes

Classical style: Problem formulation⇒ MethodExamples:

Gradient and Newton methods in optimization.

Runge-Kutta method for ODE, etc.

2. Modern style:Problem formulationProblem class

⇒ Method

Examples:

Non-smooth convex minimization.

Smooth minimization: minx∈Q

f (x), with f ∈ C 1,1.

Gradient mapping (Nemirovsky&Yudin 77):

x+ = T (x) ≡ arg miny∈Q

m1(y),

m1(y) ≡ f (x) + 〈f ′(x), y − x〉+ L12 ‖y − x‖2.

Justification: f (y) ≤ m1(y) for all y ∈ Q.

Using the second-order model

Problem: f (x) min : x ∈ Rn.

Assumption: Let F be an open convex set. Then‖f ′′(x)− f ′′(y)‖ ≤ L2‖x − y‖ ∀x , y ∈ F ,

L(x0) = x ∈ Rn : f (x) ≤ f (x0) ⊂ F .

Definem2(x , y) = f (x) + 〈f ′(x), y − x〉+ 1

2〈f′′(x)(y − x), y − x〉,

m′2(x , y) = f ′(x) + f ′′(x)(y − x).

Lemma 1. For any x , y ∈ F‖f ′(y)−m′2(x , y)‖ ≤ 1

2L2‖y − x‖2,|f (y)−m2(x , y)| ≤ 1

6L2‖y − x‖3.

Corollary: For any x and y from F ,f (y) ≤ m2(x , y) + 1

6L2‖y − x‖3.

Cubic regularization

For M > 0 define fM(x , y) = m2(x , y) + 16M‖y − x‖3,

TM(x) ∈ Arg miny

fM(x , y),

where “Arg” indicates that TM(x) is a global minimum.

Computability: If ‖ · ‖ is a Euclidean norm, then TM(x) can becomputed from a convex problem.

For r ∈ D ≡ r ∈ R : f ′′(x) + M2 rI 0, r ≥ 0, denote

v(r) = −12〈(f

′′(x) + Mr2 I )−1f ′(x), f ′(x)〉 − M

12 r3.

Lemma. For M > 0, minh∈Rn

fM(x , x + h) = supr∈D

v(r).

If the sup is attained at r∗ : f ′′(x) + Mr∗

2 I 0, then

h∗ = −(f ′′(x) + Mr∗

2 I )−1f ′(x)

where r∗ > 0 is a unique solution to r = ‖(f ′′(x) + Mr2 I )−1f ′(x)‖.

Simple properties

1. Denote rM(x) = ‖x − TM(x)‖. Then

f ′(x) + f ′′(x)(TM(x)− x) + MrM(x)2 (TM(x)− x) = 0,

f ′′(x) + 12MrM(x)I 0.

2. We have 〈f ′(x), x − TM(x)〉 ≥ 0, and

f (x)− fM(x) ≥ M12 r

3M(x),

r2M(x) ≥ 2

L+M ‖f′(x)‖.

3. If M ≥ L then fM(x) ≥ f (TM(x)).

4. fM(x) ≤ miny

[f (y) + L+M

6 ‖y − x‖3].

Compare with prox-method: x+ = miny

[f (y) + 12M‖y − x‖2].

Cubic regularization of Newton method

Consider the process: xk+1 = TL(xk), k = 0, 1, . . . .Note that f (xk+1) ≤ f (xk).

Saddle points. Let f ′(x∗) = 0 and f ′′(x∗) 6 0. Then ∃ε, δ > 0such that

‖x − x∗‖ ≤ ε, f (x) ≥ f (x∗) ⇒ f (TL(x)) ≤ f (x∗)− δ

Local convergence. If L(x0) is bounded, thenX ∗ ≡ lim

k→∞xk 6= ∅.

For any x∗ ∈ X ∗ we have f (x∗) = f ∗, f ′(x∗) = 0, f ′′(x∗) 0.

Global convergence: gk ≡ min1≤i≤k

‖f ′(xi )‖ ≤ O(

1k2/3

).

For gradient method we can guarantee only gk ≤ O(

1k1/2

).

Local rate of convergence: Quadratic.

Global performance: Star-convex functions

Def. For any x∗ ∈ X ∗ and any x ∈ F , α ∈ [0, 1] we havef (αx∗ + (1− α)x) ≤ αf (x∗) + (1− α)f (x).

Th 1. Let diam F ≤ D. Then

1. If f (x0)− f ∗ ≥ 32LD

3, then f (x1)− f ∗ ≤ 12LD

3.

2. If f (x0)− f ∗ ≤ 32LD

3, then f (xk)− f ∗ ≤ 3LD3

2(1+ 13k)2 .

Let X ∗ be non-degenerate: f (x)− f ∗ ≥ γ2ρ

2(x ,X ∗). Denoteω = 1

L2 (γ2 )3.

Th 2. Denote k0 the first number for which f (xk0)− f ∗ ≤ 49 ω.

If k ≤ k0, then f (xk)− f ∗ ≤[

(f (x0)− f ∗)1/4 − k6

√23 ω

1/4

]4

.

For k ≥ k0 we have f (xk+1)− f ∗ ≤ 12 (f (xk)− f ∗)

√f (xk )−f ∗

ω .

NB The Hessian f ′′(x∗) can be degenerate!

Global performance: Gradient-dominated functions

Definition.For any x ∈ F and x∗ ∈ X ∗ we havef (x)− f (x∗) ≤ τf ‖f ′(x)‖p

with τf > 0 and p ∈ [1, 2] (degree of domination).

Example 1. Convex functions:f (x)− f ∗ ≤ 〈f ′(x), x − x∗〉 ≤ R‖f ′(x)‖

for ‖x − x∗‖ ≤ R. Thus, p = 1, τf = 12D.

Example 2. Strongly convex functions: ∀x , y ∈ Rn

f (x) ≤ f (y) + 〈f ′(y), x − y〉+ 12γ ‖f

′(x)− f ′(y)‖2.

Thus, f (x)− f ∗ ≤ 12γ ‖f

′(x)‖2 ⇒ p = 2, τf = 12γ .

Gradient dominated functions, II

Example 3. Sum of squares. Consider the systemg(x) = 0 ∈ Rm, x ∈ Rn.

Assume that m ≤ n and the Jacobian J(x) = (g ′1(x), . . . , g ′m(x)) isuniformly non-degenerate:

σ ≡ infx∈F λmin(JT (x)J(x)) > 0.

Consider the function f (x) =m∑i=1

g2i (x). Then

f (x)− f ∗ ≤ 12σ‖f

′(x)‖2.

Thus, p = 2 and τf = 12σ .

Gradient dominated functions: rate of convergence

Theorem 3. Let p = 1. Denote ω = 23L(6τf )3. Let k0 be defined

as f (xk0)− f ∗ ≤ ξ2ω for some ξ > 1. Then for k ≤ k0 we have

ln(

1ω (f (xk)− f ∗)

)≤(

23

)kln(

1ω (f (x0)− f ∗)

).

Otherwise, f (xk)− f ∗ ≤ ω · ξ2(2+ 32ξ)2

(2+(k+ 32

)·ξ)2 .

Theorem 4. Let p = 2. Denote ω = 1(144L)2τ3

f. Let k0 be defined

as f (xk0)− f ∗ ≤ ω. Then for k ≤ k0 we have

f (xk)− f ∗ ≤ (f (x0)− f ∗) · e−kσ

with σ = ω1/4

ω1/4+(f (x0)−f ∗)1/4 . Otherwise,

f (xk+1)− f ∗ ≤ ω ·(f (xk )−f ∗

ω

)4/3.

NB: Superlinear convergence without direct nondegeneracyassumption for the Hessian.

Transformations of convex functions

Let u(x) : Rn → Rn be non-degenerate. Denote by v(u) itsinverse: v(u(x)) ≡ x .

Consider the function f (x) = φ(u(x)), where φ(u) is a convexfunction. Denote

σ = maxu‖v ′(u)‖ : φ(u) ≤ f (x0),

D = maxu‖u − u∗‖ : φ(u) ≤ f (x0).

Theorem 5.1. If f (x0)− f ∗ ≥ 3

2L(σD)3, then f (x1)− f ∗ ≤ 12L(σD)3.

2. If f (x0)− f ∗ ≤ 32L(σD)3, then f (xk)− f ∗ ≤ 3L(σD)3

2(1+ 13k)2 .

Example.u1(x) = x1, u2(x) = x2 + φ1(x1), . . .un(x) = xn + φn−1(x1, . . . , xn−1),

where φi (·) are arbitrary functions.

Accelerated Newton: Cubic prox-function

Denote d(x) = 13‖x − x0‖3.

Lemma. Cubic prox-function is uniformly convex: for allx , y ∈ Rn,

〈d ′(x)− d ′(y), x − y〉 ≥ 12‖x − y‖3,

d(x)− d(y)− 〈d ′(y), x − y〉 ≥ 16‖x − y‖3,

Moreover, its Hessian is Lipschitz continuous:

‖d ′′(x)− d ′′(y)‖ ≤ 2‖x − y‖, x , y ∈ Rn.

Remark. In our constructions, we are going to use d(x) instead ofthe standard strongly convex prox-functions.

Linear estimate functions (Compare with 1st-ordermethods)

We recursively update the following sequences.

Sequence of estimate functions ψk(x) = lk(x) + N2 d(x),

k ≥ 1, where lk(x) are linear, and N > 0.

A minimizing sequence xk∞k=1.

A sequence of scaling parameters Ak∞k=1:

Ak+1def= Ak + ak , k ≥ 1.

These objects have to satisfy the following relations:

(∗) :Ak f (xk) ≤ ψ∗k ≡ min

xψk(x),

ψk(x) ≤ Ak f (x) + (L2 + 12N)d(x), ∀x ∈ Rn,

for all k ≥ 1.(⇒ Ak(f (xk)− f (x∗)) ≤ (L + N

2 )d(x∗).)

For k = 1, we can choose x1 = TL2(x0), l1(x) ≡ f (x1), A1 = 1.

Denote vk = arg minx

ψk(x).

For some ak > 0 and M ≥ 2L2, define

αk = akAk+ak

∈ (0, 1),

yk = (1− αk)xk + αkvk ,

xk+1 = TM(yk),

ψk+1(x) = ψk(x) + ak [f (xk+1) + 〈f ′(xk+1), x − xk+1〉].

Theorem. For M = 2L2, N = 12L2, and ak = (k+1)(k+2)2 , k ≥ 1,

relations (*) hold recursively.

Corollary. For any k ≥ 1 we have f (xk)− f (x∗) ≤ 14L2‖x0−x∗‖3

k(k+1)(k+2) .

Accelerated CNM

Initialization: Set x1 = TL2(x0). Define ψ1(x) = f (x1) + 6L2 · d(x).

Iteration k, (k ≥ 1): vk = arg minx∈Rn

ψk(x),

yk = kk+3xk + 3

k+3vk , xk+1 = T2L2(yk),

ψk+1(x) = ψk(x) + (k+1)(k+2)2 [f (xk+1) + 〈f ′(xk+1), x − xk+1〉]

Remark:

Instead of recursive computation of ψk(x), we can update only onevector:

s1 = 0, sk+1 = sk + (k+1)(k+2)2 f ′(xk+1), k ≥ 1.

Then vk can be computed by an explicit expression.

Global non-degeneracy

Standard setting: for convex f ∈ C 2(Rn) define positiveconstants σ1 and L1 such that

σ1‖h‖2 ≤ 〈f ′′(x)h, h〉 ≤ L1‖h‖2

for all x , y , h ∈ Rn. The value γ1(f ) = σ1L1

is called the conditionnumber of f .

(Compatible with definition in Linear Algebra.)

Geometric interpretation: 〈f ′(x),x−x∗〉‖f ′(x)‖·‖x−x∗‖ ≥

2√γ1(f )

1+γ1(f ) , x ∈ Rn.

Complexity: (1st-order methods)

PGM: O(

1γ1(f ) · ln

), FGM: O

(1√γ1(f )

· ln 1ε

).

It does not work for 2nd-order schemes: f (xk)− f ∗ ≤ 14 L2 R3

k(k+1)(k+2) .

Global 2nd-order non-degeneracy

Assumption: for any x , y ∈ Rn, function f ∈ C 2(Rn) satisfiesinequalities

‖f ′′(x)− f ′′(y)‖ ≤ L2‖x − y‖,〈f ′(x)− f ′(y), x − y〉 ≥ σ2‖x − y‖3,

where σ2 > 0. We call the value γ2(f ) = σ2L2∈ (0, 1) the 2nd-order

condition number of function f .(Invariant w.r.t. addition of convex quadratic functions.)

Example: γ2(d) = 14 .

Justification: σ23 ‖xk − x∗‖3 ≤ f (xk)− f ∗ ≤ 14L2‖x0−x∗‖3

k(k+1)(k+2) .

Hence, in O(

1[γ2(f )]1/3

)iterations we halve the distance to x∗.

Complexity bound: (Accelerated CNM with restart)

O(

1[γ2(f )]1/3 · ln 1

ε

)iterations.

Open questions

1. Problem classes.

2. Lower complexity bounds and optimal methods.

3. Non-degenerate problems: geometric interpretation?

4. Complexity of strongly convex functions.(1st-order schemes?)

5. Consequences for polynomial-time methods.

Solving the systems of nonlinear equations

1. Standard Gauss-Newton method

Problem: Find x ∈ Rn satisfying the system F (x) = 0 ∈ Rm.

Assumption: ∀x , y ∈ Rn ‖F ′(x)− F ′(y)‖ ≤ L‖x − y‖.Gauss-Newton method: Choose a merit function φ(u) ≥ 0,φ(0) = 0, u ∈ Rm.

Compute x+ ∈ Arg miny

[φ(F (x) + F ′(x)(y − x))].

Usual choice: φ(u) =m∑i=1

u2i . (Justification: Why not?)

Remarks

Local quadratic convergence (m ≥ n, non-degeneracy andF (x∗) = 0 (?)).

If m < n, then the method is not well-defined.

No global complexity results.

Modified Gauss-Newton method

Lemma. For all x , y ∈ Rn we have

‖F (y)− F (x)− F ′(x)(y − x)‖ ≤ 12L‖y − x‖2.

Corollary. Denote f (y) = ‖F (y)‖. Then

f (y) ≤ ‖F (x) + F ′(x)(y − x)‖+ 12L‖y − x‖2.

Modified method:

xk+1 = arg miny

[‖F (xk) + F ′(xk)(y − xk) + 12L‖y − xk‖2].

Remarks

The merit function is non-smooth.

Nevertheless, f (xk+1) < f (xk) unless xk is a stationary point.

Quadratic convergence for non-degenerate solutions.

Global efficiency bounds.

Problem of finding xk+1 is convex.

Different norms in Rn and Rm can be used.

Testing CNM: Chebyshev oscilator

Consider f (x) = 14 (1− x (1))2 +

n−1∑i=1

(x (i+1) − p2(x (i))

)2, with

p2(τ) = 2τ2 − 1.

Note that p2 is a Chebyshev polynomial: pk(τ) = cos(k arccos(τ)).

Hence, the equations for the “central path” is

x (i+1) = p2(x (i)) = p4(x (i−1)) = · · · = p2i (x(1)).

This is an exponential oscillation! However, all coefficients andderivatives are small.

NB: f (x) is unimodular and x∗ = (1, . . . , 1).

In our experiments we usually take x0 = (−1, 1, . . . , 1).Drawback: x0 − 2∇f (x0) = x∗. Hence, sometimes we usex0 = (−1, 0.9. . . . , 0.9).

Solving Chebyshev oscilator by CN: ‖∇f (x)‖(2) ≤ 10−8

n Iter DF GNorm NumF Time (s)

2 14 7.0 · 10−19 4.2 · 10−09 18 0.032

3 33 1.1 · 10−24 7.5 · 10−12 51 0.031

4 82 1.7 · 10−20 9.3 · 10−10 148 0.047

5 207 4.5 · 10−19 1.2 · 10−09 395 0.078

6 541 1.0 · 10−17 5.6 · 10−09 1062 0.266

7 1490 1.4 · 10−18 2.9 · 10−09 2959 0.609

8 4087 2.7 · 10−17 9.1 · 10−09 8153 1.782

9 11205 1.6 · 10−16 9.6 · 10−09 22389 5.922

10 30678 2.7 · 10−15 9.6 · 10−09 61335 18.89

11 79292 7.7 · 10−14 1.0 · 10−08 158563 57.813

12 171522 9.7 · 10−13 9.9 · 10−09 343026 144.266

13 385353 1.3 · 10−11 9.9 · 10−09 770691 347.094

14 938758 2.1 · 10−11 1.0 · 10−08 1877500 1232.953

15 2203700 7.8 · 10−11 1.0 · 10−08 4407385 3204.359

Other methods

Trust region Knitro Minos 5.5 Snopt

n Inner Iter Iter Iter NFG Iter# NFG

3 129 50 30 44 120 106 784 431 123 80 136 309 268 2045 1310 299 203 339 793 647 5096 3963 722 531 871 2022 1417 1149∗

7 12672 1921 1467 2291 5404 ∗ ∗ ∗8 40036 5234 4040 6109 146809 120873 13907 11062 11939 28535

10 358317 36837 29729∗ ∗ ∗ ∗11 842368 78854 ∗ ∗ ∗12 2121780 182261

Notation: ∗ early termination, (∗ ∗ ∗) numerical difficulties/inaccurate solution, # needs an alternative starting point.Trust region: very reliable, but T (12) = 2577 sec (Matlab),T (n) = Const ∗ (4.5)n.

Advanced Convex Optimization (PGMO 2016)

Lecture 3. Structural Optimization: InteriorPoint Methods

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Outline

1 Interior-point methods: standard problem

2 Newton method

3 Self-concordant functions and barriers

4 Minimizing self-concordant functions

5 Conic optimization problems

6 Primal-dual barriers

7 Primal-dual central path and path-following methods

Interior Point Methods

Black-Box Methods: Main assumptions represent the bounds forthe size of certain derivatives.

Example

Consider the function f (x1, x2) =

x2

2x1, x1 > 0,

0, x1 = x2 = 0.It is closed, convex, but discontinuous at the origin.

However, its epigraph x ∈ R3 : x1x3 ≥ x22 is a simple convex set:

x1 = u1 + u3, x2 = u2, x3 = u1 − u3 ⇒ u1 ≥√

u22 + u2

3 .

(Lorentz cone)

Question: Can we always replace the functional components byconvex sets?

Standard formulation

Problem: f ∗ = minx∈Q〈c , x〉,

where Q ⊂ E is a closed convex set with nonempty interior.

How we can measure the quality of x ∈ Q?

1. The residual 〈c , x〉 − f ∗ is not very informative since it doesnot depend on position of x inside Q.

2. The boundary of a convex set can be very complicated.

3. It is easy to travel inside provided that we keep a sufficientdistance to the boundary.

Conclusion: we need a barrier function f (x):

dom f = intQ,

f (x)→∞ as t → ∂Q.

Path-following method

Central path: for t > 0 define x∗(t), tc + f ′(x∗(t)) = 0(hence x∗(t) = arg min

x

[Ψt(x)

def= t〈c , x〉+ f (x)

].)

Lemma. Suppose 〈f ′(x), y − x〉 ≤ A for all x , y ∈ domQ. Then〈c , x∗(t)− x∗〉 = 1

t 〈f′(x∗(t)), x∗ − x∗(t)〉 ≤ 1

t A.

Method: tk > 0, xk ≈ x∗(tk) ⇒ tk+1 > tk ,xk+1 ≈ x∗(tk+1).

For approximating x∗(tk+1), we need a powerful minimizationscheme.

Main candidate: Newton Method.(Very good local convergence.)

Classical results on the Newton Method

Method: xk+1 = xk − [f ′′(xk)]−1f ′(xk).

Assume that:

f ′′(x∗) ≥ ` · In‖f ′′(x)− f ′′(y)‖ ≤ M‖x − y‖, ∀x , y ∈ Rn.

The starting point x0 is close to x∗: ‖x0 − x∗ ‖< r = 2`3M .

Then ‖ xk − x∗ ‖< r for all k , and the Newton method converges

quadratically: ‖ xk+1 − x∗ ‖≤ M‖xk−x∗‖2

2(`−M‖xk−x∗‖) .

Note:

The description of the region of quadratic convergence isgiven in terms of the metric 〈·, ·〉.The resulting neighborhood is changing when we chooseanother metric.

Simple observation

Let f (x) satisfy our assumptions. Consider φ(y) = f (Ay),where A is a non-degenerate (n × n)-matrix.

Lemma: Let xk be a sequence, generated by Newton Methodfor function f .Consider the sequence yk, generated by the Newton Method forfunction φ with y 0 = A−1x0.Then yk = A−1xk for all k ≥ 0.Proof: Assume yk = A−1xk for some k ≥ 0. Then

yk+1 = yk − [φ′′(yk)]−1φ′(yk)= yk − [AT f ′′(Ayk)A]−1AT f ′(Ayk)= A−1xk − A−1[f ′′(xk)]−1f ′(xk) = A−1xk+1.

Conclusion: The method is affine invariant. Its region ofquadratic convergence does not depend on the metric!

What was wrong?

Old assumption: ‖ f ′′(x)− f ′′(y) ‖≤ M ‖ x − y ‖.Let f ∈ C 3(Rn). Denote f ′′′(x)[u] = lim

α→0

1α [f ′′(x + αu)− f ′′(x)].

This is a matrix!

Then the old assumption is equivalent to: ‖ f ′′′(x)[u] ‖≤ M ‖ u ‖.Hence, at any point x ∈ Rn we have

(∗) : | 〈f ′′′(x)[u]v , v〉 |≤ M ‖ u ‖ · ‖ v ‖2 for all u, v ∈ Rn.

Note:

The LHS of (∗) is an affine invariant directional derivative.

The norm ‖ · ‖ has nothing common with our particular f .

However, there exists a local norm, which is closely relatedto f . This is ‖ u ‖f ′′(x)= 〈f ′′(x)u, u〉1/2.

Let us make a similar assumption in terms of ‖ · ‖f ′′(x).

Definition of Self-Concordant Function

Let f (x) ∈ C 3(dom f ) be a closed and convex, with open domain.Let us fix a point x ∈ dom f and a direction u ∈ Rn.

Consider the function φ(x ; t) = f (x + tu). Denote

Df (x)[u] = φ′t(x ; 0) = 〈f ′(x), u〉,D2f (x)[u, u] = φ′′tt(x ; 0) = 〈f ′′(x)u, u〉 =‖ u ‖2

f ′′(x),

D3f (x)[u, u, u] = φ′′′ttt(x ; 0) = 〈f ′′′(x)[u]u, u〉.Def. We call function f self-concordant if the inequality| D3f (x)[u, u, u] |≤ 2 ‖ u ‖3

f ′′(x) holds for any x ∈ dom f , u ∈ Rn.

Note:

We cannot expect that these functions are very common.

We hope that they are good for the Newton Method.

Examples

1. Linear function is s.c. since f ′′(x) ≡ 0, f ′′′(x) ≡ 0

2. Convex quadratic function is s.c. (f ′′′(x) ≡ 0).

3. Logarithmic barrier for a ray x > 0:f (x) = − ln x , f ′(x) = − 1

x , f ′′(x) = 1x2 , f ′′′(x) = − 2

x3 .

4. Logarithmic barrier for a quadratic region. Consider a concavefunction φ(x) = α + 〈a, x〉 − 1

2〈Ax , x〉. Define f (x) = − lnφ(x).

Df (x)[u] = − 1φ(x) [〈a, u〉 − 〈Ax , u〉] def

= ω1,

D2f (x)[u]2 = 1φ2(x)

[〈a, u〉 − 〈Ax , u〉]2 + 1φ(x)〈Au, u〉,

D3f (x)[u]3 = − 2φ3(x)

[〈a, u〉 − 〈Ax , u〉]3 − 3〈Au,u〉φ2(x)

[〈a, u〉 − 〈Ax , u〉].

D2 = ω21 + ω2, D3 = 2ω3

1 − 3ω1ω2. Hence, |D3| ≤ 2|D2|3/2.

Simple properties

1. If f1, f2 are s.c.f., then f1 + f2 is s.c. function.

2. If f (y) is s.c.f., then φ(x) = f (Ax + b) is also a s.c. function.

Proof: Denote y = y(x) = Ax + b, v = Au. Then

Dφ(x)[u] = 〈f ′(y(x)),Au〉 = 〈f ′(y), v〉,D2φ(x)[u]2 = 〈f ′′(y(x))Au,Au〉 = 〈f ′′(y)v , v〉,D3φ(x)[u]3 = D3f (y(x))[Au]3 = D3f (y)[v ]3.

Example: f (x) = 〈c , x〉 −m∑i=1

ln(ai − ‖Aix − bi‖2) is a

s.c.-function.

Main properties

Let x ∈ dom f and u ∈ Rn, u 6= 0. For x + tu ∈ dom f , considerφ(t) = 1

〈f ′′(x+tu)u,u〉1/2 .

Lemma 1. For all feasible t we have: | φ′(t) |≤ 1.

Proof: Indeed, φ′(t) = − f ′′′(x+tu)[u]3

2〈f ′′(x+tu)u,u〉3/2 .

Corollary 1: domφ contains the interval (−φ(0), φ(0)).

Proof: Since f (x + tu)→∞ as x + tu → ∂dom f , the same istrue for 〈f ′′(x + tu)u, u〉. Hence domφ(t) ≡ t | φ(t) > 0.Denote ‖h‖2

x = 〈f ′′(x)h, h〉, W 0(x ; r) = y ∈ Rn | ‖ y − x ‖x< r.Then

W 0(x ; r) ⊆ dom f for r < 1.

Main inequalities

Denote W (x ; r) = y ∈ Rn | ‖y − x‖x < r.Theorem. For all x , y ∈ dom f the following inequality holds:

‖ y − x ‖y≥ ‖y−x‖x1+‖y−x‖x .

If ‖ y − x ‖x< 1 then ‖ y − x ‖y≤ ‖y−x‖x1−‖y−x‖x .

Proof. 1. Let us choose u = y − x . Then

φ(1) = 1‖y−x‖y , φ(0) = 1

‖y−x‖x ,

and φ(1) ≤ φ(0) + 1 in view of Lemma 1.

2. If ‖ y − x ‖x< 1, then φ(0) > 1, and in view of Lemma 1,φ(1) ≥ φ(0)− 1.

Useful inequalities

Theorem. For any x , y ∈ dom f we have:

〈f ′(y)− f ′(x), y − x〉 ≥ ‖y−x‖2x

1+‖y−x‖x ,

f (y) ≥ f (x) + 〈f ′(x), y − x〉+ ω(‖ y − x ‖x),

where ω(t) = t − ln(1 + t).

Proof. Denote yτ = x + τ(y − x), τ ∈ [0, 1], and r =‖ y − x ‖x .

〈f ′(y)− f ′(x), y − x〉 =1∫

0

〈f ′′(yτ )(y − x), y − x〉dτ

=1∫

0

1τ2 ‖ yτ − x ‖2

yτ dτ ≥1∫

0

r2

(1+τ r)2 dτ = rr∫

0

1(1+t)2 dt = r2

1+r

f (y)− f (x)− 〈f ′(x), y − x〉 =1∫

0

〈f ′(yτ )− f ′(x), y − x〉dτ

≥1∫

0

‖yτ−x‖2x

τ(1+‖yτ−x‖x ) dτ =1∫

0

τ r2

1+τ r dτ =r∫

0

tdt1+t = ω(r).

Similar inequalities

Theorem. Let x ∈ dom f and ‖ y − x ‖x< 1. Then

〈f ′(y)− f ′(x), y − x〉 ≤ ‖y−x‖2x

1−‖y−x‖x ,

f (y) ≤ f (x) + 〈f ′(x), y − x〉+ ω∗(‖ y − x ‖x),

where ω∗(t) = −t − ln(1− t).

Main Theorem: for any y ∈W (x ; r), r ∈ [0, 1), we have

(1− r)2F ′′(x) F ′′(y) 1(1−r)2 F ′′(x).

Corollary. For G =1∫

0

f ′′(x + τ(y − x))dτ , we have

(1− r + r2

3 )f ′′(x) G 11−r f ′′(x).

Observation: If dom f contains no straight line, then f ′′(x) 0for any x ∈ dom f . (If not, then W (x , 1) is unbounded.)

Minimizing the self-concordant function

Consider the problem: minf (x) | x ∈ dom f . Assume dom fcontains no straight line.

Theorem. Let λf (x) < 1 for some x ∈ dom f . Then the solutionof this problem x∗f , exists and unique.

Proof. Indeed, for any y ∈ dom f we have:

f (y) ≥ f (x) + 〈f ′(x), y − x〉+ ω(‖ y − x ‖x)

≥ f (x)− ‖ f ′(x) ‖∗x · ‖ y − x ‖x +ω(‖ y − x ‖x)

= f (x)− λf (x)· ‖ y − x ‖x +ω(‖ y − x ‖x).

Since ω(t) = t − ln(1 + t), the level sets are bounded. ⇒ ∃x∗f .

It is unique since in since f (y) ≥ f (x∗f ) + ω(‖ y − x∗f ‖x∗f ), andf ′′(x∗f ) is nondegenerate.

Example: f (x) = (1− ε)x − ln x with ε ∈ (0, 1) and x = 1.

Damped Newton Method

Consider the following scheme: x0 ∈ dom f ,

xk+1 = xk − 11+λf (xk ) [f ′′(xk)]−1f ′(xk).

Theorem. For any k ≥ 0 we have f (xk+1) ≤ f (xk)− ω(λf (xk)).

Proof. Denote λ = λf (xk). Then ‖ xk+1− xk ‖x= λ1+λ . Therefore,

f (xk+1) ≤ f (xk) + 〈f ′(xk), xk+1 − xk〉+ ω∗(‖ xk+1 − xk ‖x)

= f (xk)− ω(λ).

Consequence: we come to the region λf (xk) ≤ const inO(f (x0)− f ∗) iterations.

Local convergence

For x close to x∗, f ′(x∗) = 0, function f (x) is almost quadratic:

f (x) ≈ f ∗ + 12〈f′′(x∗)(x − x∗), x − x∗〉.

Therefore, f (x)− f ∗ ≈ 12‖x − x∗‖2

x∗ ≈ 12‖x − x∗‖2

x

≈ 12〈f′(x), [f ′′(x)]−1f ′(x)〉 def= 1

2 (‖f ′(x)‖∗x)2 def= λ2

f (x).

The last value is the local norm of the gradient. It is computable.

Theorem: Let x ∈ dom f and λf (x) < 1.

Then the point x+ = x − [f ′′(x)]−1f ′(x) belongs to dom f and

λf (x+) ≤(

λf (x)1−λf (x)

)2.

Proof

Denote p = x+ − x , λ = λf (x). Then ‖ p ‖x= λ < 1, x+ ∈ dom f .λf (x+) ≤ 1

1−‖p‖x ‖ f ′(x+) ‖x= 11−λ ‖ f ′(x+) ‖x .

Note that f ′(x+) = f ′(x+)− f ′(x)− f ′′(x)(x+ − x) = Gp, where

G =1∫

0

[f ′′(x + τp)− f ′′(x)]dτ . Therefore

‖ f ′(x+) ‖2x= 〈[f ′′(x)]−1Gp,Gp〉 ≤‖ H ‖2 · ‖ p ‖2

x

,where H = [f ′′(x)]−1/2G [f ′′(x)]−1/2. In view of Corollary,

(−λ+ 13λ

2)f ′′(x) ≤ G ≤ λ1−λ f ′′(x).

Therefore ‖ H ‖≤ max

λ1−λ , λ−

13λ

2

= λ1−λ , and

λ2f (x+) ≤ 1

(1−λ)2 ‖ f ′(x+) ‖2x≤ λ4

(1−λ)4 .

NB: Region of quadratic convergence is λf (x) < λ, λ(1−λ)2 = 1.

It is affine-invariant!

Following the cental path

Consider Ψt(x) = t〈c , x〉+ f (x) with s.c. function f .

For Ψt , Newton Method has local quadratic convergence.

The region of quadratic convergence (RQC) is given byλΨt (x) ≤ β < λ.

Assume we know x = x∗(t). We want to update t, t+ = t + ∆,keeping x in RQC of function Ψt+∆: λΨt+∆

(x) ≤ β.

Question: How large can be ∆? Since tc + f ′(x) = 0, we have:

λΨt+∆(x) =‖ t+c + f ′(x) ‖∗x= |∆|· ‖ c ‖∗x= |∆|

t ‖ f ′(x) ‖∗x≤ β.

Conclusion: for the linear rate, we need to assume that

〈[f ′′(x)]−1f ′(x), f ′(x)〉 is uniformly bounded on dom f .

Thus, we come to the definition of self-concordant barrier.

Definition of Self-Concordant Barrier

Let F (x) be a s.c.-function. It is a ν-self-concordant barrier, ifmaxu∈Rn

[2〈F ′(x), u〉 − 〈F ′′(x)u, u〉] ≤ ν for all x ∈ domF .

The value ν is called the parameter of the barrier.

If F ′′(x) is non-degenerate, then 〈F ′(x), [F ′′(x)]−1F ′(x)〉 ≤ ν.

Another form: 〈F ′(x), u〉2 ≤ ν〈F ′′(x)u, u〉.Main property: 〈F ′(x), y − x〉 ≤ ν, x , y ∈ intQ.

NB: ν is responsible for the rate of p.-f. method: t+ = t ± α·tν1/2 .

Complexity: O(√ν ln ν

ε

)iterations of the Newton method.

Calculus: 1. Affine transformations do not change ν.2. Restriction on a subspace can only decrease ν.3. F = F1 + F2 ⇒ ν = ν1 + ν2.

Examples

1. Barrier for a ray: F (t) = − ln t, F ′(t) = −1t , F ′′(t) = 1

t2 , ν = 1.

2. Polytop x : 〈ai , x〉 ≤ bi, F (x) = −m∑i=1

ln(bi − 〈ai , x〉), ν = m.

3. l2-ball: F (x) = − ln(1− ‖x‖2), D1 = ω1, D2 = ω21 + ω2, ν = 1.

4. Intersection of ellipsoids: F (x) = −m∑i=1

ln(r 2i − ‖Aix − bi‖2),

ν = m.5. Epigraph t ≥ ex, F (x , t) = − ln(t − ex)− ln(ln t − x), ν = 4.

6. Universal barrier. Define the polar setP(x) = s : 〈s, y − x〉 ≤ 1, y ∈ Q.

Then F (x) = − ln volnP(x) is an O(n)-s.c. barrier for Q.

7. Lorentz cone t ≥ ‖x‖, F (x , t) = − ln(t2 − ‖x‖2), ν = 2.

8. LMI-cone X = XT 0, F (X ) = − ln det X , ν = n.

Conic minimization problems

Problem: f ∗ = min〈c , x〉 : Ax = b, x ∈ K,where

A ∈ Rm×n : Rn → Rm, m < n, c ∈ Rn, b ∈ Rm.

K , intK 6= ∅, is a closed convex pointed cone:

1 ∀x1, x2 ∈ K ⇒ x1 + x2 ∈ K .2 ∀x ∈ K , τ ≥ 0 ⇒ τx ∈ K .3 K contains no straight line.

Assumptions:

A is nondegenerate, b 6= 0.

There is no y ∈ Rm such that c = AT y .

Explanations:

If b = 0 then either f ∗ = 0 or f ∗ = −∞.

If ∃y ∈ Rm : c = AT y then ∀x , Ax = b, we have:〈c, x〉 = 〈AT y , x〉 = 〈y ,Ax〉 = 〈b, y〉.

Main Assumption:

There exist a computable ν-normal barrier F (x) for K such that

F (x) is a ν-self-concordant barrier for K .

F (x) is logarithmically homogeneous: ∀x ∈ intK , τ > 0(∗) F (τx) = F (x)− ν ln τ .

Examples:

1 Positive orthant: K = Rn+

def= x ∈ Rn : x (i) ≥ 0, i = 1 . . . n,

F (x) = −n∑

i=1ln x (i), ν = n.

2 Cone of positive semidefinite matrices:

K = Sn+

def= X ∈ Rn×n : X = XT , 〈Xu, u〉 ≥ 0 ∀u ∈ Rn,

F (X ) = − ln det X , ν = n.

3 2nd order cone: K = Lndef= z = (x , τ) ∈ Rn+1 : τ ≥‖ x ‖,

F (z) = − ln(τ2− ‖ x ‖2), ν = 2.

4 Direct sums of these cones.

Properties of logarithmically homogeneous barriers

Then for any x ∈ intK and τ > 0 we have:

(1) : F ′(τx) = 1τ F ′(x), (2) : F ′′(τx) = 1

τ2 F ′′(x).

(3) : 〈F ′(x), x〉 = −ν, (4) : F ′′(x)x = −F ′(x).

(5) : 〈F ′′(x)x , x〉 = ν, (6) : 〈[F ′′(x)]−1F ′(x),F ′(x)〉 = ν.

Proof:1. Differentiate (*) in x : τF ′(τx) = F ′(x).2. Differentiate 1) in x : τF ′′(τx) = 1

τ F ′′(x).3. Diff. (*) in τ : 〈F ′(τx), x〉 = −ν

τ . Take τ = 1 and we get 3).4. Differentiate 3) in x : F ′′(x)x + F ′(x) = 0.5. Substitute 4) in 3).6. Substitute 4) in 5).

Dual cones

Definition. Let K be a closed convex cone. The setK ∗ = s : 〈s, x〉 ≥ 0 ∀x ∈ K

is called the dual cone to K ∗.Theorem. If K is a proper cone, then K ∗ is also proper and(K ∗)∗ = K .

Proof: K ∗ is closed and convex as an intersection of half-spaces.If K ∗ contains a straight line s = τ s, τ ∈ R, then 〈s, x〉 = 0∀x ∈ K (contradiction).For all s ∈ K ∗ and x ∈ K we have: 〈s, x〉 ≥ 0. ThereforeK ⊆ (K ∗)∗. If ∃u ∈ (K ∗)∗ \ K then

∃s : 〈s, u〉 < 〈s, x〉 ∀x ∈ K .Hence, s ∈ K ∗ and u /∈ (K ∗)∗. Contradiction.If intK ∗ = ∅ then ∃x : 〈s, x〉 = 0, ∀s ∈ K ∗.Therefore ±x ∈ (K ∗)∗ ≡ K ∗. Contradiction.

Conjugate barriers

Definition. Let K be a proper cone and F (x) be a ν-s.c.b. for K .The function

F∗(s) = max−〈s, x〉 − F (x) : x ∈ Kis called conjugate (or dual) barrier.

Main properties

domF∗(s) ≡ intK ∗.

F∗(s) is a ν-normal barrier for K ∗.

For any x ∈ intK and s ∈ intK ∗ we have:

F (x) + F∗(s) ≥ −ν ln〈s, x〉 − ν + ν ln ν.

Equality is attained iff s = −τF ′(x) for some τ > 0.

Examples: The barriers for Rn+, Ln and Sn

+ are self-dual.

Primal–Dual Problems

Primal problem: f ∗ = minx〈c , x〉 : Ax = b, x ∈ K.

Dual problem: f∗ = maxy ,s〈b, y〉 : s + AT y = c , s ∈ K ∗.

Denote FD the feasible set of the dual problem.

Note:

For any x ∈ FP and (s, y) ∈ FD we have:0 ≤ 〈s, x〉 = 〈c − AT y , x〉 = 〈c , x〉 − 〈b, y〉.

Therefore we always have: f ∗ ≥ f∗.

Main Assumption (we always assume this)

There exists a strictly feasible primal-dual solution (x , s, y):Ax = b, x ∈ intK , s + AT y = c , s ∈ intK ∗.

Primal central path: x(t) = arg mint〈c , x〉+ F (x) : Ax = b.Dual central path:

(s(t), y(t)) = arg min−t〈b, y〉+ F∗(s) : s + AT y = c.Primal–dual central path: (x(t), s(t), y(t)), t > 0.

Lemma. The primal-dual central path is well defined.

Proof: Note that ∀x ∈ FP

F (x) ≥ −t〈s, x〉 − F∗(ts) = −t(〈c , x〉 − 〈b, y〉)− F∗(ts).Therefore t〈c , x〉+ F (x) ≥ t〈b, y〉 − F∗(ts).

Thus, x(t) exists. The proof for the dual path is symmetric.

Properties of primal-dual central path

Theorem. For any t > 0 we have the following:〈c , x(t)〉 − 〈b, y(t)〉 = ν

t ,F ′(x(t)) = −1

t F ′∗(s(t)), F ′∗(s(t)) = −1t F ′(x(t)),

F (x(t)) + F∗(s(t)) = −ν + ν ln t.Proof: Let us write down the optimality conditions for x(t):

tc + F ′(x(t)) = AT y(t), Ax(t) = b.

Denote s(t) = −1t F ′(x(t)), y(t) = 1

t y(t). Thenx(t) = −F ′∗(ts(t)) = −1

t F ′∗(s(t)).Thus, c = s(t) + AT y(t), tb + AF ′∗(s(t)) = 0.

This is the optimality conditions for the dual path. In view ofuniqueness, we have s(t) = −1

t F ′(x(t)).

The rest: 〈c , x(t)〉 − 〈b, y(t)〉 = 〈s(t), x(t)〉 = νt ,

F (x(t)) + F∗(s(t)) = F (x(t)) + F∗(−F ′(x(t)) + ν ln t = −ν + ν ln t.

Remarks

Under our Main Assumption f ∗ = f∗.

The set FP ×FD is never bounded.

We have complete characterization for the duality gap〈c , x〉 − 〈b, y〉 and the barrier F (x) + F∗(s) along the centralpath.

This information forms the basis for all primal-dual schemes.

That is not for free: We assume that F∗(s) is computable.

.

Primal–dual potential

Φ(x , s) = 2ν ln〈s, x〉+ F (x) + F∗(s)

= 2ν ln[〈c , x〉 − 〈b, y〉] + F (x) + F∗(s).

Lemma. For any (x , s, y) ∈ F0PD ≡ F0

P ×F0D we have:

〈c , x〉 − 〈b, y〉 ≤ 1ν exp

1 + 1

νΦ(x , s)

.

Proof:

Φ(x , s) = 2ν ln〈s, x〉+ F (x) + F∗(s)

≥ 2ν ln〈s, x〉 − ν + ν ln ν − ν ln〈s, x〉

= ν ln[〈c , x〉 − 〈b, y〉]− ν + ν ln ν.

Main question: What can be the rate of decrease of Φ(x , s)?

Decrease along the central path z(t) = (x(t), s(t), y(t))

We want to have ‖ z(t + ∆t)− z(t) ‖z(t)≤ 1.

This is approximately | ∆t |≤ 1‖z ′(t)‖z(t)

.

Note that 〈s ′(t), x ′(t)〉 = 0 ands ′(t) = −1

t s(t)− 1t F ′′(x(t))x ′(t),

x ′(t) = −1t x(t)− 1

t F ′′∗ (s(t))s ′(t).

Therefore 〈F ′′(x(t))x ′(t), x ′(t)〉 = −〈s(t), x ′(t)〉,〈F ′′∗ (s(t))s ′(t), s ′(t)〉 = −〈s ′(t), x(t)〉.Hence ‖ z ′(t) ‖2

z(t)= 〈F ′′(x(t))x ′(t), x ′(t)〉+ 〈F ′′∗ (s(t))s ′(t), s ′(t)〉= −(〈s(t), x(t)〉)′t = −

(νt

)′t

= νt2 .

Thus, we can take ∆t = t√ν

.

For potential: ∆t · (Φ(x(t), s(t)))′t = t√ν·(2ν ln ν

t + ν ln t)′t

= t√ν· (−ν ln t)′t = t√

ν·(−ν

t

)= −√ν.

Proximity measure

Ω(x , s) = ν ln〈s, x〉+ F (x) + F∗(s) + ν − ν ln ν= ν ln[〈c , x〉 − 〈b, y〉] + F (x) + F∗(s) + ν − ν ln ν.

Properties:

Ω(x , s) ≥ 0 for all (x , s, y) ∈ F0PD .

Ω(x , s) = 0 only along the central path.

The restriction of Ω(x , s) onto the hyperplane〈c , x〉 − 〈b, y〉 = const is a convex self-concordant function.

Note: (x(t), s(t), y(t)) = arg minx ,s,yF (x) + F∗(s) :

Ax = b, s + AT y = c , 〈c , x〉 − 〈b, y〉 = νt .

Proof: z(t) is feasible and F (x(t)) + F∗(s(t)) = −ν + ν ln t.On the other hand, for any feasible (x , s, y) we have:

F (x) + F∗(s) ≥ −ν + ν ln ν − ν ln〈s, x〉= −ν + ν ln ν − ν ln ν

t = −ν + ν ln t.The minimum is unique since FPD contains no straight line.

Primal-dual path-following scheme

If we close to the central path, we can move along thetangent direction up to the moment Ω(x , s) ≤ β.

Then we fix 〈c , x〉 − 〈b, y〉 and go back to the central path byminimizing the barrier.

Efficiency estimate: O(√ν ln 1

ε

).

Advantages:

The tangent step typically is large.

The level β bounds the number of Newton steps for thecorrector process by an absolute constant.

NB. These schemes are the most efficient now for solving Linearand Quadratic Optimization problems and Linear MatrixInequalities of moderate size.

Advanced Convex Optimization (PGMO 2016)

Lecture 4. Structural Optimization: SmoothingTechnique

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Outline

1 Nonsmooth Optimization

2 Smoothing technique

3 Application examples

Nonsmooth Unconstrained Optimization

Problem: min f (x) : x ∈ Rn ⇒ x∗, f ∗ = f (x∗),where f (x) is a nonsmooth convex function.

Subgradients: g ∈ ∂f (x)⇔ f (y) ≥ f (x) + 〈g , y − x〉 ∀y ∈ Rn.

Main difficulties:

g ∈ ∂f (x) is not a descent direction at x .

g ∈ ∂f (x∗) does not imply g = 0.

Example

f (x) = max1≤j≤m

〈aj , x〉+ bj,

∂f (x) = Conv aj : 〈aj , x〉+ bj = f (x).

Subgradient methods in Nonsmooth Optimization

Advantages

Very simple iteration scheme.

Low memory requirements.

Optimal rate of convergence (uniformly in the dimension).

Interpretation of the process.

Objections:

Low rate of convergence. (Confirmed by theory!)

No acceleration.

High sensitivity to the step-size strategy.

Lower complexity bounds

Nemirovsky, Yudin 1976

If f (x) is given by a local black-box, it is impossible to converge

faster than O(

1√k

)uniformly in n. (k is the # of calls of oracle.)

NB: Convergence is very slow.

Question: We want to find an ε-solution of the problem

max1≤j≤m

〈aj , x〉+ bj → minx

: x ∈ Rn,

by a gradient scheme (n and m are big).

What is the worst-case complexity bound?

“Right answer” (Complexity Theory): O(

1ε2

)calls of oracle.

Our target: A gradient scheme with O(

)complexity bound.

Reason of speed up: our problem is not in a black box.

Complexity of Smooth Minimization

Problem: f (x) → minx

: x ∈ Rn , where f is a convex function

and ‖∇f (x)−∇f (y)‖∗ ≤ L(f )‖x − y‖ for all x , y ∈ Rn.

(For measuring gradients we use dual norms: ‖s‖∗ = max‖x‖=1

〈s, x〉.)

Rate of convergence: Optimal method gives O(L(f )k2

).

Complexity: O

(√L(f )ε

). The difference with O

(1ε2

)is very big.

Smoothing the convex function

For function f define its Fenchel conjugate:f∗(s) = max

x∈Rn[〈s, x〉 − f (x)].

It is a closed convex function with dom f∗ = Convf ′(x) : x ∈ Rn.Moreover, under very mild conditions (f∗(s))∗ ≡ f (x).

Define fµ(x) = maxs∈dom f∗

[〈s, x〉 − f∗(s)− µ2‖s‖

2∗], where ‖ · ‖∗ is a

Euclidean norm.

Note: f ′µ(x) = sµ(x), and x = f ′∗(sµ(x)) + µsµ(x). Therefore,

‖x1 − x2‖2 = ‖f ′∗(s1)− f ′∗(s2)‖2 + 2µ〈f ′∗(s1)− f ′∗(s2), s1 − s2〉

+µ2‖s1 − s2‖2 ≥ µ2‖s1 − s2‖2.

Thus, fµ ∈ C 1,11/µ and f (x) ≥ fµ(x) ≥ f (x)− µD2,

where D = Diam(dom f∗).

Main questions

1. Given by a non-smooth convex f (x), can we form itscomputable smooth ε-approximation fε(x) with

L(fε) = O(

)?

If yes, we need only O

(√L(fε)ε

)= O

(1ε

)iterations.

2. Can we do this in a systematic way?

Conclusion: We need a convenient model of our problem.

Adjoint problem

Primal problem: Find f ∗ = minxf (x) : x ∈ Q1, where

Q1 ⊂ E1 is convex closed and bounded.

Objective: f (x) = f (x) + maxu〈Ax , u〉2− φ(u) : u ∈ Q2, where

f (x) is differentiable and convex on Q1.

Q2 ⊂ E2 is a closed convex and bounded.

φ(u) is continuous convex function on Q2.

linear operator A : E1 → E ∗2 .

Adjoint problem: maxuφ(u) : u ∈ Q2, where

φ(u) = −φ(u) + minx〈Ax , u〉2 + f (x) : x ∈ Q1.

NB: Adjoint problem is not unique!

Example

Consider f (x) = max1≤j≤m

|〈aj , x〉1 − bj |.

1. Q2 = E ∗1 , A = I , φ(u) ≡ f∗(u) = maxx〈u, x〉1 − f (x) : x ∈ E1

= mins∈Rm

m∑j=1

sjbj : u =m∑j=1

sjaj ,m∑j=1|sj | ≤ 1

.

2. E2 = Rm, φ(u) = 〈b, u〉2, f (x) = max1≤j≤m

|〈aj , x〉1 − bj |

= maxu∈Rm

m∑j=1

uj [〈aj , x〉1 − bj ] :m∑j=1|uj | ≤ 1

.

3. E2 = R2m, φ(u) is a linear, Q2 is a simplex:

f (x) = maxu∈R2m

m∑j=1

(u1j −u2

j )[〈aj , x〉1−bj ] :m∑j=1

(u1j +u2

j ) = 1, u ≥ 0.

NB: Increase in dimE2 decreases the complexity of representation.

Smooth approximations

Prox-function: d2(u) is continuous and strongly convex on Q2:d2(v) ≥ d2(u) + 〈∇d2(u), v − u〉2 + 1

2σ2‖v − u‖22.

Assume: d2(u0) = 0 and d2(u) ≥ 0 ∀u ∈ Q2.

Fix µ > 0, the smoothing parameter, and definefµ(x) = max

u〈Ax , u〉2 − φ(u)− µd2(u) : u ∈ Q2.

Denote by u(x) the solution of this problem.

Theorem: fµ(x) is convex and differentiable for x ∈ E1. Itsgradient ∇fµ(x) = A∗u(x) is Lipschitz continuous with

L(fµ) = 1µσ2‖A‖2

1,2,

where ‖A‖1,2 = maxx ,u〈Ax , u〉2 : ‖x‖1 = 1, ‖u‖2 = 1.

NB: 1. For any x ∈ E1 we have f0(x) ≥ fµ(x) ≥ f0(x)− µD2,where D2 = max

ud2(u) : u ∈ Q2.

2. All norms are very important.

Optimal method

Problem: minxf (x) : x ∈ Q1 with f ∈ C 1,1(Q1).

Prox-function: strongly convex d1(x), d1(x0) = 0, d1(x) ≥ 0,x ∈ Q1.

Gradient mapping:TL(x) = arg min

y∈Q1

〈∇f (x), y − x〉1 + 1

2L‖y − x‖21

.

Method. For k ≥ 0 do:1. Compute f (xk),∇f (xk).2. Find yk = TL(f )(xk).

3. Find zk = arg minx∈Q1

L(f )σ d1(x) +

k∑i=0

i+12 〈∇f (x i ), x〉1.

4. Set xk+1 = 2k+3z

k + k+1k+3y

k .

Convergence: f (yk)− f (x∗) ≤ 4L(f )d1(x∗)σ1(k+1)2 , where x∗ is the

optimal solution.

Applications

Smooth problem: fµ(x) = f (x) + fµ(x) → min : x ∈ Q1.

Lipschitz constant: Lµ = L(f ) + 1µσ2‖A‖2

1,2. DenoteD1 = max

xd1(x) : x ∈ Q1.

Theorem: Let us choose N ≥ 1. Define

µ = µ(N) =2‖A‖1,2

N+1 ·√

D1σ1σ2D2

.

After N iterations set x = yN ∈ Q1 and

u =N∑i=0

2(i+1)(N+1)(N+2) u(x i ) ∈ Q2.

Then 0 ≤ f (x)− φ(u) ≤ 4‖A‖1,2

N+1 ·√

D1D2σ1σ2

+ 4L(f )D1

σ1·(N+1)2 .

Corollary. Let L(f ) = 0. For getting an ε-solution, we choose

µ = ε2D2

, L = D22σ2· ‖A‖

21,2

ε , N ≥ 4‖A‖1,2

√D1D2σ1σ2

· 1ε .

Example: Equilibrium in matrix games (1)

Denote ∆n = x ∈ Rn : x ≥ 0,n∑

i=1x (i) = 1. Consider the

problem minx∈∆n

maxu∈∆m

〈Ax , u〉2 + 〈c , x〉1 + 〈b, u〉2.

Minimization form:minx∈∆n

f (x), f (x) = 〈c , x〉1 + max1≤j≤m

[〈aj , x〉1 + bj ],

maxu∈∆m

φ(u), φ(u) = 〈b, u〉2 + min1≤i≤n

[〈ai , u〉2 + ci ],

where aj are the rows and ai are the columns of A.

1. Euclidean distance: Let us take

‖x‖21 =

n∑i=1

x2i , ‖u‖2

2 =m∑j=1

u2j ,

d1(x) = 12‖x −

1nen‖

21, d2(u) = 1

2‖u −1mem‖2

2.

Then ‖A‖1,2 = λ1/2max(ATA) and f (x)− φ(u) ≤ 4λ

1/2max(ATA)N+1 .

Example: Equilibrium in matrix games (2)

2. Entropy distance. Let us choose

‖x‖1 =n∑

i=1|xi |, d1(x) = ln n +

n∑i=1

xi ln xi ,

‖u‖2 =m∑j=1|uj |, d2(u) = lnm +

m∑j=1

uj ln uj .

LM: σ1 = σ2 = 1. (Hint: 〈d ′′1 (x)h, h〉 =n∑

i=1

h2ixi→ min

x∈∆n

= ‖h‖21.)

Moreover, since D1 = ln n, D2 = lnm, and

‖A‖1,2 = maxx max

1≤j≤m|〈aj , x〉| : ‖x‖1 = 1 = max

i ,j|Ai ,j |,

we have f (x)− φ(u) ≤ 4√

ln n lnmN+1 ·max

i ,j|Ai ,j |.

NB: 1. Usually maxi ,j|Ai ,j | << λ

1/2max(ATA).

2. We have fµ(x) = 〈c , x〉1 + µ ln

(1m

m∑j=1

e [〈aj ,x〉+bj ]/µ

).

Example 2: Continuous location problem

Problem: p cities with populations mj , j = 1, . . . ,m, are locatedat

cj ∈ Rn, j = 1, . . . , p.

Goal: Construct a service center at point x∗, which minimizes thetotal distance to the center.

That is Find f ∗ = minx

f (x) =

p∑j=1

mj‖x − cj‖1 : ‖x‖1 ≤ r

.

Primal space:

‖x‖21 =

n∑i=1

(x (i))2, d1(x) = 12‖x‖

21, σ1 = 1, D1 = 1

2 r2.

Adjoint space: E2 = (E ∗1 )p, ‖u‖22 =

p∑j=1

mj(‖uj‖∗1)2,

Q2 = u = (u1, . . . , up) ∈ E2 : ‖uj‖∗1 ≤ 1, j = 1, . . . , p ,

d2(u) = 12‖u‖

22, σ2 = 1, D2 = 1

2P.

with P ≡p∑

j=1mj , the total size of population.

Operator norm: ‖A‖1,2 = P1/2.

Rate of convergence: f (x)− f ∗ ≤ 2PrN+1 .

fµ(x) =p∑

j=1mjψµ(‖x − cj‖1), ψµ(τ) =

τ2

2µ , τ ≤ µ,τ − µ

2 , µ ≤ τ.

Example 3: Variational inequalities (linear operator)

Consider B(w) = Bw + c : E → E ∗, which is monotone:

〈Bh, h〉 ≥ 0 ∀h ∈ E .

Problem: Find w∗ ∈ Q : 〈B(w∗),w − w∗〉 ≥ 0 ∀w ∈ Q,where Q is a bounded convex closed set.

Merit function: ψ(w) = maxv〈B(v),w − v〉 : v ∈ Q.

ψ(w) is convex on E1.

ψ(w) ≥ 0 for all w ∈ Q.

ψ(w) = 0 if and only if w solves VI-problem.

〈B(v), v〉 is a convex function. Thus, ψ is exactly in our form.

Primal smoothing:ψµ(w) = max

v〈B(v),w − v〉 − µd2(v) : v ∈ Q.

Dual smoothing:φµ(v) = min

w〈B(v),w − v〉+ µd1(w) : w ∈ Q. (Looks better.)

Example 4: Piece-wise linear functions

1. Maximum of absolute values. Consider

minx

f (x) = max

1≤j≤m|〈aj , x〉1 − b(j)| : x ∈ Q1

.

For simplicity choose ‖x‖21 =

n∑i=1

(x (i))2, d1(x) = 12‖x‖

2.

It is convenient to choose E2 = R2m,

‖u‖2 =2m∑j=1|u(j)|, d2(u) = ln(2m) +

2m∑j=1

u(j) ln u(j).

Denote by A the matrix with the rows aj . Then

f (x) = maxu〈Ax , u〉2 − 〈b, u〉2 : u ∈ ∆2m,

where A =

(A−A

)and b =

(b−b

).

Thus, σ1 = σ2 = 1,D2 = ln(2m), D1 = 1

2 r2, r = max

x‖x‖1 : x ∈ Q1.

Operator norm: ‖A‖1,2 = max1≤j≤m

‖aj‖∗1.

Complexity: 2√

2 r max1≤j≤m

‖aj‖∗1√

ln(2m) · 1ε .

Approximation: for ξ(τ) = 12 [eτ + e−τ ] define

fµ(x) = µ ln

(1m

m∑j=1

ξ(

1µ [〈aj , x〉+ b(j)]

)).

Piece-wise linear functions: Sum of absolute values.

minx

f (x) =

m∑j=1|〈aj , x〉1 − b(j)| : x ∈ Q1

.

Let us choose E2 = Rm, Q2 = u ∈ Rm : |u(j)| ≤ 1,

j = 1, . . . ,m, and d2(u) = 12‖u‖

22 = 1

2

m∑j=1‖aj‖∗1 · (u(j))2.

Then fµ(x) =m∑j=1‖aj‖∗1 · ψµ

(|〈aj ,x〉1−b(j)|‖aj‖∗1

),

‖A‖21,2 = P ≡

m∑j=1‖aj‖∗1.

On the other hand, D2 = 12P and σ2 = 1. Thus, we get the

following complexity bound: 1ε ·√

8D1σ1·

m∑j=1‖aj‖∗1.

NB: The bound and the scheme allow m→∞.

Computational experiments

Test problem: minx∈∆n

maxu∈∆m

〈Ax , u〉2.

Entries of A are uniformly distributed in [−1, 1].

Goal: Test of computational stability. Computer: 2.6GHz.

Complexity of iteration: 2mn operations.Results for ε = 0.01. Table 1

m\n 100 300 1000 3000 10000

1008080′′

10110′′

11123′′

131412′′

141544′′

3009100′′

11122′′

141510′′

161735′′

1819135′′

10001112

2′′1213

8′′141532′′

1718115′′

2020451′′

Number of iterations: 40− 50% of predicted values.

Results for ε = 0.001. Table 2m\n 100 300 1000 3000 10000

1006970

2′′8586

8′939429′′,

1000091′′

10908349′′

3007778

8′′10101

27′′12424

97′′14242313′′

156561162′′

1000878830′′

11010105′′

13030339′′

157571083′′

182824085′′

Results for ε = 0.0001. Table 3m\n 100 300 1000 3000

10067068

25′′72073

80′′74075287′′

80081945′′

30085086

89′′, 42%92093243′′

101102914′′

1121133302′′

100097098331′′

100101760′′

1161172936′′

13914011028′′

Comparing the bounds

Smoothing + FGM: 2 · 4 · mnε

√ln n lnm.

Short-step p.-f. method (n ≥ m):(7.2√n ln 1

ε

)· m(m+1)

2 n.

Right digits

m n 2 3 4 5

100 100 g g b b300 300 g g b b300 1000 g g b b300 3000 g g = b300 10000 g g g b

1000 1000 g g g b1000 3000 g g g b1000 10000 g g g =

g - S+FGM, b - barrier method

Advanced Convex Optimization (PGMO 2016)

Lecture 5. Huge-scale optimization

Yurii Nesterov, CORE/INMA (UCL)

January 20-22, 2016 (Ecole Polytechnique, Paris)

Outline

1 Problems sizes

2 Random coordinate search

3 Confidence level of solutions

4 Sparse Optimization problems

5 Sparse updates for linear operators

6 Fast updates in computational trees

7 Simple subgradient methods

8 Application examples

Nonlinear Optimization: problems sizes

Class Operations Dimension Iter.Cost Memory

Small-size All 100 − 102 n4 → n3 Kilobyte: 103

Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106

Large-scale Ax 105 − 107 n2 → n Gigabyte: 109

Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012

Sources of Huge-Scale problems

Internet (New)

Telecommunications (New)

Finite-element schemes (Old)

Partial differential equations (Old)

Very old optimization idea: Coordinate Search

Problem: minx∈Rn

f (x) (f is convex and differentiable).

Coordinate relaxation algorithm

For k ≥ 0 iterate

1 Choose active coordinate ik .

2 Update xk+1 = xk − hk∇ik f (xk)eik ensuring f (xk+1) ≤ f (xk).(ei is ith coordinate vector in Rn.)

Main advantage: Very simple implementation.

Possible strategies

1 Cyclic moves. (Difficult to analyze.)

2 Random choice of coordinate (Why?)

3 Choose coordinate with the maximal directional derivative.

Complexity estimate: assume‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖, x , y ∈ Rn.

Let us choose hk = 1L . Then

f (xk)− f (xk+1) ≥ 12L |∇ik f (xk)|2 ≥ 1

2nL‖∇f (xk)‖2

≥ 12nLR2 (f (xk)− f ∗)2.

Hence, f (xk)− f ∗ ≤ 2nLR2

k , k ≥ 1. (For Grad.Method, drop n.)

This is the only known theoretical result known for CDM!

Criticism

Theoretical justification:

Complexity bounds are not known for the most of theschemes.

The only justified scheme needs computation of thewhole gradient. (Why don’t use GM?)

Computational complexity:

Fast differentiation: if function is defined by a sequence ofoperations, then C (∇f ) ≤ 4C (f ).

Can we do anything without computing the function’s values?

Result: CDM are almost out of the computational practice.

Google problem

Let E ∈ Rn×n be an incidence matrix of a graph. Denotee = (1, . . . , 1)T and

E = E · diag (ET e)−1.

Thus, ET e = e. Our problem is as follows:

Find x∗ ≥ 0 : E x∗ = x∗.

Optimization formulation:

f (x)def= 1

2‖E x − x‖2 + γ2 [〈e, x〉 − 1]2 → min

x∈Rn

Huge-scale problems

Main features

The size is very big (n ≥ 107).

The data is distributed in space.

The requested parts of data are not always available.

The data may be changing in time.

Consequences

Simplest operations are expensive or infeasible:

Update of the full vector of variables.

Matrix-vector multiplication.

Computation of the objective function’s value, etc.

Structure of the Google Problem

Let ua look at the gradient of the objective:

∇i f (x) = 〈ai , g(x)〉+ γ[〈e, x〉 − 1], i = 1, . . . , n,

g(x) = E x − x ∈ Rn, (E = (a1, . . . , an)).

Main observations:

The coordinate move x+ = x − hi∇i f (x)ei needs O(pi ) a.o.(pi is the number of nonzero elements in ai .)

didef= diag

(∇2f

def= ET E + γeeT

)i

= γ + 1pi

are available.

We can use them for choosing the step sizes (hi = 1di

).

Reasonable coordinate choice strategy? Random!

Random coordinate descent methods (RCDM)

minx∈RN

f (x), (f is convex and differentiable)

Main Assumption:

|f ′i (x + hiei )− f ′i (x)| ≤ Li |hi |, hi ∈ R, i = 1, . . . ,N,

where ei is a coordinate vector. Then

f (x + hiei ) ≤ f (x) + f ′i (x)hi + Li2 h

2i . x ∈ RN , hi ∈ R.

Define the coordinate steps: Ti (x)def= x − 1

Lif ′i (x)ei . Then,

f (x)− f (Ti (x)) ≥ 12Li

[f ′i (x)]2, i = 1, . . . ,N.

Random choice for coordinates

We need a special random counter Rα, α ∈ R:

Prob [i ] = p(i)α = Lαi ·

[N∑j=1

Lαj

]−1, i = 1, . . . ,N.

Note: R0 generates uniform distribution.

Method RCDM(α, x0)

For k ≥ 0 iterate:

1) Choose ik = Rα.

2) Update xk+1 = Tik (xk).

Complexity bounds for RCDM

We need to introduce the following norms for x , g ∈ RN :

‖x‖α =

[N∑i=1

Lαi [x (i)]2]1/2

, ‖g‖∗α =

[N∑i=1

1Lαi

[g (i)]2]1/2

.

After k iterations, RCDM(α, x0) generates random output xk ,which depends on ξk = i0, . . . , ik. Denote φk = Eξk−1

f (xk).

Theorem. For any k ≥ 1 we have

φk − f ∗ ≤ 2k ·

[N∑j=1

Lαj

]· R2

1−α(x0),

where Rβ(x0) = maxx

maxx∗∈X∗

‖x − x∗‖β : f (x) ≤ f (x0)

.

Interpretation

Denote Sα =N∑i=1

Lαi .

1. α = 0. Then S0 = N, and we get

φk − f ∗ ≤ 2Nk · R

21 (x0).

Note

We use the metric ‖x‖21 =N∑i=1

Li [x(i)]2.

Matrix with diagonal LiNi=1 can have its norm equal to n.

Hence, for GM we can guarantee the same bound.

But its cost of iteration is much higher!

Interpretation

2. α = 12 . Denote

D∞(x0) = maxx

maxy∈X∗

max1≤i≤N

|x (i) − y (i)| : f (x) ≤ f (x0)

.

Then, R21/2(x0) ≤ S1/2D

2∞(x0), and we obtain

φk − f ∗ ≤ 2k ·[

N∑i=1

L1/2i

]2· D2∞(x0).

Note:

For the first order methods, the worst-case complexity ofminimizing over a box depends on N.

Since S1/2 can be bounded, RCDM can be applied insituations when the usual GM fail.

Interpretation

3. α = 1. Then R0(x0) is the size of the initial level set in thestandard Euclidean norm. Hence,

φk − f ∗ ≤ 2k ·[

N∑i=1

Li

]· R2

0 (x0) ≡ 2Nk ·[

1N

N∑i=1

Li

]· R2

0 (x0).

Rate of convergence of GM can be estimated as

f (xk)− f ∗ ≤ γ

kR20 (x0),

where γ satisfies condition f ′′(x) γ · I , x ∈ RN .Note: maximal eigenvalue of symmetric matrix can reach its trace.

In the worst case, the rate of convergence of GM is the same asthat of RCDM.

Minimizing the strongly convex functions

Theorem. Let f (x) be strongly convex with respect to ‖ · ‖1−αwith convexity parameter σ1−α > 0.Then, for xk generated by RCDM(α, x0) we have

φk − φ∗ ≤(

1− σ1−αSα

)k(f (x0)− f ∗).

Proof: Let xk be generated by RCDM after k iterations.Let us estimate the expected result of the next iteration.

f (xk)− Eik (f (xk+1)) =N∑i=1

p(i)α · [f (xk)− f (Ti (xk))]

≥N∑i=1

p(i)α2Li

[f ′i (xk)]2 = 12Sα

(‖f ′(xk)‖∗1−α)2

≥ σ1−αSα

(f (xk)− f ∗).

It remains to compute expectation in ξk−1.

Confidence level of the answers

Note: We have proved that the expected values of random f (xk)are good.

Can we guarantee anything after a single run?

Confidence level: Probability β ∈ (0, 1), that some statementabout random output is correct.Main tool: Chebyschev inequality (ξ ≥ 0):

Prob [ξ ≥ T ] ≤ E(ξ)T .

Our situation:

Prob [f (xk)− f ∗ ≥ ε] ≤ 1ε [φk − f ∗] ≤ 1− β.

We need φk − f ∗ ≤ ε · (1− β). Too expensive for β → 1?

Regularization technique

Consider fµ(x) = f (x) + µ2‖x − x0‖21−α. It is strongly convex.

Therefore, we can obtain φk − f ∗µ ≤ ε · (1− β) in

O(

1µSα ln 1

ε·(1−β)

)iterations.

Theorem. Define α = 1, µ = ε4R2

0 (x0), and choose

k ≥ 1 +8S1R2

0 (x0)ε

[ln

2S1R20 (x0)ε + ln 1

1−β

].

Let xk be generated by RCDM(1, x0) as applied to fµ.Then

Prob (f (xk)− f ∗ ≤ ε) ≥ β.

Note: β = 1− 10−p ⇒ ln 10p = 2.3p.

Implementation details: Random Counter

Given the values Li , i = 1, . . . ,N, generate efficiently random

i ∈ 1, . . . ,N with probabilities Prob [i = k] = Lk/N∑j=1

Lj .

Solution: a) Trivial ⇒ O(N) operations.

b). Assume N = 2p. Define p + 1 vectors Sk ∈ R2p−k,

k = 0, . . . , p:

S(i)0 = Li , i = 1, . . . ,N.

S(i)k = S

(2i)k−1 + S

(2i−1)k−1 , i = 1, . . . , 2p−k , k = 1, . . . , p.

Algorithm: Make the choice in p steps, from top to bottom.

If the element i of Sk is chosen, then choose in Sk−1 either 2i

or 2i − 1 in accordance to probabilitiesS(2i)k−1

S(i)k

orS(2i−1)k−1

S(i)k

.

Difference: for n = 220 > 106 we have p = log2N = 20.

Sparse problems

Problem: minx∈Q

f (x), where Q is closed and convex in RN , and

f (x) = Ψ(Ax), where Ψ is a simple convex function:

Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,

A : RN → RM is a sparse matrix.

Let p(x)def= # of nonzeros in x . Sparsity coefficient:

γ(A)def= p(A)

MN .

Example 1: Matrix-vector multiplication

Computation of vector Ax needs p(A) operations.

Initial complexity MN is reduced in γ(A) times.

Gradient Method

x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.

Main computational expenses

Projection onto a simple set Q needs O(N) operations.

Displacement xk → xk − hf ′(xk) needs O(N) operations.

f ′(x) = ATΨ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).

Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often haveγ(A) ≈ 10−4 . . . 10−6. Can we get more?

Sparse updating strategy

Main idea

After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸

y

+Ad .

What happens if d is sparse?

Denote σ(d) = j : d (j) 6= 0. Then y+ = y +∑

j∈σ(d)d (j) · Aej .

Its complexity, κA(d)def=

∑j∈σ(d)

p(Aej), can be VERY small!

κA(d) = M∑

j∈σ(d)γ(Aej) = γ(d) · 1

p(d)

∑j∈σ(d)

γ(Aej) ·MN

≤ γ(d) max1≤j≤m

γ(Aej) ·MN.

If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then

κA(d) ≤ c2 · γ2(A) ·MN .

Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years

When it can work?

Simple methods: No full-vector operations! (Is it possible?)

Simple problems: Functions with sparse gradients.

Examples

1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient

f ′(x) = Ax − b, x ∈ RN ,

is not sparse even if A is sparse.

2 Piece-wise linear function g(x) = max1≤i≤m

[〈ai , x〉 − b(i)]. Its

subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse if ai is sparse!

But: We need a fast procedure for updating max-operations.

Fast updates in short computational trees

Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.

Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.Size of the next level halves the size of the previous one:

vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,

where ψi ,j are some bivariate functions.

v2,1v1,1 v1,2

v0,1 v0,2 v0,3 v0,4

v2,n/4v1,n/2−1 v1,n/2

v0,n−3v0,n−2v0,n−1 v0,n

. . . . . . . . .

. . .

vk−1,1 vk−1,2vk,1

Main advantages

Important examples (symmetric functions)

f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,

f (x) = ln

(n∑

i=1ex

(i)

), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,

f (x) = max1≤i≤n

x (i), ψi ,j(t1, t2) ≡ max t1, t2 .

The binary tree requires only n − 1 auxiliary cells.

Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).

If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.

Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost

.

Simple subgradient methods

I. Problem: f ∗def= min

x∈Qf (x), where

Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,

the optimal value f ∗ is known.

Consider the following optimization scheme (B.Polyak, 1967):

x0 ∈ Q, xk+1 = πQ

(xk −

f (xk)− f ∗

‖f ′(xk)‖2f ′(xk)

), k ≥ 0.

Denote f ∗k = min0≤i≤k

f (xi ). Then for any k ≥ 0 we have:

f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2

,

‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.

Proof:

Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then

r2k+1(x∗) ≤∥∥∥xk − f (xk )−f ∗

‖f ′(xk )‖2f ′(xk)− x∗

∥∥∥2= r2k (x∗)− 2 f (xk )−f ∗

‖f ′(xk )‖2〈f ′(xk), xk − x∗〉+ (f (xk )−f ∗)2

‖f ′(xk )‖2

≤ r2k (x∗)− (f (xk )−f ∗)2‖f ′(xk )‖2

≤ r2k (x∗)− (f ∗k −f∗)2

L2(f ).

From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then

‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.

(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)

Constrained minimization (N.Shor (1964) & B.Polyak)

II. Problem: minx∈Qf (x) : g(x) ≤ 0, where

Q is closed and convex,

f , g have uniformly bounded subgradients.

Consider the following method. It has step-size parameter h > 0.

If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ

(xk − g(xk )

‖g ′(xk )‖2g ′(xk)

),

else (B): xk+1 = πQ

(xk − h

‖f ′(xk )‖ f′(xk)

).

Let Fk ⊆ 0, . . . , k be the set (B)-iterations, andf ∗k = min

i∈Fk

f (xi ).

Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and

f ∗k − f (x) ≤ hL(f ), maxi∈Fk

g(xi ) ≤ hL(g).

Computational strategies

1. Constants L(f ), L(g) are known (e.g. Linear Programming)

We can take h = εmaxL(f ),L(g) . Then we need to decide on the

number of steps N (easy!).

Note: The standard advice is h = R√N+1

(much more difficult!)

2. Constants L(f ), L(g) are not known

Start from a guess.

Restart from scratch each time we see the guess is wrong.

The guess is doubled after restart.

3. Tracking the record value f ∗k

Double run. Other ideas are welcome!

Application examples

Observations:

1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.

Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).

2 For p-diagonal matrices κ(A) ≤ p2.

Nonsmooth formulation of Google Problem

Main property of spectral radius (A ≥ 0)

If A ∈ Rn×n+ , then ρ(A) = min

x≥0max1≤i≤n

1x(i)〈ei ,Ax〉.

The minimum is attained at the corresponding eigenvector.

Since ρ(E ) = 1, our problem is as follows:

f (x)def= max

1≤i≤N[〈ei , E x〉 − x (i)] → min

x≥0.

Interpretation: Maximizing the self-esteem!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:

〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.

Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)

Computational experiments: Iteration Cost

We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).

Setup: Each agent has exactly p random friends.Thus, κ(A) ≈ p2.

Iteration Cost: GMs ≈ p2 log2N, GM ≈ pN.

Time for 104 iterations (p = 32)

N κ(A) GMs GM

1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92

16384 1824 4.69 408.38

Time for 103 iterations (p = 16)

N κ(A) GMs GM

131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5

1048576 608 0.40 2590.8

1 sec ≈ 100 min!

Convergence of GMs : Medium Size

Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.

Iterations f − f ∗ Time (sec)

1.0 · 105 0.1100 16.443.0 · 105 0.0429 49.326.0 · 105 0.0221 98.651.1 · 106 0.0119 180.852.2 · 106 0.0057 361.714.1 · 106 0.0028 674.097.6 · 106 0.0014 1249.541.0 · 107 0.0010 1644.13

Dimension and accuracy are sufficiently high, but the time is stillreasonable.

Convergence of GMs : Large Scale

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.

Iterations f − f ∗ Time (sec)

0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59

Final point x∗: ‖x∗‖∞ = 2.941497, R20

def= ‖x∗ − e‖22 = 1.2 · 105.

Theoretical bound:L2(f )R2

0ε2

= 5.3 ·107. Time for GM: ≈ 1 year!

Conclusion

1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.

2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:

y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,

y (n) ≤ b.

We need new variables y (j) for all nonzero coefficients of a.

Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?

3 Similar technique can be applied to dense columns.

Theoretical consequences

Assume that κ(A) ≈ γ2(A)n2. Compare three methods:

Sparse updates (SU). Complexity γ2(A)n2 L2R2

ε2log n

operations.

Smoothing technique (ST). Complexity γ(A)n2 LRε operations.

Polynomial-time methods (PT). Complexity(γ(A)n + n3)n ln LR

ε operations.

There are three possibilities.

Low accuracy: γ(A)LRε < 1. Then we choose SU.

Moderate accuracy: 1 < γ(A)LRε < n2. We choose ST.

High accuracy: γ(A)LRε > n2. We choose PT.

NB: For Huge-Scale problems usually γ(A) ≈ 1n ⇒

LRε

∨n

Recommended