Short course: Optimality Conditions and Algorithms in ...ghaeser/santiago1.pdf · Short course: Optimality Conditions and Algorithms in Nonlinear Optimization Part I - Introduction

Short course: Optimality Conditions andAlgorithms in Nonlinear Optimization

Part I - Introduction to nonlinear optimization

Gabriel Haeser

Department of Applied MathematicsInstitute of Mathematics and Statistics

University of São PauloSão Paulo, SP, Brazil

Santiago de Compostela, Spain, October 28-31, 2014

www.ime.usp.br/∼ghaeser Gabriel Haeser

Outline

Part I - Introduction to nonlinear optimizationExamples and historical notesFirst and Second order optimality conditionsPenalty methodsInterior point methods

Part II - Optimality ConditionsAlgorithmic proof of Karush-Kuhn-Tucker conditionsSequential Optimality ConditionsAlgorithmic discussion

Part III - Constraint QualificationsGeometric InterpretationFirst and Second order constraint qualifications

Part IV - AlgorithmsAugmented Lagrangian methodsInexact Restoration algorithmsDual methods



Optimization

Optimization is a mathematical problem with many “real world”applications. The goal is to find minimizers or maximizers of amultivariable real function, under a restricted domain.

to draw a map of Americawith areas proportional to the

real areas

hard-spheres problem: toplace m points on a

n-dimensional sphere in sucha way that the smallest

distance between two pointsis maximized.


Problem America

To draw a map of America, similar to the usual map, with areasproportional to real areas.

Minimize 12∑m

i=1 ‖pi − pi‖2,Subject to 1

2∑nj

i=1(pxi py

i+1 − pxi+1py

i ) = βj, j = 1, . . . , c

c = 17 countriesβj is the real area of country j

m = 132 given points pi on thefrontiers of the usual mapGreen-Gauss formula to computeareas


Problem America

United States (without Alaska and Hawaii) = 8.080.464 km2

Brazil = 8.514.876 km2

Usual map ratio ≈ 1.32

Real ratio ≈ 0.95

Usual map Areas proportional to real areaswww.ime.usp.br/∼ghaeser Gabriel Haeser

Problem America

Areas proportional to GDP Areas proportional to population


Kissing and hard-spheres problems

The kissing number of dimension n is the largest number of unitspheres that may be put touching a n-dimensional unit spherewithout overlapping.

The hard-spheres problem consists of maximizing the smallestdistance d between m points on the n-dimensional sphere of ra-dius 2.

n Kissing number2 63 124 245 40–446 72–787 126–1348 2409 306–36410 500–554

d∗ ≥ 2⇒ kissing number ≥ m

n = 2, n = 3,m = 6, d∗ = 2 m = 12, d∗ ≈ 2.194


Applications: Packing


Applications: PackingInitial configuration for molecular dynamics


Large scale problems: Finance

Jacek Gondzio and Andreas Grothey (May 2005):quadratic convex program with 353 million constraints and 1010million variables.

Tool: Interior Point Method


Large scale problems: Localization

Find a point in the rectangle but not in the ellipsis such that thesum of the distances to the polygons is minimized.

1.567.804 polygons.3.135.608 variables.1.567.804 upper levelconstraints.12.833.106 lower levelconstraints.convergence in10 outer iterations,56 inner iterations,133 funct. evaluations,185 seconds.

Tool: Augmented Lagrangian method


TANGO Project - www.ime.usp.br/∼egbirgin/tango

Trustable Algorithms for Nonlinear General Optimization



40.370 visits registered by Google Analytics - Since 2007(More than 3.000 downloads)

USA: 7.969, Brazil: 7.230, Germany: 2.974



Spain: 733


Historical Notes

Military Programs formulated as a system of linearinequalities gave rise to the term Programming in a linearstructure (title of the first paper by G. Dantzig, 1948).Koopmans shortened the term to Linear Programming.Dorfman (in 1949) thought that Linear Programming wastoo restrictive and suggest the more general termMathematical Programming, now called MathematicalOptimization.Nonlinear Programming is the title of the 1951 paper byKuhn and Tucker that deals with Optimality Conditions.These results are the extension of the Lagrange rule ofmultipliers (1813) to the case of equality and inequalityconstraints. These were previously considered on the 1939unpublished master’s thesis of Karush (KKT conditions).These works are particularly important because theysuggest the development of algorithms to deal withpractical problems.


Historical Notes

Linear Programming is part of a revolutionary developmentthat gave humanity the capability to formulate an objectiveand determine a way of detailed decisions to reach thisgoal in the best way possible.Tools: Models, algorithms, computers and softwares.The impossibility to perform large computations is the mainreason, according to Dantzig, to the lack of interest inoptimization before 1947.

Important topics in computing: (a) Dealing with sparsity allowsfor solving larger problems; (b) Global optimization; (c)Automatic differentiation of a function represented in aprogramming language.


Automatic Differentiation

f (x1, x2) = sin(x1) + x1x2


Duality

Game theory and linear programming:1948 - G. Dantzig visited John von Neumann in Princeton.

J. von Neumann, 1963. Discussion of a maximum problem.D. Gale, H. W. Kuhn, A. W. Tucker, 1951. Linear programmingand the theory of games.

Elements of duality:

a pair of optimization problems, one a maximum problemwith objective function f and the other a minimum problemwith objective function h, based on the same datafor feasible solutions to the pair of problems, always h ≥ f

necessary and sufficient conditions for optimality are h = f


Duality

(Fermat XVII century): Given 3 points p1, p2 and p3 on the plane,find the point x that minimizes the sum of the distances from x top1, p2 and p3.


Duality

(Thomas Moss, The Ladies Diary, 1755): “In the three sides ofan equiangular field stand three trees, at the distances of 10, 12and 16 chains from one another: to find the content of the field,it being the greatest the data will admit.”


Duality

(J.D. Gergonne (ed), Annales de Mathématiques Pures et Ap-pliquées, 1810-1811): Given any triangle, circumscribe thelargest possible equilateral triangle about it.

Solution given in the 1811-1812 edition by Rochat, Vecten,Fauguier and Pilatte where duality was acknowledged.


The problem (NLP)

Minimize f (x),Subject to hi(x) = 0, i = 1, . . . ,m.

gj(x) ≤ 0, j = 1, . . . , p.

f , hi, gj : Rn → R are (twice) continuously differentiablefunctions.

Ω = x ∈ Rn | h(x) = 0, g(x) ≤ 0 (feasible set)


Solution

Global Solution: A feasible point x∗ ∈ Ω is a global minimizer ofNLP when

f (x∗) ≤ f (x),∀x ∈ Ω

Local Solution: A feasible point x∗ ∈ Ω is a local minimizer ofNLP when there exists a neighbourhood B(x∗, ε) of x∗ such that

f (x∗) ≤ f (x), ∀x ∈ Ω ∩ B(x∗, ε)

A(x) = j ∈ 1, . . . , p | gj(x) = 0 (set of active inequalities atx ∈ Ω)


Example

Minimize x2 + y2,Subject to x + y− 1 = 0.


First order optimality condition - Lagrange multipliers

Minimize x2 + y2,Subject to x + y− 1 = 0.

x = 12 , y = 1

2 ,

(11

)+ (−1)

(11

)= 0


Example

Maximize x2 + y2,Subject to x + 2y− 2 ≤ 0,

x ≥ 0,y ≥ 0.


Minimize −x2 − y2,Subject to x + 2y− 2 ≤ 0,

−x ≤ 0,−y ≤ 0.

x = 2, y = 0,(−40

)+ 4

(12

)+ 8

(0−1

)= 0

x = 0, y = 1,(

0−2

)+ 1

(12

)+ 1

(−10

)= 0

x = 0.4, y = 0.8,(−0.8−1.6

)+ 0.8

(12

)= 0


First order optimality condition - KKT condition

(Karush-Kuhn-Tucker) Under some condition (constraintqualification), if x∗ is a local solution, there exist Lagrangemultipliers λ ∈ Rm and µ ∈ Rp such that:

∇f (x) +

m∑i=1

λi∇hi(x∗) +

p∑j=1

µj∇gj(x∗) = 0, (Lagrange condition)

µjgj(x∗) = 0, j = 1, . . . , p, (complementarity)

h(x∗) = 0, g(x∗) ≤ 0, (feasibility)

µ ≥ 0. (dual feasibility)

Interpretation: up to first order, a feasible direction cannot be adescent direction.


Second order optimality condition

x∗ =

(0.40.8

),∇g1(x∗) =

(12

), ∇2f (x∗) =

(−2 00 −2

).

There exists some d ∈ Rn,∇g1(x∗)Td ≤ 0, dT∇2f (x∗)d < 0.

Theorem: Under some conditions, if x∗ is a local minimizer

dT

∇2f (x) +

m∑i=1

λi∇2hi(x∗) +

p∑j=1

µj∇2gj(x∗)

d ≥ 0,

for every d ∈ Rn such that

∇f (x∗)Td ≤ 0,

∇hi(x∗)Td = 0, i = 1 . . . ,m

∇gj(x∗)Td ≤ 0, j ∈ A(x∗).

Interpretation: All critical directions must be of ascent nature.


History of nonlinear programming

Kuhn, Tucker, 1951.Nonlinear programming.

Albert William Tucker(1905 - 1995)Princeton UniversityTopology

Harold William Kuhn(1925 - 2014)Princeton UniversityPhD 1950, AlgebraGame Theory, Optimization

Saddle point problem

φ(x∗, u) ≤ φ(x∗, u∗) ≤ φ(x, u∗), ∀x, u



William Karush (1917-1997)

1939. Minima of Functions of Several Variableswith Inequalities as Side Conditions.M.Sc. thesis, Department of Mathematics,University of Chicago

Calculus of Variations and Optimization

University of Chicago and California StateUniversity (also Manhattan Project)

I concluded that you two had exploited and de-veloped the subject so much further than I, thatthere was no justification for my announcing tothe world, “Look what I did, first.”, 1975.



Fritz John (1910 - 1994)

1948. Extremum problems with inequalities assubsidiary conditions.

PhD 1933 in Göttingen under CourantNew York University

Partial differential equations, convex geometry,nonlinear elasticity



Fritz John (1910 - 1994)

Let S be a bounded set in Rm. Find the sphereof least positive radius enclosing S.

Minimize F(x) := xm+1,Subject toG(x, y) := xm+1−

∑mi=1(xi− yi)

2 ≥ 0 for all y ∈ S.

the boundary of a compact convex set S in Rn

lies between two homothetic ellipsoids of ratio≤ n, and the outter ellipsoid can be taken to bethe ellipsoid of least volume containing S.


Snell’s law of diffractionsin θy

vy= sin θz

vz



vy= sin θz

vz

Minimize T(x) :=‖x− y‖

vy+‖x− z‖

vzSubject to h(x) = 0

At the solution x∗, ∇T(x∗) = x∗−yvy‖y−x∗‖ + x∗−z

vz‖z−x∗‖ is parallel to∇h(x∗), the normal vector to the surface.

Define y = x∗ + y−x∗vy‖y−x∗‖ and z = x∗ + z−x∗

vz‖z−x∗‖ .Hence −∇T(x∗) = (y− x∗) + (z− x∗) is the diagonal of thefollowing parallelogram:



vy= sin θz

vz

By triangular sim-ilarity, y and z areequally away fromthe normal line.Hence‖y− x∗‖ sin θy =‖z− x∗‖ sin θz.The calculation‖y − x∗‖ = 1

vyand

‖z− x∗‖ = 1vz

yieldsSnell’s law.


External Penalty Method

Choose a sequence ρk with ρk → +∞ and for each k solvethe problem

Minimize f (x) + ρkP(x),

obtaining the (global) solution xk, if it exists.P is a smooth functionP(x) ≥ 0

P(x) = 0⇔ h(x) = 0, g(x) ≤ 0

For example: P(x) = ‖h(x)‖22 + ‖max0, g(x)‖2

2



Theorem: If xk is well defined then every limit point of xk isa global solution to Minimize P(x)

Theorem: If xk is well defined and there exists a point wherethe function P vanishes (feasible region is not empty), thenevery limit point of xk is a global solution ofMinimize f (x), Subject to h(x) = 0, g(x) ≤ 0.

The External Penalty Method can be used as a theoretical toolto prove KKT conditions, but also, it can be adjusted to be anefficient algorithm (augmented lagrangian method).



Minimize x21 + x2

2,Subject to x1 − 1 = 0

x2 − 1 ≤ 0.

Minimize x21 + x2

2 + ρk((x1 − 1)2 + max0, x2 − 12)(= Φk(x)).

Solving ∇Φk(x) = 0 we get xk = ( ρk1+ρk

, 0)→ (1, 0).

Show simulation


Internal Penalty Method

Choose a sequence µk with µk → 0+ and for each k solve theproblem

Minimize f (x) + µkB(x),Subject to h(x) = 0

g(x) < 0.

B is smoothB(x) ≥ 0

B(x)→ +∞ if some gi(x)→ 0 with g(x) < 0.For example: B(x) = −

∑mi=1 log(−gi(x))


Interior Point Method

Consider the convex quadratic problem

Minimize cTx + 12 xTQx,

Subject to Ax = bx ≥ 0.

and the barrier subproblem

Minimize cTx + 12 xTQx− µ

∑nj=1 log xj,

Subject to Ax = bx > 0.

KKT condition

c− ATλ+ Qx− µX−1e = 0,Ax = b,

where X−1 = diagx−11 , . . . , x−1

n and e = (1, . . . , 1)T. Denotings = µX−1e we get

ATλ+ s− Qx = c,Ax = b,

XSe = µe, (x, s) > 0.



Active-set methods

ATλ+ s− Qx = c,

Ax = b,

XSe = 0,

(x, s) ≥ 0.

Interior point methods

ATλ+ s− Qx = c,

Ax = b,

XSe = µe,

(x, s) > 0.



Complementarity: xisi = 0,∀i = 1, . . . , n.

Active-set methods try to guess the optimal active subsetA ⊆ 1, . . . , n and set xi = 0 for i ∈ A (active constraints), si = 0for i 6∈ A (inactive constraints).

Interior point methods use ε-mathematics:Replace xisi = 0,∀i = 1, . . . , nby xisi = µ, ∀i = 1, . . . , n.

Force convergence by letting µ→ 0+.



Solve the nonlinear system of equations

f (x, λ, s) = 0,

where f : R2n+m → R2n+m is the mapping:

f (x, λ, s) =

ATλ+ s− Qx− cAx− bXSe− µe

.



Newton direction:

−Q AT IA 0 0S 0 X

.

∆x∆λ∆s

=

c− ATλ− s + Qxb− Axµe− XSe

.

Reduce µ at each Newton iteration.



Algorithm: Step 0: Choose (x0, λ0, s0), (x0, s0) > 0, µ0 > 0 andparameters 0 < γ < 1 and ε > 0. Set k = 0.

Step 1: Compute the Newton direction (∆x,∆λ,∆s) at(x, λ, s) := (xk, λk, sk).

Step 2: Choose a stepsize α such that (xk +α∆x, sk +α∆s) > 0.

Step 3: Update µk+1 = γµk.

Step 4: If xksk ≤ εx0s0, stop. Else set k := k + 1 and go to Step 1.



Consider the merit function

ψ(x, s) = (n +√

n) log(xTs)−n∑

i=1

(xisi),

(Note that ψ(x, s)→ −∞⇒ xTs→ 0.)

Choosing the stepsize α that minimizes ψ(xk + α∆x, sk + α∆s)(exact line search) we get:

Theorem: If γ = nn+√

n , we have xTk sk ≤ εxT0 s0 in O(√

n log(nε

))iterations.


Algorithms

There are no “direct method” to solve NLP.NLP is solved using iterative methods.An iterative method generates a sequence of pointsxk ∈ Rn that converges (or not) to a solution of the problem.Iterative methods are programmed and implemented oncomputers, where real mathematical operations arereplaced by floating point operations.


Algorithms

Theory is necessary to avoid performing an infinite numberof experiments.Useful theory should be able to predict the behavior ofmany experiments.Usually, the theory does not refer to the real sequencesgenerated by the computer, but theoretical sequencesdefined by the algorithms.The analogy between real sequences and theoreticalsequences is not perfect.There are practical phenomena that the theory is not ableto predict, but relevant theory is the one that contributs inexplaining practical phenomena.


Documents

Short course: Optimality Conditions and Algorithms in ...ghaeser/santiago1.pdf · Short course: Optimality Conditions and Algorithms in Nonlinear Optimization Part I - Introduction