Overview on Optimization Algorithms - NTNU

Introduction Mathematical Foundations Algorithms

Overview on Optimization Algorithms

Markus A. Köbis

NTNU, TMA4180, Optimering I, Spring 2021

2021-02-26

1


A Review Session

Idea• OPTION A - a live session with a presentation highlighting the ´big

picture’ of what we have learned so far

What we don’t do today• Learn new material ;-)• Go through proofs

Questions are welcome!

2


Talking about the ‘Big Picture’(I)

Topics of Today:

3


Talking about the ‘Big Picture’(II)

Topics of Today:

Recap of Theory and Methods for Unconstrained Optimization Problems

minx∈Rn

f(x)

• (Practical) Problems and Challenges• Theoretical Considerations• Algorithms

4


Talking about the ‘Big Picture’(III)

Organization Problem

Problem, Convexity, Continuity, Opt.-Conditions

(Descent) Methods

Gradient

Line Search

Newton’s Method

Linear CG

Gauss-Newton

5


Problems and Challenges (I)

Optimization Problems in Science and Engineering

PhysicsHamilton principle

EngineeringShape/Trajectory Optimization

EconomicsPortfolio OptimizationReturn MaximizationCost Minimization

Social SciencesExperiment Design

Data Fitting

Chemistry/BiologyReactor Optimization

6


Taxonomy of Optimization Problems(Not exhaustive)

Optimization

Uncertainty Deterministic Multiobjective Optimization

Stochastic Programming Robust Optimization Continuous Discrete

Unconstrained ConstrainedInteger

Programming

Combinatorial

Optimization

Nonlinear

Least SquaresNonlinear Equations

Nondifferentiable

OptimizationGlobal Optimization Nonlinear Programming Network Optimization Bound Constrained Linearly Constrained

Mixed Integer

Nonlinear ProgrammingSemidefinite Programming

Semiinfinite

Programming

Mathematical Programs

with Equilibrium Constraints

Derivative-Free

Optimization

Linear

Programming

Quadratic

Programming

Second-Order

Cone Programming

Complementarity

Problems

Quadratically-Constrained

Quadratic Programming

Source: https://neos-guide.org/content/optimization-taxonomy7


Quality of AlgorithmsProperties of ´good’ algorithms (Why isn’t there a best one?)• Accuracy• Efficiency (also: Scalability)

• Speed• Total execution time (avoid NP if possible, parallel vs. sequential)• Alternative: approximation quality after fixed time• Number of computer operations• Core function/ Linear algebra/GPU-routine-calls• Floating point operations (plus, times, roots, exp./trig, ...)• Bit operations• Number of f/∇f/∇2f calls

• Memory• Robustness

• Reliably find all solutions for a variety of problems• Feedback if algorithm unsuccessful, reproducibility• Provide additional information on problem structure

• User-friendliness, ability to include expert knowledge, interactivity• Easily extendable and maintainable• Cheap, trusted, supported, ...

8


Building the ´Big Picture’

Mathematical Foundations ↔ Algorithms

9


When Do Minimizers Exist?Weierstrass’ theorem

Weierstrass TheoremA continuous function on a compactdomain attains its minimum.

• In practice: Very often enough• Can one weaken the ´continuity’

requirement?e.g. complicated functions in technicalapplications

• Can one weaken the ´compactness’requirement?e.g. optimal control/problems inmathematical physics

Karl Weierstrass, 1815–1897Source: Public Domain,

https://commons.wikimedia.org/w/

index.php?curid=324146

10


Convexity

Convex SetA set S ⊆ Rn is convex if for all x1, x2 ∈ S, λ ∈ [0, 1]

λx1 + (1− λ)x2︸︷︷︸convex combination

∈ S.

Convex FunctionA function f : S → R is convex if its domain S is a convex set and if for anytwo points x1, x2 ∈ S and λ ∈ [0, 1] the following property is satisfied:

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2).

x1

x2

f(x)

x

(x1, f(x1)) (x2, f(x2))

11


(Lower Semi-) Continuity

DefinitionA function f : Rn → R is called lower semi-continuous, if for every x ∈ Rnand every sequence (xk)k∈N converging to x we have

f(x) ≤ lim infk→∞

f(xk).

x

f(x)

α

Ωα

RemarkAn alternative (equivalent)definition of lower semi-continuityis the following: A functionf : Rn → R is lowersemi-continuous, if the lower levelset

Ωα(f) := x ∈ Rn : f(x) ≤ α

is closed for every α ∈ R.

12


Coercivity

DefinitionA function f : Rn → R is called coercive, if we have for every sequence(xk)k∈N with ‖xk‖ → ∞ that f(xk)→∞.

x

f(x)

13


Weierstrass revisited

PropositionAssume that f : Rn → R is lower semi-continuous and that Ω ⊂ Rn iscompact. Then the optimization problem

minx∈Ω

f(x)

has at least one global minimizer.

TheoremAssume that the function f : Rn → R is lower semi-continuous and coercive.Then the optimization problem minx∈Rn f(x) admits at least one globalminimizer.

14


The Role of Convexity

TheoremIf f : Rn → R is convex, any local minimizer is a global minimizer. (Thisimplies that optimality conditions (see below) can be used for globaloptimization.)

15


Designing Algorithms

Optimization Problem:minx∈Rn

f(x)

How to proceed?Design computer method to find good points in Rn until you are finished(a) Can the search be based on existing mathematical tools? (e.g. Solve

(non-)linear systems, feasibility problem solvers, simulation methods etc.)(b) How can you know that the algorithm is ‘finished’?

→ Optimality Conditions

16


Optimality Conditions (I)First Order Necessary ConditionLet x∗ ∈ Rn be a local minimizer, and let f : Rn → R be continuouslydifferentiable in an open neighborhood of x∗. Then

∇f(x∗) = 0 .

This condition classifies x∗ as a stationary point

17


Optimality Conditions (II)

Second Order Necessary ConditionLet x∗ ∈ Rn be a local minimizer of f : Rn → R. If ∇2f exists and iscontinuous in an open neighborhood of x∗, then ∇f(x∗) = 0 and ∇2f(x∗) ispositive semi-definite.

Second Order Sufficient ConditionLet ∇2f be continuous in an open neighborhood of x∗ ∈ Rn, ∇f(x∗) = 0 andlet ∇2f(x∗) be positive definite.Then x∗ is a strict local minimizer.

Geometry/Practice• Properties of a Quadratic Approximation on f• In (large-scale) algorithms often not/only approximately used.

18


Designing Algorithms (II)

minx∈Rn

f(x)

Algorithm: Computerized function (Input → output scheme) with limitedinternal memory, calculation power and precision.

Picture: Point cloud

19


Designing Algorithms (III)Versatile methods for ‘any’ optimization problem

1. First Attempt: Grid Search: Evaluate f along a predefined grid an choose somegood grids to refine this further.‘Curse of Dimension’: Number of grid points grows exponentially withdimension n of x.

2. Improvement (a): Solve a series of 1d-problems using ‘classical methods’How to define these?

3. Improvement (b): Algorithms based on Heuristics/Physics/Biologye.g. Nelder-Mead Simplex Method, Simulated Annealing, EvolutionaryAlgorithms

4. (Often) Better: Iterative Procedure: Choose a starting point x0 ∈ Rn andsearch along a sequence

x0, x1, x2, . . . , xk, xk+1, . . .→ x∗

with improved properties of xk+1 over xk.

Given the scarce information (f is just input→output), the only way to measure‘improved properties’is through function values → descent methods

scaling invariance, also important to account for easy extension and maintenance of code20


Descent Methods (I)Basic Idea of Descent Methods1. Start with initial x0 ∈ Rn

2. For k = 0, 1, 2, . . .Find xk+1 such that f(xk+1) < f(xk)If some stopping criterion is fulfilled: STOP

Output: x∗ = xk+1

How to find a better x? Define the search direction

λ·pk := xk+1 − xk ⇔ xk+1 = xk + pk

When is pk good?

Definition

For a given xk ∈ Rn, a direction (nonzerovector (of undefined length)) pk ∈ Rn is calleda descent direction if there exists α > 0 suchthat

f(xk + αpk) < f(xk) ∀α ∈ (0, α). αf

(xk

+αpk)

α21


Descent Methods (II)

Lemma: Characerizing descent directions for differentiable fLet f : Rn → R be continuously differentiable.1. Any direction p satisfying

∇f(x)> · p < 0

is a descent direction.2. The negative gradient −∇f(x) is a descent direction (if not zero)3. For any positive definite matrix B ∈ Rn×n, the direction

−B · ∇f(x)

is a descent direction (if x is not stationary).

22


Line Search (I)

In general form of the descent algorithm:

xk+1 := xk + αk · pk

What is a good αk?1. Constant α:• Good for some special algorithms and when problem structure iswell-understood• α too large: ‘overshooting’ and method may diverge• α too small: Potentially many steps and slow convergence

2. Determining a perfect value, exact line search: αk := argminx Φk(α)︸︷︷︸=f(xk+α·pk)

• Good if Φk-minimization is easy.3. Inexact Line Searches: Use heuristics to obtain adequate step lengths at

justifiable cost

23


Line Search (I)Backtracking

(Sub-) Algorithm: BacktrackingProblem: Find a good/quick approximation to the solution of minα Φk(α).Parameter: α ∈ R as initial guess, reduction constant ρ ∈ (0, 1)

Set: α = αWhile: some stopping criterion for φk(α) is not fulfilled:set: α = ρ · α.

In Practice• Also set a maximum number of iterations• Sometimes: Re-use α from previous run of algorithm• Improvements based on 1d equation solvers (regula falsi, interpolation

etc.)• ρ ∈ (0.4, 0.7)

24


Line Search (II)

Armijo (Sufficient Decrease) ConditionGiven c1 ∈ (0, 1), we require

f(xk + αkpk)︸︷︷︸Φk(αk)

≤ f(xk) + c1 · αk · ∇f(xk)> · pk︸︷︷︸Φk(0)+αk·c1·Φ′

k(0)=:l(αk) α

f(x

k+αpk)

acceptable

α

f(x

k+αpk)

desired minimal slope

Armijo

acceptable

Wolfe (Curvature) Condition(s)Given, furthermore, a constant c2 ∈ (c1, 1), werequire additionally

∇f(xk + αk · pk)> · pk︸︷︷︸Φ′k

(αk)

≥ c2 · ∇f(xk)> · pk︸︷︷︸c2·Φ′

k(0)

Strong Wolfe condition: Also restrict positiveslope

Existence of such αk given if f smooth and bounded from below along the search line.

25


Line Search (III)Fundamental Result for Line Search Methods

Theorem (Zoutendijk)Consider any iteration of the form

xk+1 = xk + αk · pk

where pk is a descent direction and αk satisfies the Wolfe conditions for0 < c1 < c2 < 1. Suppose that f is bounded below, differentiable withL-continuous gradient on any open set containing Ωf(x0)f .Then ∑

k≥0

(cos θk)2 · ‖∇f(xk)‖2 <∞ ,

where θk := ^(−∇f(xk), pk)

Result

cos θk ≥ δ > 0 ⇒ limk→∞

‖∇f(xk)‖ → 0

26


Gradient DescentGiven a (continuously) differentiable objective f : Rn → R.

We have information concerning the slope(s) of f .

Algorithm: Gradient DescentUse a descent method (with or without line search).Use the search direction

∀k : pk := −∇f(xk)

Locally (i.e. α→ 0), this is the best possible choice

27


Gradient Descent: ConvergenceGradient Descent for quadratic test problem

f(x) :=1

2x> ·Q · x− b> · x

with positive definite matrix Q.

Gradient Descent with Exact Line SearchUpdate formula

xk+1 = xk −∇f(xk)> · ∇f(xk)∇f(xk)> ·Q · ∇f(xk)

· ∇f(xk)

TheoremWhen gradient descent with exact line search is applied to f as above, then

‖xk+1 − x∗‖2Q ≤(λn − λ1

λn + λ1

)2

· ‖xk − x∗‖2Q ,

where 0 < λ1 ≤ λ2 ≤ · · · ≤ λn are the eigenvalues of Q. (linear convergence)

Similar result for nonlinear case. (Q↔ ∇2f(x∗), (.)2 ↔ r2 with κ(Q) < r < 1)28


Gradient Descent: Convergence

x1

x2

Perfect (convergence in one step) if all level curves equally spaced along allaxes.

29


Newton’s Method

Sir Isaac Newton, 1643 - 1727

Motivation• Instant method forf(x) = x> ·Q · x− b> · x with s.p.d.matrix Q

• Solve ∇f = 0 with the ‘classical’Newton’s method

• (Sequential) approximate of f byquadratic functions

Global Convergence Properties• Oscillations• Divergence• Stuck in local minima• Often not even a descent direction• Not defined if ∇2f(xk) singular

Source: By Godfrey Kneller - http://www.phys.uu.nl/ vgent/astrology/images/newton1689.jpg], Public Domain,

https://commons.wikimedia.org/w/index.php?curid=146431 30


Newton’s Method

Newton’s MethodApply the descent method algorithm with the update

xk+1 := xk − αk︸︷︷︸=1

or→1

·(∇2f(xk)

)−1︸︷︷︸not always

s.p.d.

·∇f(xk)

(Additional) Costs:• approximation/calculation of Hessian ∇2f(xk), storage thereof• Inversion of ∇2f(xk) (never invert(!))

Why all this trouble?

TheoremIf f is twice differentiable with a L.-continuous Hessian and ∇2f(x∗) is s.p.d.in a local minimum x∗ and x0 is sufficiently close to x∗, then Newton’smethod convergences quadratically in x and ∇f towards x∗.

31


Modifications of Newton’s Method *different names in literature

Modified Newton’s MethodReplace

∇2f(xk) ∇2f(x0/m) (m < k)

loose quadratic convergence, fewer evaluations/inversions

Simplified NewtonReplace Hessian by (good and easy to invert) approximation

∇2f(xk) Qk ≈ ∇2f(xk)

Damped NewtonUse a reduced step length αk < 1 in the Newton updates.loose quadratic convergence, (often:) increase domain of convergence

In PracticeNewton’s method only when Hessian cheap and/or to finalize iteration.

32


(Linear) Conjugated Gradient Method (I)

Back to quadratic problem:

f(x) =1

2· x> ·A · x− b> · x

with a s.p.d. matrix A ∈ Rn×n.Alternatively: Solve the (large) linear system A · x = b. with the residualr(x) := A · x− b = ∇f(x)

Conjugate Direction MethodsUse a set of conjugate directions p0, p1, . . . , pn−1 (always lin. indep.) assearch directions in the descent algorithm. i.e. p>i ·A · pj = 0, (i 6= j)• Exact line search like for the gradient method possible• Convergence after n steps guaranteed• After k steps: Minimization of f over a k-dimensional subspace (a so-calledKrylov space (→ Cayley-Hamilton))

33


(Linear) Conjugated Gradient Method (II)

CG method: Basic idea• Use only previous vector pk−1 and residual r(xk) to construct next

conjugate direction pk

pk = −r(xk) + βk · pk−1 (βk uniquely det. from being conj. direction)

Linear CG methodr0 = Ax0 − b, p0 = −r0

While rk 6= 0:αk = − r>k ·pk

p>k ·A·pkxk+1 = xk + αk · pkrk+1 = A · xk+1 − bβk+1 =

r>k+1·A·pkp>k ·A·pk

pk+1 = −rk+1 + βk+1 · pk

34


Convergence and Preconditioned CG

TheoremIf A has only r distinct eigenvalues, then the (linear) conjugated gradientmethod terminates at the solution after r steps.

TheoremThe (theoretical) convergence rate is similar to the gradient method

‖xk+1 − x∗‖2A ≤(λn−k − λ1

λn−k + λ1

)2

· ‖x0 − x∗‖2A

PCGImprove convergence behavior by linearly transforming the search spacex := C · x

f(x) =1

2· x> · (C−>AC−1)x− (C−>b)> · x

NB: Extension (and convergence analysis) to the nonlinear case (relatively) easy.

35


Least Squares ProblemsForm of the objective function:

f(x) =1

2

m∑j=1

r2j (x)

where each rj : Rn → R is a real valued smooth function called residual. Weassume that m ≥ n.

Remark• Least-squares problems arise in many areas of applications, and may in

fact be the largest source of unconstrained optimization problems• measure the discrepancy between the model and the observed behavior of

the system.• by minimizing f , we select values for the parameters that best match the

model to the data.• we show how to devise efficient, robust minimization algorithms by

exploiting the special structure of the function f and its derivatives.

36


Least Squares Problems (cont.)Residual vector r : Rn → Rm

r(x) := (r1(x), . . . , rm(x))>

Then f = 12 ‖r(x)‖22. The Jacobian (m× n-matrix) is defined as

J(x) :=

[∂rj∂xi

]j=1,...,m,i=1,...,n

=

∇r1(x)>

...∇rm(x)>

.

∇f(x) =

m∑j=1

rj(x)∇rj(x) = J(x)>r(x),

∇2f(x)︸︷︷︸“for free”

=

m∑j=1

∇rj(x)∇rj(x)> +

m∑j=1

rj(x)∇2rj(x)>

= J(x)>J(x) +m∑j=1

rj(x)∇2rj(x). (1)

37


Gauss Newton MethodsThe Gauss-Newton MethodThe standard Newton equations ∇2f(xk)p = −∇f(xk) are replaced by thefollowing system to obtain the search direction pGN

k :

J>k JkpGNk = −J>k rk.

AdvantagesUsing the approximation O2f(xk) ≈ JTk Jk saves the trouble of computing theindividual residual Hessians ∇2rj from (1). Often, the first term in (1)dominates the second one, such that J>k Jk is a good approximation of∇2f(xk). In practice, many least-squares problems have small residuals at thesolution, leading to rapid local convergence of Gauss-Newton.

ConvergenceIn theory: (Linear) convergence under mild regularity and full-rank condition onsystem matrices.

In practice: Good global and fast local convergence38


What Next?

39

Documents

Overview on Optimization Algorithms - NTNU