OPER 627: Nonlinear Optimization Lecture 14: Mid-term Revieysong3/lecture14-midterm-review.pdf · 1 Choice 3: Modiﬁed Newton, Bk = r2f(xk) + Ek If r2f(x k) is PD, E k = 0 Otherwise,

OPER 627: Nonlinear OptimizationLecture 14: Mid-term Review

Department of Statistical Sciences and Operations ResearchVirginia Commonwealth University

Oct 16, 2013

(Lecture 14) Nonlinear Optimization Oct 16, 2013 1 / 16

Exam begins now...

Try to find Professor Song’s technical mistakes (not including typos) inhis terrible slides:

If you find one that nobody else could find, you get one extra pointMaximum extra points: 5Submit the exam paper with these mistakes (I will allocate somespace for you to fill out)


An overall summary

1 Theory: optimality conditions in various casesIn general, FONC, SONC, SOSC, apply for functions defined in anopen setOptimality conditions with convexity

2 Algorithms: line search and trust regionAll we learn is Newton methodAlgorithms in this class only guarantees convergence to astationary point from any initial point.


How do we use optimality conditions?

1 Use FONC to rule out non-stationary solutionsFONC is used in all algorithms that have global convergence

2 Use SONC to rule out saddle points

3 Use SOSC to validate quadratic convergence of Newton methodSOSC is used to show fast convergence to a local minimizer


Convexity

1 First-order characterization of convex functions

2 Second-order characterization of convex functions defined in anopen set

3 Free lunch, free dinner, ultimate gift

4 Strongly convex: ∇2f is PD, why important?


Optimization algorithms

Motivation: optimize for the next step using information from thecurrent step

f (xk + pk ) ≈ m(pk ) := f (xk ) +∇f (xk )>pk +12

p>k Bkpk

1 Line search: if Bk is “nice”, we can find a descent direction pkeasily, and the minimizer along that direction is our next iterate

2 Trust region: m(pk ) only approximates f (xk + pk ) well locally, wewill look for the next iterate based on our confidence level on howm(pk ) approximate f (xk + pk ), and adjust our confidence leveladaptively


Line search

1 Wolfe conditions:Sufficient descent: φ(α) ≤ φ(0) + c1φ

′(0)αSufficient curvature: φ′(α) ≥ c2φ

′(0)What are their purposes?

2 Fundamental result for line search:Just assume the search direction pk is a descent direction, and useWolfe condition for line search∑∞

k=0 cos2 θk‖∇f (xk )‖2 <∞, where cos θk = − ∇f (xk )>pk‖∇f (xk )‖‖pk‖

How to use this result to prove global convergence for steepestdescent? Newton?


Line search is all about Newton

Model function

min m(pk ) := f (xk ) +∇f (xk )>pk +12

p>k Bkpk

Approximated Hessian Bk : This is an unconstrained QP, any stationarypoint is optimal if and only if Bk is PD

1 Choice 1: Bk = I, correspond to steepest descent pk = −∇f (xk )

Only first-order information is usedLinear local convergenceConvergence could be very slow if the condition number of Hessianis large

2 Choice 2: Bk = ∇2f (xk ), correspond to pure Newtonpk = −[∇2f (xk )]−1∇f (xk )

No line search is needed, stepsize is always αk = 1Quadratic local convergence to x∗ if x∗ satisfies SOSCFragile: may run into trouble if Hessian is not PD


Line search is all about Newton (contd)

Model function

min m(pk ) := f (xk ) +∇f (xk )>pk +12

p>k Bkpk

1 Choice 3: Modified Newton, Bk = ∇2f (xk ) + EkIf ∇2f (xk ) is PD, Ek = 0Otherwise, Ek is “big enough” to ensure Bk is PDLoses quadratic convergence, because we need line search

2 Choice 4: Quasi-Newton, construct/update a PD matrix Bk as wego

Updating formula ensures the secant equation: Bk+1sk = yk , sousing Bk to approximate Hessian makes senseBk is PD by Wolfe/curvature condition and updating formulaBFGS, approximates inverse Hessian HkSuperlinear local convergence, but no global convergence ingeneral


Trust region

All about solving the trust region subproblem (TRP)

minp

m(p) := f (xk ) +∇f (xk )>p +12

p>Bkp

s.t. ‖p‖ ≤ ∆

Solving TRP:1 Direct method: needs matrix factorization, iterative root-finding

procedure2 Cauchy points3 Improved Cauchy points, dogleg methods


Cauchy points and dogleg

Cauchy point: the best solution along the steepest descent directionwithin the trust region

A constrained step size problem

minτk

fk + τkg>k psk +

12τ2

k (psk )>Bkps

k

s.t. τk ≤ 1

where psk = −∆k

‖gk‖gk

Improvement: Dogleg (only when Bk is PD)Use a two line segments to approximate the full trajectory from theglobal minimizer pB = −B−1g to the minimizer along steepestdescent direction pU

Optimization over the two line segments is easy because ofmonotone structure


Line search vs. Trust region

1 Line search first finds adirection, then chooses thestep length

2 Line search methods have an“easy” problem to solve ineach iteration

3 Line search methods may notallow true Hessian in themodel function

1 Trust region first finds a length,then chooses the direction

2 Trust region methods have ahard TRP in each iteration

3 Trust region allows trueHessian in the model function


Important concepts

Condition number1 Condition number and convergence

Steepest descentNewton/quasi-NewtonConjugate gradient (preconditioned CG)

2 Condition number and numerical stabilityLeast squares: solving normal equations J>Jx = J>y

Wolfe condition1 Global convergence for inexact line search

2 Guarantee for PD quasi-Newton Hessian matrices


Optimization for large-scale problems

When problems get bigger, we have to compromise1 Quasi-Newton: hard to compute Hessian when large-scale, use

first-order information to mimic the behavior of Hessian

2 L-BFGS: use a limited-memory list of vectors sk , yk toapproximate quasi-Newton matrix

3 Inexact Newton: solving Newton direction via conjugate gradientCG is hunkydoryCG enables inexact solutions, which is sufficient for superlinearconvergence if error bound ηk → 0


Choice of algorithms for unconstrained nonlinearoptimization

If first-order information is not available, you need to take anothercourse!

1 Second-order information is not availableSteepest descent (if you are lazy)Quasi-Newton: n ≤ 100Large-scale: L-BFGS, nonlinear CG, inexact Quasi-Newton, etc

2 Second-order information is availableNewton, if you know your problem is strongly convex, or you knowyou are very close to optimalTrust-region methodsLarge-scale: Newton-CG, CG-trust

In practice, choose an implementation/software that matches the bestwith your application!


Good luck!


Documents

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Revieysong3/lecture14-midterm-review.pdf · 1 Choice 3: Modiﬁed Newton, Bk = r2f(xk) + Ek If r2f(x k) is PD, E k = 0 Otherwise,