Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
OPER 627: Nonlinear OptimizationLecture 14: Mid-term Review
Department of Statistical Sciences and Operations ResearchVirginia Commonwealth University
Oct 16, 2013
(Lecture 14) Nonlinear Optimization Oct 16, 2013 1 / 16
Exam begins now...
Try to find Professor Song’s technical mistakes (not including typos) inhis terrible slides:
If you find one that nobody else could find, you get one extra pointMaximum extra points: 5Submit the exam paper with these mistakes (I will allocate somespace for you to fill out)
(Lecture 14) Nonlinear Optimization Oct 16, 2013 2 / 16
An overall summary
1 Theory: optimality conditions in various casesIn general, FONC, SONC, SOSC, apply for functions defined in anopen setOptimality conditions with convexity
2 Algorithms: line search and trust regionAll we learn is Newton methodAlgorithms in this class only guarantees convergence to astationary point from any initial point.
(Lecture 14) Nonlinear Optimization Oct 16, 2013 3 / 16
How do we use optimality conditions?
1 Use FONC to rule out non-stationary solutionsFONC is used in all algorithms that have global convergence
2 Use SONC to rule out saddle points
3 Use SOSC to validate quadratic convergence of Newton methodSOSC is used to show fast convergence to a local minimizer
(Lecture 14) Nonlinear Optimization Oct 16, 2013 4 / 16
Convexity
1 First-order characterization of convex functions
2 Second-order characterization of convex functions defined in anopen set
3 Free lunch, free dinner, ultimate gift
4 Strongly convex: ∇2f is PD, why important?
(Lecture 14) Nonlinear Optimization Oct 16, 2013 5 / 16
Optimization algorithms
Motivation: optimize for the next step using information from thecurrent step
f (xk + pk ) ≈ m(pk ) := f (xk ) +∇f (xk )>pk +12
p>k Bkpk
1 Line search: if Bk is “nice”, we can find a descent direction pkeasily, and the minimizer along that direction is our next iterate
2 Trust region: m(pk ) only approximates f (xk + pk ) well locally, wewill look for the next iterate based on our confidence level on howm(pk ) approximate f (xk + pk ), and adjust our confidence leveladaptively
(Lecture 14) Nonlinear Optimization Oct 16, 2013 6 / 16
Line search
1 Wolfe conditions:Sufficient descent: φ(α) ≤ φ(0) + c1φ
′(0)αSufficient curvature: φ′(α) ≥ c2φ
′(0)What are their purposes?
2 Fundamental result for line search:Just assume the search direction pk is a descent direction, and useWolfe condition for line search∑∞
k=0 cos2 θk‖∇f (xk )‖2 <∞, where cos θk = − ∇f (xk )>pk‖∇f (xk )‖‖pk‖
How to use this result to prove global convergence for steepestdescent? Newton?
(Lecture 14) Nonlinear Optimization Oct 16, 2013 7 / 16
Line search is all about Newton
Model function
min m(pk ) := f (xk ) +∇f (xk )>pk +12
p>k Bkpk
Approximated Hessian Bk : This is an unconstrained QP, any stationarypoint is optimal if and only if Bk is PD
1 Choice 1: Bk = I, correspond to steepest descent pk = −∇f (xk )
Only first-order information is usedLinear local convergenceConvergence could be very slow if the condition number of Hessianis large
2 Choice 2: Bk = ∇2f (xk ), correspond to pure Newtonpk = −[∇2f (xk )]−1∇f (xk )
No line search is needed, stepsize is always αk = 1Quadratic local convergence to x∗ if x∗ satisfies SOSCFragile: may run into trouble if Hessian is not PD
(Lecture 14) Nonlinear Optimization Oct 16, 2013 8 / 16
Line search is all about Newton (contd)
Model function
min m(pk ) := f (xk ) +∇f (xk )>pk +12
p>k Bkpk
1 Choice 3: Modified Newton, Bk = ∇2f (xk ) + EkIf ∇2f (xk ) is PD, Ek = 0Otherwise, Ek is “big enough” to ensure Bk is PDLoses quadratic convergence, because we need line search
2 Choice 4: Quasi-Newton, construct/update a PD matrix Bk as wego
Updating formula ensures the secant equation: Bk+1sk = yk , sousing Bk to approximate Hessian makes senseBk is PD by Wolfe/curvature condition and updating formulaBFGS, approximates inverse Hessian HkSuperlinear local convergence, but no global convergence ingeneral
(Lecture 14) Nonlinear Optimization Oct 16, 2013 9 / 16
Trust region
All about solving the trust region subproblem (TRP)
minp
m(p) := f (xk ) +∇f (xk )>p +12
p>Bkp
s.t. ‖p‖ ≤ ∆
Solving TRP:1 Direct method: needs matrix factorization, iterative root-finding
procedure2 Cauchy points3 Improved Cauchy points, dogleg methods
(Lecture 14) Nonlinear Optimization Oct 16, 2013 10 / 16
Cauchy points and dogleg
Cauchy point: the best solution along the steepest descent directionwithin the trust region
A constrained step size problem
minτk
fk + τkg>k psk +
12τ2
k (psk )>Bkps
k
s.t. τk ≤ 1
where psk = −∆k
‖gk‖gk
Improvement: Dogleg (only when Bk is PD)Use a two line segments to approximate the full trajectory from theglobal minimizer pB = −B−1g to the minimizer along steepestdescent direction pU
Optimization over the two line segments is easy because ofmonotone structure
(Lecture 14) Nonlinear Optimization Oct 16, 2013 11 / 16
Line search vs. Trust region
1 Line search first finds adirection, then chooses thestep length
2 Line search methods have an“easy” problem to solve ineach iteration
3 Line search methods may notallow true Hessian in themodel function
1 Trust region first finds a length,then chooses the direction
2 Trust region methods have ahard TRP in each iteration
3 Trust region allows trueHessian in the model function
(Lecture 14) Nonlinear Optimization Oct 16, 2013 12 / 16
Important concepts
Condition number1 Condition number and convergence
Steepest descentNewton/quasi-NewtonConjugate gradient (preconditioned CG)
2 Condition number and numerical stabilityLeast squares: solving normal equations J>Jx = J>y
Wolfe condition1 Global convergence for inexact line search
2 Guarantee for PD quasi-Newton Hessian matrices
(Lecture 14) Nonlinear Optimization Oct 16, 2013 13 / 16
Optimization for large-scale problems
When problems get bigger, we have to compromise1 Quasi-Newton: hard to compute Hessian when large-scale, use
first-order information to mimic the behavior of Hessian
2 L-BFGS: use a limited-memory list of vectors sk , yk toapproximate quasi-Newton matrix
3 Inexact Newton: solving Newton direction via conjugate gradientCG is hunkydoryCG enables inexact solutions, which is sufficient for superlinearconvergence if error bound ηk → 0
(Lecture 14) Nonlinear Optimization Oct 16, 2013 14 / 16
Choice of algorithms for unconstrained nonlinearoptimization
If first-order information is not available, you need to take anothercourse!
1 Second-order information is not availableSteepest descent (if you are lazy)Quasi-Newton: n ≤ 100Large-scale: L-BFGS, nonlinear CG, inexact Quasi-Newton, etc
2 Second-order information is availableNewton, if you know your problem is strongly convex, or you knowyou are very close to optimalTrust-region methodsLarge-scale: Newton-CG, CG-trust
In practice, choose an implementation/software that matches the bestwith your application!
(Lecture 14) Nonlinear Optimization Oct 16, 2013 15 / 16
Good luck!
(Lecture 14) Nonlinear Optimization Oct 16, 2013 16 / 16