16
Lecture 2 C7B Optimization Hilary 2011 A. Zisserman • Optimization for general functions • Downhill simplex (amoeba) algorithm • Newton’s method Line search Quasi-Newton methods • Least-Squares and Gauss-Newton methods Review – unconstrained optimization • Line minimization • Minimizing Quadratic functions • line search • problem with steepest descent Assumptions: • single minimum • positive definite Hessian

Lecture 2 - University of Oxfordaz/lectures/opt/2011/lect2.pdf · 2011. 2. 20. · -2 -1 0 1 2-1-0.5 0 0.5 1 1.5 2 2.5 3 Steepest Descent Steepest descent • The zig-zag behaviour

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Lecture 2

    C7B Optimization Hilary 2011 A. Zisserman

    • Optimization for general functions

    • Downhill simplex (amoeba) algorithm

    • Newton’s method• Line search• Quasi-Newton methods

    • Least-Squares and Gauss-Newton methods

    Review – unconstrained optimization

    • Line minimization

    • Minimizing Quadratic functions• line search

    • problem with steepest descent

    Assumptions:

    • single minimum

    • positive definite Hessian

  • Optimization for General Functions

    -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1-1

    0

    1

    2

    3

    4

    5

    f(x, y) = exp(x)(4x2 + 2y2 + 4xy+2x+1)

    Apply methods developed using quadratic Taylor series expansion

    Rosenbrock’s function

    f(x, y) = 100(y − x2)2 + (1− x)2

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Rosenbrock function

    Minimum is at [1,1]

  • -0.95 -0.9 -0.85 -0.8 -0.750.65

    0.7

    0.75

    0.8

    0.85

    Steepest Descent

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Steepest Descent

    Steepest descent

    • The zig-zag behaviour is clear in the zoomed view (100 iterations)

    • The algorithm crawls down the valley

    • The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation)

    Conjugate Gradients

    • Again, an explicit line minimization must be used at every step

    • The algorithm converges in 101 iterations

    • Far superior to steepest descent

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Conjugate Gradient Descent

    gradient < 1e-3 after 101 iterations

  • The downhill simplex (amoeba) algorithm

    • Due to Nelder and Mead (1965)

    • A direct method: only uses function evaluations (no derivatives)

    • a simplex is the polytope in N dimensions with N+1 vertices, e.g.

    • 2D: triangle

    • 3D: tetrahedron

    • basic idea: move by reflections, expansions or contractions

    start reflect expand

    contract

  • One iteration of the simplex algorithm

    • Reorder the points so that f(xn+1) > f(x2) > f(x1) (i.e. xn+1 is the worst point).

    • Generate a trial point xr by reflectionxr = x̄+ α(x̄− xn+1)

    where x̄ =¡P

    i xi¢/(N + 1) is the centroid and α > 0. Compute f(xr), and there

    are then 3 possibilities:

    1. f(x1) < f(xr) < f(xn) (i.e. xr is neither the new best or worst point), replacexn+1 by xr.

    2. f(xr) < f(x1) (i.e. xr is the new best point), then assume direction of reflectionis good and generate a new point by expansion

    xe = xr+ β(xr − x̄)where β > 0. If f(xe) < f(xr) then replace xn+1 by xe, otherwise, the expansionhas failed, replace xn+1 by xr.

    3. f(xr) > f(xn) then assume the polytope is too large and generate a new pointby contraction

    xc = x̄+ γ(xn+1 − x̄)where γ (0 < γ < 1) is the contraction coefficient. If f(xc) < f(xn+1) then thecontraction has succeeded and replace xn+1 by xc, otherwise contract again.

    Standard values are α = 1,β = 1, γ = 0.5.

    Example

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Downhill Simplex

    Path of best vertex

    Matlab fminsearchwith 200 iterations-2 -1.8 -1.6 -1.4 -1.2 -1

    1.8

    1.9

    2

    2.1

    2.2

    2.3

    2.4

    2.5

    2.6

    2.7

    detail

  • XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    XX

    YY

    -1 0 1 2 3

    -10

    12

    3

    Example 2: contraction about a minimum

    Summary• no derivatives required

    • deals well with noise in the cost function

    • is able to crawl out of some local minima (though, of course, can still get stuck)

    reflections

    contractions

    Newton’s method Expand f(x) by its Taylor series about the point xn

    f(xn+ δx) ≈ f(xn) + g>n δx+1

    2δx>Hn δx

    where the gradient is the vector

    gn = ∇f(xn) ="∂f

    ∂x1, . . . ,

    ∂f

    ∂xN

    #>and the Hessian is the symmetric matrix

    Hn = H (xn) =

    ⎡⎢⎢⎢⎢⎣∂2f∂x21

    . . . ∂2f

    ∂x1∂xN... . . .

    ∂2f∂x1∂xN

    ∂2f∂x2N

    ⎤⎥⎥⎥⎥⎦

    For a minimum we require that ∇f(x) = 0, and so

    ∇f(x) = gn+ Hnδx = 0with solution δx = −H−1n gn. This gives the iterative update

    xn+1 = xn − H−1n gn

  • • If f(x) is quadratic, then the solution is found in one step.

    • The method has quadratic convergence (as in the 1D case).

    • The solution δx = −H−1n gn is guaranteed to be a downhill direction.

    • Rather than jump straight to the minimum, it is better to performa line minimization which ensures global convergence

    xn+1 = xn − αnH−1n gn

    • If H = I then this reduces to steepest descent.

    xn+1 = xn − H−1n gn

    Assume that H is positive definite (all eigenvalues greater than zero)

    Newton’s method - example

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations-2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations

    ellipses show successive quadratic approximations

    • The algorithm converges in only 15 iterations compared to the 101 for conjugate gradients, and 200 for simplex

    • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible

  • Quasi-Newton methods

    • If the problem size is large and the Hessian matrix is dense then itmay be infeasible/inconvenient to compute it directly.

    • Quasi-Newton methods avoid this problem by keeping a “rollingestimate” of H (x), updated at each iteration using new gradient

    information.

    • Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno(BFGS), and also Davidson, Fletcher and Powell (DFP).

    First derivatives

    f 0(x0 +h

    2) =

    f1 − f0h

    and f 0(x0 −h

    2) =

    f0 − f−1h

    second derivative

    f 00(x0) =f1−f0h −

    f0−f−1h

    h=f1 − 2f0 + f−1

    h2

    For Hn+1 build an approximation from Hn, gn, gn+1,xn,xn+1

    e.g. in 1D

    xx-1 x+1x0h h

    f(x)

    Quasi-Newton: BFGS

    • Set H0 = I.

    • Update according to

    Hn+1 = Hn+qnq>nq>n sn

    − (Hnsn) (Hnsn)>

    s>n Hnsnwhere

    sn = xn+1 − xnqn = gn+1 − gn

    • The matrix itself is not stored, but rather represented compactly bya few stored vectors.

    • The estimate Hn+1 is used to form a local quadratic approximationas before.

  • Example

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Rosenbrock function

    • The method converges in 25 iterations, compared to 15 for the full-Newton method

    Non-linear least squares

    • It is very common in applications for a cost function f(x) to be thesum of a large number of squared residuals

    f(x) =MXi=1

    r2i

    • If each residual depends non-linearly on the parameters x then theminimization of f(x) is a non-linear least squares problem.

    • Examples in computer vision are maximum likelihood estimators ofimage relations (such as homographies and fundamental matrices)

    from point correspondences.

  • Example: aligning a 3D model to an image

    Transformation parameters:

    • 3D rotation matrix R• translation 3-vector T = (Tx, Ty, Tz)>

    Image generation:

    • rotate and translate 3D model by R and T• project to generate image ÎR,T(x, y)

    Cost function

    TxTy

    −f

  • J>

    f(x) =MXi=1

    r2i

    The M ×N Jacobian of the vector of residuals r is defined as

    J(x) =

    ⎛⎜⎜⎜⎝∂r1∂x1

    . . . ∂r1∂xN... . . .∂rM∂x1

    ∂rM∂xN

    ⎞⎟⎟⎟⎠

    Consider∂

    ∂xk

    Xi

    r2i =Xi

    2ri∂ri∂xk

    Hence

    ∇f(x) = 2J>r

    Non-linear least squares

    assume M > N

    J

    =

    For the Hessian we require

    ∂2

    ∂xl∂xk

    Xi

    r2i = 2∂

    ∂xl

    Xi

    ri∂ri∂xk

    = 2Xi

    ∂ri∂xk

    ∂ri∂xl

    +2Xi

    ri∂2ri

    ∂xk∂xl

    Hence

    H (x) = 2J>J + 2MXi=1

    riR i

    J>J Ri

  • • Note that the second-order term in the Hessian H (x) is multiplied bythe residuals ri.

    • In most problems, the residuals will typically be small.

    • Also, at the minimum, the residuals will typically be distibuted withmean = 0.

    • For these reasons, the second-order term is often ignored, giving theGauss-Newton approximation to the Hessian :

    H (x) = 2J>J

    • Hence, explicit computation of the full Hessian can again be avoided.

    Example – Gauss-Newton

    The minimization of the Rosenbrock function

    f(x, y) = 100(y − x2)2 + (1− x)2

    can be written as a least-squares problem with residual vector

    r =

    "10(y − x2)(1− x)

    #

    J(x) =

    ⎛⎝ ∂r1∂x ∂r1∂y∂r2∂x

    ∂r2∂y

    ⎞⎠ = Ã −20x 10−1 0!

  • The true Hessian is

    H(x) =

    ⎡⎢⎣ ∂2f

    ∂x2∂2f∂x∂y

    ∂2f∂x∂y

    ∂2f∂y2

    ⎤⎥⎦ = " 1200x2 − 400y+2 −400x−400x 200#

    The Gauss-Newton approximation of the Hessian is

    2J>J = 2"−20x −110 0

    # "−20x 10−1 0

    #=

    "800x2 + 2 −400x−400x 200

    #

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations-2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations

    • minimization with the Gauss-Newton approximation with line search takes only 14 iterations

    xn+1 = xn − αnH−1n gn with Hn (x) = 2J>n Jn

  • Comparison

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations-2 -1 0 1 2

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations

    Newton Gauss-Newton

    • requires computing Hessian

    • exact solution if quadratic

    • approximates Hessian by Jacobian product

    • requires only derivatives

    Conjugate gradient

    Gauss-Newton

    Gradient descent

    Newton

    NB axes do not have equal scales

  • Conjugate gradient

    Gauss-Newton (no line search)

    Gradient descent

    Newton