46
1 Sebastian Bernasek 7-14-2015 Intro to Optimization: Part 1

Optimization tutorial

Embed Size (px)

Citation preview

Page 1: Optimization tutorial

1

Sebastian Bernasek7-14-2015

Intro to Optimization: Part 1

Page 2: Optimization tutorial

2

What is optimization?

Identify variable values that minimize or maximize some objective while satisfying

constraints

objective

variables

constraints

minimize f(x)

where x = {x1,x2,..xn}

s.t. Ax < b

Page 3: Optimization tutorial

3

What for?Finance

• maximize profit, minimize risk• constraints: budgets, regulations

Engineering• maximize IRR, minimize emissions• constraints: resources, safety

Data modeling

Page 4: Optimization tutorial

4

Given a proposed model:y(x) = θ1 sin(θ2 x)

which parameters (θi) best describe the data?

Data modeling

Page 5: Optimization tutorial

5

Which parameters (θi) best describe the data?

We must quantify goodness-of-fit

Data modeling

Page 6: Optimization tutorial

6

A good model will have minimal residual error

Goodness-of-fit metrics

where Xi,Yi are data

and y(Xi) is the model, e.g.

Page 7: Optimization tutorial

7

Least Squares

Weighted Least Squares

Goodness-of-fit metrics

gives greater importance to more precise data

all data equally important

We seek to minimize SSE and WSSE

Page 8: Optimization tutorial

8

Log likelihood

Define the likelihood L(θ|Y)=p(Y|θ)as the likelihood of θ being the true parameters given the observed data

Goodness-of-fit metrics

Page 9: Optimization tutorial

9

Log likelihood

Given p(Yi iid Y | θ) we can compute p(Y|θ):

the log transform is for convenience

Goodness-of-fit metrics

We seek to maximize ln L(θ | Y)

Page 10: Optimization tutorial

10

Log likelihood

So what is p(Yi|θ) ?

Assume each residual is drawn from a distribution. For example, assume ei are Gaussian distributed with

Goodness-of-fit metrics

Page 11: Optimization tutorial

11

Log likelihood

Goodness-of-fit metricsmaximize ln L(θ | Y)

Page 12: Optimization tutorial

12

Least Squares• simple and straightforward to implement• requires large N for high accuracy

Weighted Least Squares• accounts for variability in precision of

variables• converges to least squares for high N

Log Likelihood• requires assumption for residuals PDF

Goodness-of-fit metrics

Page 13: Optimization tutorial

13

Given a proposed model:y(x) = θ1 sin(θ2 x)

which parameters (θi) best describe the data?

Data modeling

objective

variables

constraints

minimize SSE(θ)

where θ = {θ1,θ2,..θn}

s.t. Aθ < b

Page 14: Optimization tutorial

14

Given a proposed model:y(x) = θ1 sin(θ2 x)

which parameters (θi) best describe the data?

Data modeling

optimumvariables

minimum

θ = {5,1}

SSE(θ) = 277

Page 15: Optimization tutorial

15

minimize f(x)where x = {x1,x2,..xn}

s.t. Ax < b

So how do we optimize?

Page 16: Optimization tutorial

16

Types of problems

There are many classes of optimization problems

1. constrained vs unconstrained2. static vs dynamic3. continuous vs discrete variables4. deterministic vs stochastic variables5. single vs multiple objective functions

Page 17: Optimization tutorial

17

Types of algorithms

There are many more classes of algorithms that attempt to solve these problems

NEOS, UW

Page 18: Optimization tutorial

18

Types of algorithms

There are many more classes of algorithms that attempt to solve these problems

current scope

NEOS, UW

Page 19: Optimization tutorial

19

Unconstrained Optimization

Zero-Order Methods (function calls only)• Nelder-Mead Simplex (direct search)• Powell Conjugate Directions

First-Order Methods• Steepest Descent • Nonlinear Conjugate Gradients• Broyden-Fletcher-Goldfarb-Shanno Algorithms (BFGS)

Second-Order Methods• Newton’s Method• Newton Conjugate Gradient

Here we classify algorithms by the derivative information utilized.

scipy.optimize.fmin

Page 20: Optimization tutorial

20

All but the simplex and Newton methods call one-dimensional line searches as a subroutine

Common option:• Bisection Methods (e.g. Golden Search)

General Iterative Schemeα=step size

dn = search direction

Unconstrained Optimization in 1-D

linear convergence, but robust

Page 21: Optimization tutorial

21

1-D Optimization

root finding

Calculus-based option:• Newton-Raphson

Unconstrained Optimization in 1-D

can use explicit derivatives or

numerical approximation

we want:

so let:

Page 22: Optimization tutorial

22

Move to minimum of quadratic fit at each point

can achieve quadratic convergence for twice differentiable functions

Newton Raphson

COS 323 Course Notes, Princeton U.

Page 23: Optimization tutorial

23

Newton’s Method in N-Dimensions 1. Construct a locally quadratic model (mn)

via Taylor expansion about xn:

2. At each step we want to move toward the minimum of this model, where

points near xn…

differentiating…

solving…

Page 24: Optimization tutorial

24

Newton’s Method in N-Dimensions 3. The minimum of the local second-order

model lies in the direction pn.

Determine the optimal step size, α, by 1-D optimization

Search directionGeneral Iterative Schemeα=step size

dn = search direction

Golden search, Newton’s method, Brent’s Method, Nelder-Mead Simplex, etc.

Page 25: Optimization tutorial

25

Newton’s Method in N-Dimensions 4. Take the step

5. Check termination criteria and return to step 3

possible criteria:

• Maximum iterations reached• Change in objective function below

threshold• Change in local gradient below threshold• Change in local Hessian below threshold

Page 26: Optimization tutorial

26

BFGS Algorithm (quasi-Newton method)

• Numerically approx. H-1(xn)• Multiply matrices

Newton’s Method in N-Dimensions How do we compute the Hessian?

Newton’s Method

• Define H(xn) expressions• Invert it and multiply

• Accurate• Costly for high N• Requires 2nd derivatives

• Avoids solving system• Only req. 1st derivatives• Crazy math I don’t get

scipy.optimize.fmin_bfgs

Page 27: Optimization tutorial

27

Gradient Descent Newton/BFGS make use of the local

Hessian Alternatively we could just use the

gradient1. Pick a starting point, x0

2. Evaluate the local derivative3. Perform line-search along gradient

4. Move directly along the gradient

5. Check convergence criteria and return to 2

Page 28: Optimization tutorial

28

Gradient Descent Function must be differentiable Subsequent steps are always

perpendicular Can get caught in narrow valleys

Page 29: Optimization tutorial

29

Conjugate Gradient Method Avoids reversing previous iterations by ensuring

that each step is conjugate to all previous steps, creating a linearly independent set of basis vectors1. Pick a starting point and evaluate local

derivative2. First step follows gradient descent

3. Compute weights for previous steps, βn

where

is the steepest directionPolak-Ribiere Version

Page 30: Optimization tutorial

30

Conjugate Gradient Method Creates a set, si, of linearly independent

vectors that span the parameter space xi.

4. Compute search direction, sn

5. Move to optimal point along sn

6. Check convergence criteria and return to step 3

*Note that setting βi = 0 yields the gradient descent algorithm

Page 31: Optimization tutorial

31

Conjugate Gradient Method For properly conditions problems, guaranteed to

converge in N iterations Very commonly used scipy.optimize.fmin_cg

Page 32: Optimization tutorial

32

Powell’s Conjugate Directions Performs N line searches along N basis vectors in

order to determine an optimal search direction Preserves minimization achieved by previous

steps by retaining the basis vector set between iterations1. Pick a starting point and a set of basis vectors

2. Determine the optimum step size along each vector

3. Let the search vector be the linear combination of basis vectors:

is convention

Page 33: Optimization tutorial

33

Powell’s Conjugate Directions4. Move along the search vector

5. Add to the basis and drop the oldest basis vector

6. Check the convergence criteria and return to step 2

and

Problem: Algorithm tends toward a linearly dependent basis set

Solutions: 1. Reset to an orthogonal basis every N iterations 2. At step 5, replace the basis vector corresponding to the largest change in f(x)

Page 34: Optimization tutorial

34

Powell’s Conjugate DirectionsAdvantages

No derivatives required only uses function calls Quadratic convergence

Accessible via scipy.optimize.fmin_powell

Page 35: Optimization tutorial

35

Nelder-Mead Simplex AlgorithmDirect search algorithmDefault method:

scipy.optimize.fmin

Method consists of a simplex crawling around the parameter space until it finds and brackets a local minimum.

Page 36: Optimization tutorial

36

Nelder-Mead Simplex AlgorithmSimplex: convex hull of N+1 vertices in N-space.

2D: a triangle 3D: a tetrahedron

Page 37: Optimization tutorial

37

Nelder-Mead Simplex Algorithm1. Pick a starting point and define a simplex around

it with N+1 vertices xi

2. Evaluate f(xi) at each vertex and rank order the vertices such that x1 is the best and xN+1 is the worst

3. Evaluate the centroid of the best N vertices

Page 38: Optimization tutorial

38

Nelder-Mead Simplex Algorithm4. Reflection: let

If replace xN+1 with xr

worst point(highest function val.)

COS 323 Course Notes, Princeton U.

Page 39: Optimization tutorial

39

Nelder-Mead Simplex Algorithm5. Expansion:

If reflection resulted in the best point, try:

If then replace xN+1 with xe

If not, replace xN+1 with xr

worst point(highest function val.)

COS 323 Course Notes, Princeton U.

Page 40: Optimization tutorial

40

Nelder-Mead Simplex Algorithm6. Contraction: If reflected point is still the worst, then try contraction

COS 323 Course Notes, Princeton U.

Page 41: Optimization tutorial

41

Nelder-Mead Simplex Algorithm7. Shrinkage:

If contraction fails, scale all vertices toward the best vertex.

for

COS 323 Course Notes, Princeton U.

Page 42: Optimization tutorial

42

Nelder-Mead Simplex Algorithm Advantages:

Doesn’t require any derivatives Few function calls at each iteration Works with rough surfaces

Disadvantages: Can require many iterations Does not always converge. Convergence criteria are

unknown. Inefficient in very high N

Page 43: Optimization tutorial

43

Nelder-Mead Simplex Algorithm

Parameter

Required Typical

α > 0 1β > 1 2γ 0 < γ < 1 0.5δ 0 < δ < 1 0.5

Page 44: Optimization tutorial

44

Algorithm Comparison

min f(x) iterations f(x) evals f'(x) evalspowell -2 2 43 0

conjugate gradient -2 4 40 10gradient descent -2 3 32 8

bfgs -2 6 48 12simplex -2 45 87 0

f(x) = sin(x) + cos(y)

Page 45: Optimization tutorial

45

Algorithm Comparison

f(x) = sin(xy) + cos(y)

Simplex & Powell seem to similarly follow valleys with a more “local” focus

BFGS/CG readily transcend valleys

Page 46: Optimization tutorial

46

2-D Rosenbock Function

min f(x) iterations f(x) evals f'(x) evalspowell 3.8 E-28 25 719 0

conjugate gradient 9.5 E-08 33 368 89gradient descent 1.1 E+01 400 1712 428

bfgs 1.8 E-11 47 284 71simplex 5.6 E-10 106 201 0