19
There and Back Again: A Tale o f Expectations and Sl opes. A Neurips 2020 Tut orial by Marc Deisenro th and Bay Backprop Heights Monte Carlo Differentiation N o r m a l i z i ng Flo w integration numerical Swamps of Vale of Implicit F o r w a r d -ba ckw ar d B ea c h Passage Adjoint Unrolling Hills Of T ime b A C D E f h G

Method of Lagrange - Mathematics for Machine Learning

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Method of Lagrange - Mathematics for Machine Learning

There and Back Again: A Tale of

Ex

pectatio

ns and Sl

opes. A Neurips 2020 Tut oria

l by Marc Deis

enro

th and

BayBackprop

HeightsMonteCarlo

Dif

fere

ntia

tion N

ormalizing Flow

integration

numerical

Swamps of

Vale of Im

plicit

Forward-backw

ard Beach

PassageAdjoint

Unr

olling

Hill

s Of T

ime

b

A

C

D

E

fh G

Page 2: Method of Lagrange - Mathematics for Machine Learning

There and Back Again: A Tale of

Ex

pectatio

ns and Sl

opes. A Neurips 2020 Tut oria

l by Marc Deis

enro

th and

BayBackprop

HeightsMonteCarlo

LagrangeLake

Dif

fere

ntia

tion N

ormalizing Flow

integration

numerical

Swamps of

Vale of Im

plicit

Forward-backw

ard Beach

PassageAdjoint

Unr

olling

Hill

s Of T

ime

b

A

C

D

E

f

i

h G

Page 3: Method of Lagrange - Mathematics for Machine Learning

Method of LagrangeCheng Soon Ong

Marc Peter Deisenroth

December 2020

Marc Deisenroth

and

Cheng

Soon Ong’s Tutor

ial at

CM

Page 4: Method of Lagrange - Mathematics for Machine Learning

Motivation

▶ In machine learning, we use gradients to train predictors▶ For functions f(x) we can directly obtain its gradient ∇f(x)

▶ We may wish to calculate a gradient for functions with constraints▶ Assume that we can calculate gradient for corresponding unconstrained problem

Key ideaHow to take a gradient with respect to a constrained optimization problem.

1

Page 5: Method of Lagrange - Mathematics for Machine Learning

History

▶ Joseph-Louis Lagrange (b. 1736) worked on calculus of variations and founded the idea▶ Nonlinear programming was coined by Harold W. Kuhn and Albert W. Tucker in 1951.▶ Turns out William Karush already wrote about similar conditions in his master’s thesis

in 1939, and was reintroduced into the literature by Akira Takayama in 1974.▶ Classic backpropagation paper, by Yann LeCun in 1988, observed the relation between

backpropagation and Lagrange multipliers.

2

Page 6: Method of Lagrange - Mathematics for Machine Learning

Constrained optimization

Primal optimization problemGiven f : RD → R and gi : RD → R for i = 1, . . . ,m,

minx f(x)

subject to gi(x) ⩽ 0 for all i = 1, . . . ,m

3

Page 7: Method of Lagrange - Mathematics for Machine Learning

Lagrange multipliers as relaxation

We can convert the primal optimization problem to an unconstrained problem.

J(x) = f(x) +

m∑i=1

1(gi(x)) ,

where 1(z) is an infinite step function

1(z) ={0 if z ⩽ 0

∞ otherwise.

Key ideaApproximate indicator function by linear function.

4

Page 8: Method of Lagrange - Mathematics for Machine Learning

Lagrangian

We associate the primal optimization problem with a Lagrangian, by introducing Lagrangemultipliers λi for each constraint gi.

Lagrangian

L(x,λ) = f(x) +

m∑i=1

λigi(x)

= f(x) + λ⊤g(x)

5

Page 9: Method of Lagrange - Mathematics for Machine Learning

Why Lagrangian?

▶ minx∈Rd L(x,λ) is an unconstrained optimization problem for a given value of λ▶ Turns out: If solving minx∈Rd L(x,λ) is easy, then the overall problem is easy

6

Page 10: Method of Lagrange - Mathematics for Machine Learning

Dual optimization problem

Key ideaConvert problem in x ∈ RD, to another problem in λ ∈ Rm.

▶ Primal and dual problems are related▶ The objective values at optimality related via gradients▶ the dual problem (maximization over λ) is a maximum over a set of affine functions,

and hence is a concave function, even though f(·) and gi(·) may be non-convex

7

Page 11: Method of Lagrange - Mathematics for Machine Learning

Primal dual pair

The problem

minx f(x)

subject to gi(x) ⩽ 0 for all i = 1, . . . ,m

is known as the primal problem, corresponding to the primal variables x. The associatedLagrangian dual problem is given by

maxλ∈Rm D(λ)

subject to λ ⩾ 0 ,

where λ are the dual variables and D(λ) = minx∈Rd L(x,λ).

8

Page 12: Method of Lagrange - Mathematics for Machine Learning

Minimax inequality

Why Lagrangian duality is useful? Because primal values are greater than dual values.

Key ideaFor any function with two arguments, the maximin is less than the minimax.

maxy minx φ(x,y) ⩽ minx maxy φ(x,y) .

9

Page 13: Method of Lagrange - Mathematics for Machine Learning

Weak duality

Key ideaPrimal values f(x) are always greater than (or equal to) dual values D(λ)

We can see this by applying the minimax inequality

minx∈Rd maxλ⩾0 L(x,λ)︸ ︷︷ ︸f(x)

⩾ maxλ⩾0 minx∈Rd L(x,λ)︸ ︷︷ ︸D(λ)

.

10

Page 14: Method of Lagrange - Mathematics for Machine Learning

Motivating the Karush Kuhn Tucker conditions

LagrangianL(x,λ) = f(x) +

m∑i=1

λigi(x)

StationaritySolve for optimal value by taking gradient w.r.t. x and set to zero

Lagrange multipliers are orthogonal to constraints▶ Recall that we have one Lagrange multiplier per constraint▶ If λi = 0, then gi(x) can be any value (has slack)▶ If gi(x) = 0 (active), then λi has slack

Retain feasibility

11

Page 15: Method of Lagrange - Mathematics for Machine Learning

Karush Kuhn Tucker conditions

stationarity0 ∈ ∇f(x) +

m∑i=1

λi∇gi(x) . (1)

complementary slacknessλigi(x) = 0 for all i = 1, . . . ,m . (2)

primal feasibilitygi(x) ⩽ 0 for all i = 1, . . . ,m . (3)

dual feasibilityλi ⩾ 0 for all i = 1, . . . ,m . (4)

12

Page 16: Method of Lagrange - Mathematics for Machine Learning

Constrained optimization

Key ideaObjective gradient is a conic combination of active constraint gradients

By stationarity condition

∇f(x) = −m∑i=1

λi∇gi(x)

for a given value of λ.

13

Page 17: Method of Lagrange - Mathematics for Machine Learning

KKT implies strong duality

Key ideaKKT conditions are sufficient for zero duality gap

Proof sketch:▶ By weak duality, f(x) ⩾ D(λ). Let x∗ and λ∗ denote optimal values▶ By stationarity conditions

D(λ∗) = f(x∗) +

m∑i=1

λ∗i gi(x

∗) ,

▶ By complementary slackness

f(x∗) +

m∑i=1

λ∗i gi(x

∗) = f(x∗) .

▶ Hence duality gap is zero14

Page 18: Method of Lagrange - Mathematics for Machine Learning

KKT implies strong duality

Key ideaChallenge is to express geometric constraints as algebraic equations

Constraint qualification additionally needed for KKT to necessarily imply zero duality gap.▶ Mangasarian Fromovitz▶ Linear independence▶ Fritz John▶ Slater’s condition

15

Page 19: Method of Lagrange - Mathematics for Machine Learning

Summary

Key ideaHow to take a gradient with respect to a constrained optimization problem.

▶ Objective gradient is a conic combination of active constraint gradients

∇f(x) = −m∑i=1

λi∇gi(x)

▶ Check that one of Lagrange multipliers or primal constraints is zero▶ The KKT conditions are sufficient for a zero duality gap

16