114
Automatic Differentiation (1) Slides Prepared By: Atılım Güneş Baydin [email protected]

[email protected] Automatic Differentiation (1)

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Automatic Differentiation (1)

Slides Prepared By:

Atılım Güneş [email protected]

Page 2: gunes@robots.ox.ac.uk Automatic Differentiation (1)

2

Outline

This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, Jacobian, etc.)- How do we compute derivatives- Automatic differentiation

Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)

Page 3: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Derivatives and machine learning

3

Page 4: gunes@robots.ox.ac.uk Automatic Differentiation (1)

4

Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances Computer vision

NVIDIA DRIVE PX 2 segmentationTop-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015)

Speech recognition/synthesis

Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)

Machine translation

Page 5: gunes@robots.ox.ac.uk Automatic Differentiation (1)

5

Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances

Probabilistic programming (and modeling)

Pyro ProbTorch

Edward TensorFlow Probability

- Variational inference- “Neural” density estimation

- Transformed distributions via bijectors- Normalizing flows (Rezende & Mohamed, 2015)- Masked autoregressive flows (Papamakarios et al., 2017)

(2017) (2017)

(2016) (2018)

Page 6: gunes@robots.ox.ac.uk Automatic Differentiation (1)

6

Derivatives in machine learningAt the core of all: differentiable functions (programs) whose parameters are tuned by gradient-based optimization

(Ruder, 2017) http://ruder.io/optimizing-gradient-descent/

Page 7: gunes@robots.ox.ac.uk Automatic Differentiation (1)

7

Automatic differentiationExecute differentiable functions (programs) via automatic differentiation

A word on naming:- Differentiable programming, a generalization of deep learning (Olah, LeCun)

“Neural networks are just a class of differentiable functions”- Automatic differentiation- Algorithmic differentiation- AD- Autodiff- Algodiff- Autograd

Also remember:- Backprop- Backpropagation (backward propagation of errors)

Page 8: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Essential conceptsrefresher

8

Page 9: gunes@robots.ox.ac.uk Automatic Differentiation (1)

9

DerivativeFunction of a real variable

Sensitivity of function value w.r.t. a change in its argument (the instantaneous rate of change)

Newton, c. 1665

Leibniz, c. 1675Leibniz Lagrange Newton

Dependent Independent

Page 10: gunes@robots.ox.ac.uk Automatic Differentiation (1)

10

DerivativeFunction of a real variable

Newton, c. 1665

Leibniz, c. 1675

around 15 such rules

Note: the derivative is a linear operator, a.k.a. a higher-order function in programming languages

Page 11: gunes@robots.ox.ac.uk Automatic Differentiation (1)

11

Partial derivativeFunction of several real variables

A derivative w.r.t. one independent variable, with others held constant

“del”

Page 12: gunes@robots.ox.ac.uk Automatic Differentiation (1)

12

Partial derivativeFunction of several real variables

The gradient, given

is the vector of all partial derivatives

“nabla” or “del”

Nabla is the higher-order function:

points to the direction with the largest rate of change

Page 13: gunes@robots.ox.ac.uk Automatic Differentiation (1)

13

Total derivativeFunction of several real variables

The derivative w.r.t. all variables (independent & dependent)

Consider all partial derivatives simultaneously and accumulate all direct and indirect contributions (Important: will be useful later)

Page 14: gunes@robots.ox.ac.uk Automatic Differentiation (1)

14

Matrix calculus and machine learningExtension to multivariable functions

Scalar output Vector output

Scalar input

Vector input

scalar field vector field

In machine learning, we construct (deep) compositions of- , e.g., a neural network- , e.g., a loss function, KL divergence, or log joint probability

Page 15: gunes@robots.ox.ac.uk Automatic Differentiation (1)

15

Matrix calculus and machine learning

Generalization to tensors (multi-dimensional arrays) for efficientbatching, handling of sequences, channels in convolutions, etc.

And many, many more rules

Page 16: gunes@robots.ox.ac.uk Automatic Differentiation (1)

16

Finally, two constructs relevant to machine learning: Jacobian and Hessian

Matrix calculus and machine learning

Page 17: gunes@robots.ox.ac.uk Automatic Differentiation (1)

How to compute derivatives

17

Page 18: gunes@robots.ox.ac.uk Automatic Differentiation (1)

18

We can compute the derivatives not just of mathematical functions, but of general programs (with control flow)

Derivatives as code

Page 19: gunes@robots.ox.ac.uk Automatic Differentiation (1)

19

Derivatives as code

Page 20: gunes@robots.ox.ac.uk Automatic Differentiation (1)

20

Manual

Analytic derivatives are needed for theoretical insight- analytic solutions, proofs- mathematical analysis, e.g., stability of fixed points

Unnecessary when we just need derivative evaluations for optimization

You can see papers like this:

Page 21: gunes@robots.ox.ac.uk Automatic Differentiation (1)

21

Symbolic differentiationSymbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell

Page 22: gunes@robots.ox.ac.uk Automatic Differentiation (1)

22

Symbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell

Symbolic differentiation

Graph optimization(e.g., in Theano)

Page 23: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Symbolic graph builders such as Theano and TensorFlowhave limited, unintuitive control flow, loops, recursion

Problem: only applicable to closed-form mathematical functions

You can find the derivative of

but not of

Symbolic differentiation

Page 24: gunes@robots.ox.ac.uk Automatic Differentiation (1)

24

Finite difference approximation of ,

Problem: we must select andwe face approximation errors

Problem: needs to be evaluated times, once with each standard basis vector

Numerical differentiation

Page 25: gunes@robots.ox.ac.uk Automatic Differentiation (1)

25

Finite difference approximation of ,

Better approximations exist:- Higher-order finite differences

e.g., center difference:

- Richardson extrapolation- Differential quadrature

These increase rapidly in complexity and never completely eliminate the error

Numerical differentiation

Page 26: gunes@robots.ox.ac.uk Automatic Differentiation (1)

26

Finite difference approximation of ,

Better approximations exist:- Higher-order finite differences

e.g., center difference:

- Richardson extrapolation- Differential quadrature

These increase rapidly in complexity and never completely eliminate the error

Numerical differentiation

Still extremely useful as a quick check of our gradient implementationsGood to learn:

Page 27: gunes@robots.ox.ac.uk Automatic Differentiation (1)

27

If we don’t need analytic derivative expressions, we can evaluate a gradient exactly with only one forward and one reverse execution

In machine learning, this is known as backpropagation or “backprop”

- Automatic differentiation is more than backprop

- Or, backprop is a specialized reverse mode automatic differentiation

- We will come back to this shortly

Nature 323, 533–536 (9 October 1986)

Automatic differentiation

Page 28: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Backprob or automatic differentiation?

28

Page 29: gunes@robots.ox.ac.uk Automatic Differentiation (1)

29

1960s 1970s 1980s

Precursors

Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962

Wengert, 1964Forward mode

Linnainmaa, 1970, 1976Backpropagation

Dreyfus, 1973Control parameters

Werbos, 1974Reverse mode

Speelpenning, 1980Automatic reverse mode

Werbos, 1982First NN-specific backprop

Parker, 1985

LeCun, 1985

Rumelhart, Hinton, Williams, 1986Revived backprop

Griewank, 1989Revived reverse mode

Page 30: gunes@robots.ox.ac.uk Automatic Differentiation (1)

1960s 1970s 1980s

Precursors

Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962

Wengert, 1964Forward mode

Linnainmaa, 1970, 1976Backpropagation

Dreyfus, 1973Control parameters

Werbos, 1974Reverse mode

Speelpenning, 1980Automatic reverse mode

Werbos, 1982First NN-specific backprop

Parker, 1985

LeCun, 1985

Rumelhart, Hinton, Williams, 1986Revived backprop

Griewank, 1989Revived reverse mode 30

Recommended reading:

Griewank, A., 2012. Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, Extra Volume ISMP, pp.389-400.

Schmidhuber, J., 2015. Who Invented Backpropagation?http://people.idsia.ch/~juergen/who-invented-backpropagation.html

Page 31: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Automatic differentiation

31

Page 32: gunes@robots.ox.ac.uk Automatic Differentiation (1)

32

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

Page 33: gunes@robots.ox.ac.uk Automatic Differentiation (1)

33

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

Page 34: gunes@robots.ox.ac.uk Automatic Differentiation (1)

34

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d log*

a

b

cd

Page 35: gunes@robots.ox.ac.uk Automatic Differentiation (1)

35

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

log*

a

b

cd

3

26

1.791

primal

Page 36: gunes@robots.ox.ac.uk Automatic Differentiation (1)

36

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

[0.5, 0.333] = f’(2, 3)

log*

a

b

cd

3

26

1.7910.5

0.333

0.166 1

derivativetangent, adjoint“gradient”

primal

Page 37: gunes@robots.ox.ac.uk Automatic Differentiation (1)

37

Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives

- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies

f(a, b):

c = a * b

d = log(c)

return d

1.791 = f(2, 3)

[0.5, 0.333] = f’(2, 3)

log*

a

b

cd

3

26

1.7910.5

0.333

0.166 1

derivativetangent, adjoint“gradient”

primal

Page 38: gunes@robots.ox.ac.uk Automatic Differentiation (1)

38

Two main flavors

Forward mode Reverse mode (a.k.a. backprop)

Nested combinations (higher-order derivatives, Hessian–vector products, etc.)

- Forward-on-reverse- Reverse-on-forward- ...

PrimalsDerivatives

Primals

Derivatives

Automatic differentiation

(Tangents)(Adjoints)

Page 39: gunes@robots.ox.ac.uk Automatic Differentiation (1)

39

What happens to control flow?

f(a, b):

c = a * b

if c > 0:

d = log(c)

else:

d = sin(c)

return d

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

Page 40: gunes@robots.ox.ac.uk Automatic Differentiation (1)

40

What happens to control flow?

f(a = 2, b = 3):

c = a * b = 6

if c > 0:

d = log(c) = 1.791

else:

d = sin(c)

return d

log*

a

b

cd

3

26

1.791

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

Page 41: gunes@robots.ox.ac.uk Automatic Differentiation (1)

41

What happens to control flow?

f(a = 2, b = -1):

c = a * b = -2

if c > 0:

d = log(c)

else:

d = sin(c) = -0.909

return d

sin*

a

b

cd

-1

2-2

-0.909

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

Page 42: gunes@robots.ox.ac.uk Automatic Differentiation (1)

42

What happens to control flow?

f(a = 2, b = -1):

c = a * b = -2

if c > 0:

d = log(c)

else:

d = sin(c) = -0.909

return d

sin*

a

b

cd

-1

2-2

-0.909

It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution

A directed acyclic graph (DAG)

Topological ordering

Page 43: gunes@robots.ox.ac.uk Automatic Differentiation (1)

43

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

sin*x1 y1

Primals: independent dependentDerivatives (tangents): independent dependent

log +x2

v2

y2

v1

Page 44: gunes@robots.ox.ac.uk Automatic Differentiation (1)

44

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (tangents): independent dependent

Page 45: gunes@robots.ox.ac.uk Automatic Differentiation (1)

45

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

Primals: independent dependentDerivatives (tangents): independent dependent

Page 46: gunes@robots.ox.ac.uk Automatic Differentiation (1)

46

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

Primals: independent dependentDerivatives (tangents): independent dependent

Page 47: gunes@robots.ox.ac.uk Automatic Differentiation (1)

47

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

Primals: independent dependentDerivatives (tangents): independent dependent

Page 48: gunes@robots.ox.ac.uk Automatic Differentiation (1)

48

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

Primals: independent dependentDerivatives (tangents): independent dependent

Page 49: gunes@robots.ox.ac.uk Automatic Differentiation (1)

49

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

Primals: independent dependentDerivatives (tangents): independent dependent

Page 50: gunes@robots.ox.ac.uk Automatic Differentiation (1)

50

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

Primals: independent dependentDerivatives (tangents): independent dependent

Page 51: gunes@robots.ox.ac.uk Automatic Differentiation (1)

51

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

Primals: independent dependentDerivatives (tangents): independent dependent

Page 52: gunes@robots.ox.ac.uk Automatic Differentiation (1)

52

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

Primals: independent dependentDerivatives (tangents): independent dependent

Page 53: gunes@robots.ox.ac.uk Automatic Differentiation (1)

53

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

Primals: independent dependentDerivatives (tangents): independent dependent

Page 54: gunes@robots.ox.ac.uk Automatic Differentiation (1)

54

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

Primals: independent dependentDerivatives (tangents): independent dependent

Page 55: gunes@robots.ox.ac.uk Automatic Differentiation (1)

55

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

Primals: independent dependentDerivatives (tangents): independent dependent

Page 56: gunes@robots.ox.ac.uk Automatic Differentiation (1)

56

Forward mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

Primals: independent dependentDerivatives (tangents): independent dependent

Page 57: gunes@robots.ox.ac.uk Automatic Differentiation (1)

57

Forward mode

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Primals: independent dependentDerivatives (tangents): independent dependent

Page 58: gunes@robots.ox.ac.uk Automatic Differentiation (1)

58

Forward mode

sin*x1 y1

log +x2

v2

y2

v12

3

1

0

6

3

1.098

0

-0.279

2.880

7.098

3

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Can be anynot only unit vectors

Primals: independent dependentDerivatives (tangents): independent dependent

Page 59: gunes@robots.ox.ac.uk Automatic Differentiation (1)

For this is a directional derivative

59

Forward mode

In general, forward mode evaluates a Jacobian–vector product

So we evaluated:

Can be anynot only unit vectors

Primals: independent dependentDerivatives (tangents): independent dependent

Page 60: gunes@robots.ox.ac.uk Automatic Differentiation (1)

60

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 61: gunes@robots.ox.ac.uk Automatic Differentiation (1)

61

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v1

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 62: gunes@robots.ox.ac.uk Automatic Differentiation (1)

62

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 63: gunes@robots.ox.ac.uk Automatic Differentiation (1)

63

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 64: gunes@robots.ox.ac.uk Automatic Differentiation (1)

64

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 65: gunes@robots.ox.ac.uk Automatic Differentiation (1)

65

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 66: gunes@robots.ox.ac.uk Automatic Differentiation (1)

66

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 67: gunes@robots.ox.ac.uk Automatic Differentiation (1)

67

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 68: gunes@robots.ox.ac.uk Automatic Differentiation (1)

68

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 69: gunes@robots.ox.ac.uk Automatic Differentiation (1)

69

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 70: gunes@robots.ox.ac.uk Automatic Differentiation (1)

70

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 71: gunes@robots.ox.ac.uk Automatic Differentiation (1)

71

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 72: gunes@robots.ox.ac.uk Automatic Differentiation (1)

72

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 73: gunes@robots.ox.ac.uk Automatic Differentiation (1)

73

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 74: gunes@robots.ox.ac.uk Automatic Differentiation (1)

74

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 75: gunes@robots.ox.ac.uk Automatic Differentiation (1)

75

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 76: gunes@robots.ox.ac.uk Automatic Differentiation (1)

76

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 77: gunes@robots.ox.ac.uk Automatic Differentiation (1)

77

Reverse mode

f(x1, x2):

v1 = x1 * x2

v2 = log(x2)

y1 = sin(v1)

y2 = v1 + v2

return (y1, y2)

f(2, 3)

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

1.920

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 78: gunes@robots.ox.ac.uk Automatic Differentiation (1)

78

Reverse mode

sin*x1 y1

log +x2

v2

y2

v12

3

6

1.098

-0.279

7.098

1

0

0.960

0

2.880

1.920

In general, forward mode evaluates a transposed Jacobian–vector product

So we evaluated:

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 79: gunes@robots.ox.ac.uk Automatic Differentiation (1)

79

Reverse mode

In general, reverse mode evaluates a transposed Jacobian–vector product

So we evaluated:For this is the gradient

Primals: independent dependentDerivatives (adjoints): independent dependent

Page 80: gunes@robots.ox.ac.uk Automatic Differentiation (1)

80

Forward vs reverse summaryIn the extremeuse forward mode to evaluate

In the extremeuse reverse mode to evaluate

Page 81: gunes@robots.ox.ac.uk Automatic Differentiation (1)

81

Forward vs reverse summaryIn the extremeuse forward mode to evaluate

In the extremeuse reverse mode to evaluate

In general the Jacobian can be evaluated in- with forward mode- with reverse mode

Reverse performs better when

Page 82: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Backprop through normal PDF

82

Page 83: gunes@robots.ox.ac.uk Automatic Differentiation (1)

83

Backprop through normal PDF

-x

µ

σ

·2 *

·2 /

*

2 π

- exp

sqrt 1/·

* f0.5

0

1

Page 84: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Summary

84

Page 85: gunes@robots.ox.ac.uk Automatic Differentiation (1)

85

Summary

This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, etc.)- How do we compute derivatives- Automatic differentiation

Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)

Page 86: gunes@robots.ox.ac.uk Automatic Differentiation (1)

86

ReferencesBaydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2017. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153), pp.1-153.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.

Baydin, Atılım Güneş, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. 2018. “Online Learning Rate Adaptation with Hypergradient Descent.” In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, April 30 – May 3, 2018.

Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). SIAM.

Nocedal, J. and Wright, S.J., 1999. Numerical Optimization. Springer.

Page 87: gunes@robots.ox.ac.uk Automatic Differentiation (1)

Extra slides

87

Page 88: gunes@robots.ox.ac.uk Automatic Differentiation (1)

88

Forward mode Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 89: gunes@robots.ox.ac.uk Automatic Differentiation (1)

89

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 90: gunes@robots.ox.ac.uk Automatic Differentiation (1)

90

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 91: gunes@robots.ox.ac.uk Automatic Differentiation (1)

91

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 92: gunes@robots.ox.ac.uk Automatic Differentiation (1)

92

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 93: gunes@robots.ox.ac.uk Automatic Differentiation (1)

93

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 94: gunes@robots.ox.ac.uk Automatic Differentiation (1)

94

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 95: gunes@robots.ox.ac.uk Automatic Differentiation (1)

95

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

6

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 96: gunes@robots.ox.ac.uk Automatic Differentiation (1)

96

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

6

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 97: gunes@robots.ox.ac.uk Automatic Differentiation (1)

97

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

61.791

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 98: gunes@robots.ox.ac.uk Automatic Differentiation (1)

98

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

1

0

3

61.791

0.5

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

Page 99: gunes@robots.ox.ac.uk Automatic Differentiation (1)

99

Forward mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

2

3

Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent

1

0

3

61.791

0.5

In general, forward mode evaluates a Jacobian–vector product

We evaluated the partial derivative with

Page 100: gunes@robots.ox.ac.uk Automatic Differentiation (1)

100

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

Page 101: gunes@robots.ox.ac.uk Automatic Differentiation (1)

101

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

Page 102: gunes@robots.ox.ac.uk Automatic Differentiation (1)

102

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

Page 103: gunes@robots.ox.ac.uk Automatic Differentiation (1)

103

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

Page 104: gunes@robots.ox.ac.uk Automatic Differentiation (1)

104

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

6

Page 105: gunes@robots.ox.ac.uk Automatic Differentiation (1)

105

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

Page 106: gunes@robots.ox.ac.uk Automatic Differentiation (1)

106

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

1

Page 107: gunes@robots.ox.ac.uk Automatic Differentiation (1)

107

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

1

Page 108: gunes@robots.ox.ac.uk Automatic Differentiation (1)

108

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

Page 109: gunes@robots.ox.ac.uk Automatic Differentiation (1)

109

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

Page 110: gunes@robots.ox.ac.uk Automatic Differentiation (1)

110

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

Page 111: gunes@robots.ox.ac.uk Automatic Differentiation (1)

111

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

Page 112: gunes@robots.ox.ac.uk Automatic Differentiation (1)

112

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

Page 113: gunes@robots.ox.ac.uk Automatic Differentiation (1)

113

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with

Page 114: gunes@robots.ox.ac.uk Automatic Differentiation (1)

114

Reverse mode

f(a, b):

c = a * b

d = log(c)

return d

f(2, 3)

log*

a

b

cd

Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent

2

3

61.791

10.166

0.5

0.333

In general, reverse mode evaluates a transposed Jacobian–vector product

We evaluated the gradient with