8/12/2019 Gockenbach Diff Calculus
1/55
Optimization and Engineering, 2, 75129, 2001
c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
A Primer on Differentiation
MARK S. GOCKENBACH
Department of Mathematical Sciences, Michigan Technological University, 1400 Townsend Drive, Houghton,
MI 49931-129 5, USA
Received February 4, 2000; Revised April 4, 2001
Abstract. The central idea of differential calculus is that the derivative of a function defines the best local linear
approximation to the function near a given point. This basic idea, together with some representation theorems
from linear algebra, unifies the various derivativesgradients, Jacobians, Hessians, and so forthencountered
in engineering and optimization. The basic differentiation rules presented in calculus classes, notably the product
and chain rules, allow the computation of the gradients and Hessians needed by optimization algorithms, even
when the underlying operators are quite complex. Examples include the solution operators of time-dependent and
steady-state partial differential equations. Alternatives to the hand-coding of derivatives are finite differences and
automatic differentiation, both of which save programming time at the possible cost of run-time efficiency.
Keywords: differentiation, solution operators, finite differences, automatic differentiation
1. Introduction
Throughout their study of calculus, students are introduced to derivatives of various types.
These include:
The (ordinary) derivative f(x) of a real-valued function f of a single variable. Thenumber f(x0)is the slope of the line tangent to the graph ofy= f(x)at x= x0. It isalso interpreted as the instantaneous rate of change ofy= f(x)at x= x0.
The partial derivativesg
x1(x1,x2, . . . ,xn ),
g
x2(x1,x2, . . . ,xn ) , . . . ,
g
xn(x1,x2, . . . ,xn )
of a real-valued function of several variables. These numbers are interpreted as the
instantaneous rates of change of y= g(x1,x2, . . . ,xn ) as one variable is changed andthe others held fixed.
The gradient vector
g(x1,x2, . . . ,xn )=
gx1
(x1,x2, . . . ,xn )
g
x2(x1,x2, . . . ,xn )
...
g
xn(x1,x2, . . . ,xn)
.
8/12/2019 Gockenbach Diff Calculus
2/55
8/12/2019 Gockenbach Diff Calculus
3/55
A PRIMER ON DIFFERENTIATION 77
least in the United States) without encountering a course that makes this principle explicit.1
Moreover, the elementary rules of differentiation as learned in calculus coursesthe prod-
uct rule, chain rule, and so forthcan leave a student ill-prepared to compute derivatives
of the complicated functions and operators that arise in advanced engineering and applied
mathematics research.
The purpose of this paper is to explain the concept of derivative from the point of view
of local linear approximation, to show how the various types of derivatives mentioned
above fit into the concept, and to work through several important and nontrivial exam-
ples. In the following section, I discuss the basic definitions and notation needed. The
setting for these definitions is a normed vector spacea vector space with a norm. For
this reason, linear algebra is important. In Section 3, I present the elementary represen-
tation theorems of linear algebra, and show how they lead to the various scalars, vectors,
and matrices that arise in calculus courses in the context of differentiation. This is fol-lowed by a brief discussion of the rules of differentiation (Section 4), simple represen-
tations for operators on infinite dimensional spaces (Section 5), and second derivatives
(Section 6). In addition to several examples included in the sections described above, I
discuss two more involved examples: the adjoint state method for handling finite dif-
ference solution operators (Section 7), and a direct computation of the derivative (and
its adjoint) of a finite element solution operator (Section 8). Finally, in Section 9, I dis-
cuss two alternatives to programming derivatives by hand: finite differences and automatic
differentiation.
Throughout this paper, the emphasis is on the structure of maps and their derivatives, not
on the analytic details. Therefore, most technical proofs are omitted.
2. Definitions and notation
2.1. Normed vector spaces; inner products
The various derivatives described in the introduction can all be discussed in the context of
a function (operator, map) fmapping one Euclidean space into another. I will write Rn for
Euclideann-space, and denote a vectorxRn asx=(x1,x2, . . . ,xn )or
x=
x1
x2
...
xn
.
Note that R1 is (isomorphic to) R, the set of real numbers.
The following examples were discussed in the introduction:
f : RR, a real-valued function of a single variable, f : Rn R, a real-valued function of several variables, f : Rn Rm , a vector-valued function of several variables, f : RRn , a vector-valued function of a single variable.
8/12/2019 Gockenbach Diff Calculus
4/55
78 GOCKENBACH
From now on I adopt vector notation and write, for example, g(x) instead of
g(x1,x2, . . . ,xn ). Also, I distinguish vectors from scalars only by context.
Now, Euclideann-space is equipped with an inner product, namely, the dot product:
(x,y)=xy=n
i=1xiyi , x,yRn .
A more general setting is an inner product space (which need not be Euclidean or finite-
dimensional). An inner product space is just a vector space Vwith an inner product(, )V,which is a mapping fromV Vinto R satisfying the following properties:
(u+ v,w)V= (u, w)V+ (v,w)Vfor allu, v , wV, , R; (u, v)V= (v , u)Vfor allu, vV; (v,v)V 0 for allvV, and(v, v)=0 if and only ifv=0.
An inner product on Vinduces a norm V onV:
vV=
(v,v)V for allvV.
It is sometimes necessary to work with norms that are not de fined by inner products. A
general norm Uon a vector spaceUis a mapping fromUinto R satisfying
uU 0 for alluU, and uU= 0 if and only ifu=0; uU= ||uUfor alluU, R; u+ vU uU+ vUfor allu, vU(the triangle inequality).
It can be shown that if(, )Vis an inner product on a vector spaceV, then vV=
(v,v)
define a norm on V.
The reason I discuss vector spaces more general than Euclidean space is that many prac-
tical problems cannot be described using finite-dimensional spaces. For example, suppose
is an open subset of Rn and f : Rn. Then under appropriate conditions on f, forany closed and bounded subset W of, there exists >0 such that, for each x0W, theInitial Value Problem (IVP)
x= f(x)(1)
x(0)= x0
has a unique solution x: [, ] Rn . Thus the IVP (1) defines an operator S: W(C[, ])n , where (C[, ])n is the space of all continuous functions u: [, ]Rn .This space is infinite-dimensional and therefore cannot be identified with any Euclidean
space. I pursue this example Section 5.5, where I compute the derivative ofS.
8/12/2019 Gockenbach Diff Calculus
5/55
A PRIMER ON DIFFERENTIATION 79
2.2. Definition of the derivative
Now suppose X andY are normed linear spaces, suppose Uis an open subset ofX, and
assume that f : U Y. As I explained in Section 2.1, the types of functions encounteredin calculus allfit under this description, as do many other important examples.
First recall the following definition.
Definition 2.1. Suppose X andY are vector spaces, and L:X Y. The operator L islinearif
L(x+z)=L x+L zx,z X
and
L(x)=L xx X, R
(or, more concisely,L (x+ z)=L x+ Lz for allx,z Xand, R).
Next is the fundamental definition in this paper.
Definition 2.2. LetxU. Suppose there is a continuous linear operator L :X Ysuchthat
limx
0
f(x+ x) f(x) L xYxX =
0.
Then f is said to be differentiable at x, and L is called the derivative of f at x, denoted
L =D f(x).
According to this definition, if f is differentiable at x, then D f(x) defines a linear
approximation to f nearx; indeed, if
E(x, x)= f(x+ x) f(x) D f(x)x,
then
f(x+ x)= f(x) + D f(x)x+E(x, x)
and
E(x, x)YxX
0 asx0.
This last condition is abbreviated by
E(x, x)=o(xX) asx 0
8/12/2019 Gockenbach Diff Calculus
6/55
80 GOCKENBACH
(read E(x, x) is little-oh ofxX), which indicates that the error E(x, x) is smallcompared to xX when xXis small. It is easy to show that, if fis differentiable atx,then no other linear map K: X Y defines a better local linear approximation to f nearx; that is, ifK= D f(x), then the error in the approximation
f(x+ x)= f(x) +Kx
is large than the error in the approximation
f(x+ x)= f(x) +D f(x)x
in the sense that
limx0
f(x+ x) f(x) KxYxX
=0.
Now, in addition to the basic definition of derivative just given, there is really just one
key idea in this paper: the linear mapD f(x) has different representations, depending on the
particularXand Yinvolved. There is an underlying question here, which properly belongs
to linear algebra (or functional analysis when the spaces are infinite-dimensional): Given
two normed vector spaces XandY,find a convenient representation for a continuous linear
mapL : X Y. I will address this question in Sections 3 and 5 below; here I preview thosesections by answering the question for X=Y= R.
Now suppose that L : RR is linear. Ifa= L(1), which is a real number, then, sincex
=x
1 for all x
R,
L x= x L(1)=a x.
Thus, ifL : RR is linear, there is a real number aR such that
L x=a x for allx R.
That is, a linear map from R to R is represented by a real number. Therefore, if f : RRhas a derivative at x, it is customary in elementary calculus courses to define f(x) to bethenumber
f(x)= limx
0
f(x+ x) f(x)x
,
which is equivalent to
limx0
f(x+ x) f(x) f(x)xx
=0.
It now becomes clear that, under this definition, the number f(x)is just therepresenterofthe linear map D f(x).
8/12/2019 Gockenbach Diff Calculus
7/55
A PRIMER ON DIFFERENTIATION 81
It may seem overly pedantic to distinguish between the linear map and its representer.
However, when the vector spaces X and Yare not both one-dimensional, I believe it is
essential to make the distinction. Before I go on to more examples in Section 3, where this
should become clearer, I need to define continuity of the derivative, and also the concept of
partial derivatives.
2.3. Continuity of the derivative
I assume again that f : U Y, where XandYare norm vector spaces and U Xis open.If f is differentiable at each x U, then D f become an operator; for each x U, D f(x)belongs to L(X, Y), the space of all continuous linear maps from X intoY:
D f : U L(X, Y).
When f is differentiable at everyxU, fis simply said to be differentiable. Now,L(X, Y)is a vector space, since there is a natural way to add operators and multiply them by scalars,
and continuity and linearity are obviously preserved by these operations. Also,L(X, Y) has
a natural norm:
LL(X,Y)=supL xY
xX:x X,x=0
.
Note that this definition of norm implies
L xY LL(X,Y)xXx X.The norm of an operator thus measures the largest factor by which the operator stretches
ormagnifiesany vector in its domain. It can be shown that a linear operator L :X Yis continuous if and only if
LL(X,Y)
8/12/2019 Gockenbach Diff Calculus
8/55
82 GOCKENBACH
(Of course, f might be defined on a subset of XY, but the exposition is simpler if Iassume that the domain of fis all ofXY.) SinceXYis a vector space, an operator likefis just another example that fits into the discussion above: f is differentiable at(x,y)if
there is a continuous linear operator L : X Y Zsuch that
lim(x,y)(0,0)
f(x+ x,y+ y) f(x,y)L(x, y)Z(x, y)XY
=0.
(For the norm on X Y, the obvious choices are: (x,y) =
x2X+ y2Y, (x,y) =max{xX, yY}, and (x,y) =xX+ yY. I will only use the property that(x, 0)XY= xX and(0,y)XY= yY, which holds for any of the above.) Onthe other hand, given any y
Y,
g(x)= f(x,y) for allx X
defines an operatorg:X Z. Similarly, for any x X,
h(y)= f(x,y) for allyY
defines an operatorh: Y Z. The question now arises: What is the relationship betweenD f, Dg, and D h?
The answer to this question is very simple when the structure of operators inL(XY,Z)is understood.
Theorem 2.3. Let X, Y,and Z be normed linear spaces. Then L L(X Y,Z)if andonly if there exist L 1L(X,Z)and L2L(Y,Z)such that
L(x,y)=L 1x+L 2y for all x X,yY.
Proof: Suppose L L(X Y,Z). Define L 1L(X,Z)by
L1x= L (x, 0) for allx X,
andL 2L(Y,Z)by
L2y= L (0,y) for allyY.
It is easy to prove that L 1andL 2are indeed linear and bounded. Moreover, for any(x,y)X Y,
L(x,y)= L((x, 0) + (0,y))= L(x, 0) +L (0,y)= L1x+L 2y,
as desired.
8/12/2019 Gockenbach Diff Calculus
9/55
A PRIMER ON DIFFERENTIATION 83
On the other hand, it is easy to verify that ifL 1 L(X,Z), L 2 L(Y,Z), and L :XY Z isdefinedby
L(x,y)=L 1x+L 2y for all(x,y) X Y,
thenL L(X Y,Z).
It is now easy to prove the following theorem.
Theorem 2.4. Let X,Y,and Z be normed linear spaces,suppose f :XYZ,and let(x0,y0) X Y . Define g :XZ by g (x) = f(x,y0)and h: YZ by h (y) = f(x0,y).Suppose f is differentiable at(x0,y0). Then g is differentiable at x0,h is differentiable at
y0,and
D f(x0,y0)(x, y)=Dg(x0)x+D h(y0)y.
The operators Dg(x0)and Dh(y0)are called the partial derivativesof f,and are denoted
Dx f(x0,y0)and Dy f(x0,y0),respectively. Thus
D f(x0,y0)(x, y)=Dx f(x0,y0)x+Dy f(x0,y0)y.
Proof: By the preceding theorem, there existL 1 L(X,Z)and L 2 L(Y,Z)such that
D f(x0,y0)(x, y)
=L1x
+L 2y for allx
X, y
Y.
In particular,
D f(x0,y0)(x, 0)=L1x,
so
limx0
f(x0+ x,y0) f(x0,y0) L 1xZ(x, 0)XY
=0.
This is equivalent to
limx0 g(x0+ x) g(x0) L 1xZ(xX =0,
soL 1= Dg(x0). Similarly, L 2= Dh (y0), and the proof is complete.
Note that, for example,
Dxf(x,y)L(X,Z),
8/12/2019 Gockenbach Diff Calculus
10/55
84 GOCKENBACH
that is,
Dxf :X Y L(X,Z).
Similarly,
Dy f :X Y L(Y,Z).
The following theorem is only slightly harder to prove.
Theorem 2.5. Suppose X,Y,and Z are normed linear spaces, f :X Y Z,and thepartial derivatives of f, Dxf(x,y) and Dy f(x,y),exist and are continuous on an open
set UX Y. Then f is C1
on U,and
D f(x,y)(x, y)=Dxf(x,y)x+Dy f(x,y)y.
Note that the continuity ofDxf andDy fis necessary; it isnotthe case that ifDxf(x0,y0)
and Dy f(x0,y0)exist, then D f(x0,y0)must exist.
These above results obviously generalize to an operator of the form f :X1 X2 Xn Z; the basic equation is
D f(x)x= Dx1 f(x)x1+ Dx2 f(x)x2+ + Dxn f(x)xn ,
wherex, x X1 X2 Xn.
3. Representation of linear operators on Euclidean spaces
3.1. The basic theorem
I will now give the fundamental representation theorem for linear operators on Euclidean
spaces. Specializing this result to the various contexts described in the introduction (R R,Rn R, Rn Rm , and R Rn ) will account for the various types of derivatives describedthere.
Theorem 3.1. Let L: Rn Rm . Then L is linear if and only if there is an m n matrixA such that
L x= Ax for all x Rn
.
Proof: Let {e1, e2, . . . , en} be the standard basis for Rn (so thei th component ofei is oneand all other components are zero). Letc1= Le1,c2= Le2, . . . , cn= Le1, and define Ato be them n matrix whose columns are the vectors c1, c2, . . . , cn. That is, Ai j is(cj )i ,thei th component of the vector cj . Then, since
x= x1e1+ x2e2+ + xn en,
8/12/2019 Gockenbach Diff Calculus
11/55
A PRIMER ON DIFFERENTIATION 85
the linearity ofL yields
L x= x1Le1+x2Le2+ +xnLen= x1c1+x2c2+ + xncn .
However, by the definition of matrix multiplication,
Ax= x1c1+ x2c2+ + xncnalso holds. Thus
L x= Ax for allx Rn.
Thus every linear operator L : Rn Rm can be represented by anm nmatrix.If f : Rn Rm is differentiable, then D f(x) : Rn Rm linear. Therefore, there is an
mn matrix J representing D f(x). This matrix J turns out to be the Jacobian matrixmentioned in the introduction. To see this, it is convenient to first consider certain special
cases.
3.2. Representation of derivatives in special cases
3.2.1. m=n= 1. In the special casem = n = 1, so that fis a real-valued function of areal variable, D f(x)is represented by a single number f(x), as was already shown.
3.2.2. m= 1, n > 1. In the case m = 1, n> 1, so that f is a real-valued function of severalvariables, the result of Section 2.4 applies:
D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.
Herex1, x2, . . . , xn are the components of the vector x Rn . Moreover, regarded asa function xi with the other components ofx heldfixed, f defines a real-valued function
of a real variable. Thus Dxi f(x)can be represented by a single number, which is usually
denoted
f
xi(x).
Thus
D f(x)x= fx1
(x)x1+ f
x2(x)x2+ +
f
xn(x), (2)
which can be recognized as a matrix-vector product if the numbers
f
x1(x),
f
x2(x) , . . . ,
f
xn(x)
8/12/2019 Gockenbach Diff Calculus
12/55
86 GOCKENBACH
are gathered in a row, that is, a 1 nmatrix:f
x1(x)x1,
f
x2(x)x2, . . . ,
f
xn(x)
.
This is the representer of D f(x) suggested by Theorem 3.1. There is a slightly different
representer ofD f(x) that is to be preferred because it generalizes to the infinite-dimensional
case: Eq. (2) is recognized as the inner product ofxwith the vector
f(x)
=
fx1
(x)
fx2
(x)
...
fxn
(x)
,
which is called the gradientof f atx. Thus
D f(x)x=(f(x), x)Rn for allxRn .
The gradient f(x)is the usual representer ofD f(x).
3.2.3. m >1, n= 1. In the casem >1, n=1, so that fis a vector-valued function of areal variable (often called a curve since its image is a curve inm-space),D f(t): RRmis represented by anm
1 matrix. Now, fcan be written
f(t)=
f1(t)
f2(t)
...
fm (t)
,
where each fi is a real-valued function of a real variable. By considering the definition of
the derivative in this case, which implies that
limt0
|fi (t+ t) fi (t) (D f(t)t)i ||t| =0,
it is easy to see that(D f(t)t)i= fi(t)t. Therefore the representer ofD f(t)is
f1(t)
f2(t)...
fm (t)
,
8/12/2019 Gockenbach Diff Calculus
13/55
A PRIMER ON DIFFERENTIATION 87
which is written as f(t)or f(t)in calculus courses. Since anm 1 matrix can be thoughtof as a vector, the usual interpretation of f(t)as the tangent vector to the curve x= f(t)holds. If the curve is traced out by a particle as tvaries, then, at time t, the particle is at
x= f(t), while at timet+ t, it is approximately at f(t) + f(t)t.
3.2.4. The general case m > 1, n > 1. Finally, consider the casem > 1,n > 1, so that f
is a vector-valued function of a vector variable. Then, by the results on partial derivatives,
D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.
Now, regarded as a function ofxi with the other components ofx heldfixed, f defines a
function of the type considered in Section 3.2.3. The representer ofDxi f(x)is the column
vector
f
xi(x)=
f1xi
(x)
f2xi
(x)
...
fmxi
(x)
,
and it follows from this that the matrix Jrepresenting D f(x)has
f
x1 (x),
f
x2 (x) , . . . ,
f
xn (x)
as columns. Thus
J=
f1x1
(x) f1
x2(x) f1
xn(x)
f2x1
(x) f2
x2(x) f2
xn(x)
......
. . ....
fmx1
(x) fm
x2(x) fm
xn(x)
,
which is the Jacobian matrix mentioned in the introduction. (Note that the gradients of the
component functions fi (x)form the rows ofJ.)
3.3. Summary
Here is a summary of the results that I have presented:
If f : RR, then the representer ofD f(t)is the scalar f(x).
8/12/2019 Gockenbach Diff Calculus
14/55
88 GOCKENBACH
If f : Rn R, then the representer ofD f(x)is the vectorf(x). Recall that this is aslight departure from the general framework, as D f(x)is represented via inner product
with a (column) vector rather than via matrix multiplication with a row vector.
If f : RRm , then the representer ofD f(x)is the (column) vector f(t). If f : Rn Rm , then the representer ofD f(x)is the Jacobian matrix J defined by
Ji j= fi
xj(x).
3.4. Example: A quadratic function
Suppose A is ann
n symmetric matrix (AT
=A), b
Rn ,
R, and f : Rn
R is
defined by
f(x)= 12
(x,Ax)Rn+ (b,x)Rn+ .
To compute f(x), one method is to write fin terms of the components ofx,
f(x)= 12
ni=1
nj=1
Ai jxixj+n
i=1bixi+ ,
and compute the partial derivatives of f. While this is possible, it is easier to proceed from
the definition: Write f(x+x) f(x) asa termthat is linearin xplus a smaller remainder.Now,
f(x+ x) f(x)= 12
(x+ x,A(x+ x)) + (b,x+ x) + 1
2(x,Ax) (b,x) = (Ax+ b, x) + 1
2(x,Ax).
Note that the symmetry of A was used to conclude that (x,Ax)= (x,Ax). The term(x,A x)/2 is small compared to x whenxis small; in fact,
12
(x,Ax)=O (x2)=o(x).
Therefore,
f(x+ x) f(x)=(Ax+ b, x) + o(x),
that is,
D f(x)x=(Ax+ b, x).
This equation exhibits the representer f(x)ofD f(x):
f(x)= Ax+ b.
8/12/2019 Gockenbach Diff Calculus
15/55
A PRIMER ON DIFFERENTIATION 89
4. Rules for differentiation
I will now review the important rules for differentiating functions.
4.1. The derivative of a linear function
Suppose f : X Yis linear and continuous. Thenf(x+ x) f(x)= f(x) + f(x) f(x)= f(x).
It follows that
D f(x)x= f(x),that is,D f(x)= f. This holds independently ofx X, and is the analogue of the rule fromcalculus which states that the derivative of a linear function is a constant.
4.2. The chain rule
Now suppose thath : Y Z,g : X Y areC1, and f : X Z is the composition ofhand g:
f(x)=h (g(x)) for allx X.Then
f(x+ x)=h (g(x+ x))= h (g(x) +Dg(x)x+ o(x))= h (g(x)) +D h(g(x))(Dg(x)x+ o(x))
+ o(Dg(x)x+ o(x))= f(x) +D h(g(x))Dg(x)x+ o(x).
Thus
D f(x)x= Dh (g(x))Dg(x)x.This is thechain rule.
4.3. The product rule
SupposeX,Y, Z, andWare normed vector spaces, and P : YZWis continuous andbilinear; that is
P(1y1+ 2y2,z)=1 P(y1,z) + 2 P(y2,z)for all y1,y2Y,z Z, 1, 2R,
P(y, 1z1+ 2z2)=1 P(y,z1) + 2 P(y,z2)for all yY,z1,z2 Z, 1, 2R.
8/12/2019 Gockenbach Diff Calculus
16/55
8/12/2019 Gockenbach Diff Calculus
17/55
A PRIMER ON DIFFERENTIATION 91
Thus
h(x)= G Tf(x) +FTg(x).
4.5. Example: Differentiating an inverse
SupposeL : XL(Y,Z), and assume thatL (x)1 exists and is continuous for eachx X.Define f :X L(Z, Y)by
f(x)=L(x)1.
Assuming that L isC1
, what is D f(x)?Both the chain rule and the product rule are involved in the answer. Define : U
L(Z, Y)by (K)=K1, where
U= {K L(Y,Z): K1 exists and is continuous}.
Then
K(K)= I,
where I : Z Zis the identity operator. Differentiating both sides yields
KD (K) K+ K(K)=0. (3)
To obtain this result, the product rule was applied to the mappingP : L(Y,Z)L(Z, Y)L(Z,Z) defined byP (K,L )=KL. Also, note thatIL(Z,Z) is constant, so its derivativeis zero. Now, (K)=K1, so (3) yields
KD (K)K= K K1
or
D(K) K= K1K K1. (4)
(The reader will notice the shadow of the calculus rule
f(x)=
1
x f(x)
= 1
x2
here.)
Equation (4) can now be combined with the chain rule to findD f(x) for f(x)=L (x)1.Since f is the composition of and L , the chain rule yields
D f(x)x= D (L(x))DL(x)x= L(x)1DL(x)x L(x)1.
8/12/2019 Gockenbach Diff Calculus
18/55
92 GOCKENBACH
This expression forD f(x)xis the product (composition) of three linear operators,L (x)1,DL(x)x, and L (x)1. Note that since the product of linear operators is not commutative,order is important in this formula.
5. Simple representations on infinite-dimensional spaces
IfX andYare both infinite-dimensional, then little can be said in general about the repre-
sentation ofL L(X, Y). However, ifoneof the spaces isfinite-dimensional, then it is notdifficult to derive some useful results.
5.1. Real-valued functions defined on Hilbert spaces
First I take the special case of f : X R, where X is aHilbert spacea complete innerproduct space. In this case, the Riesz Representation theorem is available.
Theorem 5.1. (Riesz Representation Theorem). Let X be a Hilbert space. Then if L :
X R is linear and continuous,there exists a unique v X such that
L(x)=(x, v)x for all x X.
This theorem follows immediately from Theorem 3.1 ifXis finite-dimensional. For a proof
in the infinite-dimensional case, see any book on Hilbert spaces or functional analysis.
Now suppose f : X R is differentiable at x X. Then D f(x)is a continuous linearfunction defined onX, and, by the Riesz representation theorem, there exists a vector v
X
satisfying
D f(x)x=(x, v)X for allx X.
Just as in the finite-dimensional case, the vector v is called the gradientof f at x, and is
denoted by f(x).
5.2. Example: The gradient of a nonlinear least-squares function
A common optimization problem is to minimize the nonlinear least-squares function
f(x)
= 12
F(x)
22
= 1
2(F(x),F(x))Y,
where F: X Yis a nonlinear operator and X,Yare Hilbert spaces. I will now apply theresults developed above to compute the gradient of f. I will also specialize the results to
X= Rn ,Y= Rm .By the product rule,
D f(x)x= 12
(F(x),D f(x)x)Y+ 12 (D f(x)x,F(x))Y= (D f(x)x,F(x))Y.
8/12/2019 Gockenbach Diff Calculus
19/55
A PRIMER ON DIFFERENTIATION 93
Now, anm nmatrix Ahas the property that
(Ax,y)Rm= (x,ATy)Rn for allx Rn,yRm .
Similarly, for every operator L L(X, Y), there is a uniqueadjointoperatorL defined bythe equation
(L x,y)Y= (x,Ly)X for allx X,yY.
(The existence and uniqueness ofLcan be proved using the Riesz representation theorem.)Therefore,
D f(
x)
x=(
D F(x
)x
,F
(x
))Y=
(x
,D F
(x
) F
(x
))X
,
which shows that
f(x)=D F(x) F(x).
Computing the adjoint of D f(x) can be quite challenging in some applications; see
Section 7 for a nontrivial example. In the case ofX=Rn ,Y= Rm ,D f(x)is represented bythe Jacobian matrix J, and therefore D f(x)is represented by its transpose. It follows that
f(x)= JTF(x),
where Jis the Jacobian matrix ofFatx.
5.3. Finite-dimensional operators on Hilbert space
Next I consider the case ofF : X Rm , where Xis again a Hilbert space. Clearly Fcanbe represented as
F(x)=
F1(x)
F2(x)
...
Fm (x)
,
where Fi : X R, i= 1, 2, . . . , m. It follows that
D F(x)x=
D f1(x)x
D f2(x)x
...
D fm (x)x
8/12/2019 Gockenbach Diff Calculus
20/55
94 GOCKENBACH
=
(F1(x), x)X(F2(x), x)X
...
(Fm (x), x)X
.
Thus the derivative ofFcan be represented by m vectors in X, namely,
F1(x), F2(x) , . . . , Fm (x).
By analogy with thefinite-dimensional case, these vectors can be thought of as forming the
rows of a matrix (with infinitely many columns).
5.4. Operators with afinite-dimensional domain
5.4.1. The case of a one-dimensional domain. SupposeY is a normed linear space, and
assumeF: RYis differentiable. TheD F(t)is a continuous linear operator from R intoY. It is simple to represent such operators, for ifL L(R, Y)andz=L (1), then
Lt=tL(1)=t z for alltR.
ThusL is represented by an element z ofY.
It follows that D F(t)is represented by an element ofY, which is denoted F(t):
D F(t)t= tF(t) for all t R.
5.4.2. The case of a finite-dimensional domain. Now suppose F: Rn Y is differen-tiable. ThenD F(x) L(Rn , Y), and the structure of such an operator must be determined.LetL L(Rn, Y) and let {e1, e2, . . . , en} be the standard basis for Rn. Thenfor anyx Rn ,
x=n
i=1xi ei ,
and so
L x
=
n
i=1
xiLei .
That is, there are nvectorsL e1, Le2, . . . ,L enin Y, and each imageL xis a linear combina-
tion of thesenvectors. Thesenvectors representL . By analogy with the caseY= Rm , onecan think of the representer ofL as a matrix with ncolumns (each of which is a vector in Y).
It is now easy to see that D F(x) is represented by n vectors, each of which is the
representer of a partial derivative of F at x. Again, one can think of the representer of
D F(x)as a matrix with n columns, each a representer of a partial derivative ofFatx.
8/12/2019 Gockenbach Diff Calculus
21/55
A PRIMER ON DIFFERENTIATION 95
5.5. Example: The solution operator of an IVP
As an important example of the previous section, consider a vector field f :Rn, where Rn is an open set. The vectorfield f defines anautonomous(i.e. time-independent)ordinary differential equation (ODE)
x= f(x).
Bya standardresultof the theoryof ODEs, ifW is closedand bounded and f is C1,thenthere exists a positive number such that, for each x0 W, there exists x (C[, ])nsatisfying the Initial Value Problem (IVP)
x
= f(x),
x(0)=x0. (5)That is, there is an operator S : W (C[, ])n with S(x0)= x, the solution of (5). IcallSthe solution operator of the IVP.
Recall that(C[, ])n is the space of all continuous, vector-valued functions de finedon [, ]. The usual norm of(C[, ])n is
u=max{u(t)2 : t[, ]}.This definition implies that ifu vis small, then u(t) v(t) is uniformly small onthe interval [, ]. For this reason, is sometimes called theuniformnorm.
The derivativeDS(x0)is computed by finding the local linear approximation to S(x0+x0)
S(x0). Writez
=S(x0
+x0)and x
=S(x0). Thenz satisfies
z= f(z),z(0)=x0+ x0,
andxsatisfies (5). Therefore, ifw=zx, then
w= z x= f(z) f(x)= D f(x)(z x) + o(z x)= D f(x)w+ o(w)
and
w(0)=z (0) x(0)=x0+ x0 x0=x0.
Since a linear (inx0) approximation tow is desired, it is reasonable to drop the o(w)term from the ODE and consider the solution u to
u= D f(x(t))u,(6)
u(0)=x0.
8/12/2019 Gockenbach Diff Calculus
22/55
96 GOCKENBACH
Note thatu really does depend linearly on x0. Indeed, ifu solves (6) andv solves
u= D f(x(t))u,u(0)=y0,
theny= u+ vsatisfies
y= u+ v=Df(x(t))u+ D f(x(t))v= Df(x(t))(u+ v)
= Df(x(t))y,
and
y(0)= u(0) + v(0)= x0+ y0.
Thereforeu, the solution of (6), depends linearly onx0, and it is an approximation tow
since it is obtained by solving an IVP with the same initial conditions as that satisfied byw
and with a slightly changed vectorfield. It can be proved, in fact, that w=u + o(w)=u +o(x02). (This is a standard theorem about the continuous dependence of the solutionto an IVP on the vector field.) Therefore DS(x0)x0= u, where u is the solution of theIVP (6).
6. Second derivatives
In elementary calculus, if f :R Ris twice differentiable, then the scalar f(x) represent-ing D f(x)is called thefirst derivative. Since, in this way of looking at things, f: RRis the same type of function as is f itself, it is natural to define f(x) as the derivativeof f at x, so that f(x)is also a scalar and f, like f and f, maps R into R. As I willnow explain, this is another instance in which the one-dimensional case gives a completely
misleading picture.
Suppose X andYare normed linear spaces, UX is open, and f : UY is differen-tiable. Then the derivative D f is also an operator mapping one normed linear space into
another; however, it isnotof the same type as f, since D f mapsU into L(X, Y). It does
make sense to ask whether D f : U L(X, Y) is differentiable; to examine this ques-tion, Definition 2.2 is applied. The operator D f is differentiable atx U if there exists acontinuous linear operator L L(X,L(X, Y))such that
limx0
Df(x+ x) D f(x) L xL(X,Y)xX
=0.
If suchanL exists, then f is saidto be twice-differentiable at x, andL is denoted byD2f(x).
If f is twice-differentiable at each x U, then f is called twice-differentiable, in which
8/12/2019 Gockenbach Diff Calculus
23/55
A PRIMER ON DIFFERENTIATION 97
caseD2 f is an operator mapping Uinto L(X,L(X, Y)). If this operator is continuous, then
f is calledC2.
Now, clearly Definition 2.2 can be used to discuss derivatives of order three and higher.
However, things become quite awkward. For example, if f is three times differentiable,
then
D3 f(x)L(X,L(X,L(X, Y))),
and ifD 4 f(x)exists, then
D4 f(x)L(X,L(X,L(X,L(X, Y)))).
Fortunately, a simplification is afforded by the nature of the spaces
L(X,L(X, Y)),L(X,L(X,L(X, Y))),...
ConsiderL L(X,L(X, Y)). By definition, L (x) L(X, Y)and L (x)z Yfor eachx,z X. In other words, L defines an operator B :XX Y by
B(x,z)=L(x)z.
It is easy to see that B is bilinear, that is, that
B(1x1+ 2x2,z)=1B(x1,z) + 2B(x2,z)x1,x2,z X, 1, 2 R,B(z, 1x1+ 2x2)=1B(z,x1) + 2B(z,x2)x1,x2,z X, 1, 2 R.
Indeed,
L(1x1+ 2x2)=1L(x1) + 2L(X2),
from which it follows that
B(1x1+ 2x2,z)=L(1x1+ 2x2)z= 1L(x1)z+ 2L(x2)z= 1B(x1,z) + 2B(x2,z).
Similarly, L (z)is linear, so
B(z, 1x1+ 2x2)=L(z)(1x1+ 2x2)= 1L(z)x1+ 2L(z)x2= 1B(z,x1) + 2B(z,x2).
8/12/2019 Gockenbach Diff Calculus
24/55
98 GOCKENBACH
IfX,Y, and Zare normed linear spaces, then the space of continuous bilinear operators
B:X Y Zis denoted by L2(X, Y,Z). This space has a natural norm:
BL2(X,Y,Z)=supB(x,y)z
xXyY: x X,yY,x=0,y=0
.
It is a standard result that a bilinear operator B :X Y Zis continuous if and only if
BL2(X,Y,Z)
8/12/2019 Gockenbach Diff Calculus
25/55
8/12/2019 Gockenbach Diff Calculus
26/55
100 GOCKENBACH
Note that the notation D2xy f denotes the partial derivative with respect to y of Dxf, and
similarly for D2yxf. By Theorem 6.1, each ofD2 f(x, y), D 2xxf(x, y), and D
2yy f(x, y)is
symmetric. It follows easily that
D2x y f(x,y)zx= D2yxf(x,y)xz for allx X, zY.
Of course, these results can be generalized to the case of f :X1X2 Xn Z,in which case the fundamental formulas are
D2f(x)xr=n
i=1
nj=1
D2xixj f(x)xj ri ,
and
D2xixj f(x)xj ri= D2xjxi f(x)ri xj .
6.3. Representation of second derivatives onfinite-dimensional spaces
Now I returnto the caseof f : Rn Rm and derive the formula for the 3-tensor representingD2 f(x). Since Rn can be regarded as the product ofncopies of R,D2 f(x) can be expressed
in terms of the second partial derivatives of f:
D2f(x)xr=n
j=1
n
k=1
D2xjxk f(x)xkxj .
Recall that D 2xjxk fis the derivative with respect to xkofDxj f; also recall that
Dxj f : Rn L(R, Rm ),
or, effectively,
Dxj f : Rn Rm
(since each operator in L(R, Rm )) is represented by a vector in Rm , and vice versa).
Specifically,Dxj f(x)is represented by the vector
f
xj(x).
By the same reasoning, D 2xjxk f(x)is also represented by a vector in Rm , namely,
2 f
xkxj(x).
8/12/2019 Gockenbach Diff Calculus
27/55
A PRIMER ON DIFFERENTIATION 101
It follows that
D2f(x)xr=n
j=1
nk=1
2f
xk xj(x)xkxj .
Thus,
D2f(x)xr
i=
nj=1
nk=1
2fi
xk xj(x)xkxj ,
which shows that D 2f(x)is represented by the 3-tensorT, with
Ti jk= 2fi
xkxj(x).
6.4. The Hessian
In Sections 3.2.2 and 5.1, I showed that, in the case of f :X R, the usual representer ofD f(x) is thegradientvector f(x).WhenX= Rn , thisis not quite the sameas the Jacobianmatrix (which is a row matrix in that case); the gradient is adopted instead precisely because
it generalizes to the case in which Xis a Hilbert space. In the same way, the representer
for D2f(x)has a special form when the range of fis R. Indeed, in this case D2f(x) is a
bilinear operator mapping XXinto R, and the following theorem holds.
Theorem 6.4. Suppose X is a Hilbert space and B L2(X,X,R) is a bilinear form.Then there exists a linear operator L L(X,X)such that
B(x,y)=(L x,y)x.
This theorem can be proved using the Riesz representation theorem, since, for fixedx, the
map
y B(x,y)
defines a continuous, linear, real-valued function onX.InthecaseofD2f(x), where f :XR, the linear operator representing the bilinear operator is called the Hessianoperator andis denoted 2 f(x). That is,
2 f(x)L(X,X)
is defined by
D2f(x)xy= 2 f(x)x, yX
.
8/12/2019 Gockenbach Diff Calculus
28/55
102 GOCKENBACH
In the case of f : Rn R, the Hessian matrix is just the specialization of the tensor Tdiscussed above. Since m = 1 in this case, the 3-tensor can obviously be identified witha 2-tensor, i.e. a matrix (just as the Jacobian matrix can be identified with a vector, the
gradient, in this case). Therefore,
(2 f(x))i j= 2f
xj xi(x).
6.5. Example: The Hessian of a nonlinear least-squares function
I now return to the example of Section 5.2. Let F: X Y, where X andY are Hilbertspaces, and define
f(x)= 12
(F(x),F(x))Y.
I showed earlier that
D f(x)x=(D F(x)x,F(x))Y.
By the product rule, it follows that
D2F(x)xr= (D F(x)x,D F(x)z)Y+ (D2F(x)xr,F(x))Y. (7)
This gives a formula for the second derivative of f, but it must be rearranged to exhibit the
Hessian operator. Thefirst term is easy to handle, since
(D F(x)x,D F(x)r)Y= (x,D F(x)D F(x)r)X.
The operatorD F(x)D F(x)thus forms part of the Hessian; indeed, in small residual least-squares problems, this operator is a good approximation to the Hessian, at least forxnear the
minimizer. For this reason, it is often used as an approximation to Hessian, and is referred
to as theGauss-NewtonHessian.
To handle the second term, write
B(x, r)=D2 F(x)xr;
then
(D2F(x)xr,F(x))Y= (B(x, r),F(x))Y= (x, (B(, r)) F(x))X,
where I write B(, r)for the linear operator defined by
x B(x, r).
8/12/2019 Gockenbach Diff Calculus
29/55
A PRIMER ON DIFFERENTIATION 103
Now, it is easy to see that
r(B(, r)) F(x)
defines a linear operator mapping Xto X, and it can be shown to be bounded (continuous).
This operator depends onxthroughF(x)and D 2F(x), and I will denote it by S(x), so that
S(x)r= (B(, r)) F(x).
With this notation,
(D2F(x)xr,F(x))Y
=(x,S(x)r))X,
and therefore
2f(x)=D F(x)D F(x) +S(x).
Lastly, I will computeS(x)in the case F: Rn Rm . In that case,D 2 F(x)is representedby the 3-tensor T, where
Ti jk= Fi
xkxj(x).
Therefore, settingz= F(x),
(D2 F(x)xr,z)Rm=m
i=1(D2 F(x)xr)izi
=m
i=1
nj=1
nk=1
Ti jkxkrjzi
=n
k=1
n
j=1
mi=1
Ti jkzi
rj
xk
=
x,
nj=1
mi=1
Ti jkzi
rj
Rn
,
and so
(S(x)r)k=n
j=1
mi=1
Ti jkzi
rj
=n
j=1
mi=1
Fi
xkxj(x)Fi (x)
rj .
8/12/2019 Gockenbach Diff Calculus
30/55
104 GOCKENBACH
This shows thatS(x)is represented by the matrix whose(k, j )entry is
mi=1
Fi
xk xj(x)Fi (x),
and hence
S(x)=m
i=1Fi (x)2 Fi (x).
The matrix representing S(x) has been referred to as the mess matrix, and the above
formula shows that it is expensive to compute. This explains the popularity of the Gauss-
Newton Hessian. However, in large-residual least-squares problems, use of the full Hessian
(or an approximation to it) is necessary.
7. Example: The adjoint state method
As a more involved example, I will discuss the computation ofDG(c) and DG(c) for anonlinear operatorG defined by an (explicit)finite-difference simulation. This discussion
is taken from the paper (Gockenbach et al., in press). The problem described here arises, for
example, when one or more coefficients in a partial differential equation are to be estimated
by the Output Least-Squares (OLS) technique. In this technique, the parameters are chosen
to produce simulated data as close as possible (in a norm induced by an inner product) to
observed data. Specifically, the OLS problem is
minc
J(c),J(c)= 12G(c) D obs2, (8)
where c C denotes the unknown parameters, Dobs is the observed data, and G is theforward map, that is, the operator embodying the mathematical model of the dependence of
the data on the parameters. Thus the OLS problem is just a nonlinear least-squares problem
of the type discussed above, and
J(c)=DG (c)(G(c) D obs ).
In the application I consider here,G is defined by an explicitfinite-difference simulation
followedby sampling (inmanyapplications, only part offield simulatedby finite-differences
is observable). I will therefore assume that
G(c)=SU=N
n=0SnU
n,
whereUn Uis (related to) the n th time level of the simulated field, S : U D is thesampling operator, and Dis the data space (that is, D obs D).
8/12/2019 Gockenbach Diff Calculus
31/55
A PRIMER ON DIFFERENTIATION 105
Note thatSis defined by
SU=N
n=0SnU
n,
where Sn : U Dfor n= 0, 1, . . . ,N.That is, each time level of the computedfield issampled, and the results are accumulated as the data. This formalism provides an efficient
way to abstractly represent several different sampling possibilities. For example, the entire
time levelUn may be recorded for certain values ofn, in which case Sn is the zero operator
for all other values ofn. Alternatively, every time level could be sampled at a few receiver
locations (as in the typical seismic experiment), and the results recorded as time series. At
the other extreme the entire history of the field could be retained. All of these possibilitiescan be accommodated within the above formalism by appropriate choice ofS.
Anyfinite-difference scheme can be considered to be formally two-level, by concatenat-
ing several time levels if necessary. Therefore,
Un+1 = Hn (c, Un), n=0, 1, . . . ,N 1.
I call Hn : CU U thestencil operator.
7.1. A convection-diffusion example
I will now pause to give an explicit example of the situation described above. Consider thefollowing initial-boundary value problem for the convection-diffusion equation:
ut+ a(x)ux= 0, 0< x0,u(x, 0)= (x), 0< x0,
wherea(x) >0 for all x [0, 1]. Define a grid on the rectangle(x, t)[0, 1] [0, T] bysetting
xj
= j x, x
=
1
M
, tn
=nt, t
=
T
N
,
and writeu nj for the approximation tou (xj , tn). Since the characteristics of the PDE point
up and to the right in the (x, t)plane, it is natural to discretize using a forward difference
in time and a backward difference in space to obtain
un+1j unjt
+ ajunj unj1
x=0,
8/12/2019 Gockenbach Diff Calculus
32/55
106 GOCKENBACH
whereaj= a (xj ). Taking into account the initial and boundary values yields
un+1j =
0, j= 0, n=0, 1, . . . ,N 1j aj xt( j j1), n=0, j=1, 2, . . . ,Mun1 a1 xtu
n1 , n=1, 2, . . . ,N 1, j=1
unj a1 xt
unj unj1
, n=1, 2, . . . ,N 1, j=2, 3, . . . ,M
(9)
The stencil operator Hfor this example is therefore defined by H(a, un )= un+1, whereun+1 is defined by (9). In terms of the above notation, U is (M+1)-dimensional space,while Cis M-dimensional space.
To introduce sampling, suppose that sensors are placedat several grid points on the spatialgrid, say atxj1 ,xj2 , . . . ,xj , and that the observed data consists of the times series
u1ji , u2ji
, . . . , uNji , i=1, 2, . . . , .
If each time series forms a column of a matrix, then the data D is an(N+ 1) matrix,and we have
D=N
n=0SnU
n ,
whereSnUn is the matrix with every row equal to zero except in nth row, which has entries
unj1 , unj2
, . . . , unj .
7.2. Back to the general case
Thelinearizationof the mapc G(c)is the result offirst-order perturbation of the time-stepping equations:
DG (c)c=N
n=0SnU
n,
where
Un+1 = DcHn (c, Un)c +DUHn(c, Un )Un
andUn = (DU(c)c)n . Note that if the originalfinite-difference scheme is linear (reallyaffine: linear plus constant), then it can be written as
Un+1 = A(c)Un + Fn ,
8/12/2019 Gockenbach Diff Calculus
33/55
A PRIMER ON DIFFERENTIATION 107
whereA(c) =DUHn(c, U) (DUHn is independent of the time level nin this case). It followsthatUsatisfies
Un+1 = A(c)Un + (DA(c)c)Un .
Therefore, in this common case, the linearization is computed by afinite-difference simula-
tion identical to the original, except that the right-hand side Fn is replaced by
(DA(c) c)Un .
I will now show how to compute the adjoint ofD G(c). The spaces Cand Urequire inner
products; these inner products will be denoted (, )Cand(, )U, respectively. ThefieldUbelongs to UN+1, and I define the inner product on UN+1 by
(U, V)UN+1=N
n=0(Un , Vn )U.
For convenience, and suppressing the dependence onc, write An for DUHn(c, Un )and
Fn+1 for DcHn(c, Un)c, F0 =0, so that the linearized scheme can be written as
U0 =0, Un+1 A nUn = Fn+1, n=0, 1, . . . ,N 1.
This can also be written as
MU= F,
where M : UN+1 UN+1 is the block linear operator
M=
I 0 0 0A0 I 0 0
0 A1 I 0...
.... . .
. . ....
0 0 AN1 I
(note that Mdepends onc, but I suppress this dependence). ThenU= M1 F, and theexplicit time-stepping scheme is equivalent to solvingMU= Fby forward substitution.
Now, write B for the operator mapping cto F:
(Bc)n =
0, n=0DcHn1(c, Un1)c, n=1, 2, . . . ,N
(again suppressing the fact that B depends onc). Then
DG (c)=S M1B,
8/12/2019 Gockenbach Diff Calculus
34/55
108 GOCKENBACH
and so
DG (c)= BMS.
Assuming that Sn, Sn , and the stencil operator Hn and its derivatives and adjoints DcHn
(c, U),DUHn (c, U), DcHn (c, U), and DUHn (c, U) are known (the reader might find it
instructive to compute these derivatives and adjoints for the convection-diffusion example
given above), I will now show how to compute D G(c) from them.Note that D G(c)D= BMSD. Write V= SD. Then, as is easy to verify,
Vn =( SD)n = Sn D, n=0, 1, . . . ,N.
From my choice of inner product on UN
+1
, it follows that M is the block linear operator
M=
I A0 0 00 I A1 0...
.... . .
. . ....
0 0 I AN10 0 0 I
Write W= MV, so that W solves MW= V. Since M is block upper trian-gular, Wcan be found by back substitution, which is equivalent to the following reverse
time-stepping scheme:
WN = VN, Wn1 = An1Wn + Vn1, n= N,N 1, . . . , 1.
I will refer to Was theadjoint stateand to the equation MW= Vas theadjoint stateequation.
Next I compute B . Note that
(Bc, W)UN+1= (0, W0)U+N
n=1(DcHn1(c, Un1)c, Wn )U
=N
n=1(c, DcHn1(c, Un1)Wn )C
=
c,N
n=1DcHn1(c, Un1)Wn
C
.
This shows that
BW=N
n=1DcHn1(c, Un1)Wn.
8/12/2019 Gockenbach Diff Calculus
35/55
A PRIMER ON DIFFERENTIATION 109
Thus the procedure for computing D G(c)D, forDDN+1, is:
1. Solve the simulation problem to produce thefieldU(needed in steps 3b and 3c).
2. Setcto zero.
3. Forn= N, N 1, . . . , 1:(a) Compute Vn = Sn D.(b) Compute Wn by taking one step (backward in time) on the adjoint state equation
(or simply WN = VN).(c) Add DcHn1(c, Un1)Wn to the output vectorc.
A logistical problem immediately asserts itself: Uis produced by steppingforwardin time,
Wby steppingbackwards.Unless the state space has small dimension (which is certainlynot the typical case), storage of the entire time history of the reference field U is very
expensive in terms of memory. On the other hand, one could, at each step of the backward
time-stepping algorithm, re-compute the needed time level Un by forward time-stepping
fromU0. This is obviously expensive in terms of computation time.
To balance the need for storage and recomputation, a checkpointing scheme due to
Andreas Griewank (1992), extended in Symes et al. (1998), can be employed. The idea
is to save (checkpoint) various time levelsUn to use as intermediate initial data to restart
the computation ofU during the solution of the adjoint state system. A complete description
of the algorithm appears in Gockenbach et al. (in press).
8. Example: Differentiating afinite element solution operator in an inverse problem
As myfinal example, I will discuss the computation of the derivative and its adjoint when
the operator is the (approximate) solution operator, as implemented using thefinite element
method, of an elliptic partial differential equation. Suppose is a bounded polygonal region
in the plane and consider the boundary value problem (BVP)
(au)= p in , (10)u= 0 on .
This BVP models, for example, the small tranverse displacements uof an elastic membrane
under a tranverse pressure p. The coefficient a is describes the elastic properties of the
membrane, and when the membrane is heterogeneous, a is a function of space: a=a (x).The usualdirect problemis to computeu given the functions pand a; that is, given the
elastic properties of the membrane and the pressure to which it is subjected, determine its
displacement. In many applications, it is necessary to solve an inverse problem, such as:
Given p and a measurement ofu, estimate a; that is, by observing the displacement of
the membrane under a known pressure, estimate the elastic properties of the membrane. (It
would also be possible to consider p as needing to be measured, that is, that it forms part
of the data of the problem. To simplify the presentation, I will assume that the pressure p
is known.)
8/12/2019 Gockenbach Diff Calculus
36/55
110 GOCKENBACH
One way to solve the inverse problem numerically is to use the Output Least-Squares
approach, as described in the previous section, in conjunction with the finite element method
for solving the BVP. Suppose that the measured data is denoted u obs anda is to be chosen
so that the predicted displacementu , as simulated by piecewise linearfinite elements, is to
be as close tou obs as possible in the L2()norm. It is necessary to have a representation
for the unknown coefficienta, and I will represent it using a piecewise linear function. Let
T (h) be a triangulation of, and define
P(h) = { : R| is continuous and piecewise linear relative to T (h)}
P(h)0 = {P(h) | =0 on }.
Suppose the nodes of the triangulation T( h) arex1,x2, . . . ,xm and i is the element ofP(h)
defined by
i (xj )=
1, j=i0, j=i .
Then {1, 2, . . . , m} is the standard basis for the space P(h), and every elementuP(h)satisfies
u=m
i=1u(xi )i .
The basis functions that correspond to interior nodes comprise a basis forP(h)0 ; I will denote
this basis by {1, 2, . . . , n} (there exists a sequence i1, i2, . . . , insuch that j= ij , j=1, 2, . . . , n).
Thefinite element method for estimating the solution of (10) takes the form
finduP(h) such that
au i=
fi i= 1, 2, . . . , n. (11)
Upon substituting
u=n
j=1Uj j ,
(11) can be written as the matrix-vector equation KU= P, where
Ki j=
aj i , i, j= 1, 2, . . . , n,
Pi=
pi , i= 1, 2, . . . , n.
Note that K Rnn is symmetric and positive definite.
8/12/2019 Gockenbach Diff Calculus
37/55
A PRIMER ON DIFFERENTIATION 111
Now define the (approximate) solution operator of (10) as
f : P(h) P(h)0 ,
where
f :au=n
i=1Ui i , U= K1 P
and K Rnn and P Rn are defined as above. The OLS approach is then to minimizethe function J: P(h) R defined by
J(a)= 12f(a) uobs2
L2().
This is another nonlinear least-squares function, and its gradient is given by
J(a)= D f(a)(f(a) uobs).
It is easier to compute D f(a)and D f(a) if I explicitly recognize the fact that the basesfor P(h) and P
(h)0 make it possible to identify them with R
m and Rn , respectively. Define
E: Rm P(h) by
EA
=
m
i=1
Ai i ,
and note that, as discussed above, E1 is defined by
(E1a)i= a (xi ), i= 1, 2, . . . , m.
Similarly, define E0: Rn P(h)0 by
E0U=n
i=1Ui i ;
then(E10 u)i=
u (xji ), i
=1, 2, . . . , n. I can then write
f= E10 FE,
where F: Rm Rn is defined by
F(A)=U, a=m
i=1Ai i , u=
ni=1
Ui i , U= K1 P.
8/12/2019 Gockenbach Diff Calculus
38/55
112 GOCKENBACH
I will now how to compute D F(A)and D F(A). The matrix Kdepends on A , so I willwrite K= K(A). With
a=m
k=1Akk,
it follows that
Ki j (A)=
aj i
= m
k=1
Akk
j
i
=m
k=1
kj i
Ak
=m
k=1Ti jkAk,
where
Ti jk=
kj i , i, j=1, 2, . . . , m, k=1, 2, . . . , m.
It then follows that, for any A, ARm ,
DK(A)A=m
k=1Ti jkAk= K(A).
This result,DK(A)A= K(A), is to be expected because the operator K: A K(A)islinear in A. Since Fis defined by
F(A)=K(A)1 P,
the result from Section 4.5 applies, and
D F(A)A= K(A)1(D K(A)A)K(A)1 P= K(A)1 K(A)U,
whereU= F(A). This formula shows that computing Df(A)Afor a givenAis no moreexpensive than computing the simulated displacementU(assumingUis computedfirst, so
K(A)andUare already known), and may be much less expensive if the matrix K(A)has
already been factored.
8/12/2019 Gockenbach Diff Calculus
39/55
A PRIMER ON DIFFERENTIATION 113
I will now turn to the computation ofD f(A). Note that
(K(A)U)i=n
j=1(K(A))i j Uj
=n
j=1
mk=1
Ti jkAkUj
=m
k=1
nj=1
Ti jkAkUj
=m
k=1
n
j=1
Ti jkUjAk.I now define the matrix L= L (U)by
L(U)=n
j=1Ti jkUj,
which allows me to write
K(A)U= L (U)A
and
D F(A)A
= K(A)1L(U)A.
The formula for D f(A) now follows:
D F(A)U= L(U)T K(A)1U
(where I used the fact that Kis symmetric).
The relationship between D f(a)and D F(A)is straightforward, and, indeed, is exactly
analogous to the relationship between f and F. Ifa= E Aanda=EA, that is,
a=m
i=1Ai i , a=
ni=1
Ai i ,
thenu= D f(a)aand U= D F(A)Asatisfy
u=n
i=1Ui i .
Indeed, this follows from the chain rule applied to the relationship
f(a)=E0 F(E1a).
8/12/2019 Gockenbach Diff Calculus
40/55
8/12/2019 Gockenbach Diff Calculus
41/55
A PRIMER ON DIFFERENTIATION 115
by the fundamental rule (AB)= BA, which shows that
(E1)=(E)1. (14)
The calculation ofE is exactly the same as for E10 ; the result is
E= ME1, (15)
where MRmm is defined by
Mi j= (i , j )L2().
Together, (14) and (15) yield
(E1)= EM1.
The matrixMis symmetric and positive definite and hence invertible; this follows from the
fact that {1, 2, . . . , m} is linearly independent.Using the expressions for E0 and(E
1), (13) yields
D f(a)= EM1D F(E1a)M0E10 .
The appearance of the trivial mappings E and E10 in this formula is no more significantthan it was in the formula for D f(a). On the other hand, the Gram matrices M1 and M0appear because of the different inner products used for the two pairs of isomorphic spaces.
9. Avoiding the need to program derivatives
The user of a software package implementing numerical optimization algorithms is required
to provide some computer code (usually a subroutine written in a given language) to evaluate
the objective and constraint functions. (This is how the user specifies his or her problem
to the optimization code.) Typically, the optimization code will need values of various
derivatives, which can be obtained in several ways:
1. The user can provide hand-written computer codes to evaluate the derivatives.
2. The optimization code can estimate the derivatives usingfinite differences.
3. The derivatives can be produced, either by the user or the optimization code, using
automatic differentiation.
The emphasis of my presentation so far has been on understanding the basic theory of
derivatives, particularly the linear algebraic foundations, and on using this theory to derive
formulas for derivatives of specific functions. Such understanding is essential for hand-
coding derivatives.
Suppose, though, that a user wishes to avoid the labor (and risk2) of programming the
derivativesof theproblem functions. In this finalsection,I will briefly discuss the advantages
8/12/2019 Gockenbach Diff Calculus
42/55
116 GOCKENBACH
and disadvantages of the other two approaches to the computation of derivatives, finite
differences and automatic differentiation.
9.1. Finite difference estimation of derivatives
Optimization codes generally use the representers of the relevant derivatives: the gradient
and Hessian of a real-valued function, the Jacobian matrix of a vector-valued function. In
order to be concise, I will mostly limit my discussion to the computation of the gradient of
a real-valued function.
Suppose3 f : Rn R. Then f(x)is the vector in Rn whosei th component is
f
xi(x)= limh0
f(x
+hei )
f(x)
h , (16)
whereei is thei th standard basis vector (that is, the vector with every component equal to
zero, except thei th, which is one). When the only information available about f is ablack
boxthat will return its value for a given x, it is not possible to implement (16) exactly, as
the limit operation implies an infinite calculation.
A natural way of approximating f/xi is to simply truncate the limit operation by
choosing a small but nonzero value ofh:
f
xi(x)
.= f(x+ hei ) f(x)h
. (17)
Indeed, Taylors theorem,
f(x+ hei )= f(x) + f
xi(x)h+ 1
2
2 f
x2i(x+ hei )h2, (0, 1),
can easily be rearranged to show that
f
xi(x)= f(x+ hei ) f(x)
h+ 1
2
2 f
x2i(x+ hei )h, (0, 1).
Thus, the error in (17) is O (h); this error is referred to as thetruncation error. The question
now arises: What value ofh should be chosen in practice?
Atfirst glance, it would appear that smaller values ofh (the smaller, the better) wouldtend to lead to better approximations of the partial derivative. Though this is true in exact
arithmetic, it does not take into account the effects offloating point (computer) arithmetic.
First of all, h cannot be chosen too small in comparison to xi ; otherwise, the values ofxiandxi+ h, rounded to the nearest floating point number, will be identical (and therefore,necessarily, so will be f(x+ hei )and f(x)). More subtly, the magnitude
2 f
x2iplays a part.
A computer subroutine implementing the evaluation of fwill inevitably return inexact
results, because of round-off error if for no other reason. Suppose the implemented function
8/12/2019 Gockenbach Diff Calculus
43/55
A PRIMER ON DIFFERENTIATION 117
actually returns
f(x)= f(x) + (x),
with
|(x)|<
for all relevant values ofx. Then formula (17) will be implemented as
f
xi
(x) .
=
f(x+ hei ) f(x)h
= f(x+ hei ) + (x+ hei ) f(x) (x)h
= f(x+ hei ) f(x)h
+ (x+ hei ) (x)h
= fxi
(x) + 12
2 f
x2i(x+ hei )h+
(x+ hei ) (x)h
.
There is no reason to expect that the function is differentiable, so all that can be said about
the last term in the above expression is that
(x+hei )
(x)
h
2
h.
If the second partial derivative of f is bounded by M, then
fxi (x) f(x+ hei ) f(x)
h
Mh2 + 2h . (18)This bound suggests that the total error in the approximation can grow as h0, since theround-off error(or at least the bound for it) grows as h is decreased.
Thus smaller values ofharenot necessarily betterin practice,and so thequestionremains:
How shouldhbe chosen in practice? One idea would be to choose hto minimize the bound
in (18). This leads to the value
h= 2
M
.
However, this result is of limited use, since the value Mis not available in general. It does
suggest, however, that
h= O ()
8/12/2019 Gockenbach Diff Calculus
44/55
118 GOCKENBACH
is reasonable, and an estimate of might be available. The usual choice is
h=sign(xi )|xi |
, (19)
with some adjustment made if |xi | is too close to zero.Withhdetermined by some variation on (19), the error in the computed partial derivative
isO
. This leads to thefirst disadvantage of usingfinite difference estimates for partial
derivatives, and thus for gradients: the attainable accuracy in the computed minimizer is
limited. Afterall, algorithmsfor numericaloptimization are based on the necessary condition
that
f(x)=0
at a minimizer (or the analogous Lagrange multiplier conditions, which also involve f(x),for a constrained minimization problem). It is easy to see that the minimizer cannot be
reliably computed to an accuracy greater than the accuracy with which the gradients is
computed.
The foregoing disadvantage offinite differences is only important for small problems,
when it is reasonable (and may be important) to compute the solution to a high degree
of accuracy. A more serious objection is related to the computational cost of using finite
differences: to estimatef(x)costsn evaluations of the function f (assuming that f(x)must be computed anyway in the course of the optimization algorithm). For any problem
in which it is expensive to evaluate f, or n is large, or both, this cost may be unacceptable.
By comparison, the examples given in Section 7 and 8 yield formulas that will result in the
computation of the gradient at a cost equal to a small multiple of the cost of computing the
function itself. (Note,though, in thecase of theadjoint state methoddetailedin Section 7, thisefficiency depends on the use of the checkpointing scheme that was only briefly mentioned.)
The major advantage of usingfinite differences is obvious: the user need only implement
the problem functions and not their derivatives. The optimization code can then take care
of all details concerning the estimation of derivatives, including the choice of the step
sizeh(although the user may need to provide an estimate of). When the cost is affordable
and there is no need for high accuracy solutions, this makes finite differences an attractive
option. Although I will not discuss it here,finite difference methods can also be devised for
computing Jacobian and Hessian matrices.
9.2. Automatic differentiation
Automatic, or algorithmic,differentiation (AD) is a term applied to a collection of techniquesfor automatically producing derivatives of functions implemented in computer subroutines.
AD tools can analyze a computer program that implements a mathematical function or oper-
ator, and systematically apply the rules of differentiation, notably the chain rule, producing
a new computer program implementing the desired derivative.
There are two primary approaches to automatic differentiation, operator overloading and
source transformation. I will only discuss the source transformation approach, and, indeed,
will focus on a single AD tool, TAMC (Giering, 1999). Another source transformation
8/12/2019 Gockenbach Diff Calculus
45/55
A PRIMER ON DIFFERENTIATION 119
tool is ADIFOR (Bischof et al., 1992). An AD package that uses the operator overloading
approach is ADOL-C (Griewank et al., 1996). For a more complete discussion of automatic
differentiation, see Griewank (2000). The following discussion is taken from the report
(Gockenbach, 2000), which may be consulted for more details.
The Tangent linear and Adjoint Model Compiler(TAMC), designed and implemented by
Ralf Giering (1999), is an Automatic Differentiation (AD) package that produces linearized
and adjoint code for nonlinear operators. To be more precise, given a Fortran subroutine
implementing an operator of the formF: Rn Rm , TAMCcan produce code that computesD F(x)x and D F(x)y. TAMC can also produce derivatives and adjoints for operatorsdefined on product spaces, such as G : Rn Rk Rm .
Although TAMC produces correct and efficient code, the exact operation of TAMC-
generated code can be slightly counter-intuitive to those not well-versed in AD. I will
present an explicit mathematical model for an operator as implemented by a computerprogram, and explain the output of TAMC in terms of this model.
The following simple example will serve to introduce some of the issues encountered in
using TAMC. Define F: RR by y= F(x)=x2. This operator is implemented in thefollowing Fortran subroutine:
subroutine F(x,y)
double precision x,y
y = xxreturn
end
TAMC generates the following adjoint code (stripped of TAMC-generated comments):
subroutine adF(x,adx,ady)
implicit none
double precision adx
double precision ady
double precision x
adx = adx+2adyxady = 0.d0
end
This code correctly computes the adjoint ofD F(x); however, the valueD F(x)yis addedto (rather than assigned to) the output variable. Moreover, the input variable is assigned the
value of zero after it is used. That is, instead of implementing
x D F(x)y,
8/12/2019 Gockenbach Diff Calculus
46/55
120 GOCKENBACH
the TAMC-generated code implements
x x+D F(x)y,y 0.
Below I will show how this result could have been predicted.
9.2.1. The mathematical structure of a subroutine implementing an operator. Consider
an operator F: Rn Rm . A Fortran subroutine implementing y F(x)would have argu-mentsx,y, as well as (possibly) other arguments involved in the definition of the operator
(grid parameters, constants, etc.). The subroutine (which may call other subroutines)
consists of a sequence of statements which together perform the desired calculation. A
number of variables are involved: x and y, any variables required to hold intermediate
quantities, loop control indices, etc. Some of the variable merely serve to control the flow
of the executable statements, and are not important in developing a mathematical model
of the subroutine. The crucial variables are the active variables, which are necessarily of
floating point type. A variable u is active if
thefinal value of an output variable (i.e. one of the components ofy) depends one thevalue ofu at some step i ;
the value of u at step i depends on the initial value of an input variable (one of thecomponents ofx).
The input and output variables are active by definition. The phrase variable wat step j
depends on the value of variable z at stepi will be left undefined. The intuitive meaningis that the value of wat step j is linked to the value of z by a sequence of assignments
statements, the last of which assigns a value to w, and thefirst of which (and perhaps others)
has z, holding its value from step i , in the right hand side. (A precise definition of this
concept would be required to implement a package such as TAMC, but is not needed to
understand how it works.)
Now letSbe the set of all active variables appearing in the subroutine (or in subroutines
called by itI will not bother to make this distinction). Identifying Swith a Euclidean
space RN,F: Rn Rm can be viewed as the composition of operators
F= PFMFM1 F1Q,
where
Fi: SS, i=1, 2, . . . ,M
and Q and P are the natural projections onto the domain and range, respectively, of F.
(That is, P:SRm is defined by
(Ps)i= sji , i= 1, 2, . . . , m,
8/12/2019 Gockenbach Diff Calculus
47/55
A PRIMER ON DIFFERENTIATION 121
whereyi= sji (recall that each active variable, including every component ofy, is identifiedwith a component ofs ). The projection Q:S Rn is defined similarly.) Each statementassigning a value to an active variable can be thought of as implicitly de fining one of the
operatorsFi , and it is in the sense of these operators that I spoke ofstepsin the previous
paragraph. Of course, most steps will only involve a few variables, so most of the active
variables will retain their previous values. The role of the projectors Q and P should be
clear: Q assigns to the input variables their initial values and assigns zero to all othervariables;P extracts from the set of all active variables the output F(x).
There may be other floating point variables that are not active by this definition. For
example, if thefinal value of an output variable depends on the value ofz at some step, but
that value ofz does not depend on the initial value of any input variable, then z plays the
role of a constant. (Input arguments to the subroutine other thanx can be constants.) It is
also possible to have variables which depend on the input variables, but do not influence anyoutput variable. Such a variable can be called diagnostic. For an example of a diagnostic
variable, consider the variable z in the following program fragment:
z=x(1) x(1)+x(2) x(2)if (z.gt.1.0d0) then
y(1) = 2.0d0x(1)else
y(1) = x(1)x(1)endif
Constant and diagnostic variables are called passive variables.
By way of example, consider an assignment statement of the form
w=g(u).
Assumingu and ware active variables, say u=si1 , w=si2 (s S), this implicitly definesthe operator Fk:SS
(Fk(s))j=
g
si1
, j=i 2,sj, j=i 2.
It is instructive to compute D fk(s)and(DkF(s)). The derivative is given by
(D Fk(s)s)j=
gsi1si1 , j=i 2,sj , j=i 2.
Therefore,
(D Fk(s)s, r)=j=i2
sj rj+ g(si1 )si1 ri2
=
j=i1,i2sjrj+ si1 (ri1+ g(si1 ) ri2 ) + si2 0,
8/12/2019 Gockenbach Diff Calculus
48/55
8/12/2019 Gockenbach Diff Calculus
49/55
A PRIMER ON DIFFERENTIATION 123
F1(x,y)=(x,x2)P(x,y)=y.
The interpretation adopted by TAMC is that the subroutine implements the operator
G1(x,y)=(x,x2).
Note that
DG 1(x,y)(x, y)=(x, 2xx),(DG 1(x,y))
(x, y)=(x+ 2xy, 0).
The derivative ofFis given by
D F(x)x=2xx.
TAMC produces the following subroutine for the derivative:
subroutine g_f(x,g_x,g_y)
implicit none
double precision g_x
double precision g_y
double precision x
g_y = 2g_x x
end
This subroutine performs the computation
DG 1(x,y)(x, y)=(x, 2xx);
that is, it performs the operation
x x(implicitly),y
2xx.
In this case, the subroutine can also be regarded as performing the desired operation
y D F(x)x.
The adjoint ofD f(x)is given by
D F(x)y=2xy.
8/12/2019 Gockenbach Diff Calculus
50/55
124 GOCKENBACH
TAMC produces the following adjoint code:
subroutine adf(x,adx,ady)
implicit none
double precision adx
double precision ady
double precision x
adx = adx+2ady xady = 0.d0
end
This subroutine performs the operation
(x, y)(DG 1(x,y))(x, y),
as would be predicted by my discussion above.
In terms of the original operator F, the Fortran command
call adf(x,dx,dy)
does the following:
adds the value D F(x)y to dx, assuming that x and dy have been initialized to holdthe values ofxandy, respectively;
setsdy to zero.
To use adf in the desired manner (given x, y, compute x= D f(x)y), one wouldhave to perform the following steps:
1. initializex and dy to the values ofxand y, respectively;
2. setdx to zero;
3. savedy (assuming that its value is wanted later);
4. calladf(x,dx,dy);
5. restore the value ofdy.
Alternatively, one could hand-edit the routine adf so as to
1. replace theadd-tostatements with simple assignments;
2. remove the statements that change the input variable (dyin this example).
As a second example, suppose the operator
F(x,y)=x2y
8/12/2019 Gockenbach Diff Calculus
51/55
A PRIMER ON DIFFERENTIATION 125
is implemented so that the result F(x,y)overwrites one of the inputs, say (arbitrarily) y.
This is done in the following subroutine:
subroutine F(x, y)
double precision x,y,w
w = xxy = wy
return
end
The active variables are now x, y, w, andFcan be written as F= P F2 F1 Q, where
F1(x,y, w)=(x,y,x2),F2(x,y, w)=(x, wy,w).
TAMC regards the subroutineas implementing G2 G1, where, again, G1 = F1 and G2 = F2.Now,
DG1(x,y, w)(x, y,w)=(x, y, 2xx),DG2(x,y, w)(x, y,w)=(x, wy+ yw,w),
and
(DG1(x,y,w))(x, y,w)=(x+ 2xw,y, 0),
(DG2(x,y,w))(x, y,w)=(x, yw,w+y y).
The TAMC-generated code for the derivative is:
subroutine g_f(x,y,g_x,g_y)
implicit none
double precision g_x
double precision g_y
double precision x
double precision y
double precision g_w
double precision w
g_w = 2g_xxw = xx
8/12/2019 Gockenbach Diff Calculus
52/55
126 GOCKENBACH
g_
y = g_
wy+g_
yw
end
Thefirst executable statement computes the action ofD G1(x,y, w), the second computes
G1(x,y, w), and the third computes the action ofD G2(G1(x,y,w)). The behavior of this
subroutine is exactly as expected, and as desired; the Fortran command
call g_f(x,y,dx,dy)
overwrites dy withD F(x,y)(x, y), assuming that x,y,dx,and dy have previously been
initialized with the values x,y, d x, andy, respectively.
TAMC generates the following code for the adjoint:
subroutine adf(x,y,adx,ady)
implicit none
double precision adx
double precision ady
double precision x
double precision y
double precision adw
double precision w
adw = 0.d0
w = xxadw = adw+adyyady = ady wadx = adx+2adw xadw = 0.d0
end
This subroutine initializes the local variable adwto zero (the first executable statement),
computes G1(x,y, w) (the second statement), applies (DG 2(G1(x,y, w))) (statements
three and four), and applies (DG 1(x,y,w)) (statements five and six). This behavior is
consistent with my description above; note that its effect is the following:
y x2y,x x+ 2x yy.
The desired behavior of the subroutine is
y x2y,x 2x yy.
8/12/2019 Gockenbach Diff Calculus
53/55
8/12/2019 Gockenbach Diff Calculus
54/55
128 GOCKENBACH
The apparent advantages of AD are:
The user avoids the labor-intensive and error-prone task of implementing derivatives byhand.
If it is necessary to modify the original function, its derivatives can be modi fied auto-matically by the AD tool, again saving time spent writing and debugging code.
Modern AD tools can handle very complex code and produce, in many cases, efficientderivative code.
The disadvantages are not quite so obvious, and stem from the fact the foregoing advantages
are not quite realized:
My discussion of the code generated by TAMC shows that, in fact, the code produced bya fully automatic AD tool may need some modification by hand to achieve the desired
results as efficiently as possible.
Code produced by a fully automatic AD tool can be inefficient for certain applications.An example of this is provided by the adjoint state calculation of Section 7; an AD tool
would tend to either save or re-compute all of the intermediate computations needed for
the reverse time-stepping calculation. Either approach is significantly inefficient in some
applications.
In summary, my view is that automatic differentiation is very useful in a variety of situations,
but it must be used with care if efficiency is a prime consideration. In particular, fully
automaticdifferentiation may not be satisfactory for some applications. 4
Notes
1. I am far from thefirst to notice this. See, for example, Groetsch (1980):
A closely guarded secret in some elementary calculus courses is the fact that the basic idea of differential
calculus is the local approximation of a nonlinear function by a linear function. To quote from Dieudonne
(1969), In the classical teaching of calculus, this idea is immediately obscured by the accidental fact that,
on a one-dimensional vector space, there is a one-to-one correspondence between linear forms and numbers,
and therefore the derivative at a point is defined to be a number instead of a linear form.
The texts of Dieudonne and Groetsch are general references for the material in this paper; see also the survey
paper by Tapia (1971).
2. Probably the most common difficulty encountered when using packaged optimization software results from
the users providing incorrect derivatives.3. It is possible to compute gradients without referring to coordinates explicitly, for instance in the formula
J(x)= D f(x)(F(x) d)for J(x)= (1/2)F(x) d2. However,finite difference derivatives dependon an explicit coordinate representation, and so I may as well assume that the function is de fined on Rn .
4. See Griewank (2000), page 92:
As a rule, a general-purpose AD tool will not produce transformed code as efficient as that produced by a
special-purpose translator designed to work only with underlying code of a particular structure, since thelatter
can make assumptions (often with far-reaching consequences), where as the former can only guess.
8/12/2019 Gockenbach Diff Calculus
55/55
A PRIMER ON DIFFERENTIATION 129
A tool that requires the user to rewrite the underlying code in order to make explicit such factors as variable
dependence, structural sparsity, interface width, and memory access patterns will be able to produce efficient
transformed code more easily than a tool that must use internal analysis of the underlying program to deduce
these structural features, but that does not require such a great effort from the user. On the other hand, the
first sort of tool is difficult to apply to legacy code. A possible compromise is a tool that allows, but does not
require, the user to insert directives into the program text.
The latest version of TAMC does allow user-defined directives, making efficient code more attainable.
References
C. Bischof, A. Carle, G. Corliss, A. Griewank, and P. Hovland, Adifor: Generating derivative code from fortran
programs,Scientific ProgrammingVol. 1, pp. 129, 1992.
J. Dieudonne,Foundations of Modern Analysis, Academic Press: New York, London, 1969.Ralf Giering, Tangent Linear and Adjoint Model Compiler, User Manual 1.4, 1999. URL: http://puddle.
mit.edu/ralf/tamcM. S. Gockenbach,Understanding code generated by TAMC,Department of Computational and Applied Math-
ematics, Rice University, Houston, TX, Technical Report TR00-30, 2000.
M. S. Gockenbach, D. R. Reynolds, and W. W. Symes, Efficient and automatic implementation of the adjoint
state method,in Press.
A. Griewank,Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differenti-
ation,Optimization Methods and Software, Vol. 1, pp. 3554, 1992.
A. Griewank, D. Juedes, and J. Utke, ADOL-C, a package for the automatic differentiation of algorithms written
in C/C++,ACM TOMSVol. 22, pp. 131167, 1996.Andreas Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM:
Philadelphia, 2000.
Charles W. Groetsch,Elements of Applicable Functional Analysis, Marcel Dekker: New York, 1980.
W. W. Symes, J. O. Blanch, and R. Versteeg, A numerical study of linear inversion in layered viscoacoustic
media, in Comparison of Seismic Inversion Methods on a Single Real Dataset, R. Keys and D. Foster, eds.,Society of Exploration Geophysicists, Tulsa, 1998.
R. Tapia, The differentiation and integration of nonlinear operators, in Nonlinear Functional Analysis and
Applications, L. Rall, ed., Academic Press: New York, 1971.