Download pdf - Gockenbach Diff Calculus

8/12/2019 Gockenbach Diff Calculus

1/55

Optimization and Engineering, 2, 75129, 2001

c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Primer on Differentiation

MARK S. GOCKENBACH

Department of Mathematical Sciences, Michigan Technological University, 1400 Townsend Drive, Houghton,

MI 49931-129 5, USA

Received February 4, 2000; Revised April 4, 2001

Abstract. The central idea of differential calculus is that the derivative of a function defines the best local linear

approximation to the function near a given point. This basic idea, together with some representation theorems

from linear algebra, unifies the various derivativesgradients, Jacobians, Hessians, and so forthencountered

in engineering and optimization. The basic differentiation rules presented in calculus classes, notably the product

and chain rules, allow the computation of the gradients and Hessians needed by optimization algorithms, even

when the underlying operators are quite complex. Examples include the solution operators of time-dependent and

steady-state partial differential equations. Alternatives to the hand-coding of derivatives are finite differences and

automatic differentiation, both of which save programming time at the possible cost of run-time efficiency.

Keywords: differentiation, solution operators, finite differences, automatic differentiation

1. Introduction

Throughout their study of calculus, students are introduced to derivatives of various types.

These include:

The (ordinary) derivative f(x) of a real-valued function f of a single variable. Thenumber f(x0)is the slope of the line tangent to the graph ofy= f(x)at x= x0. It isalso interpreted as the instantaneous rate of change ofy= f(x)at x= x0.

The partial derivativesg

x1(x1,x2, . . . ,xn ),

g

x2(x1,x2, . . . ,xn ) , . . . ,

g

xn(x1,x2, . . . ,xn )

of a real-valued function of several variables. These numbers are interpreted as the

instantaneous rates of change of y= g(x1,x2, . . . ,xn ) as one variable is changed andthe others held fixed.

The gradient vector

g(x1,x2, . . . ,xn )=

gx1

(x1,x2, . . . ,xn )

g

x2(x1,x2, . . . ,xn )

...

g

xn(x1,x2, . . . ,xn)

.


2/55


3/55

A PRIMER ON DIFFERENTIATION 77

least in the United States) without encountering a course that makes this principle explicit.1

Moreover, the elementary rules of differentiation as learned in calculus coursesthe prod-

uct rule, chain rule, and so forthcan leave a student ill-prepared to compute derivatives

of the complicated functions and operators that arise in advanced engineering and applied

mathematics research.

The purpose of this paper is to explain the concept of derivative from the point of view

of local linear approximation, to show how the various types of derivatives mentioned

above fit into the concept, and to work through several important and nontrivial exam-

ples. In the following section, I discuss the basic definitions and notation needed. The

setting for these definitions is a normed vector spacea vector space with a norm. For

this reason, linear algebra is important. In Section 3, I present the elementary represen-

tation theorems of linear algebra, and show how they lead to the various scalars, vectors,

and matrices that arise in calculus courses in the context of differentiation. This is fol-lowed by a brief discussion of the rules of differentiation (Section 4), simple represen-

tations for operators on infinite dimensional spaces (Section 5), and second derivatives

(Section 6). In addition to several examples included in the sections described above, I

discuss two more involved examples: the adjoint state method for handling finite dif-

ference solution operators (Section 7), and a direct computation of the derivative (and

its adjoint) of a finite element solution operator (Section 8). Finally, in Section 9, I dis-

cuss two alternatives to programming derivatives by hand: finite differences and automatic

differentiation.

Throughout this paper, the emphasis is on the structure of maps and their derivatives, not

on the analytic details. Therefore, most technical proofs are omitted.

2. Definitions and notation

2.1. Normed vector spaces; inner products

The various derivatives described in the introduction can all be discussed in the context of

a function (operator, map) fmapping one Euclidean space into another. I will write Rn for

Euclideann-space, and denote a vectorxRn asx=(x1,x2, . . . ,xn )or

x=

x1

x2

...

xn

.

Note that R1 is (isomorphic to) R, the set of real numbers.

The following examples were discussed in the introduction:

f : RR, a real-valued function of a single variable, f : Rn R, a real-valued function of several variables, f : Rn Rm , a vector-valued function of several variables, f : RRn , a vector-valued function of a single variable.


4/55

78 GOCKENBACH

From now on I adopt vector notation and write, for example, g(x) instead of

g(x1,x2, . . . ,xn ). Also, I distinguish vectors from scalars only by context.

Now, Euclideann-space is equipped with an inner product, namely, the dot product:

(x,y)=xy=n

i=1xiyi , x,yRn .

A more general setting is an inner product space (which need not be Euclidean or finite-

dimensional). An inner product space is just a vector space Vwith an inner product(, )V,which is a mapping fromV Vinto R satisfying the following properties:

(u+ v,w)V= (u, w)V+ (v,w)Vfor allu, v , wV, , R; (u, v)V= (v , u)Vfor allu, vV; (v,v)V 0 for allvV, and(v, v)=0 if and only ifv=0.

An inner product on Vinduces a norm V onV:

vV=

(v,v)V for allvV.

It is sometimes necessary to work with norms that are not de fined by inner products. A

general norm Uon a vector spaceUis a mapping fromUinto R satisfying

uU 0 for alluU, and uU= 0 if and only ifu=0; uU= ||uUfor alluU, R; u+ vU uU+ vUfor allu, vU(the triangle inequality).

It can be shown that if(, )Vis an inner product on a vector spaceV, then vV=

(v,v)

define a norm on V.

The reason I discuss vector spaces more general than Euclidean space is that many prac-

tical problems cannot be described using finite-dimensional spaces. For example, suppose

is an open subset of Rn and f : Rn. Then under appropriate conditions on f, forany closed and bounded subset W of, there exists >0 such that, for each x0W, theInitial Value Problem (IVP)

x= f(x)(1)

x(0)= x0

has a unique solution x: [, ] Rn . Thus the IVP (1) defines an operator S: W(C[, ])n , where (C[, ])n is the space of all continuous functions u: [, ]Rn .This space is infinite-dimensional and therefore cannot be identified with any Euclidean

space. I pursue this example Section 5.5, where I compute the derivative ofS.


5/55


2.2. Definition of the derivative

Now suppose X andY are normed linear spaces, suppose Uis an open subset ofX, and

assume that f : U Y. As I explained in Section 2.1, the types of functions encounteredin calculus allfit under this description, as do many other important examples.

First recall the following definition.

Definition 2.1. Suppose X andY are vector spaces, and L:X Y. The operator L islinearif

L(x+z)=L x+L zx,z X

and

L(x)=L xx X, R

(or, more concisely,L (x+ z)=L x+ Lz for allx,z Xand, R).

Next is the fundamental definition in this paper.

Definition 2.2. LetxU. Suppose there is a continuous linear operator L :X Ysuchthat

limx

0

f(x+ x) f(x) L xYxX =

0.

Then f is said to be differentiable at x, and L is called the derivative of f at x, denoted

L =D f(x).

According to this definition, if f is differentiable at x, then D f(x) defines a linear

approximation to f nearx; indeed, if

E(x, x)= f(x+ x) f(x) D f(x)x,

then

f(x+ x)= f(x) + D f(x)x+E(x, x)

and

E(x, x)YxX

0 asx0.

This last condition is abbreviated by

E(x, x)=o(xX) asx 0


6/55

80 GOCKENBACH

(read E(x, x) is little-oh ofxX), which indicates that the error E(x, x) is smallcompared to xX when xXis small. It is easy to show that, if fis differentiable atx,then no other linear map K: X Y defines a better local linear approximation to f nearx; that is, ifK= D f(x), then the error in the approximation

f(x+ x)= f(x) +Kx

is large than the error in the approximation

f(x+ x)= f(x) +D f(x)x

in the sense that

limx0

f(x+ x) f(x) KxYxX

=0.

Now, in addition to the basic definition of derivative just given, there is really just one

key idea in this paper: the linear mapD f(x) has different representations, depending on the

particularXand Yinvolved. There is an underlying question here, which properly belongs

to linear algebra (or functional analysis when the spaces are infinite-dimensional): Given

two normed vector spaces XandY,find a convenient representation for a continuous linear

mapL : X Y. I will address this question in Sections 3 and 5 below; here I preview thosesections by answering the question for X=Y= R.

Now suppose that L : RR is linear. Ifa= L(1), which is a real number, then, sincex

=x

1 for all x

R,

L x= x L(1)=a x.

Thus, ifL : RR is linear, there is a real number aR such that

L x=a x for allx R.

That is, a linear map from R to R is represented by a real number. Therefore, if f : RRhas a derivative at x, it is customary in elementary calculus courses to define f(x) to bethenumber

f(x)= limx

0

f(x+ x) f(x)x

,

which is equivalent to

limx0

f(x+ x) f(x) f(x)xx

=0.

It now becomes clear that, under this definition, the number f(x)is just therepresenterofthe linear map D f(x).


7/55


It may seem overly pedantic to distinguish between the linear map and its representer.

However, when the vector spaces X and Yare not both one-dimensional, I believe it is

essential to make the distinction. Before I go on to more examples in Section 3, where this

should become clearer, I need to define continuity of the derivative, and also the concept of

partial derivatives.

2.3. Continuity of the derivative

I assume again that f : U Y, where XandYare norm vector spaces and U Xis open.If f is differentiable at each x U, then D f become an operator; for each x U, D f(x)belongs to L(X, Y), the space of all continuous linear maps from X intoY:

D f : U L(X, Y).

When f is differentiable at everyxU, fis simply said to be differentiable. Now,L(X, Y)is a vector space, since there is a natural way to add operators and multiply them by scalars,

and continuity and linearity are obviously preserved by these operations. Also,L(X, Y) has

a natural norm:

LL(X,Y)=supL xY

xX:x X,x=0

.

Note that this definition of norm implies

L xY LL(X,Y)xXx X.The norm of an operator thus measures the largest factor by which the operator stretches

ormagnifiesany vector in its domain. It can be shown that a linear operator L :X Yis continuous if and only if

LL(X,Y)


8/55

82 GOCKENBACH

(Of course, f might be defined on a subset of XY, but the exposition is simpler if Iassume that the domain of fis all ofXY.) SinceXYis a vector space, an operator likefis just another example that fits into the discussion above: f is differentiable at(x,y)if

there is a continuous linear operator L : X Y Zsuch that

lim(x,y)(0,0)

f(x+ x,y+ y) f(x,y)L(x, y)Z(x, y)XY

=0.

(For the norm on X Y, the obvious choices are: (x,y) =

x2X+ y2Y, (x,y) =max{xX, yY}, and (x,y) =xX+ yY. I will only use the property that(x, 0)XY= xX and(0,y)XY= yY, which holds for any of the above.) Onthe other hand, given any y

Y,

g(x)= f(x,y) for allx X

defines an operatorg:X Z. Similarly, for any x X,

h(y)= f(x,y) for allyY

defines an operatorh: Y Z. The question now arises: What is the relationship betweenD f, Dg, and D h?

The answer to this question is very simple when the structure of operators inL(XY,Z)is understood.

Theorem 2.3. Let X, Y,and Z be normed linear spaces. Then L L(X Y,Z)if andonly if there exist L 1L(X,Z)and L2L(Y,Z)such that

L(x,y)=L 1x+L 2y for all x X,yY.

Proof: Suppose L L(X Y,Z). Define L 1L(X,Z)by

L1x= L (x, 0) for allx X,

andL 2L(Y,Z)by

L2y= L (0,y) for allyY.

It is easy to prove that L 1andL 2are indeed linear and bounded. Moreover, for any(x,y)X Y,

L(x,y)= L((x, 0) + (0,y))= L(x, 0) +L (0,y)= L1x+L 2y,

as desired.


9/55


On the other hand, it is easy to verify that ifL 1 L(X,Z), L 2 L(Y,Z), and L :XY Z isdefinedby

L(x,y)=L 1x+L 2y for all(x,y) X Y,

thenL L(X Y,Z).

It is now easy to prove the following theorem.

Theorem 2.4. Let X,Y,and Z be normed linear spaces,suppose f :XYZ,and let(x0,y0) X Y . Define g :XZ by g (x) = f(x,y0)and h: YZ by h (y) = f(x0,y).Suppose f is differentiable at(x0,y0). Then g is differentiable at x0,h is differentiable at

y0,and

D f(x0,y0)(x, y)=Dg(x0)x+D h(y0)y.

The operators Dg(x0)and Dh(y0)are called the partial derivativesof f,and are denoted

Dx f(x0,y0)and Dy f(x0,y0),respectively. Thus

D f(x0,y0)(x, y)=Dx f(x0,y0)x+Dy f(x0,y0)y.

Proof: By the preceding theorem, there existL 1 L(X,Z)and L 2 L(Y,Z)such that

D f(x0,y0)(x, y)

=L1x

+L 2y for allx

X, y

Y.

In particular,

D f(x0,y0)(x, 0)=L1x,

so

limx0

f(x0+ x,y0) f(x0,y0) L 1xZ(x, 0)XY

=0.

This is equivalent to

limx0 g(x0+ x) g(x0) L 1xZ(xX =0,

soL 1= Dg(x0). Similarly, L 2= Dh (y0), and the proof is complete.

Note that, for example,

Dxf(x,y)L(X,Z),


10/55

84 GOCKENBACH

that is,

Dxf :X Y L(X,Z).

Similarly,

Dy f :X Y L(Y,Z).

The following theorem is only slightly harder to prove.

Theorem 2.5. Suppose X,Y,and Z are normed linear spaces, f :X Y Z,and thepartial derivatives of f, Dxf(x,y) and Dy f(x,y),exist and are continuous on an open

set UX Y. Then f is C1

on U,and

D f(x,y)(x, y)=Dxf(x,y)x+Dy f(x,y)y.

Note that the continuity ofDxf andDy fis necessary; it isnotthe case that ifDxf(x0,y0)

and Dy f(x0,y0)exist, then D f(x0,y0)must exist.

These above results obviously generalize to an operator of the form f :X1 X2 Xn Z; the basic equation is

D f(x)x= Dx1 f(x)x1+ Dx2 f(x)x2+ + Dxn f(x)xn ,

wherex, x X1 X2 Xn.

3. Representation of linear operators on Euclidean spaces

3.1. The basic theorem

I will now give the fundamental representation theorem for linear operators on Euclidean

spaces. Specializing this result to the various contexts described in the introduction (R R,Rn R, Rn Rm , and R Rn ) will account for the various types of derivatives describedthere.

Theorem 3.1. Let L: Rn Rm . Then L is linear if and only if there is an m n matrixA such that

L x= Ax for all x Rn

.

Proof: Let {e1, e2, . . . , en} be the standard basis for Rn (so thei th component ofei is oneand all other components are zero). Letc1= Le1,c2= Le2, . . . , cn= Le1, and define Ato be them n matrix whose columns are the vectors c1, c2, . . . , cn. That is, Ai j is(cj )i ,thei th component of the vector cj . Then, since

x= x1e1+ x2e2+ + xn en,


11/55


the linearity ofL yields

L x= x1Le1+x2Le2+ +xnLen= x1c1+x2c2+ + xncn .

However, by the definition of matrix multiplication,

Ax= x1c1+ x2c2+ + xncnalso holds. Thus

L x= Ax for allx Rn.

Thus every linear operator L : Rn Rm can be represented by anm nmatrix.If f : Rn Rm is differentiable, then D f(x) : Rn Rm linear. Therefore, there is an

mn matrix J representing D f(x). This matrix J turns out to be the Jacobian matrixmentioned in the introduction. To see this, it is convenient to first consider certain special

cases.

3.2. Representation of derivatives in special cases

3.2.1. m=n= 1. In the special casem = n = 1, so that fis a real-valued function of areal variable, D f(x)is represented by a single number f(x), as was already shown.

3.2.2. m= 1, n > 1. In the case m = 1, n> 1, so that f is a real-valued function of severalvariables, the result of Section 2.4 applies:

D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.

Herex1, x2, . . . , xn are the components of the vector x Rn . Moreover, regarded asa function xi with the other components ofx heldfixed, f defines a real-valued function

of a real variable. Thus Dxi f(x)can be represented by a single number, which is usually

denoted

f

xi(x).

Thus

D f(x)x= fx1

(x)x1+ f

x2(x)x2+ +

f

xn(x), (2)

which can be recognized as a matrix-vector product if the numbers

f

x1(x),

f

x2(x) , . . . ,

f

xn(x)


12/55

86 GOCKENBACH

are gathered in a row, that is, a 1 nmatrix:f

x1(x)x1,

f

x2(x)x2, . . . ,

f

xn(x)

.

This is the representer of D f(x) suggested by Theorem 3.1. There is a slightly different

representer ofD f(x) that is to be preferred because it generalizes to the infinite-dimensional

case: Eq. (2) is recognized as the inner product ofxwith the vector

f(x)

=

fx1

(x)

fx2

(x)

...

fxn

(x)

,

which is called the gradientof f atx. Thus

D f(x)x=(f(x), x)Rn for allxRn .

The gradient f(x)is the usual representer ofD f(x).

3.2.3. m >1, n= 1. In the casem >1, n=1, so that fis a vector-valued function of areal variable (often called a curve since its image is a curve inm-space),D f(t): RRmis represented by anm

1 matrix. Now, fcan be written

f(t)=

f1(t)

f2(t)

...

fm (t)

,

where each fi is a real-valued function of a real variable. By considering the definition of

the derivative in this case, which implies that

limt0

|fi (t+ t) fi (t) (D f(t)t)i ||t| =0,

it is easy to see that(D f(t)t)i= fi(t)t. Therefore the representer ofD f(t)is

f1(t)

f2(t)...

fm (t)

,


13/55


which is written as f(t)or f(t)in calculus courses. Since anm 1 matrix can be thoughtof as a vector, the usual interpretation of f(t)as the tangent vector to the curve x= f(t)holds. If the curve is traced out by a particle as tvaries, then, at time t, the particle is at

x= f(t), while at timet+ t, it is approximately at f(t) + f(t)t.

3.2.4. The general case m > 1, n > 1. Finally, consider the casem > 1,n > 1, so that f

is a vector-valued function of a vector variable. Then, by the results on partial derivatives,

D f(x)x= Dx1 f(x)x1+Dx2 f(x)x2+ +Dxn f(x)xn.

Now, regarded as a function ofxi with the other components ofx heldfixed, f defines a

function of the type considered in Section 3.2.3. The representer ofDxi f(x)is the column

vector

f

xi(x)=

f1xi

(x)

f2xi

(x)

...

fmxi

(x)

,

and it follows from this that the matrix Jrepresenting D f(x)has

f

x1 (x),

f

x2 (x) , . . . ,

f

xn (x)

as columns. Thus

J=

f1x1

(x) f1

x2(x) f1

xn(x)

f2x1

(x) f2

x2(x) f2

xn(x)

......

. . ....

fmx1

(x) fm

x2(x) fm

xn(x)

,

which is the Jacobian matrix mentioned in the introduction. (Note that the gradients of the

component functions fi (x)form the rows ofJ.)

3.3. Summary

Here is a summary of the results that I have presented:

If f : RR, then the representer ofD f(t)is the scalar f(x).


14/55

88 GOCKENBACH

If f : Rn R, then the representer ofD f(x)is the vectorf(x). Recall that this is aslight departure from the general framework, as D f(x)is represented via inner product

with a (column) vector rather than via matrix multiplication with a row vector.

If f : RRm , then the representer ofD f(x)is the (column) vector f(t). If f : Rn Rm , then the representer ofD f(x)is the Jacobian matrix J defined by

Ji j= fi

xj(x).

3.4. Example: A quadratic function

Suppose A is ann

n symmetric matrix (AT

=A), b

Rn ,

R, and f : Rn

R is

defined by

f(x)= 12

(x,Ax)Rn+ (b,x)Rn+ .

To compute f(x), one method is to write fin terms of the components ofx,

f(x)= 12

ni=1

nj=1

Ai jxixj+n

i=1bixi+ ,

and compute the partial derivatives of f. While this is possible, it is easier to proceed from

the definition: Write f(x+x) f(x) asa termthat is linearin xplus a smaller remainder.Now,

f(x+ x) f(x)= 12

(x+ x,A(x+ x)) + (b,x+ x) + 1

2(x,Ax) (b,x) = (Ax+ b, x) + 1

2(x,Ax).

Note that the symmetry of A was used to conclude that (x,Ax)= (x,Ax). The term(x,A x)/2 is small compared to x whenxis small; in fact,

12

(x,Ax)=O (x2)=o(x).

Therefore,

f(x+ x) f(x)=(Ax+ b, x) + o(x),

that is,

D f(x)x=(Ax+ b, x).

This equation exhibits the representer f(x)ofD f(x):

f(x)= Ax+ b.


15/55


4. Rules for differentiation

I will now review the important rules for differentiating functions.

4.1. The derivative of a linear function

Suppose f : X Yis linear and continuous. Thenf(x+ x) f(x)= f(x) + f(x) f(x)= f(x).

It follows that

D f(x)x= f(x),that is,D f(x)= f. This holds independently ofx X, and is the analogue of the rule fromcalculus which states that the derivative of a linear function is a constant.

4.2. The chain rule

Now suppose thath : Y Z,g : X Y areC1, and f : X Z is the composition ofhand g:

f(x)=h (g(x)) for allx X.Then

f(x+ x)=h (g(x+ x))= h (g(x) +Dg(x)x+ o(x))= h (g(x)) +D h(g(x))(Dg(x)x+ o(x))

+ o(Dg(x)x+ o(x))= f(x) +D h(g(x))Dg(x)x+ o(x).

Thus

D f(x)x= Dh (g(x))Dg(x)x.This is thechain rule.

4.3. The product rule

SupposeX,Y, Z, andWare normed vector spaces, and P : YZWis continuous andbilinear; that is

P(1y1+ 2y2,z)=1 P(y1,z) + 2 P(y2,z)for all y1,y2Y,z Z, 1, 2R,

P(y, 1z1+ 2z2)=1 P(y,z1) + 2 P(y,z2)for all yY,z1,z2 Z, 1, 2R.


16/55


17/55


Thus

h(x)= G Tf(x) +FTg(x).

4.5. Example: Differentiating an inverse

SupposeL : XL(Y,Z), and assume thatL (x)1 exists and is continuous for eachx X.Define f :X L(Z, Y)by

f(x)=L(x)1.

Assuming that L isC1

, what is D f(x)?Both the chain rule and the product rule are involved in the answer. Define : U

L(Z, Y)by (K)=K1, where

U= {K L(Y,Z): K1 exists and is continuous}.

Then

K(K)= I,

where I : Z Zis the identity operator. Differentiating both sides yields

KD (K) K+ K(K)=0. (3)

To obtain this result, the product rule was applied to the mappingP : L(Y,Z)L(Z, Y)L(Z,Z) defined byP (K,L )=KL. Also, note thatIL(Z,Z) is constant, so its derivativeis zero. Now, (K)=K1, so (3) yields

KD (K)K= K K1

or

D(K) K= K1K K1. (4)

(The reader will notice the shadow of the calculus rule

f(x)=

1

x f(x)

= 1

x2

here.)

Equation (4) can now be combined with the chain rule to findD f(x) for f(x)=L (x)1.Since f is the composition of and L , the chain rule yields

D f(x)x= D (L(x))DL(x)x= L(x)1DL(x)x L(x)1.


18/55

92 GOCKENBACH

This expression forD f(x)xis the product (composition) of three linear operators,L (x)1,DL(x)x, and L (x)1. Note that since the product of linear operators is not commutative,order is important in this formula.

5. Simple representations on infinite-dimensional spaces

IfX andYare both infinite-dimensional, then little can be said in general about the repre-

sentation ofL L(X, Y). However, ifoneof the spaces isfinite-dimensional, then it is notdifficult to derive some useful results.

5.1. Real-valued functions defined on Hilbert spaces

First I take the special case of f : X R, where X is aHilbert spacea complete innerproduct space. In this case, the Riesz Representation theorem is available.

Theorem 5.1. (Riesz Representation Theorem). Let X be a Hilbert space. Then if L :

X R is linear and continuous,there exists a unique v X such that

L(x)=(x, v)x for all x X.

This theorem follows immediately from Theorem 3.1 ifXis finite-dimensional. For a proof

in the infinite-dimensional case, see any book on Hilbert spaces or functional analysis.

Now suppose f : X R is differentiable at x X. Then D f(x)is a continuous linearfunction defined onX, and, by the Riesz representation theorem, there exists a vector v

X

satisfying

D f(x)x=(x, v)X for allx X.

Just as in the finite-dimensional case, the vector v is called the gradientof f at x, and is

denoted by f(x).

5.2. Example: The gradient of a nonlinear least-squares function

A common optimization problem is to minimize the nonlinear least-squares function

f(x)

= 12

F(x)

22

= 1

2(F(x),F(x))Y,

where F: X Yis a nonlinear operator and X,Yare Hilbert spaces. I will now apply theresults developed above to compute the gradient of f. I will also specialize the results to

X= Rn ,Y= Rm .By the product rule,

D f(x)x= 12

(F(x),D f(x)x)Y+ 12 (D f(x)x,F(x))Y= (D f(x)x,F(x))Y.


19/55


Now, anm nmatrix Ahas the property that

(Ax,y)Rm= (x,ATy)Rn for allx Rn,yRm .

Similarly, for every operator L L(X, Y), there is a uniqueadjointoperatorL defined bythe equation

(L x,y)Y= (x,Ly)X for allx X,yY.

(The existence and uniqueness ofLcan be proved using the Riesz representation theorem.)Therefore,

D f(

x)

x=(

D F(x

)x

,F

(x

))Y=

(x

,D F

(x

) F

(x

))X

,

which shows that

f(x)=D F(x) F(x).

Computing the adjoint of D f(x) can be quite challenging in some applications; see

Section 7 for a nontrivial example. In the case ofX=Rn ,Y= Rm ,D f(x)is represented bythe Jacobian matrix J, and therefore D f(x)is represented by its transpose. It follows that

f(x)= JTF(x),

where Jis the Jacobian matrix ofFatx.

5.3. Finite-dimensional operators on Hilbert space

Next I consider the case ofF : X Rm , where Xis again a Hilbert space. Clearly Fcanbe represented as

F(x)=

F1(x)

F2(x)

...

Fm (x)

,

where Fi : X R, i= 1, 2, . . . , m. It follows that

D F(x)x=

D f1(x)x

D f2(x)x

...

D fm (x)x


20/55

94 GOCKENBACH

=

(F1(x), x)X(F2(x), x)X

...

(Fm (x), x)X

.

Thus the derivative ofFcan be represented by m vectors in X, namely,

F1(x), F2(x) , . . . , Fm (x).

By analogy with thefinite-dimensional case, these vectors can be thought of as forming the

rows of a matrix (with infinitely many columns).

5.4. Operators with afinite-dimensional domain

5.4.1. The case of a one-dimensional domain. SupposeY is a normed linear space, and

assumeF: RYis differentiable. TheD F(t)is a continuous linear operator from R intoY. It is simple to represent such operators, for ifL L(R, Y)andz=L (1), then

Lt=tL(1)=t z for alltR.

ThusL is represented by an element z ofY.

It follows that D F(t)is represented by an element ofY, which is denoted F(t):

D F(t)t= tF(t) for all t R.

5.4.2. The case of a finite-dimensional domain. Now suppose F: Rn Y is differen-tiable. ThenD F(x) L(Rn , Y), and the structure of such an operator must be determined.LetL L(Rn, Y) and let {e1, e2, . . . , en} be the standard basis for Rn. Thenfor anyx Rn ,

x=n

i=1xi ei ,

and so

L x

=

n

i=1

xiLei .

That is, there are nvectorsL e1, Le2, . . . ,L enin Y, and each imageL xis a linear combina-

tion of thesenvectors. Thesenvectors representL . By analogy with the caseY= Rm , onecan think of the representer ofL as a matrix with ncolumns (each of which is a vector in Y).

It is now easy to see that D F(x) is represented by n vectors, each of which is the

representer of a partial derivative of F at x. Again, one can think of the representer of

D F(x)as a matrix with n columns, each a representer of a partial derivative ofFatx.


21/55


5.5. Example: The solution operator of an IVP

As an important example of the previous section, consider a vector field f :Rn, where Rn is an open set. The vectorfield f defines anautonomous(i.e. time-independent)ordinary differential equation (ODE)

x= f(x).

Bya standardresultof the theoryof ODEs, ifW is closedand bounded and f is C1,thenthere exists a positive number such that, for each x0 W, there exists x (C[, ])nsatisfying the Initial Value Problem (IVP)

x

= f(x),

x(0)=x0. (5)That is, there is an operator S : W (C[, ])n with S(x0)= x, the solution of (5). IcallSthe solution operator of the IVP.

Recall that(C[, ])n is the space of all continuous, vector-valued functions de finedon [, ]. The usual norm of(C[, ])n is

u=max{u(t)2 : t[, ]}.This definition implies that ifu vis small, then u(t) v(t) is uniformly small onthe interval [, ]. For this reason, is sometimes called theuniformnorm.

The derivativeDS(x0)is computed by finding the local linear approximation to S(x0+x0)

S(x0). Writez

=S(x0

+x0)and x

=S(x0). Thenz satisfies

z= f(z),z(0)=x0+ x0,

andxsatisfies (5). Therefore, ifw=zx, then

w= z x= f(z) f(x)= D f(x)(z x) + o(z x)= D f(x)w+ o(w)

and

w(0)=z (0) x(0)=x0+ x0 x0=x0.

Since a linear (inx0) approximation tow is desired, it is reasonable to drop the o(w)term from the ODE and consider the solution u to

u= D f(x(t))u,(6)

u(0)=x0.


22/55

96 GOCKENBACH

Note thatu really does depend linearly on x0. Indeed, ifu solves (6) andv solves

u= D f(x(t))u,u(0)=y0,

theny= u+ vsatisfies

y= u+ v=Df(x(t))u+ D f(x(t))v= Df(x(t))(u+ v)

= Df(x(t))y,

and

y(0)= u(0) + v(0)= x0+ y0.

Thereforeu, the solution of (6), depends linearly onx0, and it is an approximation tow

since it is obtained by solving an IVP with the same initial conditions as that satisfied byw

and with a slightly changed vectorfield. It can be proved, in fact, that w=u + o(w)=u +o(x02). (This is a standard theorem about the continuous dependence of the solutionto an IVP on the vector field.) Therefore DS(x0)x0= u, where u is the solution of theIVP (6).

6. Second derivatives

In elementary calculus, if f :R Ris twice differentiable, then the scalar f(x) represent-ing D f(x)is called thefirst derivative. Since, in this way of looking at things, f: RRis the same type of function as is f itself, it is natural to define f(x) as the derivativeof f at x, so that f(x)is also a scalar and f, like f and f, maps R into R. As I willnow explain, this is another instance in which the one-dimensional case gives a completely

misleading picture.

Suppose X andYare normed linear spaces, UX is open, and f : UY is differen-tiable. Then the derivative D f is also an operator mapping one normed linear space into

another; however, it isnotof the same type as f, since D f mapsU into L(X, Y). It does

make sense to ask whether D f : U L(X, Y) is differentiable; to examine this ques-tion, Definition 2.2 is applied. The operator D f is differentiable atx U if there exists acontinuous linear operator L L(X,L(X, Y))such that

limx0

Df(x+ x) D f(x) L xL(X,Y)xX

=0.

If suchanL exists, then f is saidto be twice-differentiable at x, andL is denoted byD2f(x).

If f is twice-differentiable at each x U, then f is called twice-differentiable, in which


23/55


caseD2 f is an operator mapping Uinto L(X,L(X, Y)). If this operator is continuous, then

f is calledC2.

Now, clearly Definition 2.2 can be used to discuss derivatives of order three and higher.

However, things become quite awkward. For example, if f is three times differentiable,

then

D3 f(x)L(X,L(X,L(X, Y))),

and ifD 4 f(x)exists, then

D4 f(x)L(X,L(X,L(X,L(X, Y)))).

Fortunately, a simplification is afforded by the nature of the spaces

L(X,L(X, Y)),L(X,L(X,L(X, Y))),...

ConsiderL L(X,L(X, Y)). By definition, L (x) L(X, Y)and L (x)z Yfor eachx,z X. In other words, L defines an operator B :XX Y by

B(x,z)=L(x)z.

It is easy to see that B is bilinear, that is, that

B(1x1+ 2x2,z)=1B(x1,z) + 2B(x2,z)x1,x2,z X, 1, 2 R,B(z, 1x1+ 2x2)=1B(z,x1) + 2B(z,x2)x1,x2,z X, 1, 2 R.

Indeed,

L(1x1+ 2x2)=1L(x1) + 2L(X2),

from which it follows that

B(1x1+ 2x2,z)=L(1x1+ 2x2)z= 1L(x1)z+ 2L(x2)z= 1B(x1,z) + 2B(x2,z).

Similarly, L (z)is linear, so

B(z, 1x1+ 2x2)=L(z)(1x1+ 2x2)= 1L(z)x1+ 2L(z)x2= 1B(z,x1) + 2B(z,x2).


24/55

98 GOCKENBACH

IfX,Y, and Zare normed linear spaces, then the space of continuous bilinear operators

B:X Y Zis denoted by L2(X, Y,Z). This space has a natural norm:

BL2(X,Y,Z)=supB(x,y)z

xXyY: x X,yY,x=0,y=0

.

It is a standard result that a bilinear operator B :X Y Zis continuous if and only if

BL2(X,Y,Z)


25/55


26/55

100 GOCKENBACH

Note that the notation D2xy f denotes the partial derivative with respect to y of Dxf, and

similarly for D2yxf. By Theorem 6.1, each ofD2 f(x, y), D 2xxf(x, y), and D

2yy f(x, y)is

symmetric. It follows easily that

D2x y f(x,y)zx= D2yxf(x,y)xz for allx X, zY.

Of course, these results can be generalized to the case of f :X1X2 Xn Z,in which case the fundamental formulas are

D2f(x)xr=n

i=1

nj=1

D2xixj f(x)xj ri ,

and

D2xixj f(x)xj ri= D2xjxi f(x)ri xj .

6.3. Representation of second derivatives onfinite-dimensional spaces

Now I returnto the caseof f : Rn Rm and derive the formula for the 3-tensor representingD2 f(x). Since Rn can be regarded as the product ofncopies of R,D2 f(x) can be expressed

in terms of the second partial derivatives of f:

D2f(x)xr=n

j=1

n

k=1

D2xjxk f(x)xkxj .

Recall that D 2xjxk fis the derivative with respect to xkofDxj f; also recall that

Dxj f : Rn L(R, Rm ),

or, effectively,

Dxj f : Rn Rm

(since each operator in L(R, Rm )) is represented by a vector in Rm , and vice versa).

Specifically,Dxj f(x)is represented by the vector

f

xj(x).

By the same reasoning, D 2xjxk f(x)is also represented by a vector in Rm , namely,

2 f

xkxj(x).


27/55


It follows that

D2f(x)xr=n

j=1

nk=1

2f

xk xj(x)xkxj .

Thus,

D2f(x)xr

i=

nj=1

nk=1

2fi

xk xj(x)xkxj ,

which shows that D 2f(x)is represented by the 3-tensorT, with

Ti jk= 2fi

xkxj(x).

6.4. The Hessian

In Sections 3.2.2 and 5.1, I showed that, in the case of f :X R, the usual representer ofD f(x) is thegradientvector f(x).WhenX= Rn , thisis not quite the sameas the Jacobianmatrix (which is a row matrix in that case); the gradient is adopted instead precisely because

it generalizes to the case in which Xis a Hilbert space. In the same way, the representer

for D2f(x)has a special form when the range of fis R. Indeed, in this case D2f(x) is a

bilinear operator mapping XXinto R, and the following theorem holds.

Theorem 6.4. Suppose X is a Hilbert space and B L2(X,X,R) is a bilinear form.Then there exists a linear operator L L(X,X)such that

B(x,y)=(L x,y)x.

This theorem can be proved using the Riesz representation theorem, since, for fixedx, the

map

y B(x,y)

defines a continuous, linear, real-valued function onX.InthecaseofD2f(x), where f :XR, the linear operator representing the bilinear operator is called the Hessianoperator andis denoted 2 f(x). That is,

2 f(x)L(X,X)

is defined by

D2f(x)xy= 2 f(x)x, yX

.


28/55

102 GOCKENBACH

In the case of f : Rn R, the Hessian matrix is just the specialization of the tensor Tdiscussed above. Since m = 1 in this case, the 3-tensor can obviously be identified witha 2-tensor, i.e. a matrix (just as the Jacobian matrix can be identified with a vector, the

gradient, in this case). Therefore,

(2 f(x))i j= 2f

xj xi(x).

6.5. Example: The Hessian of a nonlinear least-squares function

I now return to the example of Section 5.2. Let F: X Y, where X andY are Hilbertspaces, and define

f(x)= 12

(F(x),F(x))Y.

I showed earlier that

D f(x)x=(D F(x)x,F(x))Y.

By the product rule, it follows that

D2F(x)xr= (D F(x)x,D F(x)z)Y+ (D2F(x)xr,F(x))Y. (7)

This gives a formula for the second derivative of f, but it must be rearranged to exhibit the

Hessian operator. Thefirst term is easy to handle, since

(D F(x)x,D F(x)r)Y= (x,D F(x)D F(x)r)X.

The operatorD F(x)D F(x)thus forms part of the Hessian; indeed, in small residual least-squares problems, this operator is a good approximation to the Hessian, at least forxnear the

minimizer. For this reason, it is often used as an approximation to Hessian, and is referred

to as theGauss-NewtonHessian.

To handle the second term, write

B(x, r)=D2 F(x)xr;

then

(D2F(x)xr,F(x))Y= (B(x, r),F(x))Y= (x, (B(, r)) F(x))X,

where I write B(, r)for the linear operator defined by

x B(x, r).


29/55


Now, it is easy to see that

r(B(, r)) F(x)

defines a linear operator mapping Xto X, and it can be shown to be bounded (continuous).

This operator depends onxthroughF(x)and D 2F(x), and I will denote it by S(x), so that

S(x)r= (B(, r)) F(x).

With this notation,

(D2F(x)xr,F(x))Y

=(x,S(x)r))X,

and therefore

2f(x)=D F(x)D F(x) +S(x).

Lastly, I will computeS(x)in the case F: Rn Rm . In that case,D 2 F(x)is representedby the 3-tensor T, where

Ti jk= Fi

xkxj(x).

Therefore, settingz= F(x),

(D2 F(x)xr,z)Rm=m

i=1(D2 F(x)xr)izi

=m

i=1

nj=1

nk=1

Ti jkxkrjzi

=n

k=1

n

j=1

mi=1

Ti jkzi

rj

xk

=

x,

nj=1

mi=1

Ti jkzi

rj

Rn

,

and so

(S(x)r)k=n

j=1

mi=1

Ti jkzi

rj

=n

j=1

mi=1

Fi

xkxj(x)Fi (x)

rj .


30/55

104 GOCKENBACH

This shows thatS(x)is represented by the matrix whose(k, j )entry is

mi=1

Fi

xk xj(x)Fi (x),

and hence

S(x)=m

i=1Fi (x)2 Fi (x).

The matrix representing S(x) has been referred to as the mess matrix, and the above

formula shows that it is expensive to compute. This explains the popularity of the Gauss-

Newton Hessian. However, in large-residual least-squares problems, use of the full Hessian

(or an approximation to it) is necessary.

7. Example: The adjoint state method

As a more involved example, I will discuss the computation ofDG(c) and DG(c) for anonlinear operatorG defined by an (explicit)finite-difference simulation. This discussion

is taken from the paper (Gockenbach et al., in press). The problem described here arises, for

example, when one or more coefficients in a partial differential equation are to be estimated

by the Output Least-Squares (OLS) technique. In this technique, the parameters are chosen

to produce simulated data as close as possible (in a norm induced by an inner product) to

observed data. Specifically, the OLS problem is

minc

J(c),J(c)= 12G(c) D obs2, (8)

where c C denotes the unknown parameters, Dobs is the observed data, and G is theforward map, that is, the operator embodying the mathematical model of the dependence of

the data on the parameters. Thus the OLS problem is just a nonlinear least-squares problem

of the type discussed above, and

J(c)=DG (c)(G(c) D obs ).

In the application I consider here,G is defined by an explicitfinite-difference simulation

followedby sampling (inmanyapplications, only part offield simulatedby finite-differences

is observable). I will therefore assume that

G(c)=SU=N

n=0SnU

n,

whereUn Uis (related to) the n th time level of the simulated field, S : U D is thesampling operator, and Dis the data space (that is, D obs D).


31/55


Note thatSis defined by

SU=N

n=0SnU

n,

where Sn : U Dfor n= 0, 1, . . . ,N.That is, each time level of the computedfield issampled, and the results are accumulated as the data. This formalism provides an efficient

way to abstractly represent several different sampling possibilities. For example, the entire

time levelUn may be recorded for certain values ofn, in which case Sn is the zero operator

for all other values ofn. Alternatively, every time level could be sampled at a few receiver

locations (as in the typical seismic experiment), and the results recorded as time series. At

the other extreme the entire history of the field could be retained. All of these possibilitiescan be accommodated within the above formalism by appropriate choice ofS.

Anyfinite-difference scheme can be considered to be formally two-level, by concatenat-

ing several time levels if necessary. Therefore,

Un+1 = Hn (c, Un), n=0, 1, . . . ,N 1.

I call Hn : CU U thestencil operator.

7.1. A convection-diffusion example

I will now pause to give an explicit example of the situation described above. Consider thefollowing initial-boundary value problem for the convection-diffusion equation:

ut+ a(x)ux= 0, 0< x0,u(x, 0)= (x), 0< x0,

wherea(x) >0 for all x [0, 1]. Define a grid on the rectangle(x, t)[0, 1] [0, T] bysetting

xj

= j x, x

=

1

M

, tn

=nt, t

=

T

N

,

and writeu nj for the approximation tou (xj , tn). Since the characteristics of the PDE point

up and to the right in the (x, t)plane, it is natural to discretize using a forward difference

in time and a backward difference in space to obtain

un+1j unjt

+ ajunj unj1

x=0,


32/55

106 GOCKENBACH

whereaj= a (xj ). Taking into account the initial and boundary values yields

un+1j =

0, j= 0, n=0, 1, . . . ,N 1j aj xt( j j1), n=0, j=1, 2, . . . ,Mun1 a1 xtu

n1 , n=1, 2, . . . ,N 1, j=1

unj a1 xt

unj unj1

, n=1, 2, . . . ,N 1, j=2, 3, . . . ,M

(9)

The stencil operator Hfor this example is therefore defined by H(a, un )= un+1, whereun+1 is defined by (9). In terms of the above notation, U is (M+1)-dimensional space,while Cis M-dimensional space.

To introduce sampling, suppose that sensors are placedat several grid points on the spatialgrid, say atxj1 ,xj2 , . . . ,xj , and that the observed data consists of the times series

u1ji , u2ji

, . . . , uNji , i=1, 2, . . . , .

If each time series forms a column of a matrix, then the data D is an(N+ 1) matrix,and we have

D=N

n=0SnU

n ,

whereSnUn is the matrix with every row equal to zero except in nth row, which has entries

unj1 , unj2

, . . . , unj .

7.2. Back to the general case

Thelinearizationof the mapc G(c)is the result offirst-order perturbation of the time-stepping equations:

DG (c)c=N

n=0SnU

n,

where

Un+1 = DcHn (c, Un)c +DUHn(c, Un )Un

andUn = (DU(c)c)n . Note that if the originalfinite-difference scheme is linear (reallyaffine: linear plus constant), then it can be written as

Un+1 = A(c)Un + Fn ,


33/55


whereA(c) =DUHn(c, U) (DUHn is independent of the time level nin this case). It followsthatUsatisfies

Un+1 = A(c)Un + (DA(c)c)Un .

Therefore, in this common case, the linearization is computed by afinite-difference simula-

tion identical to the original, except that the right-hand side Fn is replaced by

(DA(c) c)Un .

I will now show how to compute the adjoint ofD G(c). The spaces Cand Urequire inner

products; these inner products will be denoted (, )Cand(, )U, respectively. ThefieldUbelongs to UN+1, and I define the inner product on UN+1 by

(U, V)UN+1=N

n=0(Un , Vn )U.

For convenience, and suppressing the dependence onc, write An for DUHn(c, Un )and

Fn+1 for DcHn(c, Un)c, F0 =0, so that the linearized scheme can be written as

U0 =0, Un+1 A nUn = Fn+1, n=0, 1, . . . ,N 1.

This can also be written as

MU= F,

where M : UN+1 UN+1 is the block linear operator

M=

I 0 0 0A0 I 0 0

0 A1 I 0...

.... . .

. . ....

0 0 AN1 I

(note that Mdepends onc, but I suppress this dependence). ThenU= M1 F, and theexplicit time-stepping scheme is equivalent to solvingMU= Fby forward substitution.

Now, write B for the operator mapping cto F:

(Bc)n =

0, n=0DcHn1(c, Un1)c, n=1, 2, . . . ,N

(again suppressing the fact that B depends onc). Then

DG (c)=S M1B,


34/55

108 GOCKENBACH

and so

DG (c)= BMS.

Assuming that Sn, Sn , and the stencil operator Hn and its derivatives and adjoints DcHn

(c, U),DUHn (c, U), DcHn (c, U), and DUHn (c, U) are known (the reader might find it

instructive to compute these derivatives and adjoints for the convection-diffusion example

given above), I will now show how to compute D G(c) from them.Note that D G(c)D= BMSD. Write V= SD. Then, as is easy to verify,

Vn =( SD)n = Sn D, n=0, 1, . . . ,N.

From my choice of inner product on UN

+1

, it follows that M is the block linear operator

M=

I A0 0 00 I A1 0...

.... . .

. . ....

0 0 I AN10 0 0 I

Write W= MV, so that W solves MW= V. Since M is block upper trian-gular, Wcan be found by back substitution, which is equivalent to the following reverse

time-stepping scheme:

WN = VN, Wn1 = An1Wn + Vn1, n= N,N 1, . . . , 1.

I will refer to Was theadjoint stateand to the equation MW= Vas theadjoint stateequation.

Next I compute B . Note that

(Bc, W)UN+1= (0, W0)U+N

n=1(DcHn1(c, Un1)c, Wn )U

=N

n=1(c, DcHn1(c, Un1)Wn )C

=

c,N

n=1DcHn1(c, Un1)Wn

C

.

This shows that

BW=N

n=1DcHn1(c, Un1)Wn.


35/55


Thus the procedure for computing D G(c)D, forDDN+1, is:

1. Solve the simulation problem to produce thefieldU(needed in steps 3b and 3c).

2. Setcto zero.

3. Forn= N, N 1, . . . , 1:(a) Compute Vn = Sn D.(b) Compute Wn by taking one step (backward in time) on the adjoint state equation

(or simply WN = VN).(c) Add DcHn1(c, Un1)Wn to the output vectorc.

A logistical problem immediately asserts itself: Uis produced by steppingforwardin time,

Wby steppingbackwards.Unless the state space has small dimension (which is certainlynot the typical case), storage of the entire time history of the reference field U is very

expensive in terms of memory. On the other hand, one could, at each step of the backward

time-stepping algorithm, re-compute the needed time level Un by forward time-stepping

fromU0. This is obviously expensive in terms of computation time.

To balance the need for storage and recomputation, a checkpointing scheme due to

Andreas Griewank (1992), extended in Symes et al. (1998), can be employed. The idea

is to save (checkpoint) various time levelsUn to use as intermediate initial data to restart

the computation ofU during the solution of the adjoint state system. A complete description

of the algorithm appears in Gockenbach et al. (in press).

8. Example: Differentiating afinite element solution operator in an inverse problem

As myfinal example, I will discuss the computation of the derivative and its adjoint when

the operator is the (approximate) solution operator, as implemented using thefinite element

method, of an elliptic partial differential equation. Suppose is a bounded polygonal region

in the plane and consider the boundary value problem (BVP)

(au)= p in , (10)u= 0 on .

This BVP models, for example, the small tranverse displacements uof an elastic membrane

under a tranverse pressure p. The coefficient a is describes the elastic properties of the

membrane, and when the membrane is heterogeneous, a is a function of space: a=a (x).The usualdirect problemis to computeu given the functions pand a; that is, given the

elastic properties of the membrane and the pressure to which it is subjected, determine its

displacement. In many applications, it is necessary to solve an inverse problem, such as:

Given p and a measurement ofu, estimate a; that is, by observing the displacement of

the membrane under a known pressure, estimate the elastic properties of the membrane. (It

would also be possible to consider p as needing to be measured, that is, that it forms part

of the data of the problem. To simplify the presentation, I will assume that the pressure p

is known.)


36/55

110 GOCKENBACH

One way to solve the inverse problem numerically is to use the Output Least-Squares

approach, as described in the previous section, in conjunction with the finite element method

for solving the BVP. Suppose that the measured data is denoted u obs anda is to be chosen

so that the predicted displacementu , as simulated by piecewise linearfinite elements, is to

be as close tou obs as possible in the L2()norm. It is necessary to have a representation

for the unknown coefficienta, and I will represent it using a piecewise linear function. Let

T (h) be a triangulation of, and define

P(h) = { : R| is continuous and piecewise linear relative to T (h)}

P(h)0 = {P(h) | =0 on }.

Suppose the nodes of the triangulation T( h) arex1,x2, . . . ,xm and i is the element ofP(h)

defined by

i (xj )=

1, j=i0, j=i .

Then {1, 2, . . . , m} is the standard basis for the space P(h), and every elementuP(h)satisfies

u=m

i=1u(xi )i .

The basis functions that correspond to interior nodes comprise a basis forP(h)0 ; I will denote

this basis by {1, 2, . . . , n} (there exists a sequence i1, i2, . . . , insuch that j= ij , j=1, 2, . . . , n).

Thefinite element method for estimating the solution of (10) takes the form

finduP(h) such that

au i=

fi i= 1, 2, . . . , n. (11)

Upon substituting

u=n

j=1Uj j ,

(11) can be written as the matrix-vector equation KU= P, where

Ki j=

aj i , i, j= 1, 2, . . . , n,

Pi=

pi , i= 1, 2, . . . , n.

Note that K Rnn is symmetric and positive definite.


37/55


Now define the (approximate) solution operator of (10) as

f : P(h) P(h)0 ,

where

f :au=n

i=1Ui i , U= K1 P

and K Rnn and P Rn are defined as above. The OLS approach is then to minimizethe function J: P(h) R defined by

J(a)= 12f(a) uobs2

L2().

This is another nonlinear least-squares function, and its gradient is given by

J(a)= D f(a)(f(a) uobs).

It is easier to compute D f(a)and D f(a) if I explicitly recognize the fact that the basesfor P(h) and P

(h)0 make it possible to identify them with R

m and Rn , respectively. Define

E: Rm P(h) by

EA

=

m

i=1

Ai i ,

and note that, as discussed above, E1 is defined by

(E1a)i= a (xi ), i= 1, 2, . . . , m.

Similarly, define E0: Rn P(h)0 by

E0U=n

i=1Ui i ;

then(E10 u)i=

u (xji ), i

=1, 2, . . . , n. I can then write

f= E10 FE,

where F: Rm Rn is defined by

F(A)=U, a=m

i=1Ai i , u=

ni=1

Ui i , U= K1 P.


38/55

112 GOCKENBACH

I will now how to compute D F(A)and D F(A). The matrix Kdepends on A , so I willwrite K= K(A). With

a=m

k=1Akk,

it follows that

Ki j (A)=

aj i

= m

k=1

Akk

j

i

=m

k=1

kj i

Ak

=m

k=1Ti jkAk,

where

Ti jk=

kj i , i, j=1, 2, . . . , m, k=1, 2, . . . , m.

It then follows that, for any A, ARm ,

DK(A)A=m

k=1Ti jkAk= K(A).

This result,DK(A)A= K(A), is to be expected because the operator K: A K(A)islinear in A. Since Fis defined by

F(A)=K(A)1 P,

the result from Section 4.5 applies, and

D F(A)A= K(A)1(D K(A)A)K(A)1 P= K(A)1 K(A)U,

whereU= F(A). This formula shows that computing Df(A)Afor a givenAis no moreexpensive than computing the simulated displacementU(assumingUis computedfirst, so

K(A)andUare already known), and may be much less expensive if the matrix K(A)has

already been factored.


39/55


I will now turn to the computation ofD f(A). Note that

(K(A)U)i=n

j=1(K(A))i j Uj

=n

j=1

mk=1

Ti jkAkUj

=m

k=1

nj=1

Ti jkAkUj

=m

k=1

n

j=1

Ti jkUjAk.I now define the matrix L= L (U)by

L(U)=n

j=1Ti jkUj,

which allows me to write

K(A)U= L (U)A

and

D F(A)A

= K(A)1L(U)A.

The formula for D f(A) now follows:

D F(A)U= L(U)T K(A)1U

(where I used the fact that Kis symmetric).

The relationship between D f(a)and D F(A)is straightforward, and, indeed, is exactly

analogous to the relationship between f and F. Ifa= E Aanda=EA, that is,

a=m

i=1Ai i , a=

ni=1

Ai i ,

thenu= D f(a)aand U= D F(A)Asatisfy

u=n

i=1Ui i .

Indeed, this follows from the chain rule applied to the relationship

f(a)=E0 F(E1a).


40/55


41/55


by the fundamental rule (AB)= BA, which shows that

(E1)=(E)1. (14)

The calculation ofE is exactly the same as for E10 ; the result is

E= ME1, (15)

where MRmm is defined by

Mi j= (i , j )L2().

Together, (14) and (15) yield

(E1)= EM1.

The matrixMis symmetric and positive definite and hence invertible; this follows from the

fact that {1, 2, . . . , m} is linearly independent.Using the expressions for E0 and(E

1), (13) yields

D f(a)= EM1D F(E1a)M0E10 .

The appearance of the trivial mappings E and E10 in this formula is no more significantthan it was in the formula for D f(a). On the other hand, the Gram matrices M1 and M0appear because of the different inner products used for the two pairs of isomorphic spaces.

9. Avoiding the need to program derivatives

The user of a software package implementing numerical optimization algorithms is required

to provide some computer code (usually a subroutine written in a given language) to evaluate

the objective and constraint functions. (This is how the user specifies his or her problem

to the optimization code.) Typically, the optimization code will need values of various

derivatives, which can be obtained in several ways:

1. The user can provide hand-written computer codes to evaluate the derivatives.

2. The optimization code can estimate the derivatives usingfinite differences.

3. The derivatives can be produced, either by the user or the optimization code, using

automatic differentiation.

The emphasis of my presentation so far has been on understanding the basic theory of

derivatives, particularly the linear algebraic foundations, and on using this theory to derive

formulas for derivatives of specific functions. Such understanding is essential for hand-

coding derivatives.

Suppose, though, that a user wishes to avoid the labor (and risk2) of programming the

derivativesof theproblem functions. In this finalsection,I will briefly discuss the advantages


42/55

116 GOCKENBACH

and disadvantages of the other two approaches to the computation of derivatives, finite

differences and automatic differentiation.

9.1. Finite difference estimation of derivatives

Optimization codes generally use the representers of the relevant derivatives: the gradient

and Hessian of a real-valued function, the Jacobian matrix of a vector-valued function. In

order to be concise, I will mostly limit my discussion to the computation of the gradient of

a real-valued function.

Suppose3 f : Rn R. Then f(x)is the vector in Rn whosei th component is

f

xi(x)= limh0

f(x

+hei )

f(x)

h , (16)

whereei is thei th standard basis vector (that is, the vector with every component equal to

zero, except thei th, which is one). When the only information available about f is ablack

boxthat will return its value for a given x, it is not possible to implement (16) exactly, as

the limit operation implies an infinite calculation.

A natural way of approximating f/xi is to simply truncate the limit operation by

choosing a small but nonzero value ofh:

f

xi(x)

.= f(x+ hei ) f(x)h

. (17)

Indeed, Taylors theorem,

f(x+ hei )= f(x) + f

xi(x)h+ 1

2

2 f

x2i(x+ hei )h2, (0, 1),

can easily be rearranged to show that

f

xi(x)= f(x+ hei ) f(x)

h+ 1

2

2 f

x2i(x+ hei )h, (0, 1).

Thus, the error in (17) is O (h); this error is referred to as thetruncation error. The question

now arises: What value ofh should be chosen in practice?

Atfirst glance, it would appear that smaller values ofh (the smaller, the better) wouldtend to lead to better approximations of the partial derivative. Though this is true in exact

arithmetic, it does not take into account the effects offloating point (computer) arithmetic.

First of all, h cannot be chosen too small in comparison to xi ; otherwise, the values ofxiandxi+ h, rounded to the nearest floating point number, will be identical (and therefore,necessarily, so will be f(x+ hei )and f(x)). More subtly, the magnitude

2 f

x2iplays a part.

A computer subroutine implementing the evaluation of fwill inevitably return inexact

results, because of round-off error if for no other reason. Suppose the implemented function


43/55


actually returns

f(x)= f(x) + (x),

with

|(x)|<

for all relevant values ofx. Then formula (17) will be implemented as

f

xi

(x) .

=

f(x+ hei ) f(x)h

= f(x+ hei ) + (x+ hei ) f(x) (x)h

= f(x+ hei ) f(x)h

+ (x+ hei ) (x)h

= fxi

(x) + 12

2 f

x2i(x+ hei )h+

(x+ hei ) (x)h

.

There is no reason to expect that the function is differentiable, so all that can be said about

the last term in the above expression is that

(x+hei )

(x)

h

2

h.

If the second partial derivative of f is bounded by M, then

fxi (x) f(x+ hei ) f(x)

h

Mh2 + 2h . (18)This bound suggests that the total error in the approximation can grow as h0, since theround-off error(or at least the bound for it) grows as h is decreased.

Thus smaller values ofharenot necessarily betterin practice,and so thequestionremains:

How shouldhbe chosen in practice? One idea would be to choose hto minimize the bound

in (18). This leads to the value

h= 2

M

.

However, this result is of limited use, since the value Mis not available in general. It does

suggest, however, that

h= O ()


44/55

118 GOCKENBACH

is reasonable, and an estimate of might be available. The usual choice is

h=sign(xi )|xi |

, (19)

with some adjustment made if |xi | is too close to zero.Withhdetermined by some variation on (19), the error in the computed partial derivative

isO

. This leads to thefirst disadvantage of usingfinite difference estimates for partial

derivatives, and thus for gradients: the attainable accuracy in the computed minimizer is

limited. Afterall, algorithmsfor numericaloptimization are based on the necessary condition

that

f(x)=0

at a minimizer (or the analogous Lagrange multiplier conditions, which also involve f(x),for a constrained minimization problem). It is easy to see that the minimizer cannot be

reliably computed to an accuracy greater than the accuracy with which the gradients is

computed.

The foregoing disadvantage offinite differences is only important for small problems,

when it is reasonable (and may be important) to compute the solution to a high degree

of accuracy. A more serious objection is related to the computational cost of using finite

differences: to estimatef(x)costsn evaluations of the function f (assuming that f(x)must be computed anyway in the course of the optimization algorithm). For any problem

in which it is expensive to evaluate f, or n is large, or both, this cost may be unacceptable.

By comparison, the examples given in Section 7 and 8 yield formulas that will result in the

computation of the gradient at a cost equal to a small multiple of the cost of computing the

function itself. (Note,though, in thecase of theadjoint state methoddetailedin Section 7, thisefficiency depends on the use of the checkpointing scheme that was only briefly mentioned.)

The major advantage of usingfinite differences is obvious: the user need only implement

the problem functions and not their derivatives. The optimization code can then take care

of all details concerning the estimation of derivatives, including the choice of the step

sizeh(although the user may need to provide an estimate of). When the cost is affordable

and there is no need for high accuracy solutions, this makes finite differences an attractive

option. Although I will not discuss it here,finite difference methods can also be devised for

computing Jacobian and Hessian matrices.

9.2. Automatic differentiation

Automatic, or algorithmic,differentiation (AD) is a term applied to a collection of techniquesfor automatically producing derivatives of functions implemented in computer subroutines.

AD tools can analyze a computer program that implements a mathematical function or oper-

ator, and systematically apply the rules of differentiation, notably the chain rule, producing

a new computer program implementing the desired derivative.

There are two primary approaches to automatic differentiation, operator overloading and

source transformation. I will only discuss the source transformation approach, and, indeed,

will focus on a single AD tool, TAMC (Giering, 1999). Another source transformation


45/55


tool is ADIFOR (Bischof et al., 1992). An AD package that uses the operator overloading

approach is ADOL-C (Griewank et al., 1996). For a more complete discussion of automatic

differentiation, see Griewank (2000). The following discussion is taken from the report

(Gockenbach, 2000), which may be consulted for more details.

The Tangent linear and Adjoint Model Compiler(TAMC), designed and implemented by

Ralf Giering (1999), is an Automatic Differentiation (AD) package that produces linearized

and adjoint code for nonlinear operators. To be more precise, given a Fortran subroutine

implementing an operator of the formF: Rn Rm , TAMCcan produce code that computesD F(x)x and D F(x)y. TAMC can also produce derivatives and adjoints for operatorsdefined on product spaces, such as G : Rn Rk Rm .

Although TAMC produces correct and efficient code, the exact operation of TAMC-

generated code can be slightly counter-intuitive to those not well-versed in AD. I will

present an explicit mathematical model for an operator as implemented by a computerprogram, and explain the output of TAMC in terms of this model.

The following simple example will serve to introduce some of the issues encountered in

using TAMC. Define F: RR by y= F(x)=x2. This operator is implemented in thefollowing Fortran subroutine:

subroutine F(x,y)

double precision x,y

y = xxreturn

end

TAMC generates the following adjoint code (stripped of TAMC-generated comments):

subroutine adF(x,adx,ady)

implicit none

double precision adx

double precision ady

double precision x

adx = adx+2adyxady = 0.d0

end

This code correctly computes the adjoint ofD F(x); however, the valueD F(x)yis addedto (rather than assigned to) the output variable. Moreover, the input variable is assigned the

value of zero after it is used. That is, instead of implementing

x D F(x)y,


46/55

120 GOCKENBACH

the TAMC-generated code implements

x x+D F(x)y,y 0.

Below I will show how this result could have been predicted.

9.2.1. The mathematical structure of a subroutine implementing an operator. Consider

an operator F: Rn Rm . A Fortran subroutine implementing y F(x)would have argu-mentsx,y, as well as (possibly) other arguments involved in the definition of the operator

(grid parameters, constants, etc.). The subroutine (which may call other subroutines)

consists of a sequence of statements which together perform the desired calculation. A

number of variables are involved: x and y, any variables required to hold intermediate

quantities, loop control indices, etc. Some of the variable merely serve to control the flow

of the executable statements, and are not important in developing a mathematical model

of the subroutine. The crucial variables are the active variables, which are necessarily of

floating point type. A variable u is active if

thefinal value of an output variable (i.e. one of the components ofy) depends one thevalue ofu at some step i ;

the value of u at step i depends on the initial value of an input variable (one of thecomponents ofx).

The input and output variables are active by definition. The phrase variable wat step j

depends on the value of variable z at stepi will be left undefined. The intuitive meaningis that the value of wat step j is linked to the value of z by a sequence of assignments

statements, the last of which assigns a value to w, and thefirst of which (and perhaps others)

has z, holding its value from step i , in the right hand side. (A precise definition of this

concept would be required to implement a package such as TAMC, but is not needed to

understand how it works.)

Now letSbe the set of all active variables appearing in the subroutine (or in subroutines

called by itI will not bother to make this distinction). Identifying Swith a Euclidean

space RN,F: Rn Rm can be viewed as the composition of operators

F= PFMFM1 F1Q,

where

Fi: SS, i=1, 2, . . . ,M

and Q and P are the natural projections onto the domain and range, respectively, of F.

(That is, P:SRm is defined by

(Ps)i= sji , i= 1, 2, . . . , m,


47/55


whereyi= sji (recall that each active variable, including every component ofy, is identifiedwith a component ofs ). The projection Q:S Rn is defined similarly.) Each statementassigning a value to an active variable can be thought of as implicitly de fining one of the

operatorsFi , and it is in the sense of these operators that I spoke ofstepsin the previous

paragraph. Of course, most steps will only involve a few variables, so most of the active

variables will retain their previous values. The role of the projectors Q and P should be

clear: Q assigns to the input variables their initial values and assigns zero to all othervariables;P extracts from the set of all active variables the output F(x).

There may be other floating point variables that are not active by this definition. For

example, if thefinal value of an output variable depends on the value ofz at some step, but

that value ofz does not depend on the initial value of any input variable, then z plays the

role of a constant. (Input arguments to the subroutine other thanx can be constants.) It is

also possible to have variables which depend on the input variables, but do not influence anyoutput variable. Such a variable can be called diagnostic. For an example of a diagnostic

variable, consider the variable z in the following program fragment:

z=x(1) x(1)+x(2) x(2)if (z.gt.1.0d0) then

y(1) = 2.0d0x(1)else

y(1) = x(1)x(1)endif

Constant and diagnostic variables are called passive variables.

By way of example, consider an assignment statement of the form

w=g(u).

Assumingu and ware active variables, say u=si1 , w=si2 (s S), this implicitly definesthe operator Fk:SS

(Fk(s))j=

g

si1

, j=i 2,sj, j=i 2.

It is instructive to compute D fk(s)and(DkF(s)). The derivative is given by

(D Fk(s)s)j=

gsi1si1 , j=i 2,sj , j=i 2.

Therefore,

(D Fk(s)s, r)=j=i2

sj rj+ g(si1 )si1 ri2

=

j=i1,i2sjrj+ si1 (ri1+ g(si1 ) ri2 ) + si2 0,


48/55


49/55


F1(x,y)=(x,x2)P(x,y)=y.

The interpretation adopted by TAMC is that the subroutine implements the operator

G1(x,y)=(x,x2).

Note that

DG 1(x,y)(x, y)=(x, 2xx),(DG 1(x,y))

(x, y)=(x+ 2xy, 0).

The derivative ofFis given by

D F(x)x=2xx.

TAMC produces the following subroutine for the derivative:

subroutine g_f(x,g_x,g_y)

implicit none

double precision g_x

double precision g_y

double precision x

g_y = 2g_x x

end

This subroutine performs the computation

DG 1(x,y)(x, y)=(x, 2xx);

that is, it performs the operation

x x(implicitly),y

2xx.

In this case, the subroutine can also be regarded as performing the desired operation

y D F(x)x.

The adjoint ofD f(x)is given by

D F(x)y=2xy.


50/55

124 GOCKENBACH

TAMC produces the following adjoint code:

subroutine adf(x,adx,ady)

implicit none



double precision x

adx = adx+2ady xady = 0.d0

end

This subroutine performs the operation

(x, y)(DG 1(x,y))(x, y),

as would be predicted by my discussion above.

In terms of the original operator F, the Fortran command

call adf(x,dx,dy)

does the following:

adds the value D F(x)y to dx, assuming that x and dy have been initialized to holdthe values ofxandy, respectively;

setsdy to zero.

To use adf in the desired manner (given x, y, compute x= D f(x)y), one wouldhave to perform the following steps:

1. initializex and dy to the values ofxand y, respectively;

2. setdx to zero;

3. savedy (assuming that its value is wanted later);

4. calladf(x,dx,dy);

5. restore the value ofdy.

Alternatively, one could hand-edit the routine adf so as to

1. replace theadd-tostatements with simple assignments;

2. remove the statements that change the input variable (dyin this example).

As a second example, suppose the operator

F(x,y)=x2y


51/55


is implemented so that the result F(x,y)overwrites one of the inputs, say (arbitrarily) y.

This is done in the following subroutine:

subroutine F(x, y)

double precision x,y,w

w = xxy = wy

return

end

The active variables are now x, y, w, andFcan be written as F= P F2 F1 Q, where

F1(x,y, w)=(x,y,x2),F2(x,y, w)=(x, wy,w).

TAMC regards the subroutineas implementing G2 G1, where, again, G1 = F1 and G2 = F2.Now,

DG1(x,y, w)(x, y,w)=(x, y, 2xx),DG2(x,y, w)(x, y,w)=(x, wy+ yw,w),

and

(DG1(x,y,w))(x, y,w)=(x+ 2xw,y, 0),

(DG2(x,y,w))(x, y,w)=(x, yw,w+y y).

The TAMC-generated code for the derivative is:

subroutine g_f(x,y,g_x,g_y)

implicit none

double precision g_x

double precision g_y

double precision x

double precision y

double precision g_w

double precision w

g_w = 2g_xxw = xx


52/55

126 GOCKENBACH

g_

y = g_

wy+g_

yw

end

Thefirst executable statement computes the action ofD G1(x,y, w), the second computes

G1(x,y, w), and the third computes the action ofD G2(G1(x,y,w)). The behavior of this

subroutine is exactly as expected, and as desired; the Fortran command

call g_f(x,y,dx,dy)

overwrites dy withD F(x,y)(x, y), assuming that x,y,dx,and dy have previously been

initialized with the values x,y, d x, andy, respectively.

TAMC generates the following code for the adjoint:

subroutine adf(x,y,adx,ady)

implicit none



double precision x

double precision y

double precision adw

double precision w

adw = 0.d0

w = xxadw = adw+adyyady = ady wadx = adx+2adw xadw = 0.d0

end

This subroutine initializes the local variable adwto zero (the first executable statement),

computes G1(x,y, w) (the second statement), applies (DG 2(G1(x,y, w))) (statements

three and four), and applies (DG 1(x,y,w)) (statements five and six). This behavior is

consistent with my description above; note that its effect is the following:

y x2y,x x+ 2x yy.

The desired behavior of the subroutine is

y x2y,x 2x yy.


53/55


54/55

128 GOCKENBACH

The apparent advantages of AD are:

The user avoids the labor-intensive and error-prone task of implementing derivatives byhand.

If it is necessary to modify the original function, its derivatives can be modi fied auto-matically by the AD tool, again saving time spent writing and debugging code.

Modern AD tools can handle very complex code and produce, in many cases, efficientderivative code.

The disadvantages are not quite so obvious, and stem from the fact the foregoing advantages

are not quite realized:

My discussion of the code generated by TAMC shows that, in fact, the code produced bya fully automatic AD tool may need some modification by hand to achieve the desired

results as efficiently as possible.

Code produced by a fully automatic AD tool can be inefficient for certain applications.An example of this is provided by the adjoint state calculation of Section 7; an AD tool

would tend to either save or re-compute all of the intermediate computations needed for

the reverse time-stepping calculation. Either approach is significantly inefficient in some

applications.

In summary, my view is that automatic differentiation is very useful in a variety of situations,

but it must be used with care if efficiency is a prime consideration. In particular, fully

automaticdifferentiation may not be satisfactory for some applications. 4

Notes

1. I am far from thefirst to notice this. See, for example, Groetsch (1980):

A closely guarded secret in some elementary calculus courses is the fact that the basic idea of differential

calculus is the local approximation of a nonlinear function by a linear function. To quote from Dieudonne

(1969), In the classical teaching of calculus, this idea is immediately obscured by the accidental fact that,

on a one-dimensional vector space, there is a one-to-one correspondence between linear forms and numbers,

and therefore the derivative at a point is defined to be a number instead of a linear form.

The texts of Dieudonne and Groetsch are general references for the material in this paper; see also the survey

paper by Tapia (1971).

2. Probably the most common difficulty encountered when using packaged optimization software results from

the users providing incorrect derivatives.3. It is possible to compute gradients without referring to coordinates explicitly, for instance in the formula

J(x)= D f(x)(F(x) d)for J(x)= (1/2)F(x) d2. However,finite difference derivatives dependon an explicit coordinate representation, and so I may as well assume that the function is de fined on Rn .

4. See Griewank (2000), page 92:

As a rule, a general-purpose AD tool will not produce transformed code as efficient as that produced by a

special-purpose translator designed to work only with underlying code of a particular structure, since thelatter

can make assumptions (often with far-reaching consequences), where as the former can only guess.


55/55


A tool that requires the user to rewrite the underlying code in order to make explicit such factors as variable

dependence, structural sparsity, interface width, and memory access patterns will be able to produce efficient

transformed code more easily than a tool that must use internal analysis of the underlying program to deduce

these structural features, but that does not require such a great effort from the user. On the other hand, the

first sort of tool is difficult to apply to legacy code. A possible compromise is a tool that allows, but does not

require, the user to insert directives into the program text.

The latest version of TAMC does allow user-defined directives, making efficient code more attainable.

References

C. Bischof, A. Carle, G. Corliss, A. Griewank, and P. Hovland, Adifor: Generating derivative code from fortran

programs,Scientific ProgrammingVol. 1, pp. 129, 1992.

J. Dieudonne,Foundations of Modern Analysis, Academic Press: New York, London, 1969.Ralf Giering, Tangent Linear and Adjoint Model Compiler, User Manual 1.4, 1999. URL: http://puddle.

mit.edu/ralf/tamcM. S. Gockenbach,Understanding code generated by TAMC,Department of Computational and Applied Math-

ematics, Rice University, Houston, TX, Technical Report TR00-30, 2000.

M. S. Gockenbach, D. R. Reynolds, and W. W. Symes, Efficient and automatic implementation of the adjoint

state method,in Press.

A. Griewank,Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differenti-

ation,Optimization Methods and Software, Vol. 1, pp. 3554, 1992.

A. Griewank, D. Juedes, and J. Utke, ADOL-C, a package for the automatic differentiation of algorithms written

in C/C++,ACM TOMSVol. 22, pp. 131167, 1996.Andreas Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM:

Philadelphia, 2000.

Charles W. Groetsch,Elements of Applicable Functional Analysis, Marcel Dekker: New York, 1980.

W. W. Symes, J. O. Blanch, and R. Versteeg, A numerical study of linear inversion in layered viscoacoustic

media, in Comparison of Seismic Inversion Methods on a Single Real Dataset, R. Keys and D. Foster, eds.,Society of Exploration Geophysicists, Tulsa, 1998.

R. Tapia, The differentiation and integration of nonlinear operators, in Nonlinear Functional Analysis and

Applications, L. Rall, ed., Academic Press: New York, 1971.