Solution of NonLinear Inverse Problems and the Levenberg-Marquardt Method

Ta

J

t

wdrt

cw

©

GEOPHYSICS, VOL. 72, NO. 4 �JULY-AUGUST 2007�; P. W1–W16, 6 FIGS., 1 TABLE.10.1190/1.2732552

he solution of nonlinear inverse problemsnd the Levenberg-Marquardt method

ose Pujol1

mptsaattt�rntliwl

ABSTRACT

Although the Levenberg-Marquardt damped least-squaresmethod is an extremely powerful tool for the iterative solution ofnonlinear problems, its theoretical basis has not been describedadequately in the literature. This is unfortunate, because Leven-berg and Marquardt approached the solution of nonlinear prob-lems in different ways and presented results that go far beyondthe simple equation that characterizes the method. The idea ofdamping the solution was introduced by Levenberg, who alsoshowed that it is possible to do that while at the same time reduc-ing the value of a function that must be minimized iteratively.This result is not obvious, although it is taken for granted. More-over, Levenberg derived a solution more general than the onecurrently used. Marquardt started with the current equation andshowed that it interpolates between the ordinary least-squares-

rsviw1tatsltpqtd

ved Jansee. E-m

W1

ethod and the steepest-descent method. In this tutorial, the twoapers are combined into a unified presentation, which will helphe reader gain a better understanding of what happens whenolving nonlinear problems. Because the damped least-squaresnd steepest-descent methods are intimately related, the latter islso discussed, in particular in its relation to the gradient. Whenhe inversion parameters have the same dimensions �and units�,he direction of steepest descent is equal to the direction of minushe gradient. In other cases, it is necessary to introduce a metrici.e., a definition of distance� in the parameter space to establish aelation between the two directions. Although neither Levenbergor Marquardt discussed these matters, their results imply the in-roduction of a metric. Some of the concepts presented here are il-ustrated with the inversion of synthetic gravity data correspond-ng to a buried sphere of unknown radius and depth. Finally, theork done by early researchers that rediscovered the damped

east-squares method is put into a historical context.

INTRODUCTION

Anyone working on inverse problems will immediately recognizehe equation

�ATA + �I�� = ATc , �1�

here ��0 and I is the identity matrix. The symbols used may beifferent, but the meaning of equation 1 will be clear, namely it rep-esents the Levenberg-Marquardt damped least-squares solution ofhe equation

A� = c . �2�

Equation 1 arises in a number of situations. The one to be dis-ussed here is the solution of linearized nonlinear inverse problemsith � a vector of adjustments �or corrections� to the values of the pa-

Manuscript received by the Editor June 27, 2006; revised manuscript recei1University of Memphis, Department of Earth Sciences, Memphis, Tennes2007 Society of Exploration Geophysicists.All rights reserved.

ameters about which the problem is linearized. Once equation 1 isolved, � is added to the initial vector of parameters and the resultingector is used as a new initial vector. In this way an iterative processs established, but as early workers noted, convergence is not assuredhen � is computed using ordinary least squares �i.e., when equationwith � = 0 is used�. A typical reason for the lack of convergence is

hat the initial values of the parameters are far from the values thatctually solve the nonlinear problem, which means that the assump-ions behind the linearization of the problem are not valid. As a con-equence, the absolute values of the components of � may becomearger or oscillate as the iterations proceed, instead of going throughhe steady decrease that characterizes a convergent process. Thisroblem is solved by Levenberg �1944� and independently by Mar-uardt �1963�, who uses different approaches that led to similar solu-ions. The purpose of this tutorial is to go through their arguments inetail, which will help to gain a better understanding of what really

uary 10, 2007; published online May 30, 2007.ail: [email protected].

hrt

bbeltarrbhlivwltwTd

vsWcsdlnahttatttqucntat

ecstost1dtpo

pssstdts

biipbliqfF

tpswmc

L

Ws

Aviltmla

ww

W2 Pujol

appens when solving a linearized nonlinear problem. To give theeaders a flavor of the matters to be discussed, the basic features ofhe two approaches are summarized below.

Levenberg �1944� solves the problem of the lack of convergencey introducing �and naming� the damped least-squares method. Theasic idea was to damp �i.e., limit� the values of the parameters atach iteration. Specifically, instead of using the function S �see be-ow� whose minimization leads to the ordinary least-squares solu-ion, Levenberg minimizes the function S̄ = wS + Q, where w�0nd Q is a linear combination of the components of � squared. Theesult of this minimization is a generalization of equation 1, with �Ieplaced by a diagonal matrix D with nonnegative elements. Leven-erg’s contribution to the solution of the problem did not stop here,owever. Equally important are his proofs that the minimization of S̄eads to a decrease in the values of S and the function whose linear-zation leads to S �i.e., the function s below�. The reduction in thealue of s does not occur for all values of w �which is equal to 1/�hen D = I�, and Levenberg suggests a way to find the value of w

eading to a reduction in the value of s. Another important result ishat the Q corresponding to the damped least-squares solution is al-ays �i.e., for all w� smaller than when damping is not applied.hese results are not obvious, but are rarely considered when theamped least-squares method is introduced in the literature.

Marquardt �1963�, on the other hand, starts with equation 1 and in-estigates the angle between the computed � and the direction ofteepest descent of s �equal to −�s, where � stands for gradient�.

hen � goes to infinity, the contribution of ATA in equation 1 be-omes negligible and the result is the equation used in the method ofteepest descent. This method generally produces a significant re-uction in the value of s in the early iterations, but becomes extreme-y slow after that, to the point that convergence to the solution mayot be achieved even after a large number �hundreds or more� of iter-tions �e.g., Gill et al., 1981, and the example below�. On the otherand, the ordinary least-squares method �known as the Gauss-New-on method� has the ability to converge to the solution quickly whenhe starting point is close to the solution �and even when far from it,s the example below shows�. Marquardt proves that equation 1 in-erpolates between the two methods, and shows that the angle be-ween � and −�s is a monotonically decreasing function of �, withhe angle going to zero as � goes to infinity. Based on this fact, Mar-uardt proposes a simple algorithm in which at each iteration the val-e of � is modified to assure that the corresponding value of s be-omes smaller than in the previous iteration. Marquardt also recog-izes that the term �I in equation 1 assures that the matrix in paren-heses is better conditioned than ATA and that the angle between �nd −�s is always less than 90°. If this condition is not met, the itera-ive process may not be convergent.

Although neither Levenberg nor Marquardt discusses the steep-st-descent method itself, this tutorial would be incomplete withoutonsideration of the relation between the direction of steepest de-cent and the gradient, which is not unique when inversion parame-ers have different dimensions or units. In such cases, it is not obvi-us how to measure the distance between two points in parameterpace, and as a result, equating the direction of steepest descent tohe direction of minus the gradient becomes meaningless �Feder,963�. It is only when a definition of distance �i.e., a metric� is intro-uced that the two directions become uniquely related. These ques-ions will be discussed first to put Levenberg’s and Marquardt’s ap-roaches into a broader perspective, as they involve, either directlyr indirectly, the introduction of a metric.

The concepts introduced here are illustrated with a simple exam-le involving the inversion of gravity data corresponding to a buriedphere. In this case, the unknown parameters are the radius of thephere and the depth to its center. By limiting the number of inver-ion parameters to two, it is easy to visualize both the function s andhe path followed by the parameters as a function of the iterations forifferent initial values of the parameters and �, and for solutions ob-ained using the damped and ordinary least-squares methods and theteepest-descent method.

This tutorial concludes with a historical note. Although Leven-erg’s paper was published almost twenty years before Marquardt’s,t went almost unnoticed in spite of its practical importance. Interest-ngly, an internet search uncovered a paper by Feder �1963� on com-uterized lens design that shows that ideas similar to that of Leven-erg had been rediscovered more than once. Feder’s paper, in turn,ed to a paper by Wynne �1959�, which anticipates some of the ideasn Marquardt’s approach. Yet the fact remains that it was Mar-uardt’s paper that popularized the damped least-squares method, aact he attributed to his having distributed hundreds of copies of hisORTRAN code!

THE GAUSS-NEWTON METHOD

Let f be a function of the independent variables vk,k = 1,2,. . . andhe parameters xj, j = 1,2, . . . ,n. For convenience, the variables andarameters will be considered the components of vectors v and x, re-pectively. To identify a particular set of values of the variables weill use symbols such as vi. Let us assume that f is a mathematicalodel for observations of interest to us, and that oi is the observation

orresponding to the set of variables vi, so that

oi � f�vi,x� � f i�x�; i = 1, . . .,m . �3�

et us define the residual �i�x� as

�i�x� = oi − f i�x�; i = 1, . . .,m . �4�

e are interested in finding the set of parameters xj that minimize theum of residuals squared, namely

s�x� = �i = 1

m

�i2�x� . �5�

function that measures the misfit between observations and modelalues, such as s�x�, is known as a merit function. Other terms foundn the parameter estimation and optimization literature are objective,oss, and risk function. If f i�x� is a nonlinear function of the parame-ers, the minimization of equation 5 generally requires the use of nu-

erical methods.Atypical approach is to express f i�x� in terms of itsinearized Taylor expansion about an initial solution xj

o �j = 1, . . . ,n�t which s does not have a stationary point. This gives

f i�x� � f i�xo� + �j = 1

n � �f i

�xj�

x = xo

�xj − xjo�; i = 1, . . .,m ,

�6�

here xo has components xjo. Using this expression with equation 3,

e can introduce a new set of residuals

w

a

Nwa

wr

N

Teitwt

TSe

�e

Ied

ito�

A�p

aw�cx

wh

T

Wo

wl

pcahxcs

E

�db

�tsdh

gtw

Levenberg-Marquardt nonlinear inversion W3

ri = oi − f i�xo� − �j = 1

n

aij� j , �7�

here

� j = xj − xjo, �8�

nd

aij = ��f i

�xj�

x = xo

. �9�

ote that f i�xo� and the derivatives have specific numerical values,hile the � j are unknown. Equation 7 will be written in matrix form

s

r = c − A�, �10�

here r and � have components ri and �i, A denotes the matrix of de-ivatives, and c is the vector with components

ci = oi − f i�xo� . �11�

ow we will look for the vector � that minimizes

S�x� = �i=1

m

ri2 = rTr = �cT − �TAT��c − A��

= cTc − 2cTA� + �TATA�. �12�

he superscript T indicates transposition. Before proceeding, how-ver, it is necessary to make a comment on notation. Strictly speak-ng, the right-hand side of equation 12 is a function of �, but becausehe x that minimizes equation 5 will be determined iteratively, weill be interested in deriving results involving x, which from equa-

ion 8 is equal to

x = xo + �. �13�

herefore, S can be considered a function of x. The minimization ofrequires computing its derivative with respect to � and setting it

qual to zero:

�S

��= � �S

��1

�S

��2. . . �T

= −2AcT + 2ATA� = 0 �14�

Seber, 1977�, which leads to the well-known ordinary least-squaresquation

ATA� = ATc . �15�

n this section, we will assume that �ATA�−1 exists, which means thatquation 15 can be solved for �. When this assumption is not valid, aifferent method �such as damped least-squares� should be used.Now it remains to show that the � obtained from equation 15 min-

mizes S. To see that, we must examine the Hessian of S, HS, which ishe matrix of second derivatives of S with respect to the componentsf �. Because the quadratic terms in equation 12 are of the formATA�mn�m�n and ATA is symmetric,

�HS�kl =�2S

�� = 2�ATA�kl. �16�

k l

condition for a minimum of S to exist is that HS be positive definitee.g., Apostol, 1969�, which is the case when �ATA�−1 exists �seeAp-endix A�.

These results allow us to establish, in principle, the following iter-tive process. Solve equation 15 for �, use equation 13 to compute x,hich becomes a new xo, and use it to compute new values of A and c

using equation 9 and 11�. Then solve for � again. To make the pro-ess clear, we will describe the two steps that lead to the estimate�p+1� for the �p + 1�th iteration. First, solve

�ATA��p��p� = �ATc��p�, �17�

here the superscript �p� indicates iteration number and A and cave components

aij�p� = ��f i

�xj�

x = x�p�; ci

�p� = oi − f i�x�p�� . �18�

hen compute the updated estimate using

x�p+1� = x�p� + ��p�. �19�

e will let x�o� = xo and will apply the superscript �p� to any functionf x computed at x�p�.

To stop the iterative process, one can introduce conditions such as

� j�p� � � j

min; j = 1,2, . . . ,n , �20�

s�p+1� � smin, �21�

p � pmax, �22�

here the values on the right sides of these equations are preestab-ished values.

The iterative minimization method with the � in equation 19 com-uted using equation 17 is known as the Gauss-Newton method, andan be derived using a different approach �e.g., Gill et al., 1981; seelso the note at the end of next section�.As noted in the Introduction,owever, a problem with this method is that ��o� may be too large ifo is too far from its optimal value, which in turn may lead to a non-onvergent iterative process. The following example will illustrateome of the features of the method.

xample 1a

Consider a buried homogeneous sphere of radius a with center atyo,z�, where yo is measured along the y-axis �horizontal� and z isepth. The vertical component of the gravitational attraction causedy the sphere at a point yi at zero depth is given by:

g�yi,z,a� =4�

3

�Da3z

�yi − yo�2 + z2�3/2 �23�

e.g., Dobrin, 1976�, where � is the gravitational constant and D ishe density contrast �equal to the difference of the densities of thephere and the surrounding medium, assumed homogeneous�. Foristance and density in km and g/cm3 and gravity in mGal �usedere�, the numerical value of � is 6.672.

The inverse problem that we will solve is the following. Given mravity values Gi corresponding to points yi along the y axis, find outhe values of a and z of the sphere whose gravity best fits the Gi. Itill be assumed that D and y are known. Clearly, this problem is
o

nxo2=tc

pnaawpssbht

N�sb

t�=ciitttcewu

pbvfisapaintFmt

F2w�ptDSa

F�iits

W4 Pujol

onlinear in both a and z, which play the role of the parameters x1 and2. In practice, the Gi should be observed values, but for the purposesf this tutorial they will be synthetic data generated using equation3 with the following values: z = 7, a = 5, yo = 0, all in km; D0.25 g/cm3, m = 20, y1 = −10 km, and yi+1 − yi = 1 km. To stop

he iterative process, the condition that the adjustments �1 and �2 be-ome smaller than or equal to 1�10−5 km was assumed.

In this example, the estimated variance of the residuals, given by

2�z,a� =1

m − 2 �i = 1

m

Gi − g�yi,z,a��2; �2 � z � 12

2 � a � 10,

�24�

lays the role of the merit function s to be minimized. The 2 in the de-ominator is introduced to make 2 an unbiased estimate �Jenkinsnd Watts, 1968�. Clearly, a cannot be larger than z when the yi aressumed to be at the same elevation, but for the analysis that followse will be concerned with the mathematical, not the physical, as-ects of the problem. There are two reasons for the use 2. One is itstatistical significance and the other is that 2 is a normalized form of, which allows a comparison of results obtained for different num-ers of observations or for different models. The following results,owever, are shown in terms of the standard deviation , which hashe same units as g �i.e., mGal�.

Representative contour lines of �z,a� are shown in Figure 1.ote that the shapes of the contours are highly variable. For values ofless than about eight they are close to highly elongated ellipses

closed or open�, although the other contours are mostly straight orlightly curved with changing slopes. This fact must be kept in mindecause solving the inverse problem is equivalent to finding a path in

2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

10

Depth (km)

Ra

diu

s (

km

)

A B

C D

10

5

3

3

5

10

50100500

21

igure 1. Contour lines �cyan curves� of the function �see equation4� and paths followed by the points �zi,ai� �indicated by circles�,here i is iteration number for the Gauss-Newton inversion method

equation 17�. The numbers next to the contours indicate the value of�in mGal�. The contours between the numbered ones are equis-

aced. The points labeled A, B, C, and D are initial points �xo,yo� forhe inversion. Figure 2 shows the corresponding gravity values. For

the method did not converge. The value of for this point is 11.5.ee Table 1 for additional inversion results. The large + is centeredt �7,5�, which is at the minimum of �=0�.

he �z,a� plane that connects an initial point �zo,ao� and the pointzM,aM� that minimizes �and thus, 2�. In our example, �zM,aM�

�7,5�, and at this point = 0. It may happen, however, that be-ause of the complexity of , no path can be found, in which case thenverse problem has not been solved. To investigate this question thenitial points labeled A, B, C, and D in Figure 1 were used. Some ofhese initial values are too far from the true values �see Figure 2�, buthey were chosen for demonstration purposes, not as reasonable ini-ial estimates for this problem. In addition, the corresponding resultsan be useful for cases where the function to be minimized is notqual to zero at its minimum, and there is no easy way to assesshether the initial estimates are reasonably close to the optimal val-es.

The results of the inversion are summarized in Table 1 and theaths followed by the intermediate pairs �zp,ap� �p = iteration num-er� are shown in Figure 1. For the initial point D there was no con-ergence, but for the other three points, the minimum was reached inve iterations �points B and C� or 10 iterations �point A�. These re-ults are interesting for several reasons. First, convergence can bechieved even when the assumptions behind the linearization of theroblem are completely violated. Second, convergence is not alwayschieved. Third, whether an initial point leads to convergence or nots not directly related to its distance to the point that minimizes . Fi-ally, inspection of the inversion paths does not give any clue as tohe path corresponding to any other initial point within the range ofigure 1. These facts are typical of nonlinear problems, and the otherethods discussed below have been designed to address some of

hem.

−10 −5 0 5 1010

−1

100

101

102

103

T: (7, 5)A: (2, 10)B: (10, 10)C: (2, 2)D: (10, 2)

T

AB

CD

x (km)

g(x,z,a)

(m

Gal

)

igure 2. Gravity values computed using equation 23 for severalz,a� pairs, listed on the upper-right corner of the figure. Each pair isdentified by a different symbol and by a letter. The gravity valuesdentified by a T are the true values, while the others correspond tohe initial values used for the inversion of the true values. The gravitycale is logarithmic.

on

w

w

Nelc

Tiu

−w

le

Tt�cb

wqttgthnBcoaa

T

P

A

B

C

A

A

B

C

D

A

A

B

B

C

C

D

D

tw


THE GRADIENT AND THE METHOD OFSTEEPEST DESCENT

To motivate the following discussion, let us consider the gradientf s, to be indicated with �s, which is the column vector with compo-ents

��s� j =�s

�xj= −2�

i = 1

m

oi − f i�x��f i�x�

�xj, �25�

here equation 4 and 5 were used. Now let x = xo. Then

��s� j = −2�i = 1

m

ci�f i

�xj= − 2�ATc� j , �26�

here equation 9 and 11 were used. Writing in matrix form, we have

�s = −2ATc . �27�

ow consider equation 1 with � and c the vectors that appear inquation 15 and let � go to infinity. In this case, the first term on theeft side of the equation becomes negligible and the solution be-omes

�g �1

�ATc = −

1

2��s → 0; � → . �28�

herefore, in this limiting case the damped least-squares solution �g

s in the direction of minus the gradient. This fact is emphasized byse of the subscript g. In addition, �g goes to zero. The direction

able 1. Gravity inversion results obtained using different me

oint zo ao goM z1 a1

Gauss-Newton meth

2 10 1747 2.0 6.8

10 10 70 9.1 6.9

2 2 14 5.9 4.6

Steepest-descent method �

2 10 1747 2.3 9.9

2 10 1747 4.7 8.9

10 10 70 10.3 9.4

2 2 14 2.0 2.0

10 2 1 10.0 2.0

Levenberg-Marquardt method �e

2 10 1747 2.8 9.5

2 10 1747 2.2 7.1

10 10 70 10.0 10.0

10 10 70 10.3 7.6

2 2 14 2.0 2.0

2 2 14 2.7 2.9

10 2 1 10.0 2.0

10 2 1 9.9 3.2

Point: �see Figure 1�; zo,ao: initial values of z,a for the inversion; go

ively; zi, ai: values of z,a after the ith iteration; N: total number of iteere generated using equation 23 with z = z = 7 and a = a = 5. T
M M
�s is the basis of the steepest-descent method of minimization,hich is one of the oldest methods used.A heuristic introduction of the steepest-descent method is as fol-

ows. Using the notation introduced above, we are interested in an it-rative approach such that

s�p+1� � s�p�. �29�

o achieve this goal, we will use the fact that �s points in the direc-ion of steepest ascent. This is a well-known result from calculuse.g., Apostol, 1969� and will be proved below in a more generalontext. Therefore, the initial estimate for the �p + 1�th iteration wille computed using

x�p+1� = x�p� −1

��p� �s�x�p�� , �30�

here ��p� is a scalar that assures that equation 29 is satisfied. Theuestion is how to choose the value of ��p�. A general discussion ofhis question is presented below, but for the time being, we note thathe gradient of a function is a local property, which means that, ineneral, the direction of the steepest descent will change as a func-ion of position. Therefore, if ��p� is not selected carefully, it mayappen that the value s is not reduced, as desired. For this reason, aumber of strategies for the selection of ��p� have been designed �i.e.,everidge and Schechter, 1970; Dorny, 1975�, but for the exampleonsidered next we will use a very simple approach, based on the usef equation 30 with a large constant value of ��p�, say �, which willssure a small step in the steepest direction. In this way, we will beble to see the steepest-descent path clearly, which will be used for

and values of the initial parameter.

N zN aN �o or � �N

uation 17, Figure 1�

1 10 7.00 5.00

8 5 7.00 5.00

9 5 7.00 5.00

n 30, ��p� = �, Figure 3�

7 28904 6.52 5.00 2�107

9 7141 7.06 5.04 2�106

4 13326 7.02 5.01 2�104

5 8481 6.98 4.99 2�104

1 10265 7.02 5.01 2�104

n 108, ��p+1� = ��p�/2, Figure 5�

2 23 7.00 5.00 1�106 0.24

0 17 7.00 5.00 1�104 0.15

0 24 7.00 5.00 1�106 0.12

9 11 7.00 5.00 1�102 0.10

4 24 7.00 5.00 1�106 0.12

2 11 7.00 5.00 1�102 0.10

1 24 7.00 5.00 1�106 0.12

2 11 7.00 5.00 1�102 0.10

aximum values of g �mGal, equation 23� for zo, ao and z1, a1, respec-; �o,�N: initial and final values of �. The gravity values to be invertedesponding maximum value is 18.

thods

g1M

od �eq

53

2

1

equatio

131

22

5

1

quatio

76

55

7

2

1

2

M,g1M: m

rationshe corr

caemb

E

tAsDatctpbcsdlbb�pb�

gaq

ocvTtoesmdfdormwCeda

b

wWwm

Fstpet4

FmTcTCateio

W6 Pujol

omparison with paths corresponding to the Gauss-Newton methodnd the Levenberg-Marquardt solutions obtained for different choic-s of �. This example will also show the problems that affect theethod of steepest descent, which are removed when the Leven-

erg-Marquardt method is used.

xample 1b

The merit function is the 2 introduced in equation 24. For the ini-ial points B, C, and D the same value of � was used, while for point

two other values were used �see Figure 3�. Let us consider the mostalient aspects of this example. For three of the initial points �B, C,� the corresponding endpoints are very close to, although not ex-

ctly at, �zM,aM� �see Table 1�, but the number of iterations is ex-remely large ��7000�. Recall that with the Gauss-Newton method,onvergence for points B and C is achieved in five iterations. Notehat for the three points the paths have sharp bends, after which theaths follow the major axes of the roughly elliptical contours. Theseends occur when the paths become approximately tangent to theontours, which is a general property of the method �see the discus-ion following equation 48 and Figure 4�. For point A, the results areifferent. First, the value of � used for the other three points did notead to convergence. Second, when using the larger value of � in Ta-le 1, the path reaches a point close to where it should bend, but thisending does not occur even after a very large number of iterationsabout 29,000�. When a somewhat smaller value of � is used, theath reaches a point close to the minimum with a much smaller num-er of iterations �about 7000�, but the path between �xo,yo� andx1,y1� is clearly different from the steepest-descent path.

The previous example illustrates the well-known slow conver-ence of the steepest-descent method, which makes it computation-lly inefficient, particularly when compared to the Levenberg-Mar-uardt method. On the other hand, a study of some of the properties

2 3 4 5 6 7 8 9 10 11 122

3

4

5

6

7

8

9

10

Depth (km)

Rad

ius (km)

A B

C D

10

5

3

35

10

50100500

: µ = 2×10 4 : µ = 2×10 6 : µ = 2×10 7

igure 3. Similar to Figure 1 showing the paths corresponding to theteepest-descent method �equation 30 with ��p� = � = constant� forhe values of � given at the top of the figure. For point A the dot-dashath was far from the minimum value �see Table 1�. The large � onach path corresponds to �z1,a1�. The two contours corresponding to� 3 are different from those in Figure 1, and were drawn to show

heir relations to the bend in the paths for points B and D �see Figureand corresponding text for further details�.

f the gradient and the steepest-descent method is very fruitful be-ause it sheds light on certain questions that arise when solving in-erse problems that involve parameters with different dimensions.his type of problem is not uncommon. In seismology, for example,

he parameters may be time, position, velocity, and density, amongthers. If the dimensions of two or more of the parameters are differ-nt, a question that arises is how to define distance in the parameterpace. When all the parameters have the same dimensions and areeasured in the same units, the gradient of a function s�x� gives the

irection along which s has the largest rate of change. In other words,or a given x, s�x + x� − s�x�/ x is largest when x is in theirection of �s computed at x. In this case, it is meaningful to speakf the direction of steepest ascent and to equate it to the gradient di-ection. For any other case, however, a distance in parameter spaceust be defined. Once this is done, the direction of steepest ascent isell defined, as we now show. The following results originate withrockett and Chernoff �1955�. Although in this paper we are inter-sted in the steepest-descent direction, here we consider its oppositeirection �corresponding to the steepest ascent� to avoid introducingn inconvenient minus sign.

Ageneral definition of distance d between two points �representedy vectors � and �� is

d = �i,j

��i − �i�bij�� j − � j��1/2

� �� − ��T B�� − ��1/2, �31�

here B is a positive definite symmetric matrix �see Appendix A�.ith this condition on B, d is always a nonnegative real number,ith d = 0 only if � = �. The definition of distance is known as theetric of the space of points under consideration. If B = I, d is the

A

B

C

igure 4. Elliptical contour lines corresponding to a 2D quadraticerit function �given by equation 55 with x1 a point of minimum�.he contour spacing is not uniform. The points corresponding to theenters of the small circles are identified by the letters next to them.he solid and dashed lines are tangent to the contours at points B and. The segments AB and BC are in the directions of the gradient at And B. The positions of points B and C were determined using equa-ions 47, 54, and 56. The two segments are perpendicular to each oth-r. The two pairs of closely spaced contours were drawn to show thatf the segments AB and BC extend beyond points B and C, the valuef the quadratic function becomes larger.

up

Ghssc

w

Totwd

wr

w

a

N

ipg

Nt

wmv

I

w

a

Htdx

W

ss�nap�zhv

a

FptamTcmskvTmapAo

e

wdt


sual Euclidean distance. Given a point with coordinates xo, theoints x at a distance d from it are on the ellipsoid

�x − xo�T B�x − xo� = d2. �32�

iven a function s, the direction of steepest ascent in the d neighbor-ood of xo is defined as the direction from xo to the point on the ellip-oid of equation 32 for which the value of s is greatest. Let � repre-ent that direction. To find it, we will maximize s�xo + �� under theonstraint

�TB� = d2, �33�

here

� = x − xo. �34�

he notation used here has been chosen to emphasize the connectionf this section to the preceding one. In particular, s and � could behose introduced in equations 5 and 13. To solve this problem, weill use the method of Lagrange multipliers, which requires intro-ucing the function

u = s�xo + �� + ��d2 − �TB�� , �35�

here � is the Lagrange multiplier, computing its derivatives withespect to xi and �, and setting them equal to zero. This gives

�

�xis�xo + �� = 2��

i

bij� j , �36�

hich is equivalent to

�s�xo + �� = 2�B�, �37�

nd

d2 = �T B�. �38�

ow we will go over the following steps. Solve equation 37 for �

� =1

2�B−1�s , �39�

ntroduce this result in equation 38, use the symmetry of B, and ap-ly the square root to both sides of the resulting equation. Thus weet

d =1

2��s�T B−1�s�1/2. �40�

ext, solve this equation for 1/2� and introduce the result in equa-ion 39. This gives

� =dB−1�s

��s�TB−1�s�1/2 , �41�

here �s is computed at xo + �. Now we will let d go to zero, whicheans that � goes to the zero vector, and linearize �s�xo + �� in the

icinity of xo. Writing in component form we have

�s

�xi�xo + ��

�s

�xi�xo� + �

j

�2s

�xi�xj�xo�� j . �42�

n vector form, this equation becomes

�s�xo + �� s�xo� + H�, �43�

here H is the Hessian of s. Therefore,

�s�xo + �� = �s�xo�; d → 0 �44�

nd, from equation 41,

� =dB−1�s�xo�

��s�xo��TB−1�s�xo��1/2 ; d → 0. �45�

ere, we are assuming that �s�xo��0. If �s�xo� = 0, xo correspondso a critical point �i.e., a point where s has an extremal value or a sad-le point�. In conclusion, the direction of steepest ascent at any pointis given by the vector

�̂�x� = B−1�s�x� . �46�

hen B = I, �̂ is in the direction of the gradient, as expected.Finally, we will address the choice of B. In principle, the choice is

omewhat arbitrary, because different metrics should lead to theame minimum. In fact, equation 46 shows that �̂�x� = 0 impliess = 0 and vice versa �Feder, 1963�. However, Crockett and Cher-off �1955� showed that the most computationally efficient steepest-scent method requires that B = H. The following proof has twoarts, originating from Davies and Whitting �1972� and Greenstadt1967�, respectively. For the first part, consider the iterative minimi-ation �or maximization� of a function s�x�. At a given iteration, weave a point xo and move to a new point x1 in a direction defined by aector u. Let

x1 = xo + �u �47�

nd

s�x1� = s�xo + �u� � F�� . �48�

rom a computational point of view, we are interested in an iterativerocess with the least number of steps. This requires finding the �hat reduces �or increases in the case of maximization� the value of Fs much as possible in the direction u at every step, which in turneans that x1 becomes a point of tangency to one of the contours of s.his situation is illustrated in Figure 4, which shows the contoursorresponding to a 2D quadratic merit function with a point of mini-um. The points indicated by A and B correspond to xo and x1, re-

pectively, for a given iteration. Going from A to B, the value of seeps decreasing, while moving past B leads to an increase in thealue of s. For the next iteration B and C become the new xo and x1.he points B and C correspond to tangency points and were deter-ined using equations 54 and 56 below. The directions at A and B

re given by −�s computed at those points. Because the gradient iserpendicular to the contours, the segment BC is perpendicular toB. If the contours were circular, the minimum would be reached inne step.

To find the value of � that will lead to a point of tangency, we willxpand F to second order about xo

F � Fo + ��dF

d��

o

+ ��2

2

d2F

d�2�o

, �49�

here the subscript o indicates evaluation at xo. Then, expandingF/d� to first order about xo and setting it equal to zero at the point ofangency we obtain

Id

a

we

I

Ts

wc

It

b5

Twdm�mct

a

Tt

b

s

Bd

w

MtS

�o

Nt

w

I

Twt�g6�icsnbtrcCpwm

W8 Pujol

dF

d�� dF

d��

o

+ ��d2F

d�2�o

= 0. �50�

f F were a quadratic function, these relations would be exact. Now,ifferentiating equation 48 with respect to � gives

dF

d�= �

i

�s

�xiui = uT�s �51�

nd

d2F

d�2 = �i,j

�2s

�xi�xjuiuj = uTHu, �52�

here H is the Hessian matrix. Introducing these two expressions inquation 50 and solving for � gives

� � −�dF

d��

o��d2F

d�2�o

= −� uT�s

uTHu�

x = xo

. �53�

f u = −�s, this expression becomes

� � � �sT�s

�sTH�s�

x = xo

. �54�

his expression is exact when s is a quadratic function. For example,may be of the form

s = �x − x1�T P�x − x1� , �55�

here x1 is a constant vector and P is a symmetric matrix. In thisase,

�s = 2P�x − x1�; H = 2P . �56�

f x1 minimizes s, �s�x1� = 0 and P is positive definite �e.g., Apos-ol, 1969�.

For the second part of the proof we will consider the difference Fetween F and Fo, which is determined from equations 49 and1–53:

F = F − Fo = −1

2� �uT �s�2

uT Hu�

x = xo

. �57�

his result applies to both the minimization and maximization of s,ith the sign of F depending on whether H is positive or negativeefinite �see Appendix A�, corresponding to whether s has a mini-um or a maximum �e.g., Apostol, 1969�. Now we will set u = �̂

see equation 46� for two cases: �1� B is any positive definite sym-etric matrix, and �2� B = H. The latter case corresponds to the so-

alled Newton �or Newton-Raphson� method, and to distinguish be-ween the two possibilities we will use subscripts B and N. Thus,

FB = −1

2

��sTB−1�s�2

�sTB−1HB−1�s�58�

nd

FN = −1

2�sTH−1�s . �59�

o investigate the relative efficiency of the methods represented byhe two choices we will consider the ratio

� = FB

FN=

��sTB−1�s�2

��sTB−1HB−1�s��sTH−1�s�, �60�

ut before proceeding we will introduce the following vector

p = B−1/2�s , �61�

o that

�s = B1/2p . �62�

ecause B is positive definite and symmetric, so is B−1/2 �seeAppen-ix A�. Using these two equations, � becomes

� =�pT p�2

�pT Mp��pT M−1p� , �63�

here

M = B−1/2 HB−1/2. �64�

atrix M is positive definite �see Appendix A�. An upper boundo � can be established by using the following generalization ofchwarz’s inequality

�aT b�2 � �aT Ca��bT C−1b� �65�

see Appendix A�, where C is a positive definite matrix. Applicationf this expression to � gives

� � 1. �66�

ow we will apply the Kantorovich inequality �Luenberger, 1973�o the right-side of equation 63, which immediately gives

� �4�n�1

��n + �1�2 =4�1/�n

�1 + �1/�n�2 =4�

�1 + ��2 , �67�

here �1 and �n are the largest and smallest eigenvalues of M and

� = �1/�n. �68�

n summary,

4�

�1 + ��2 � � � 1. �69�

his result shows that the efficiency of the method depends on �,hich is the condition number of M �e.g., Gill et al., 1981�. The bet-

er conditioned this matrix is, the higher the efficiency. In particular,= 1 when � = 1, which in turn requires M = cI. Without losing

enerality, we can take c = 1, in which case B = H �see equation4�. Therefore, the Newton step is the most efficient �assuming thats is arbitrary�. Crockett and Chernoff’s �1955� proof of this result

s based on a different approach �and resulting expressions�. Thishoice of B, however, is not always the most advisable for two rea-ons. First, when minimizing s, H may not always be positive defi-ite; in fact, some of its eigenvalues may be negative and FN mayecome positive. Second, even though H may be positive definitehrough the iterative process, the required computation of second de-ivatives increases the computational costs. On the other hand, thishoice has special relevance in statistics because, as Crockett andhernoff �1955� note, when solving maximum likelihood estimationroblems the function s is the logarithm of the likelihood function, inhich case H−1 represents an estimate of the covariance matrix of theaximum likelihood estimate �Seber and Wild, 1989�.

rtpb

AKt

l2

�sc

Crti

tmtqfar

T

Gml

w

watesd

x

B

�

s

wNs

s

TIc

S̄uhe

A1tm

stmeGttwic


The covariance matrix is also related to our discussion of the met-ic via the Mahalanobis distance, named after the Indian statisticianhat introduced the concept �in 1936�. If x is a random vector from aopulation with mean � and covariance matrix V, then the distanceetween x and � is given by

dM = �x − ��T V−1�x − ��1/2. �70�

good qualitative justification for this definition can be found inrzanowski �1988�, who also notes the relation between this dis-

ance and the maximum likelihood function.Finally, it is worth noting that the choice B = HS �see equation 16�

eads to the Gauss-Newton method. In fact, using equations 46, 16,7, and 15 gives

�̂ = −�ATA�−1ATc = −� �71�

provided that the inverse exists�. Now using equation 34 with −�̂ in-tead of � �the − sign being used to specialize to the steepest-descentase� and then using equation 71 we have

x = xo − �̂ = xo + �. �72�

omparison of this expression with equation 13 shows that we haveecovered the Gauss-Newton method. This result is consistent withhe fact that one way to derive this method is to assume that H�HS

n the Newton method �e.g., Gill et al., 1981�

THE LEVENBERG-MARQUARDT DAMPEDLEAST-SQUARES METHOD

As noted in the discussion of the Gauss-Newton method, if the ini-ial solution is far from the optimal solution, the iterative process

ay not converge.Although this problem is addressed by several au-hors �see Historical Note below�, Levenberg’s �1944� and Mar-uard’s �1963� papers are much more thorough than the others andor this reason they will be examined in detail here. However, theirpproaches are so different that it is convenient to discuss them sepa-ately.

he Levenberg approach

The notation used here is that introduced in the discussion of theauss-Newton method. To solve the problem of parameter adjust-ents too large, Levenberg introduced the idea of damping the abso-

ute values of the �i by minimizing

S̄�x� = wS�x� + Q�x� , �73�

here

Q�x� = d1�12 + ¯ + dn�n

2 = �T D�, �74�

and the di are positive weighting factors independent of x, and D isdiagonal matrix with elements �D�ii = di. A comparison of equa-

ions 74 and 31 shows that Levenberg’s method implicitly introduc-s a non-Euclidean norm in the parameter space. Moreover, the re-ults of the analysis below are valid when D is a symmetric positiveefinite matrix.

Let us establish two important results concerning S̄, S, and Q. Letbe the value of x that minimizes S̄ for a given value of w, i.e.,
w
S̄�xw� � S̄�x�; x � xw. �75�

ecause Q is nonnegative and

Q�xo� = 0 �76�

see equation 8�, we can write

wS�xw� � wS�xw� + Q�xw� = S̄�xw� � S̄�xo�

= wS�xo� + Q�xo� = wS�xo� , �77�

o that

S�xw� � S�xo� , �78�

hich means that the minimization of S̄ will lead to a decrease in S.ow, letting x denote the ordinary least-squares solution �the rea-

on for this notation is explained below�, we have

wS�xw� + Q�xw� = S̄�xw� � S̄�x� = wS�x� + Q�x�

� wS�xw� + Q�x� , �79�

o that

Q�xw� � Q�x� . �80�

he second inequality in equation 79 arises because x minimizes S.nequality 80 shows that the minimization of S̄ also leads to a de-rease in the weighted sum of adjustments squared.

Next, we will derive the equation for the solution that minimizes, but before proceeding we note that Levenberg derived his resultssing scalar notation, not the more familiar matrix notation usedere. The starting point is equation 73, which will be rewritten usingquation 12

S̄ = w�cTc − 2cTA� + �TATA�� + �T D�

= w cTc − 2cTA� + �T�ATA +1

wD�� . �81�

side from a factor of w, equation 81 is formally similar to equation2, with ATA in the latter replaced by the symmetric matrix in paren-heses in the former. Therefore, by analogy with equation 15, the

inimization of S̄ leads to the damped least-squares solution

�ATA +1

wD�� = ATc , �82�

o that the only difference from the ordinary least-squares solution ishe addition of a diagonal matrix to ATA. Because the inverse of theatrix in parentheses always exists for w� �see Appendix A�,

quation 82 has a solution even when �ATA�−1 does not exist and theauss-Newton method is not applicable. Also note that for w =

he second term on the left side of equation 82 vanishes and we gethe ordinary Gauss-Newton solution �provided it exists�. This is whye introduced the x used in equations 79 and 80. On the other hand,

f w goes to zero, 1/w goes to infinity and the first term on the left be-omes negligible, which means that we can write

1

wD�g � ATc; w → 0. �83�

Iv

�pcia

wqo8

w

F

Bt

�tddtdpw

a

wa

weau

dbp

wtw

Twdw

imbt

T

frdt

T

Tw

wqtwmi

Taw

W10 Pujol

n addition, because the diagonal elements of D are nonzero, its in-erse always exists and we can write

�g � wD−1ATc = −1

2wD−1�s → 0; w → 0. �84�

see equation 27�. This result is also valid when D is symmetric andositive definite �so that its inverse exists, seeAppendix A�, in whichase it agrees with equation 46. The difference in the signs of � and �̂s due to the fact that they are in the directions of steepest descent andscent, respectively.

So far, we have concentrated on S and S̄, but as we will see next,e can derive several important results concerning s, which is theuantity that is of most interest to us. In the following we will focusn the case of w going to zero, which means that we can use equation4. Then, letting

�g = xw − xo �85�

e find that

dxw

dw=

d�g

dw= D−1ATc; w → 0. �86�

urthermore,

ds�xw�dw

= �j = 1

n ��s

�xj

dxj

dw�

x = xw

= ��s�T dxw

dw. �87�

ecause of equations 84 and 85, xw �xo. Then, introducing equa-ions 86 and 27 in equation 87 and operating gives

�ds�xw�dw

�w = 0

= −2�ATc�T D−1ATc

= −2�D−1/2ATc�T �D−1/2ATc�

= −2D−1/2ATc2 � 0 �88�

see also equation 44�. The inequality arises because of the assump-ion that xo is not a stationary point of s, which means that the partialerivatives cannot all be equal to zero. Therefore, because s�xw� isecreasing at w = 0, there are values of w �positive� that will reducehe value of s. In principle, the value of w that minimizes s could beetermined by setting ds/dw equal to zero, but because of the com-lexity of this equation in practical cases, Levenberg proposed torite s�xw� in terms of its linearized Taylor expansion

s�xw� � s�xo� + w�ds

dw�

w = 0�89�

nd to assume that s�xw� will be small. Under these conditions,

w � −s�xo�

�ds/dw�w = 0=

s�xo�2D−1/2ATc2

, �90�

here equation 88 was used. According to Levenberg, this type ofpproximation was published by Cauchy in 1847.

The results derived above do not depend on the values of theeights dj. To determine them, Levenberg proposed two approach-

s. One was to choose the di such that the directional derivative of slong the curve defined by x = xw taken at w = 0 has a minimum val-e. The directional derivative is given by equation 87 because

xw/dw is a vector tangent to xw �e.g., Spiegel, 1959�. Furthermore,ecause the product on the right side is the matrix form of the scalarroduct, we can write

ds�xw�dw

= �s�dxw

dw�cos � , �91�

here � is the angle between the two vectors. The minimum value ofhe derivative is attained for � = �. Introducing this value of � asell as equations 27, 86, and 88 into equation 91 we obtain

2D−1/2ATc2 = 2ATcD−1ATc . �92�

his equation is satisfied when D = dI, with d equal to a constant,hich results in a factor of d−1 on both sides of the equation. Taking= 1 and letting � = 1/w, we find that equation 82 becomes theell-known equation

�ATA + �I�� = ATc; � =1

w. �93�

The second approach proposed by Levenberg is to choose

di = �ATA�ii, �94�

n which case the matrix in parentheses in equation 82 becomes theatrix ATA with its diagonal elements multiplied by 1 + �. Leven-

erg did not give a motivation for this choice, but it is directly relatedo the scaling introduced by Marquardt.

he Marquardt approach

Marquardt approaches the problem from a point of view differentrom that of Levenberg. His starting point is the following series ofesults. Unless otherwise noted, the notation used here is that intro-uced earlier except for the fact that S will be assumed to be a func-ion of �, as indicated by the right side of equation 12.

Let ��0 be arbitrary �unrelated to the w above� and let �o satisfy

�ATA + �I��o = ATc . �95�

hen �o minimizes S on the sphere whose radius � satisfies

�2 = �o2. �96�

his result was proved using the method of Lagrange multipliers,hich requires minimizing the function

u��,�� = S + ��2 − �o2� �97�

ith respect to � and �, where � is a Lagrange multiplier. This re-uires finding the derivatives of u with respect to � and � and settinghem equal to zero. Because �o does not depend on �, the derivativeith respect to � can be determined as done in connection with theinimization of S̄ in equation 81. In fact, setting w = 1 and D = �I

n equation 82 immediately gives

�ATA + �I�� = ATc . �98�

his proof is more general than that provided by Marquardt, whichssumed the existence of �ATA�−1. Next, setting the derivative of uith respect to � equal to zero gives

�2 = � 2, �99�
o

ww

d

w�ddff

w

Taf

w

T

Frq

w1

sWintis

�mps

astf

a

w

tttbtfi

t

a

w

Nt

A

F

U

a


hich proves the result. For the sake of simplicity, the subscript in �o

ill be dropped.The second result requires writing ATA in terms of its eigenvalue

ecomposition �e.g., Noble and Daniel, 1977�, namely

ATA = U�UT, �100�

here U is a matrix whose columns are the eigenvectors of ATA andis the diagonal matrix of its eigenvalues �i �all real numbers�. This

ecomposition applies to any symmetric matrix. The elements of theiagonal matrix � will be indicated with �i. There should be no con-usion between the damping parameter � and the �i because theormer is never subscripted. Using equation 100 and the property

UUT = UT U = I �101�

e obtain

ATA + �I = U�� + �I�UT. �102�

he matrix in parentheses is diagonal, and, as shown inAppendix A,ll of its diagonal elements are always positive when ��0. There-ore its inverse always exists, which allows writing

� = U�� + �I�UT�−1ATc = U�� + �I�−1u , �103�

here equation 95 was used and

u = UTATc . �104�

hen

�2 = �T � = uT �� + �I�2�−1u = �i

ui2

��i + ��2 . �105�

rom this equation we see that � is a decreasing function of �. Thisesult and the previous one are from Morrison �1960, unpublished,uoted by Marquardt�.

Marquardt’s final result concerns the angle � between � and ATc,hich is proportional to −�s �see equation 27�. Using equations 103,04, and 101, we can write

cos �� =�TATc

�ATc=

uT�� + �I�−1UTATc

�ATc

=uT�� + �I�−1u

�uT �� + �I�2�−1u�1/2ATc. �106�

The angle � is a function of �. When � goes to infinity, we alreadyaw that � goes to �−�s�/� �see equation 28�, so that � goes to zero.

hen � = 0, two cases must be considered. If the inverse of ATA ex-sts, all the �i are positive, cos � � 0, and � � 90°. If the inverse doesot exist, then equation 15 cannot be solved. Now we will addresshe question of what happens to �� for other values of �. To answert, Marquardt investigates the sign of the derivative of cos � with re-pect to � and finds that

d

d�cos �� 0. �107�

Appendix B�. The main consequence of this result is that � is aonotonic decreasing function of �, which assures that it is always

ossible to find a value of � that will reduce the value of s. This ob-ervation leads to the algorithm introduced by Marquardt, which he

pplies to a scaled version of the problem. However, because thiscaling is not essential �and is not always used; e.g., Gill et al., 1981�,he basis of the algorithm is described first. At the pth iteration theollowing equation is solved for ��p�

�ATA��p� + ��p�I��p� = �ATc��p�, �108�

nd the updated value of x is computed, i.e.,

x�p+1� = x�p� + ��p�, �109�

hich in turn is used to compute s�p+1�. Then, if

s�p+1� � s�p�, �110�

he value of ��p� is reduced. Otherwise, its value is increased. Afterhis step a new iteration is started. Marquardt introduced three testshat determined the value of ��p+1�, but a simpler approach, describedelow, works well. In any case, the important point to note here ishat the Marquardt algorithm is based on a trial-and-error approachor the selection of the appropriate value of � at each iteration, whichs simpler than Levenberg’s approach �equation 90�.

Marquardt applies his algorithm after introducing the scaled ma-rix ATA�* and vector ATc�*, with components given by

�ATA�*�ij = siisjj�ATA�ij �111�

nd

�ATc�*�i = sii�ATc�i, �112�

here

sii =1

��ATA�ii

. �113�

ote that the diagonal elements of ATA�* are all equal to one. Inerms of this scaling, equation 15 becomes

ATA�*�* = ATc�*. �114�

fter solving this equation, � is computed using

�i = sii�i*. �115�

To verify that equation 115 is correct, we will proceed as follows.irst, introduce a diagonal matrix S with elements

�S�ii = sii. �116�

sing this matrix, equations 111, 112, and 114 become

ATA�* = SATAS, �117�

ATc�* = SATc , �118�

nd

SATAS�* = SATc . �119�

M

s

�

tti

FdipDhevt=

D

ID

weqlesao

E

wrafsGu

sc�wlfi�te

smto�lme

�rsctewAcat

faf�dNflptdtcap

ew2iontenFdpsppD�tt

e�

W12 Pujol

ultiplying both sides of equation 119 by S−1 on the left gives

ATA�S�*� = ATc , �120�

o that

� = S�* �121�

provided that �ATA�−1 exists�.Marquardt’s algorithm is based on the solution of equation 108 af-

er introducing the scaling described above. Solving the scaled equa-ion gives �*�p�, which is converted to ��p� using equation 115, whichn turn is used in equation 109.

The reasons given by Marquardt to scale the problem are twofold.irst, it was known �Curry, 1944� that the properties of the steepest-escent method are not scale invariant. Second, the particular scal-ng he chose was widely used in linear least-squares problems to im-rove their numerical aspects. These questions are discussed in, e.g.,raper and Smith �1981� and Gill et al. �1981�. What is not obvious,owever, is that equation 115 is applicable after the scaled version ofquation 108 is solved, but this fact can be proved using the ideas de-eloped by Levenberg. To see that, let us multiply both sides of equa-ion 82 by D−1/2 on the left, rewrite it slightly, and operate. Using �

1/w, this gives

−1/2�ATA + �D1/2 D1/2��D−1/2 D1/2��

= �D−1/2ATAD−1/2 + �I��D1/2�� = D−1/2ATc . �122�

f D is the diagonal matrix with elements given by equation 94, then−1/2 = S and equation 122 becomes

�SATAS + �I��* = SATc , �123�

here �* = S−1� �see equation 121�. This result has two consequenc-s. First, its comparison with equations 117–121 justifies Mar-uardt’s procedure. Second, it shows that this procedure is equiva-ent to Levenberg’s second choice of matrix D, namely D = S−2. Thisquivalence, noted without proof by Davies and Whitting �1972�,hows that Marquardt’s method implicitly introduces a non-Euclide-n norm that changes at each iteration because it is based on the valuef x for that iteration.

xample 1c

Here, we will apply the Levenberg-Marquardt method with andithout scaling to the gravity data introduced before. There are two

easons for using the two options. One is that because both of themre used in practice, a comparison of their performances will be use-ul. The second reason is that scaling is equivalent to changing thehape of s at each iteration, so that a direct comparison with theauss-Newton and steepest-descent methods is possible only for thenscaled version.

First, we will consider the results obtained using the unscaled ver-ion �corresponding to equation 108�. The initial values of � �indi-ated by �o� are given in Table 1. The following procedure to handleat each iteration is simpler than that proposed by Marquardt butas found to be effective when applied to a variety of inverse prob-

ems. As before, 2 plays the role of s. Then, if equation 110 is satis-ed, the value of ��p +1� is set equal to ��p�/c, where c is a constanthere, c = 2�. If not, the values of � and the parameters are set equalo those they had in the iteration p − 1. Then a new iteration is start-d. The selection of � depends on the type of inverse problem being
o
olved and on the initial values given to the parameters to be deter-ined. For points A, B, and C, equation110 was always satisfied and

he choice of �o was not critical �recall that the Gauss-Newton meth-d converged for these points�. For point D, the situation is differentsee below�. For other inverse problems, the best approach to the se-ection of �o is to invert synthetic data that resemble the actual data asuch as possible and to experiment with different values of �o and

ven the constant c.Two values of �o for each initial point were used. The first value

equal to 1�106 for all the points� was chosen very large to see theelation between the convergence paths and the correspondingteepest-descent paths. Interestingly, the two paths are extremelylose to each other for the four initial values, but the number of itera-ions is several orders of magnitude smaller �just 23 or 24� and thendpoints coincide with �zM,aM� �Table 1�. This similarity of pathsas not expected, and it is not clear how the changes in �, ATA, andTc at each iteration combine to produce the observed paths. For a

omparison, the largest initial values of ATA for the points A, B, C,nd D are close to 5�106, 5�103, 1�103, and 8, respectively, andhe corresponding value at the minimum point is 920.

The second value of �o is equal to 1�104 for point A and 1�102

or the other ones. In all cases, the point �zM,aM� is reached exactly,nd the number of iterations becomes smaller �17 for point A and 11or the other points� than for the previous value of �o. The values of

o used here were chosen so that the convergence paths are interme-iate between the previous ones and those obtained using the Gauss-ewton method. For smaller values of �o, convergence is even faster

or all points except D. For this point, values of �o less than about 10ead to a larger number of iterations. Recall that this was the onlyoint for which the Gauss-Newton method did not converge. Al-hough it is not possible to give a conclusive explanation for theseifferences in convergence speeds, they may be related to the facthat point D is in a region of the �z,a� plane with a very slow rate ofhange in the value of , so that to assure a decrease in its value, thedjustments �z and �a must be smaller than for some of the otheroints, thus requiring larger values of �o.The application of the scaled version of the method is based on

quation 123. Using �o = 1, convergence to the true values of z and aas achieved in 12 iterations for points A, B, and C �Figure 6� and in7 iterations for point D. The problem for point D is that initiallyncreases for this value of �o, which requires an increase in the valuesf � used in subsequent iterations. Therefore, to reduce the totalumber of iterations a larger �o is needed. For this particular point,he smallest number of iterations is 17 for � = 20 �Figure 6�. Let usxamine the convergence paths. For points A and C there are not sig-ificant qualitative differences with the corresponding paths seen inigure 5 for the smaller values of �o, but for the other two points theifferences are significant. For point B, the first three points of theath do not interpolate between the Gauss-Newton and steepest-de-cent paths. Recall that the interpolation property discussed here ap-lies to the unscaled version of the method, so that it cannot be ex-ected that it will always apply when scaling is introduced. For point, the path is completely unexpected, with �z1,a1� equal to

2.79,2.79�, much further away from �xo,yo� than for any of the otherhree initial points. In addition, after this first point has been reached,he path is similar to that for point C.

The inversion was repeated using equation 82 with the diagonallements of D given by equation 94, 1/w = �, and the same values of. The results obtained in this way agree with those shown in Figure
o

6c

vmvttitteAtfmsbpliaos

uLRpLl�bssnhop

Wlpat

w

Wt

N=

swn

w

Fuias

Fst


, thus providing a numerical confirmation of the equivalence of thishoice of D and the Marquardt scaling proved analytically.

In summary, for this particular example, the unscaled and scaledersions of the Levenberg-Marquardt method perform similarly. Itay be argued that the scaled version makes it easier to choose the

alue of �o, which can be taken close to one, but as point D showed,his value may not lead to a smaller number of iterations. Obviously,his is not a major concern in our case, but it may be so when solvingnverse problems involving large numbers of parameters. Also notehat the results for points C and D in Figure 6 show that it is difficulto make general statements regarding the convergence paths for lin-arized nonlinear problems, even for a relatively simple 2D case.gain, convergence to a solution may become more of an issue as

he number of inversion parameters increases. In particular, theunction s may have local minima in addition to an absolute mini-um, in which case the inversion results may depend on the initial

olution and on the selection of �o. These facts must be borne in mindy those beginning their work in nonlinear inverse problems. Eachroblem will have features that make it different from other prob-ems and, as noted above, the best way to investigate it is through thenversion of realistic synthetic data �i.e., the model is realistic�. Inddition, because actual data are always affected by measurement orbservational errors, representative errors should be added to theynthetic data.

HISTORICAL NOTE

In spite of its importance, Levenberg’s �1944� paper went largelynnoticed until it was referred to in Marquardt’s �1963� paper. Whenevenberg published his paper he was working at the Engineeringesearch Section of the U. S. Army Frankford Arsenal �Philadel-hia�, and according to Curry �1944� the engineers there preferredevenberg’s method over the steepest-descent method. Interesting-

y, the Frankford Arsenal supported the work of Rosen and Eldert1954� on lens design using least squares, but they did not use Leven-erg’s method. The computerized design of lenses was an area of re-earch with military and civilian applications, with early resultsummarized by Feder �1963�. Regarding Levenberg’s paper, Federotes that it had come to his attention in 1956 and that other peoplead rediscovered the damped least-squares method, although somef the work was supported by the military and could not be madeublic until several years later because of its classified nature.

One of the rediscoverers of the damped least-squares method wasynne �1959�, who notes that the problems affecting the ordinary

east-squares method when the initial solutions did not lead to an ap-roximate linearization could be addressed by limiting the size of thedjustment vector. Using the notation introduced here, Wynne addedhe constraint

p� = 0 , �124�

here p is a weighting factor, to an equation similar to

A� = c . �125�

ynne noted that the least-squares solution of the combined equa-ions 124 and 125 minimizes a function similar to

S̄ = rTr + p2�T�. �126�

ote that this is a special case of equation 73 with w = 1 and Dp2I. Wynne, however, did not give an explicit expression for the

olution, which can be derived as follows. Equations 124 and 125ill be written as a single equation involving partitioned matrices,amely

B� = u , �127�

here

B = �A

pI� ; u = �c

0� . �128�

2 3 4 5 6 7 8 9 10 11 122

3

4

5

6

7

8

9

10

Depth (km)

Rad

ius (km)

A B

C D

10

5

3

35

10

50100500

o: λo = 1×106; o: λo = 1×104; o: λo = 1×102

igure 5. Similar to Figure 1 showing the paths corresponding to thenscaled Levenberg-Marquardt method �equation 108� using thenitial values of � �i.e., �o� given at the top of the figure �circles�. Forcomparison, some of the steepest-descent paths in Figure 3 are alsohown here �black lines�.

2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

10

Depth (km)

Ra

diu

s (

km

)

A B

C D

10

5

3

3

5

10

50100500

o, o, o: λo = 1; o: λ

o = 20

igure 6. Similar to Figure 5 showing the paths corresponding to thecaled Levenberg-Marquardt method �based on equation 123� usinghe initial values of � �� given at the top of the figure �circles�.
o

wtWpiassctapAW

nc

wTitmra

aswttdatfbeppoacl

rpmN

n

T�v

Bde

t

s

Izsp

d

s

lrwoi

a

wt

�

W14 Pujol

Solving equation 127 by least squares gives

BT B� = BTu , �129�

hich after performing the multiplications indicated becomes equa-ion 93 with � replaced by p2. This result is quoted without proof in

ynne and Wormell �1963�. Feder �1963�, however, provides aroof that started with equation 126. Wynne’s �1959� paper is alsonteresting because it notes that for p going to infinity the solutionpproaches that which is obtained using the method of steepest de-cent, thus anticipating Marquardt’s results. In Wynne’s method, theelection of p was empirical, and was based on the condition that theomputed value of � was small enough to assure that the lineariza-ion of the nonlinear problem was approximately satisfied.Asimplerpproach, suggested by Feder �1963�, is to start with a large value ofand to reduce it gradually so as to assure convergence to a solution.n application of Wynne’s method was provided in Nunn andynne �1959�.Another rediscoverer was Girard �1958�, although his work was

ot as extensive as that of Wynne. Girard’s approach was to add theonstraint

K� = 0 , �130�

here K is a diagonal matrix, to an equation similar to equation 125.his led to a merit function similar to S̄ and to a matrix equation sim-

lar to equation 82 with �1/w�D replaced by a diagonal matrix. Nei-her the derivation of the equation, nor an expression for the ele-

ents of the matrix were given, although the latter can be derived byeplacing pI with K in the matrix B in equation 128 and proceedings before.

This brief summary shows that Levenberg’s method was knownmong people working on optical design, but this knowledge did notpread further. The widespread lack of recognition of Levenberg’sork may have to do with the unavailability of adequate computa-

ional capabilities at the time his paper appeared, and the possibilityhat his way of finding the optimal value of w at each iteration waseemed too complicated for its computer implementation. For ex-mple, Hartley �1961� notes the need to know higher derivatives ofhe merit function and concludes that the method was not well suitedor computer programming. Interestingly, it was Hartley whorought Levenberg’s paper to the attention of Marquardt as a review-r of the latter’s paper. Marquardt’s work, on the other hand, becameopular rather quickly, which brings the question of why this hap-ened. According to Davis �1993�, Marquardt explained the successf his method by the fact that he implemented it in a FORTRAN codend that he gave away hundreds of copies of it. Other interestingomments by Marquardt on his paper can be found at http://garfield.ibrary.upenn.edu/classics1979/A1979HZ24400001.pdf�.

ACKNOWLEDGMENT

I gratefully acknowledge the constructive comments of one of theeviewers, Bill Rodi, which led to an improved presentation of theaper, his careful checking of the equations, and the comments thatotivated the note on the relation between the Newton and Gauss-ewton methods.

APPENDIX A

SOME RESULTS CONCERNING POSITIVEDEFINITE AND SEMIDEFINITE MATRICES

�1� A square symmetric matrix C is said to be positive semidefi-ite if

yTCy � 0; y � 0 . �A-1�

he matrix C is positive definite if the � sign above is replaced by. If vi and �i are an eigenvector of C and its corresponding eigen-

alue, then

viTCvi = �ivi

Tvi = �ivi2. �A-2�

ecause vi�0, if C is positive semidefinite, �i �0. If C is positiveefinite, �i �0 and its inverse exists because C−1 = U�−1UT �seequations 100 and 101�.

If the � sign in equation A-1 is replaced by � the matrix C is saido be negative definite and its eigenvalues are negative.

�1a� Given any matrix A, ATA is either positive definite oremidefinite, as can be seen from

yT�ATA�y = �Ay�TAy = Ay2 � 0; y � 0 . �A-3�

f the inverse of ATA exists, all of its eigenvalues will be larger thanero and ATA will be positive definite. If the inverse does not exist,ome of the eigenvalues will be equal to zero and the matrix will beositive semidefinite.

�1b� A diagonal matrix D with elements �D�ii = di �0 is positiveefinite because

yTDy = �i

diyi2 � 0; y � 0, di � 0. �A-4�

�1c� If matrices C and P are positive semidefinite and definite, re-pectively, and ��0, then C + �P is positive definite because

yT�C + �P�y = yTCy + �yTPy � 0; y � 0, � � 0.

�A-5�

These three results are important in the context of the dampedeast-squares method because if C = ATA, then the matrices in pa-entheses in equations 82 and 93 will be positive definite as long as� and ��0, and their inverses will exist. For the particular case

f P = I the eigenvalues of C + �I are �i + �, which are always pos-tive as long as ��0 �Feder, 1963�.

�2� If B is a symmetric positive �semi�definite matrix, there existsunique matrix C such that

B = C2, �A-6�

here C is symmetric positive �semi�definite. To see that, start withhe eigenvalue decomposition of B, given by

B = U�UT �A-7�

see equation 100� and introduce the matrix

C = U�1/2 UT, �A-8�

wT

wsp

wp

f

rs

ma

bs

w

Tes

w

Tos

Tf

w

NffB

sf

T

wewt

AA


here �1/2 is a diagonal matrix with its iith element given by �i1/2.

hen,

C2 = U�1/2 UT U�1/2 UT = U�UT = B , �A-9�

here equation 101 has been used. The matrix C is known as thequare root of B, and is indicated by B1/2. This matrix is unique �for aroof see, e.g., Harville, 1997; Zhang, 1999�.

If B is positive definite,

B−1/2 = C−1 = U�−1/2UT, �A-10�

here equation 101 has been used. This matrix is also symmetric andositive definite.

�3� In its standard form, Schwarz inequality states that

�aTb�2 � �aTa��bTb� �A-11�

or any vectors a and b �e.g., Arfken, 1985�.Let C be a symmetric positive definite matrix. In equation A-11,

eplace a and b by C1/2 a and C−1/2 b, respectively. Because C1/2 isymmetric, this immediately gives equation 65 �e.g., Zhang, 1999�.

�4� The matrix M in equation 64 is symmetric �because so are theatrices on the right� and positive definite. To show that, let y be an

rbitrary nonzero vector. Then

yTMy = yTB−1/2HB−1/2y = �B−1/2y�TH�B−1/2y� � 0, �A-12�

ecause H is assumed to be positive definite. The vector in parenthe-es on the right-hand side of this equation is arbitrary.

APPENDIX B

PROOF OF EQUATION 107

In component form, equation 106 becomes

cos �� =

�i

ui2

�i + �

�i

ui2

��i + ��2�1/2

ATc

. �B-1�

The derivative of cos � with respect to � is given by

d

d�cos �� =

1

C� �i

ui2

�i + �� i

ui2

��i + ��3�− �

i

ui2

��i + ��2�2� , �B-2�

here

C = �i

ui2

��i + ��2�3/2

ATc . �B-3�

he factor 1/C is positive. To find the sign of the factor in braces inquation B-2 we must perform all the operations indicated. The re-ulting expression is

�i

ui2�1i� �

i

ui2�3i� − �

i

ui2�2i�2

�i

��i + ��2�2, �B-4�

here

�ni = �i�

��i� + ��n; n = 1,2,3; i� � i . �B-5�

o show how this result is derived it will be assumed that the numberf terms in the sums is three; the extension to any other number istraightforward. Let

ai = �i + � . �B-6�

hen, the fractions within braces in equation B-2 can be rewritten asollows

u12

a1n +

u22

a2n +

u32

a3n =

u12a2

na3n + u2

2a1na3

n + u32a1

na2n

a1na2

na3n

=u1

2�n1 + u22�n2 + u3

2�n3

a1na2

na3n =

�i

ui2�ni

�i

ain

,

�B-7�

here

�n1 = a2na3

n; �n2 = a1na3;n �n3 = a1

na2n; n = 1,2,3.

�B-8�

ote that the first subindex in �ni refers to the power to which theactors in the product are raised and the second one indicates whichactor is excluded from the product. The denominator in equation-4 is common to the two terms within the braces in equation B-2.

The denominator in equation B-4 is also positive. To find out theign of the numerator, a new transformation is needed, based on theact that

�1i �3i = ��2i�2. �B-9�

his allows writing the numerator of expression B-4 as

�i

�ui�1i1/2�2� �

i�ui�3i

1/2�2� − �i

�ui�1i1/2��ui�3i

1/2��2,

�B-10�

hich is always positive because of the Schwarz inequality �seequation A-11�. To apply it to equation B-10 let a and b be vectorsith components ui�1i

1/2 and ui�3i1/2, respectively. This result shows

hat

d

d�cos�� 0. �B-11�

REFERENCES

rfken, G., 1985, Mathematical methods for physicists:Academic Press Inc.postol, T., 1969, Calculus, vol. II: Blaisdell Publishing Company.

B

C

C

D

D

D

D

D

FG

G

G

H

H

JK

L

L

M

NN

R

SSSW

W

Z

W16 Pujol

everidge, G., and R. Schechter, 1970, Optimization: Theory and practice:McGraw-Hill Book Company.

rockett, J., and H. Chernoff, 1955, Gradient methods of maximization: Pa-cific Journal of Mathematics, 5, 33–50.

urry, H., 1944, The method of steepest descent for non-linear minimizationproblems: Quarterly ofApplied Mathematics, 2, 258–261.

avies, M., and I. Whitting, 1972, A modified form of Levenberg’s correc-tion, in F. Lootsma, ed., Numerical methods for non-linear optimization:Academic Press Inc., 191–201.

avis, P., 1993, Levenberg-Marquart �sic� methods and nonlinear estima-tion: SIAM News, 26, no. 2.

obrin, M., 1976, Introduction to geophysical prospecting: McGraw-HillBook Co.

orny, C., 1975, A vector space approach to models and optimization: JohnWiley & Sons.

raper, N., and H. Smith, 1981, Applied regression analysis: John Wiley &Sons.

eder, D., 1963, Automatic optical design: Applied Optics, 2, 1209–1226.ill, P., W. Murray, and M. Wright, 1981, Practical optimization: AcademicPress Inc.

irard, A., 1958, Calcul automatique en optique géométrique: RevueD’Optique, 37, 225–241.

reenstadt, J., 1967, On the relative efficiencies of gradient methods: Mathe-matics of Computation, 21, 360–367.

artley, H., 1961, The modified Gauss-Newton method for the fitting of non-

linear regression functions by least squares: Technometrics, 3, 269–280.arville, D., 1997, Matrix algebra from a statistician’s perspective: Springer,Pub. Co., Inc.

enkins, G., and D. Watts, 1968, Spectral analysis: Holden-Day.rzanowski, W., 1988, Principles of multivariate analysis: Oxford Universi-ty Press.

evenberg, K., 1944, A method for the solution of certain non-linear prob-lems in least squares: Quarterly ofApplied Mathematics, 2, 164–168.

uenberger, D., 1973, Introduction to linear and nonlinear programming:Addison-Wesley Publishing Company.arquardt, D., 1963, An algorithm for least-squares estimation of nonlinearparameters: SIAM Journal, 11, 431–441.

oble, B., and J. Daniel, 1977,Applied linear algebra: Prentice-Hall.unn, M., and C. Wynne, 1959, Lens designing by electronic digital comput-er: II: Proceedings of the Physical Society, 74, 316–329.

osen, S., and C. Eldert, 1954, Least-squares method for optical correction:Journal of the Optical Society ofAmerica, 44, 250–252.

eber, G., 1977, Linear regression analysis: John Wiley & Sons, Inc.eber, G., and C. Wild, 1989, Nonlinear regression: John Wiley & Sons, Inc.piegel, M., 1959, Vector analysis: McGraw-Hill Book Co.ynne, C., 1959, Lens designing by electronic digital computer: I: Proceed-ings of the Physical Society �London�, 73, 777–787.ynne, C., and P. Wormell, 1963, Lens design by computer: Applied Optics,2, 1233–1238.

hang, F., 1999, Matrix theory: Springer, Pub. Co., Inc.

Documents

Solution of NonLinear Inverse Problems and the Levenberg-Marquardt Method