On properties of the iterative maximum likelihood reconstruction method

Mathematical Methods in the Applied Sciences, Vol. 11, 331-342 (1989) MOS subject classification: 65 R 20

On Properties of the Iterative Maximum Likelihood Reconstruction Method

H. N. Multhei

Fachbereich Mathematik, Johannes Gutenberg-UniversitBt, Saarstrak 21. 0-6500 Mainz, F.R. Germany

and

B. Schorr

C E R N , CH-1211 Geneva 23, Switzerland

Communicated by W. Tornig

In this paper, we continue our investigations6 on the iterative maximum likelihood reconstruction method applied to a special class of integral equations of the first kind, where one of the essential assumptions is the positivity of the kernel and the’given right-hand side. Equations of this type often occur in connection with the determination of density functions from measured data. There are certain relations between the directed Kullback-Leibler divergence and the iterative maximum likelihood reconstruction method some of which were already observed by other authors. Using these relations, further properties of the iterative scheme are shown and, in particular, a new short and elementary proof of convergence of the iterative method is given fOr the discrete case. Numerical examples have already been given in Reference 6. Here, an example is considered which can be worked out analytically and which demonstrates fundamental properties of the algorithm.

1. Introduction

The iterative maximum likelihood reconstruction method for image reconstruction in emission tomography was first systematically studied by Shepp and Vardi.’ The proof of convergence follows from a general result of Csiszar and Tusnady2 for a class of algorithms of which the iterative maximum likelihood reconstruction method is only a special case. An outline of the proof of convergence for this special case is given by Vardi, Shepp and Kaufman.* As has been shown,’ the method is an EM algorithm and converges to a maximum likelihood estimate of the discretized density of emission counts if these counts are Poisson distributed. EM algorithms have been studied by Dempster, Laird and R ~ b i n . ~ Kondor4 applied the iterative method to the continuous case, in other words, to integral equations of the form

0170-4214/89/03033 1-12$06.00 0 1989 by B. G. Teubner Stuttgart-John Wiley 8~ Sons, Ltd.

(1.1)

Received 9 February 1988

332 H. N. Miilthei and B. Schorr

with given positive k and g and unknown non-negative f. (For convenience, x and y are taken to be one-dimensional; they may have several dimensions.) Kondor, who does not refer to Shepp and Vardi’s work, justified the use of this method purely by the success he claims to have had with it in practical applications. In Reference 6, the authors of this paper gave a motivation of the iterative method for the continuous case, based on ideas from Shepp and Vardi and from Dempster et al.

Starting from the functional A with

it has been shown6 that under certain conditions the iterative method, if it converges, tends to a non-negative function f which maximizes A on the space of density functions. The motivation for the use of the functional comes from the fact that A can be considered as the limiting case of the log-likelihood function of a sample from independent Poisson distributions. In Reference 6, many properties of the continuous and the discrete iterative scheme have been shown. Convergence in the continuous case has been an open question: and it is also not answered in this paper. However, further properties of the continuous and the discrete iterative method will be shown by using the directed Kullback-Leibler’ divergence (also called average information) of two densities h , and h,:

(1.3)

As will be seen later, this functional induces some kind of a directed distance measure on the space of density functions. Using this pseudo-distance, we are able to formulate a minimization problem whose solution is linked to the iterative scheme. In particular, we can explain the relations of the iterative scheme to the integral equation (1.1). Furthermore, we shall give an elementary and short proof of convergence of the iterative method in the discrete case.

2. The iterative method

In this section, we briefly give the main definitions. More details may be found in Reference 6. In order to avoid unnecessary complications, we shall assume as in Reference 6 that k E C [0, l]’, g E C [0, 13, and that k and g are positive. In this paper, without loss of generality, we normalize to obtain

j I k ( x , y ) d x = g(x)dx= l , y ~ [ O , l ] . c In what follows, the requirements for a solution f of (1.1) are explicitly stated. The iterative method is then defined by

Iterative Maximum Likelihood Reconstruction 333

f , f o ~ X : = {h~C[O,1]\{0}:h(x)~Ofor a11 x~[O,l]}.

I f f E X is a solution of (Ll), it trivially belongs to the set of functions

F:= { h e x : T(h) (y ) G 1, y~[O,l] , where equality holds for y with h(y) > O}.

The set 9- corresponds to the Kuhn-Tucker conditions for the discrete case (Reference 8, p. 119). ForfE 9- the fixed-point equationf= G(f) holds. Let

Xl:= h e x : h(y)dy=l . { J-1 1

Ib

Every solution EX of (1.1) belongs to Xl, and the operator G maps X into X1 c X . In Reference 6 (Theorem 1 and its proof), it is shown that everyf* €5 is a global solution of the maximization problem MP:maximize A(f) for f c X 1 . Furthermore, we have

A (f) G g(x) In d x ) dx, f~ xi (2.2)

and for a solution f* E Xl of (1.1) the equation

Nf*) = J-; glx) 1ndx)dx

holds. In addition, if X + denotes all positive f~ X , the following properties of the iterative scheme (2.1) hold (see Reference 6, p. 151* and Theorem 8):

(2.3)

(ii) If for& EX +, f, converges tof* E C [0,1] with respect to L, [0,1] as n tends to infinity, then f *EF, which means that f* is a solution of the maximization problem MP.

These properties show the connection between the iterative method (2.1), the functional A and the maximization problem MP. In the following section we shall give a connection between the iterative scheme (2. l), the directed Kullback-Leibler divergence (1.3) and a certain minimization problem.

1

(0 A(L) G N L + l ) G j Q(x)lng(x)dx,foEX+, n E N 0

3. A minimization problem equivalent to MP

Before we formulate the minimization problem, we give a few properties of the directed Kuilback-Leibler divergence which are proved in Reference 5. Let h , , h , E

*The lower bound of h(f,) in Reference 6, p. 151, is a misprint and has to be caneelled.


Ll[O,l] be probability density functions. If H i is the hypothesis that the random variable X has a distribution with density hi, i = 1, 2, then d(h, , h,) defined by (1.3) can be interpreted as the average information per observation from the distribution with the density h , for discrimination in favour of H2 against H,. The sum

d(hl,h,) + d(h, ,h , ) is called the Kullback-Leibler divergence of.h, and h,. It has all the properties of a metric except the triangle inequality property. The functional d has all the properties of the divergence except the symmetry property, and is therefore called the directed Kullback-Leibler divergence. Both kinds of divergence introduce a sort of distance measure in the space of probability density functions. Furthermore, if

H ( h ) : = - h(x)lnh(x)dx, h€X1, J: which is called the entropy of h, it can easily be verified that

A (f) = - d(g, F ) - H(g),.fE Xi. (3.1) Trivially, we then have the

Theorem 1. The maximization problem M P is equivalent t o the minimization problem: minimize d(g, F) for f~ X , . Remark. If the directed Kullback-Leibler divergence d ( g , F ) is considered as a measure of distance between g and F , then Theorem 1 yields a geometrical inter- pretation of the problem MP, in particular in the case where the integral equation (1.1) has no solution in X , . Using Theorem 1 we shall show further properties of the iterative scheme (2.1).

4. Properties of the iterative method

Let

K(h;x,y):= k(x9y)hty) ,X,YE[O, l ] ,h€X, Jo k(x,z)h(z) dz

Fn(x):= J: k(x,y)~(y)dy, n E ~ 0 .

Then K (h; x,.) E H , for h E X and x E [0,13, and we can prove

Theorem 2. Let fo E X + . W e then have the relations

A(fn + 1 1 - A (fn) = d(g, Fn) - d(g, Fn + 1 1

= d(fn + 1 ,f,J + g(x)d(K ( A ;x,.), K (S. + , ; x,.)) dx, n E N . (4.1) J: Proof. The first relation immediately follows from (3.1) and the second one is easy to verify. It is identical to the relation (3.13) in Reference 6, forf:=f,.


Corollary. Let E X + . The following statements then hold

(i) d(g, Fn + 1 ) G d(g, Fn 1, n E and therefore d(g, Fn), n E N, is convergent to a non-negative real number as n tends to infinity,

(ii) d ( L + 1,fn) G d(g,Fn) - d(g , Fn + 11, n~ N*

Prooc From (2.3) and (4.1) it follows that d(g, F,) is monotonically decreasing, and hence convergent since it is bounded from below by zero. Thus (i) is proved. The assertion (ii) immediately follows from (4.1) and the non-negativity of d.

Remark. According to (i) of the Corollary the iterative scheme (2.1) is a method of descent with respect to the directed Kullback-Leibler divergence in the range of the integral operator defined by the kernel k.

Theorem 3. Let f = G ( f ) , j E X l , and&€.%+. Then,

holds, and by integrating it the equality (4.2) follows.

Theorem 4. Let f = G ( f ) , f E X l , a n d & E X + . Then

d ( f , f . + 1 ) ~ d ( f , f , ) - ( A ( f ) - A ( f . ) ) , n E N ,

d ( f , f , + 1 ) G Wf,), nE N.

and therefore for f E F

Proof: Set k(z,x)g(z) , h E X * * t(h;x,z):= fd k(z,u)h(u)du

Because of the non-negativity of d andfT(f) = G(f) = f we have


Thus (4.3) is proved. (4.4) immediately follows from (2.2).

Remark. According to (4.4) the iterative scheme (2.1) is a method of descent with respect to d.

Corollary 1. Let f E 9 and& E X + . Then, for the correction term in (4.2), the inequality

O < N f ) - - N f , ) G f ( Y ) l n T ( f , ) ( Y ) d Y , n E N , J-: holds.

Prooj The inequality is a trivial consequence of Theorems 3 and 4.

Corollary 2. Let f E X , be a solution of the integral equation (1.1) and& E X + . Then, the inequality

Pro05 Since f is solution of (1.1) we have f = G (f ), and with Theorem 4 follows (4.5).

The inequality (4.5) yields the following information, which is of particular interest for the applications. If the directed Kullback-Leibler divergence d ( J fn) is interpreted as a measure of goodness of the approximationf, toJ it follows that iff, is a ‘bad’ approximation to f inducing the same relation between F, and g , step n + 1 yields a rather large correction. This corresponds to the situation observed in numerical examples when a ‘bad’ starting function fo is taken. The iterative scheme yields approximations after a few steps which may be improved only minimally when carrying on with the iterations. This shows in particular that the choice of the starting functionf, is not so much a problem. In fact, the choicef, = 1 is often a good one.

5. The discrete problem

The following system of equations is considered:

2 pijcpj = g i , i = l(l)r, j = 1

under the constraints cpj 2 0, j = l(l)m, where

2 gi = 1, ga 2 0, i = l(l)r, i = 1

p i j = 1, p i j 2 0, i = l ( l ) r , j = l(l)m, i = 1

It immediately follows that a solution of (5.1) satisfying the constraints belongs to the simplex

m

j = 1 s m : = ~ $ : = ( $ j ) j = l ~ l ~ m ~ ~ ~ : C $ j = 1 and$2O(bycomponents)


By discretization of (l.l), using a quadrature formula and by appropriate scaling, a system of equations of the type (5.1) with r = m is obtained, as has been shown in Reference 6. The integers m and r in (5.1) are not specified such that underdetermined or overdetermined systems are admitted. The discrete analogue to (2.1), the so-called iterative maximum likelihood reconstruction method, then becomes formally:

p""= G,(cp"),n~N,, (5.2)

p":= (cp?).- J J - l ( l ) m ,

G m ( c p ) : = (Gm,j(cP))j= l ( l ) m , CP:= (cPj)j= l ( l )rn,

Grn,j(V):= VjTm, j(cP),

with

s =-1

for cp, cpo E R" \ (0) and cp, cpo 2 0.

(5.3) can be zero. In order to clarify this situation, we use the following definitions: In the definition of T,, j ( c p ) the sum in the denominator on the right-hand side of

A : = { i E (1 ,2, . . . , r } : g i > 01, Bi:= { j ~ { 1 , 2 , . . . , r n } : p i j > O } , i ~ A .

These sets are not empty according to our assumptions. The dependence of the index sets on m is not explicitly indicated. We then define

$:= ($j)j=l(l)n~Rm: $ 2 0 and 1 $, > 0, i E A ,€Bi

i.e. D , is the domain of G,. Furthermore, let

C: = { j E ( 1,2, . . . , rn } : pij = 0 for all i E: A 1, where then

CnB, = 0 for all i c A .

Suppose we always start the iteration with cpo > 0. Then the iteration procedure (5.2) is well defined, and in particular we have

0 = cppj" = G,, j(cp"),j~ C, n E N , (5.5)

O<cpy, j~C*:= {1,2,. . . r n } \ C , n ~ N , .

Note that

C* = u Bi, i s A

338 H. N. Miilthei and B. Schon

which means that only the positive components of the iterates q", n E M, contribute to Pi(cp) in (5.4). Furthermore, all iterates cp", n E N, belong to the class

D::= {cp~D,nS, :cp~=O,i~C},

which follows from (5.2), (5.4) and (5.5). Related to the system (5.1), we consider the discrete analogue to the functional A:

Trivially, we. have

That means that A,(cp) is independent of the components cp j , j cC , and finite for ED:. It follows that cp maximizing A, on S, belongs to 0:. Carrying over point (i) of Theorem 7 in Reference 6 to the discrete case yields that the sequence A, (cp"), n E N, and cpo > 0, converges monotonically increasing to some real number as n -, co . In particular, we have

A,,,(#"':= A,(G,(cp"))-A,(cp")-rO, n - , 00. (5.6)

In an analogous manner to the proof of Theorem 2 and its corollary, it is easy to verify that

Am(@) a dm(Gm($"', $1 (5.7)

~,(~,ICI):= f CPjIn-9 'Pi V=(Vj)ESm, ~ ~ = ( ~ ~ j ) E s r n , j = l ( l ) m , with

j = 1 $j

where, by definition, Oln0: = 0. d , is the discrete directed Kullback-Leibler divergence with the analogous properties as in the continuous case. (5.7) is proved by Cover (Reference 1, Theorem l ) , in a more general form.

Theorem. Let cpo > 0. The sequence cp", n E N, converges to cp* E S , which maximizes A, on S,.

Note that A, is equal to 1 in Reference 8 up to a constant. We give below a complete proof of this theorem. The proof is elementary T .d substantially shorter than the one in Reference 8. In particular, we do not need CsisAr's and Tusnady's general results. In addition, we give an elementary analogue to Cover's' proof that cp* satisfies the Kuhn-Tucker conditions, without using Fatou's lemma.

Proof: By the Bolzano-Weierstrass theorem the existence of an accumulation point c p * ~ S , of the sequence cp" is guaranteed. Therefore, there exists a subsequence cp"', i E N, converging to cp* as i co. We have cp* E D,*, for the supposition of the opposite would lead to a contradiction to the fact that the sequence A(cpnt) is convergent as i 00. The proof of the theorem is obtained in the following four steps:

In Reference 8 we find the following theorem given in our notation.

(i) 'P* = G m ( c p * ) ,

(ii) dm(cp* , CP" + ' < d m ( c p * , v"), n~ N,


(iii)cp"+cp* as n + m ,

(iv) cp* satisfies the Kuhn-Tucker conditions for ~ptimality:~

j = I(1)m. Tm,j(cp*)= 1 , if cpf >o, T,,,, j(q*) < 1, if cp? = 0,

Step (i): according to (5.4), T,,,, j(cp), j = l(l)rn, depends at most on the components pi, ~ E C * . Therefore, Tm,j,j= l(l)m, is continuous on 0:. Following an idea of Cover' we use the inequality

which follows from (5.7). Using the continuity of T,,,, j E C*, and (5.6) we then obtain in the limiting case

O = 1 ' ~ 7 r,,,,j(cp*)ln Tm,j(q*) = dm(Gm(q*),cp*), j sC*

from which (i) immediately follows.

Step (ii): let for h = (h,,),, = ED:

Using (i) and

we obtain for n~ N

= ~ m ( ~ n ) - ~ A m ( ' P * ) + d m ( c P * , ( P n ) - d m ( < p * , V n + l ) . (5.10)

Since there exists a subsequence cp"1 converging to cp* = G,,,(cp*) as i --* 00, and since A,(cp") is monotonically increasing and convergent to a real number as n + 00, we obtain A,(cp") < A(cp*) due to the continuity of A,,, on D,*. Therefore, from (5.10) follows (ii), which is the assertion of Lemma A.l of Vardi, Shepp and Kaufman.8

Step (iii): with the subsequence qni converging to cp* as i-+ co we have

d,,,(cp*, q"') --* 0, i + 00.

340 H. N. Miilthei and B. Schon

(ii) then implies that d,(rp*,cp")-rO as n-, cz) from which rpn+rp* as n + a , immediately follows.

Step (iv): (i) immediately implies (5.8). (5.9) is trivial for j E C. Let rp? = 0 for some j € C * . We then have

Tm. j (V")--------* Tm,j(V*) > 0. n - m

The supposition T',,,, j(rp*) > 1 immediately leads to a contradiction. This completes the proof of the theorem.

The proof shows that the theorem can be supplemented by the statement (p* ED:. The formulae (5.10) and (ii) are the discrete analogues of (4.3) and (4.4), respectively, in Theorem 4. The other assertions on the continuous case can be camed over as well.

6. An example

We consider the special case gi = dl i , i = 1 (I)r, which implies A = { 1) for (5.1). 6 , is the Kronecker symbol. In addition, it is assumed that the 'pi, j = 1 (I)m, are numbered in such a way that

For s = t the inequalities in the middle are not present, and for f = m all p, j , j = 1 (l)m, are positive. It immediately follows that

{ t + l , r + 2 ,..., m } , t < m a, t = m ,

B l = { l , 2 , . . . , r } = C * ,

Owing to the fact that In is strongly monotonically increasing, and that A,,, does not depend on the components rp,, p E C, it immediately follows that every maximum point of A, on S,,, is an element of D:. Using the monotonicity of In, we can even conclude that the maximum points of A,,, on S, are given by

D:*:= {(p~D::rp, ,=O,p=st; l ( l )m} .

For ED:*


trivially holds. The system of equations (5.1) belonging to the matrix and the right- hand side defined in this example is somewhat extreme when considered in the framework of density measurements which was the motivation for the study of this iterative scheme. For it can easily be verified that the system has no solution in S, if p1 < 1, and all solutions in S, are given by D:* ifp, = 1. However, we shall see that fundamental properties of the iterative scheme can be demonstrated with this example.

How does the algorithm perform when starting with q0 > O? We have

It immediately follows that

(i) cp? = q$ = 0, j E C, n~ N, and hence @'ED:, n~ N, where q* is the maximum

(ii) ( ~ 7 > 0, jEC*, n € N o . For j E C* the iterates become:

point of A, on S,,, to which the method converges,

By induction we obtain, for j E C * ,

with

-> P1 I l , j = s + l(1)t . P l j

Summarizing we obtain the following convergence statements:

0 < qpj"- cp? = 0, j = s + 1 ( l ) t ,

q$ = cp? = 0, n~ N, j = t + l(1)m;

n + m

in particular, by starting with cpo > 0, not all maximum points of A m in D:* are obtained.

342 H. N. Mdthei and B. Schon

Remark. For this example it is obvious how to choose cpo in order to obtain all

elements in D:*: cpo 2 0 and

‘take part’ in the ‘leading’ components of the row vector (pl j ) , j = 1 (1)m.

of cpj” and cpf , respectively,jEC*, EN, it immediately follows that

cpy > 0. This is analogous to vector iteration: cpo must

In the following, we shall discuss the speed of convergence. From the representation

I= 1

which implies linear convergence with the rate of convergence pies+ Jpl , . Further, we have

cpf - cp; = o( (E)”), n a, j = s + 1 (1)t;

this means linear convergence with the rates of convergence p 1 j/pll, j = s + 1 (1)t.

Conclusion. The maximum rate of convergence is pl,s+ /pl 1. The convergence of the cp; , j = s + 2(l)t, can, in fact, be better than the convergence of one of the cp;, j = l(I)s + 1. In the case of s = t it follows that cpl = q*.

The above results also explain why, as in the case of vector iteration, the iterative scheme badly converges, in general, and in the example in particular if pl,s+ l /p l z 1.

References

1. Cover, Th. M., ‘An algorithm for maximizing expected log investment return’, IEEE Trans. Information

2. Csiszir, I. and Tusnidy, G. ‘Information geometry and alternating minimization procedures’, Statistics

3. Dempster, A. P., Laird, N. M. and Rubin, D. B., ‘Maximum likelihood from incomplete data via the EM

4. Kondor, A., ‘Method of convergent weights-an iterative procedure of solving Fredholm’s integral

5. Kullback, S., Information Theory and Sratistics, Dover Publications, New York, 1968. 6. Miilthei, H. N. and Schorr, B., ‘On an iterative method for a class of integral equations of the first kind’,

7. Shepp, L. A. and Vardi, Y., ‘Maximum Likelihood reconstruction for emission tomography’, IEEE

8. Vardi, Z., Shepp, L. A. and Kaufman, L. A., ‘A statistical model for positron emission tomography’,

9. Zangwill, W: I., Nonlinear Programming: A Llnijied Approach, Prentice-Hall, Englewood CliRs, New

Theory, IT-30. (2), 369-373 (1984).

and Decisions, Suppl. Issue No. 1, 205-237 (1984)

algorithm’, J. Roy. Stat. SOC., 39, 1-38 (1977).

equations of the first kind’, Nuclear Insrrum. Methods, 216, 177-181 (1983).

Math. Meth. in the Appl. Sci., 9, 137-168 (1987).

Trans. Medical Imaging, MI-1, 113-122 (1982).

JASA SO, 8-37 (1985).

Jersey, 1969.

Documents

On properties of the iterative maximum likelihood reconstruction method