A connectionist model for diagnostic problem solving

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 2. MARCH/APRIL 1989 285

A Connectionist Model For Diagnostic Problem Solving YUN PENG AND JAMES A. REGGIA

Ahtract -A competition-based connectionist model for solving diagnostic problems is described. The class of problems under consideration are computationally difficult in that multiple disorders may occur simultaneously and a global optimum in the space exponential to the total number of possible disorders is sought as a solution. The diagnostic problem is thus viewed as a nonlinear optimization problem. To solve this problem, global optimization criteria are decomposed into local optimization criteria that are used to govern node activation updating in the connectionist model. Nodes representing disorders compete with each other to account for each “individual” present manifestation, yet complement each other to account for all present manifestations through parallel node interactions. When equilibrium is reached, the network settles into a locally optimal state in which some disorder nodes (winners) are fully activated and compose the diagnosis for the given case, while all other disorder nudes are fully deactivated. A “resettling process” is proposed to improve accuracy. Three randomly generated examples of diagnostic problems, each of which has 1024 cases, were tested. This “decomposition plus competition plus resettling” approach yielded very high accuracy for these experimental examples.

I. INTRODUCTION

EVERAL models have been proposed during the past S few years for combining probabilistic inference with artificial intelligence (AI) symbol processing methods for diagnostic problem solving when multiple disorders may occur simultaneously. These include the authors’ own probabilistic causal model of parsimonious covering theory as well as some others [4], [15], [17], [18]. In these models disorders and manifestations are connected by causal links associated with probabilities representing the strengths of the causal associations. A hypothesis, consisting of zero or more disorders, with the highest posterior probability under the given set of manifestations (findings) is taken as the problem solution. In other words, the diagnostic problem formulated in this fashion can be viewed as a global optimization problem. To solve diagnostic problems formulated in this fashion, conventional sequential search

Manuscript received November 27, 1987, revised October 27, 1988. This work supported in part by NASA Award NAG1-885, in part by NSF Award IRI-8451430 with matching funds from AT&T Information Systems, and in part by NIH award NS-16332.

Y . Peng is with the Institute for Software, Academia Sinica, P.O. Box 8718, Beijing. People’s Republic of China. J. A. Reggia is with the Department of Computer Science, University of Maryland, College Park, MD 20742 and with Dept. of Neurology, UMAB and University of Maryland Institute of Advanced Computer Studies.

IEEE Log Number 8825839.

approaches in AI suffer from potential combinatorial ex- plosion when the number of possible disorders is large. This is because they potentially must compare the posterior probabilities of all possible combinations of disorders. Pearl proposed a parallel version of problem solving that finds global optimal solutions for the special case of singly connected causal networks within polynomial time of the network diameter [15]. However, for a nonsingly- connected causal network, that is the case for most diagnostic problems, Pearl’s approach requires parallel computation for each instantiation of the “cutpoint set”, and thus still leads to combinatorial difficulty when ths cutpoint set is large (see Section V).

Connectionist modeling based on a “neural style” of computing offers an alternative approach for solving such problems with reduced time complexity [26]. Connectionist (neural network) modeling techniques are often considered to be useful for “low level” information processing tasks, but of limited value for supporting the “high level” problem solving methods used in contemporary AI systems. Recently, advances have been made both in applying connectionist methods to a number of AI tasks [8], [29]-[31] as well as to solving global optimization problems [lo], [12]. In this paper a specific connectionist model using a competition-based activation rule is applied to find the most probable hypothesis as formulated in our probabilistic causal theory [17], [18]. A “local” or “unit-value” representation is used: nodes, representing concepts, are connected by links whose weights represent association strengths between nodes. These links are also viewed as channels for sending information between nodes. At each moment of time, each node receives information about the activation levels of its immediate neighboring nodes (nodes connected to it via direct links), and then uses this information to calculate its own activation level. Through this process of spreading activation, the network settles down to equilibrium representing a solution to a problem.

To focus information processing and prevent spreading activation from saturating a network with activation, connectionist models often employ inhibitory links (links with negative weights) between incompatible or competing nodes [5], [7], [13], [27]. In contrast, the work described in this paper uses a competitive activation mechanism to control spreading activation [23], [25]. A distinct character- istic of the competition-based connectionist model is that

0018-9472/89/0300-0285$01.00 01989 IEEE

~

286 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 2, MARCH/APRIL 1989

there is no need for inhibitory links between conceptually incompatible or competing nodes. The mutual inhibition between these nodes is realized not through explicit inhibitory links but through competition: neighboring nodes of a source node actively compete for the output of that source node, and the ability of a neighboring node to compete for activation increases as its own activation level increases [23 ] , [25].

In many connectionist model applications, when equilibrium is reached, a single “winner” node that is fully activated is desired, while “losing” nodes should be fully inactivated (a single-winner-takes-all or choice phenomenon [2], [ 5 ] ) . A competitive activation mechanism has at least two advantages in the context of an associative network with a local representation where single-winner- takes-all behavior is desired. First, most connections/links in semantic/associative networks correspond to empirical associations with at least theoretically measurable frequen- cies of occurrence. In contrast to this well-defined correspondence between connectionist model components (excitatory connections) and the entities being modeled (measurable associations), there is no generally recognized analogous real-world correspondence to inhibitory connections. In this sense, the models without inhibitory links are closer to current psychological perspectives and AI models than those with inhibitory links.’ Second, by eliminating inhibitory links, not only the enormous storage for these links is saved,2 but also the fanout of each node may be drastically reduced, thus leading to less computation at each node. The experience of our group in implementing connectionist/neural network models on a parallel machine suggests that reduced fanout is an important factor in producing speed-up on parallel architectures, especially with large networks [28].

In situations such as diagnostic problem solving, where a hypothesis or solution usually consists of more than one disorder, implementing competitive dynamics via a competitive activation mechanism would appear to be even more desirable. This is because, in general, a multiple- winner-take-all phenomenon rather than a single-winner- takes-all phenomenon is desired. In other words, multiple disorder nodes should be fully activated simultaneously. If a set of competing nodes with direct, mutually inhibitory links tries to sustain multiple winners simultaneously, these winners tend to extinguish each others’ activation. Further, in some situations, two nodes may be considered to be in competition, while in other situations the same two nodes may not be competitors and may actually “cooperate” to formulate a solution to a problem. This is illustrated by the simple diagnostic problem shown in Fig. l(a). When manifestation m , alone is present, disorders d , and d , would “compete” with each other to explain or account for the

‘While inhibitory connections might be considered to represent “negative correlations,” the authors are unaware of any psychological or AI theories that incorporate such a concept a5 a significant aspect of the theory.

’Another way to decrease but not eliminate inhibitory connections is through the use of regulatory units [30].

d l d2 d3 0.08 0.15 0.1

m l m 2 (b) m3 m4 Fig. 1. (a) Simple diagnostic problem consisting of two disorders

( d , , d 2 ) and three manifestations { ml, m 2 , m 3 } . Arcs represent causal relations. (b) Network for a simple diagnostic problem. D =

{ d,,!,, d,) is set of all disorders. M = (m , , m 2 , m3. m 4 ) is set of all mamfestations, and causal relation C is represented by links between nodes. Values of prior probabilities for disorders and causal strengths for causal links are indicated.

presence of m,. However, if instead manifestations m, and m3 are present, both d , and d , should “cooperate” with each other and both would be necessary in the solution hypothesis to account for the present manifestations. In other words, the interrelationship between disorders during diagnostic inference is not simply a static mutually excitatory or mutually inhibitory relationship, but a more complex dynamic function of the network and the problem input (the latter being the set of manifestations given to be present). Thus it is at least very difficult, if not impossible, to model these relationships through simple inhibitory links with static weights.

The conjecture underlying this research is that the dynamically changing functional relationships between disorders and appropriate multiple-winners-take-all behavior may be realized through the use of a competitive activation mechanism. This seems plausible because, as noted earlier, one can dispense with the inhibitory links usually utilized to support winner selection. The goal of t h s research was thus to construct a specific competition-based connectionist model to carry out diagnostic problem solving with high accuracy, i.e., to yield the most probable hypothesis in most cases. The emphasis of this work was thus on the computational and problem solving aspects of connectionist modeling rather than its cognitive or neurobiological aspects. If our conjecture is correct, this approach can both serve as an computationally efficient problem solver for diagnostic problems and to shed further insights on competition-based models so that their applications, especially

PENG AND REGGIA: CONNECTIONIST MODEL FOR DIAGNOSTIC PROBLEM SOLVING

~

287

in areas involving “high level” problem solving in AI, can be further extended.

The rest of the paper is organized as follows. Section I1 gives a general description of this work, including the diagnostic problem formulation used. Section I11 is de- voted to the derivation of specific competitive activation updating functions. Section IV describes the experimental simulations carried out to verify our conjecture. Finally, Section V concludes with a comparison to recent related work and gives suggestions for future research.

11. GENERAL DESCRIPTION AND PROBLEM FORMULATION

A. Probabilistic Causal Model

The diagnostic inference method whose computation is to be carried out by the connectionist model is based on a formalization of the causal and probabilistic associative knowledge underlying diagnostic problem solving. The particular formulation we use is called parsimonious covering theory [16]-[19], [21]-[22]. In parsimonious covering theory there is a set of disorders D, a set of manifestations M , and a relation C 5 D X M representing the causal associations between disorders and manifestations (see Fig. l(b) for example). A pair (d,, m,) is in C iff “disorder d, may cause manifestation m,” (corresponds to a link in Fig. l(b)). A subset of M , denoted as M+, represents the set of all manifestations given to be present, while M - = M - M + represents the set of all manifestations assumed to be absent. Based on the relation C, two sets, “effects” and “causes”, are defined for each d, E D and each m, E

M , respectively: effects(d,) = { rn,J(d,, m,) E C } and causes(rn,) = { d,J(d,, m , ) E C}. When a set of disorders D, c D is said to be a hypothesis for a problem, that means that all d, E D, are hypothesized to be occurring and all d, 4 D, are hypothesized to be not occurring. A hypothesis D, is called a “cover” of the given M + iff M + C Ud, E ,,effects (d,). In other words, a hypothesis D, represents a potential explanation for a set of manifestations M + in that it is a set of disorders, which when present, could cause or account for M’.

Each d, E D is associated with a number p , E (O,l), its prior probability; each causal link (dl, m , ) E C is associated with a number c,, E (0,1], the causal strength from d, to rn, representing how frequently d, causes m,, i.e., c l J = P(d, causes rn,Jd,). For any (d,,m,) PC, c,, is assumed to be zero. Three assumptions are made:

1) the disorders d, are independent of each other; 2) causal strengths are invariant, i.e., whenever d, oc-

curs, it always causes rn, with the same probability cl,; and

3) no manifestation can be present without being caused by some disorder.

These assumptions do not apply in some domains. How- ever, they are a useful approximation that is consistent with parsimonious covering theory and are less restrictive

than those traditionally made with Bayesian classification (see [17] for a discussion).

Based on these three assumptions, a “ relative likelihood measure” for a hypothesis D, D given M + was derived in [17] as

where informally the first product here, over all present manifestations ( M + ) , can be thought of as a weight re- flecting how likely D, is to cause the presence of manifestations given in M + ; the second product can be thought of as a weight based on absent manifestations ( M - ) (more precisely, since c,, = 0 if (d,, m , ) E C, this product can be reduced to be over M-neffects(D,), and thus can be viewed as a weight based on manifestations expected with D, but that are actually absent); and the third product represents a weight based on prior probabilities of disorders in D,. It can be proven that the relative likelihood L(D,, M + ) differs from the posterior probability P(D,IM+) only by a constant [17]. Also note that if D, is not a cover of M + , then L(D,, M + ) = 0.

In summary, a diagnostic problem can be defined as a probabilistic causal network consisting of D, M , C, and p , and c,, together with the problem features or input M + . The solution to a problem is a most probable hypothesis, designated D’, one with the highest relative likelihood ( L value) among all possible hypotheses. Since the presence/absence of disorders is independent of one another, any combination of disorders is possible, so the potential space for searching for a solution is 2O, generally an extremely large search space. Thus, diagnostic problem solving in this formulation can be viewed as a combinato- rially difficult nonlinear optimization problem: among all hypotheses D, E 2,, find the one that maximizes (1).

B. The Connectionist Model Architecture

A two-layer probabilistic causal network as defined in the previous section (e.g., Fig. l(b)) can be used directly as a connectionist model network: D and M are two sets of nodes, and C is the set of links connecting nodes between these two layers. There are no inhibitory or other intra-set links between any pair of disorder nodes or between any pair of manifestation nodes.

In the formulation of a diagnostic problem as a connectionist model that follows, each node is associated with two values: a constant input value and a dynamically changing value for its activation level. For a disorder node d, E D, its constant input value is simply its prior probability p , ; its activation level at time t during the computation, denoted as d,(t) E [O,l], serves as a measure of confirma- tion. Initially, d,(O) = p , . When d,(t) -1.0 at time t, the disorder d, is confirmed to be occurring and thus becomes


an element of the solution; when d , ( t ) = 0.0 then d , is disconfirmed or rejected. For a manifestation node m, E M , its constant input value s, equals 1.0 if m, E M + , and 0.0 if m, E M-. This value represents the external input to the connectionist network. In other words, s, = 1.0 indicates that manifestation m, is present, while s, = 0.0 indicates that m, is absent. The activation level of m, at time t , denoted as m,(t) E [0,1], is the measure of its activation caused by activation of all of its causative disorders.

Each link ( d , , m,) E C is associated with a constant weight equal to the corresponding causal strength c,, that during the computation, can be accessed solely by the two nodes d , and m,. Note that all weights c,, are positive or excitatory: there are no inhibitory links. Link ( d , , m,) serves as a bidirectional communication channel for passing information between d , and m,. After receiving information from all of its immediate neighboring nodes, a node updates its activation level a small amount (the specific activation rule used is given in the next section), and then sends its current activation level to all of its immediate neighbors. For each m, E M+, its neighboring nodes (nodes in causes(m,)) are competing with each other for m,’s activity. This relaxation process continues until equilibrium is reached at time t,. If the model converges on a set of “winners,” i.e., for each d,, d l ( t e ) approximates either one or zero, then the set of disorders D, = { d,ld,(t,) -1} is taken to be the connectionist model’s problem solution, i.e., the connectionist model derives D, as its candidate for the most probable cover D+ of the given features M+. The crucial point here is that a set of disorders D, is to be derived through the concurrent local interactions occurring in the connectionist model, where D, represents a hypothesis about the identity of the globally optimal set of disorders D + as determined by the likelihood measure (1). Since each node’s processing is driven solely by local information, it is not guaranteed a priori that the resultant locally optimal solution will correspond to a globally optimal solution.

Now the question remains as to what the local activation rules should be for disorder and manifestation nodes such that the model will have the desired behavior, i.e., it will reach equilibrium, converge on a set of multiple “winners” D,, and D, will be consistent with D+, the most probable hypothesis. This question is answered in the next section.

111. EQUATIONS FOR UPDATING NODE ACTIVATIONS

In this section, an approach to deriving local activation rules for use in connectionist models is developed. Viewing a diagnostic problem as a nonlinear optimization problem, this approach decomposes the globally optimal criteria (a continuous form of (1)) into local optimal criteria for each disorder node, and then uses these local criteria to derive activation rules for individual nodes. Thus, the global optimization problem of finding a set of disorders ( D + ) to maximize (1) is decomposed into a set of concurrent local problems for all individual disorders.

A . Optimization Problem

Let ID1 = n, i.e., let there be n possible disorders. Then any hypothesis D, c D can be considered as one of the 2” corners of the n-dimensional hypercube [0,1] “, and t l p can be represented as an n-dimensional vector X = (xl, x2,. . . , x,) where x, = 1 if d , E D, and x, = 0 otherwise. Thus, a diagnostic problem can be viewed as a discrete optimization problem: finding one of the corners of hypercube [0,1]” that maximizes (1). Instead of jumping from corner to corner in searcking for a solution, as does sequential search, if we allow X to go through the interior of the hypercube and still guarantee that the computation settles down into one of the corners, we can reformulate the problem as a continuous optimization problem. This can be done by generalizing the optimization function (1) from a discrete dymain {0,1}“ to the continuous domain [0,1]“. Define Q( X ) to be

r n 1

where x’ is a vector in [OL1IN. Noticing that c,, = 0 if ( d , , m,) 4 C, then for any X in {0,1}“ (i.e., for any corner of the hypercube), (2) becomes (1). The numerator of the last product in (2), l - x , ( l - p,), differs from the value x,p, one might anticipate here from casual inspection of (1). This is because the last product is now over i = 1,2; . * , n rather than d, E D, as in (l), and the numerator should be one rather than zero when x, = 0.

Based on the previous generalization of (l), the problem of finding D+ with the greatest value of (1) for the given M + is thus transformed into a continuous nonlinear 2pti- mization problem where the nonlinear equation Q ( X ) is the objective function against which the maximum value is desired; and x, E {O,l}, i = 1,2; . e , n ,ae constraints [3]. In other words, among all vectors X in [0,1]“ whose elements are either one or zero (i.e., those that satisfy the cons_traints) we want to find the one that will maximize Q( X ) , the objective function.

The way that our connectionist model approaches this optimization problem is to continuously update d , ( t ) and m, ( t ) based on~‘‘10cal” information using equations derived from Q ( X ) . At time t , the activation levels d , ( t ) of all disorder nodes d, compose a vector +&t) =

( d , ( t ) , d 2 ( t ) , - . -, d , ( t ) ) where d , ( t ) E [0,1]. Thus D ( t ) is a Yctor in [O,l]” . Applying the objective function (2) to D ( t ) , we have

n

(3)

PENG AND REGGIA: CONNECTIONIST MODEL FOR DIAGNOSTIC PROBLEM SOLVING 289

Initially, 5 ( 0 ) = (d,(O), d2(0), . . , d,(O)) is the vector of prior probabilities p , of d , , i = 1,2; . ., n . Ideally, when equilibrium is achieved the model should converge on a set of multiple winners at time t,. Then D(t,) =

(d,( t,), d2( t,), - . . , d,( r e ) ) , where d,( t,) approaches either zero or one, represents a hypothesis D, c D where_d, E D, iff d , ( t , ) = 1. At this time, as noted earlier, Q(F( t , ) ) = L(D,, M + ) . Therefore, D, will maximize (1) iff D ( t , ) will maximize (3). How d , ( t ) should be updated based on local information is derived shortly, but we first consider the updating of m ,( t ).

B. Updating m,(t)

The value m,(t) is the current activation of node m, caused by or induced from the current activations of all of its causative disorders. Motivated by (1) and (3), we define a local activation rule for each m, E M to be

n

= I - n ( l - c I J ’ d I ( r ) ) (4) d, € causes ( m , )

where the second equality is derived from c,, = 0 if ( d , , m , ) 4 C. This activation rule is a local computation since it only depends on current activation levels of m,’s causative disorders (those in causes (m,)) that are directly connected to m, in the causal network. Initially, m,(O) =

1 -n:=,(l - c,, - p , ) = P(m,). At equilibrium time t,, one would anticipate m,(t,) =1-17:El(l- c , , .d , ( t , ) ) =

1 -17d,GDs(l- c,,) = P(m,lDs) if a multiple-winners-take- all result occurs. These initial and final activation values are consistent with the underlying probabilistic causal model [17].

C. Updating d,(t)

A local activation rule for updating d , ( t ) is now derived in :WO steps. First, the global optimization function Q( D( t ) ) is decomposed into partially localized optimization functions q , ( d , ( t ) ) for each disorder node d,; then, q, is used to derive a local activation rule for d , ( t ) that only depends on information from its immediate neighbors (i.e., information only from m, E effects(d,)). Let M: =

M + n effects( d , ) and MI- = M - n effects(d,). These are the present manifestations that disorder d, could cause and the manifestations expected by d, but actually absent, respectively. Then from the global optimization function (3), we define

r n 1

n

where d , ( t ) is the only argument for function q,, while all d k ( t ) , k # I , are considered to be its parameters. Note that the first two products in ( 5 ) are taken over M,+ and MI- , whch are local to d, , not over M + and M - as in (3). In this sense, ( 5 ) is a partially localized version of (3) (“partially” because the parameters d k ( t ) for k # i are still present).

Viewing q , ( d , ( t ) ) as an objective function and d , ( t ) as being constrained to {OLl}, we decompose the global optimization problem of D ( t ) into local optimization problems of its elements d , ( t ) : derive whichever of d , ( t ) = 1 or d , ( t ) = 0 will maximize q,, i.e., whichever of q,(1) or q,(O) is greater, if all other d k ( t ) are fixed. If q , ( l ) > q,(O), d , ( t ) should increase in order to obtain local optimization. Like- wise, if q,(1) < q,(O), d , ( t ) should decrease. Thus we define the ratio r , ( t ) = q,(l)/q,(O) that will be used to govern the updating of d , ( t ) in the connectionist model. It can be proven (see Appendix) that

Note that r,( t ) is completely determined by information local to node d,; all d k ( t ) , ckJ , and ckl for k # i are no longer present. The quantity r , ( t ) is used to indicate whether d,( t ) should increase (r , ( t ) > 1) or decrease (r, ( t ) < 1) and by how much. In t h s sense r, ( t ) represents the “input activation” to the ith disorder from its manifestations. Moreover, the three terms in the product of ( 6 ) represent the influence on disorder d , ( t ) from d,’s present manifestations, expected but absent manifestations, and prior probability, respectively. This is also consistent with the underlying probabilistic causal model (see (1) and the discussion after it).

Note that the last two terms in (6) are constants for an individual disorder node for a given M’, so (6) can be rewritten as

where K, = I I m , € M; (1 - c l / ) . ( p , / l - p , ) is a constant that can be computed and stored at each disorder node before the iterative process starts during simulations. Therefore, at any time t , computation for the current r , ( t ) only requires the current m,(t) of all m, EM,+. Whether a manifestation m, is in M-, and thus is involved in computing K , once (before the iterative process starts) or it is in M’, and thus is involved in iteratively computing r , ( t ) thereafter, can be determined by m,’s input value s,. Thus, s, can be considered as a gate to communication channels of m,. At the beginning of a simulation, all m, send their sJ values to their respective disorders. For each disorder node d,, those s, = 0 it receives indicate manifestations in

290 IEEE TRAYSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 2, MARCH/APRIL 1989

M - and are first used to compute K i once, and then to the local activation rule for d i ( t ) “close” the corresponding links during the subsequent iterations.

Another point is that division by m,(t)- c , ; d , ( t ) = 0 in (6a) might arise for a manifestation m, E M,f . By (4) this could occur only in two situations: d, is the only disorder capable of causing m, having a nonzero activation, or all disorders capable of causing m, have zero activations. The first situation implies that all disorders in causes(m,) except d , are rejected, leaving d , as the only possible winner to account for the presence of m,; thus d,’s activation should increase toward 1.0 to be eventually accepted. Thus, using a very large r , ( t ) value as an approximation in this situation is appropriate. The second situa-

a;( t ) =f(r;( t ) -1) .(1- d;( t ) )

where d , ( t ) is the derivative or rate of change of d , ( t ) . The dynamics of this system are approximated in simulations that follow by

d, ( t + A ) = d, ( t ) + f( r, ( t ) - 1) . (I - d , ( t ) ) . A . (8b)

If d, ( t + A ) is less than 0.0 from (fib), then d,( t + A ) is set to 0.0. Thus, as desired, d , ( t ) is guaranteed to be in [0,1] at any time t .

To see how competition plays a role in this computation, note that by (4) we can rewrite (6a) as

tion is impossible if activations of disorders change smoothly. This is because when activations of all disorders in causes( m,) are close to zero, m, becomes very close to zero and the term (1- m,(r) ) / (m,(r ) - c , ; d , ( t ) ) in (6a) therefore becomes very large. This would make r,( t ) greater than one for some of these disorders d, and increase their activations. This in turn would increase mJ( t ) . Therefore, to prevent this “dividing by zero” exception in (6a) testing for m,(t) = c , , - d , ( t ) was done before the division in the simulations that follows. If equality holds, then a very large constant (say, 99 999.0) is used to approximate “in- finity” as the quotient. This will make r , ( t ) greater than 1.0 and cause the increase of d , ( t ) as desired.

As discussed earlier, r, ( t ) = q, ( l ) / q , (0) indicates the desired direction of change of d , ( t ) in order to achieve local optimization, i.e., d , ( t ) should increase when r , ( t ) > 1 and decrease wheli r , ( t ) <1. Therefore, r , ( t ) - l , when bounded by -1 and 1, is taken to be the change rate of d , ( t ) . The reason to bound the change rate is to ensure that for each iteration, d , ( t ) only changes a small amount. Put more precisely, define the ramp function f (x ) to be

1 i f x > 1

x otherwise. -1 i f x < - 1 (7)

This equation shows that for any disorder d , , r , ( t ) will tend to decrease if activation d k ( t ) of some d , E causes(M:) iccreases, and r , ( t ) will tend to increase if d k ( t ) decreases. In this sense, d , competes with all other d , E causes( M,+ ), and whether d, ( t ) should increase or decrease depends on this competition as well as the constant K,. Some simple example simulations are given in the next section to illustrate the competitive dynamics involving disorders more clearly (see Table I).

JV. EXPERIMENTAL ANALYSIS

After deriving activation rules for manifestation and disorder nodes ((4) and @a)), a number of experiments were conducted in which the parallel computation of the connectionist model was simulated by a computer program on a sequential machine. The purpose of these experiments was to test whether the model would converge on a set of multiple winners at equilibrium time t , (defined as d,(re) approximating to one or zero for all d , ) , and whether the resultant hypothesis D, would compare favorably with the most probable hypothesis D +. This section describes these experiments and analyzes the experimental results.

A . Basic Experimendal Methods and Results

The parallel computation of the connectionist model was simulated by a sequence of iterations, each of which consists of two steps. At step 1, all disorder nodes d , send This leads to the following differential equation expressing

PENG AND REGGIA: CONNECTIONIST MODEL FOR DIAGNOSTIC PROBLEM SOLVING

TABLE I

291

Number of Iterations

0 1 2 5

10 20 30 40 50 60 65

M + = { m , )

4 4 0.08 0.15 0.076 0.148 0.069 0.145 0.027 0.139 0.0 0.293 0.0 0.685 0.0 0.92 0.0 0.989 0.0 1.0

d3 0.1 0.096 0.087 0.034 0.0 0.0 0.0 0.0 0.0

0.08 0.15 0.085 0.154 0.094 0.163 0.147 0.212 0.305 0.358 0.69 0.714 0.921 0.927 0.989 0.99 1.0 1.0

0.1 0.096 0.089 0.04 0.0 0.0 0.0 0.0 0.0

0.08 0.15 0.1 0.085 0.154 0.105 0.094 0.163 0.113 0.147 0.212 0.166 0.305 0.356 0.32 0.69 0.698 0.697 0.92 0.746 0.923 0.989 0.659 0.989 1 .o 0.479 1.0 1 .o 0.192 1.0 1 .o 0.0 1 .o

their activation d , ( r ) to their immediate neighbors (their respective manifestation nodes). Each manifestation node m, E M + then updates its current activation m,(t) by (4) using d , ( t ) from just its causative disorder nodes. At step 2, all manifestation nodes m, E M + send their m,(t) to their immediate neighbors (their respective disorder nodes). Each disorder node d , then updates its current activation by (8b). This iterative process stops when all disorder nodes reach equilibrium states at time t,. (A disorder reaches equilibrium at time t iff Id,( t ) - d , ( t - A)l < 0.0001 in the simulation, i.e., the change of activation in one iteration is less than 0.0001 for every disorder node d,.) Then all disorder nodes d , with d , ( t , ) > 0.95 are accepted and compose the solution set D,, while all disorders d , with d k ( t e ) < 0.05 are rejected. If there is any disorder with its activation level stabilized between 0.05 and 0.95, it is reported as a case of not converging on a set of multiple winners.

One factor of importance to the experiments is the size of A , the incremental interval for updating node activation. Too small a value of A , say 0.01, resulted in a very slow rate of convergence, typically taking 800 to 1,000 iterations for a case to settle down to a solution. Too large a value, say 0.1, could lead to incorrect solutions in some cases. Based on empirical observations, we did the following. Initially, A was set to 0.005. After each iteration, A was incremented by 0.005 if it was less than 0.2. Then A stayed at 0.2 until equilibrium was reached. The sma!l values of A early in the simulation make the approximation equation (8b) closer to the differential equation (Sa), thus avoiding some error due to discretization of t in the simulation. After about 40 iterations, d , ( t ) are close to their final values in most situations, so the larger A (0.2) then helps fast convergence at the end of a simulation.

To provide the reader with a concrete example of what these simulations were like, Table I shows representative example of simulations for different M + sets using the simple causal network given in Fig. l(b). For the three different M + sets shown at the top of Table 1, three different D, sets, { d , } , { d,, d , } , and { d,, d , } , were generated as Ds’s by the connectionist model. Note that in each of these examples the connectionist simulation converges with a single- or multiple-winners-take-all solution,

and each D, turns out to be the same as the respective most probable hypothesis D+. When M + = { m , } , the single present manifestation m , makes d , and d , com petitors since only one of them is sufficient to account for the presence of m 2 . Eventually, one of them, d, , wins, and activations at all other disorder nodes die out. When M + = { m,, m,, m , } , none of the disorders alone is capable of covering or accounting for the given M’, but { d,, d , } or { d,, d , } could, so in this case d , and d , can be viewed as competitors. Which of these two disorders is better was determined competitively. In this particular case d , won and d , lost. The competitive dynamics is even more interesting when M i = { m,, m,, m,, m 4 } . Comparing activations of individual disorders early in the simulation, d , seems to be in the best position (it has a larger prior probability, and it gets more support from M + initially through two stronger links). However, since d , is the only disorder which can account for m,, and d , the only disorder which can account for m4, their activations gradually increase and thus they become more competitive, eventually taking all support from m , and m 3 away from d , , so d 2 ( t ) drops to zero. This simple example shows how the competitive model using the activation rule of (4) and (8) can work properly in different situations.

A thorough set of experiments was done for three significantly larger examples of diagnostic problems which we will call example 1, example 2, and example 3. Each of these examples has a causal network of 10 disorder nodes and 10 manifestation nodes. Example 1 and example 2 were randomly generated, i.e., for each d , , its prior probability p , , how many manifestations are causally associated with it, which ones they are, and what their respective causal strengths are, were all generated randomly and independently. The only restriction was that all pi's in example 1 be less than 0.10, while in example 2, the pi's were restricted to be less than 0.20. Finally, example 3 is the same as example 2 except that all prior probabilities in example 3 are assigned to be twice those in example 2. The reason for restricting p I ’ s to small numbers is that in most real-world diagnostic problems prior probabilities of disorders are very small [17]. For example, in medicine p , < 10- even for very common disorders in the general population, such as a cold or the flu, and p , is much much smaller

292 IEEE TRAYSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 2, MARCH/APRIL 1989

TABLE I1

Example 1 Example 2 Example 3

p 1 = 0.026 p , = 0.054 ps = 0.003 p7 = 0.048 p y = 0.098

cI2 = 0.31 cI7 = 0.30 cj2 = 0.29

= 0.72 c4, = 0.88 cS1 = 0.72 cs4 = 0.07 cS9 = 0.73 e,, = 0.32

c75 = 0.73 cu2 = 0.04 cR6 = 0.23

cx,lo = 0.51 cg7 = 0.95

c10,4 = 0.43

cjs = 0.15

c71 = 0.05

c,o.6 = 0.11

p2 = 0.014 p4 = 0.06 p6 = 0.023 p R = 0.079

plo = 0.027

CI4 = 0.85 c27 = 0.50

Cj6 = 0.11 c34 = 0.64

= 0.62 c4, = 0.27 c52 = 0.92 csss = 0.47

cs,lo = 0.96 c67 = 0.26 c72 = 0.80 c~~ = 0.58 cR4 = 0.26 cR7 = 0.69

cy,lo = 0.67 cy2 = 0.12

c,o.s = 0.18

p1 = 0.17 p , = 0.03 ps = 0.135 p , = 0.075 p,, = 0.14

cI2 = 0.06

c21 = 0.53 c24 = 0.09 c2x = 0.13

c2,10 = 0.85 = 0.45

cj7 = 0.59 c42 = 0.74 c47 = 0.65 cs3 = 0.72 c6, = 0.09

c6,10 = 0.44 c74 = 0.46 c76 = 0.76 csl = 0.29 egg = 0.25

cy6 = 0.48 c10,2 = 0.74

CI6 = 0.10

c94 = 0.20

p 2 = 0.07

p6 = 0.18 p x = 0.03

p4 = 0.12

plo = 0.05

CI7 = 0.51 c2, = 0.81 c25 = 0.85

cI4 = 0.68

c2q = 0.34 cj2 = 0.54

= 0.90 c3,,,, = 0.29

c4s = 0.52 c4q = 0.32 cSU = 0.49 c65 = 0.66 c7, = 0.22 c75 = 0.21

c7.10 = 0.43 cg2 = 0.34 cyl = 0.39 cys = 0.90 cy7 = 0.38

elO,R = 0.27

p 1 = 0.34 p3 = 0.06 p s = 0.27 p7 = 0.30 p q = 0.28

cI2 = 0.06

c21 = 0.53 c24 = 0.09

= 0.13 c2,10 = 0.85

= 0.45 cj7 = 0.59 c42 = 0.74 c47 = 0.65 c5, = 0.72 e,, = 0.09

c74 = 0.46 c76 = 0.76 cgl = 0.29 cRx = 0.25

cy6 = 0.48 c10,2 = 0.74

CI6 = 0.10

('6.10 = 0.44

cy4 = 0.20

p 2 = 0.14 p4 = 0.24 p6 = 0.36 p8 = 0.06

p,o = 0.10

CI7 = 0.51

e25 = 0.85

cI4 = 0.68

c2, = 0.81

c2q = 0.34 cj2 = 0.54

= 0.90 c3.10 = 0.29

c45 = 0.52 cq9 = 0.32 csx = 0.49 c6s = 0.66 e73 = 0.22 e74 = 0.21

c7,10 = 0.43 cR2 = 0.34 cyl = 0.39 cys = 0.90 cq7 = 0.38

= 0.27

TABLE I11

Example 1 Example 2 Example 3 Total

Total number of cases 1024 1024 1024 3072

Number of correct 809 692 748 2249 cases after the first (79.0 percent) (67.6 percent) (73.0 percent) (73.2 percent ) parallel computation

cases after a partial (95.7 percent) (92.9 percent) (94.4 percent) (94.3 percent) resettling process

cases after a full (99.8 percent) (99.4 percent) (99.4 percent) (99.5 percent) resettling process

Number of correct 980 951 967 2898


(e.g., for rare disorders. The details of these three example networks are given in Table 11.

For each of these example networks, all possible 1024 cases, each corresponding to a distinct M + _C M (i.e., 0, { m,}, { m 2 } , { m,, m 2 } . . . ), were tested. For each Mf, a sequential search algorithm, developed and presented else- where [18], was used to find target hypothesis D+, the globally most probable hypothesis for the given M+. D + was then compared with D,, the hypothesis obtained by the program simulating the connectionist model. The cases in which D, = Df were recorded.

In these experiments all 3072 cases converged (no oscil- lations were found), and each case resulted in a set of winning disorders being fully activated and all other disorders being fully deactivated (no disorder had activation between 0.05 and 0.95 at equilibrium). In other words, multiple-winners-take-all behavior was realized in all of these experimental cases. Most of these cases (more than 95 percent) converged in 80 to 100 iterations. A few took several hundreds of iterations or more to converge. The

simulation results for these experiments are given in the first row of Table 111. This table shows that the solution D, from the connectionist model exactly agreed with the most probable hypothesis Df (i.e., D, = 0') in about 70-80 cases for the three example networks (73.2 percent correct for all 3072 cases). This demonstrates that global optimization can be achieved in the majority of cases via local optimization in the competitive activation model. For those mismatched cases, the solution D, resulting from the competitive connectionist model do not exactly agree with their respective globally optimal D+. Further study re- vealed that most of these mismatched solutions D, are the second- or third-best hypotheses in terms of the posterior probabilities. For example, among the 215 mismatched cases in example 1, 134 and 53 of them are the second and the third best hypotheses, respectively. Thus, only 3 percent of all 1024 cases resulted in hypotheses worse than the third- best one, so 97 percent achieved globally or near- globally optimal solutions. Similar results were also obtained for example 2 and example 3 (percentages of cases


TABLE IV


of cases 1024 1024 1024 3012


Total number

cases after the first (77.0 percent) (67.7 percent) (71.0 percent) (71.9 percent) parellel computation


cases after a full (99.4 percent) (99.2 percent) (99.0 percent) (99.2 percent) resettling process



where D, is worse than the third best hypothesis are 8 percent and 8.6 percent, respectively).

B. Resettling Process

For cases where D, # D+, the inconsistency is due to misbehavior of some disorder nodes whose local optimization does not lead to a globally optimal solution. That is, either some node d , is fully activated based on its local information and competition but does not belong to D + (it should have been rejected), or d , is fully deactivated but it does belong to Dt (it should have been accepted).

The hypothesis D, generated by the parallel computation gives us a mechanism for seeking a remedy to such inconsistencies. Even though D, does not indicate which disorder nodes might have gone wrong when D, was computed, we know that if any node d j E D, went wrong, its d i ( t , ) should be zero, not one; and if d j P D, went wrong, its d , ( t , ) should be one, not zero. This information, which is not available before D, is generated, can be used as the basis for a “resettling process,” repeat simulations involving partially instantiating the causal network. The resettling process is performed after D, is generated when the first parallel computation reaches equilibrium and goes as follows ( n is the total number of disorders):

After the network stabilizes with D,, n more parallel computations are performed, each of which is done on the given original network with one disorder node dj’s activation being instantiated: d,(O) is set and clamped to 0 if it is in D,, or to 1 if it is not in D,. The n hypotheses Oil), Di2), . . . , D i n ) resulting from these n partial-instantiated networks are then compared with D,, and the one among the n + 1 hypotheses maximizing (1) or ( 3 ) is taken as the solution for the given case.

This full resettling process can be viewed as a way of escaping from the local optimum reached in the first run of the computation. Now, n more local optima are generated and they are guaranteed to be different from D, because of the way the partial-instantiation is implemented (although there may be duplicates among these n new hypotheses). Therefore, this process is expected to greatly improve accuracy assuming D, from the first computation at least approximates D’. Experimental results (see the third row of Table 111) show that after the full resettling

process, the total number of cases where the connectionist model resulted in the most probable hypothesis was markedly increased (99.5 percent of all 3072 cases were correct). Furthermore, in all of the cases where D, # D + after resettling, the solutions obtained from the connectionist model were the second-most probable hypotheses.

Unfortunately, the price we would pay for such a high accuracy via a full resettling process is that instead of one parallel computation, now n + 1 computations would be required, where n is the total number of disorders. A trade-off between accuracy and cost can be achieved by only resettling with d,(O) to zero for all d , E D, so that only lDsl more computations need be performed. Usually lDsl is drastically smaller than ID1 = n , especially in large real-world problems, so this would cost much less than resettling with every d , in D. In the three examples, as shown in the second row of Table 111, this partial resettling process resulted in D, = Dt in over 94 percent of cases.

C. Convergence with Multiple- Winners-Take-All

We now discuss the issue of convergence, i.e., whether this model will always reach equilibrium, and when equilibrium is reached, whether it guarantees that every disorder node will approximate zero/one activation. Although we do not have analytical results and existing analysis of competition-based models are not directly applicable [33] , all 3072 cases of the three examples converged on multiple winners. To further investigate this problem, the same 3012 cases were tested again with randomly assigned initial activations for all disorder nodes. Now, instead of setting the initial activation d,(O) to p , , the prior probability for each disorder d , , d,(O) is set to a very small positive random number (0.00001 to 0.0001). All cases again converged on multiple winners, and almost the same degree of accuracy was achieved. Table IV shows the experimental results for the same three examples as in Table I11 except the activation of disorder nodes were initialized to small random numbers close to zero. Correct rates very close to those in Table I11 were obtained. This not only shows that the model is somewhat insensitive to the initial activations of disorder nodes, but also suggests a way to overcome potential situations where some disorders stabilized on activation values somewhere in between 0.05 and 0.95.

294 IEEE TRAP:SACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 2, MARCH/APRIL 1989

TABLE V


of cases 1024 1024 1024 3072


Total number

cases after the first (90.5 percent) (83.9 percent) (87.3 percent) (87.2 percent) parellel computation


cases after a full (99.7 percent) (100 percent) (99.9 percent) (99.9 percent) resettling process



Since such situation represents a dedicate and uncommon symmetry of a problem ( r , ( t ) =l), changes to the initial d j (0 ) for all d , by random assignment may well bring the system out from the symmetry.

D. Altered Activation Rule Using a Sigmoid Function

It is also interesting to ask how sensitive the results obtained above are to the specific form of the activation updating rule (8a) for d , ( t ) . To consider this issue, we repeated some of the experimental simulations after alter- ing the local activation rule for disorders. Consider (7), the ramp function f used to bound the activation change rate of d , ( t ) , which looks somewhat like a sigmoid function. One of the nice properties of the sigmoid function is that it asymptotically reaches its minimal and maximal values when its argument approaches negative and positive infin- ity, respectively. In the next set of experiments, a different activation rule for d , ( t ) based on a sigmoid function was applied to the same three examples as before. The activation rule is defined as

) (9) d , ( t +1) =1/(1+ e-(r,(l)-l)/T(r)

where t is the number of iterations, and T ( t ) is a parame- ter sometimes called temperature in simulated annealing systems [l], [12]. In the experiments that follow, similar to simulated annealing, we allowed the computation to start with a high T value that was gradually reduced. This new activation rule and the one used in previous experiments (8), are both based on r , ( t ) . The two differences are that in (9) a sigmoid function replaces the ramp function f (7), and the activation d , ( t + 1) is directly computed from the current r , ( t ) rather than via its derivative as in (8a). The activation d , ( t ) on each iteration is determined by r , ( t ) and T ( t ) . The smaller T ( t ) is, the steeper the curve of the sigmoid function will be. When T ( t ) approaches 0.0, the magnitude of r, ( t ) - 1 becomes relatively insignificant, so d , ( 1 + 1) should always approach either 1 .O or 0.0, depend- ing on the sign of r,( t ) - 1. In the experiments, we set the initial value of T(0) = 2.0, and after each iteration let T ( t + 1) = T ( t ) . 0 . 8 . Equilibrium was reached as T ( t ) ap- proached 0.0.

As might be expected, this new activation rule takes less time for the system to converge than before. Except for a

few cases, convergence was reached within 20 iterations, which is much faster than the 80-100 iterations typically needed when (8b) was used as the activation rule. For those exceptional cases that did not converge with T(0) =

2.0 but resulted in oscillating behavior, in order to make the model converge the initial T value needed to be set initially to a value much higher than 2.0, sometimes as high as 100.0, and thus reaching equilibrium took a much longer time.3

Surprisingly to us, not only did this new rule work as well as the original one, but in fact it resulted in much higher accuracy: in 87.2 percent of all 3072 cases D, = D+ without any resettling. This figure rose to 99.4 percent correct with partial resettling, as reported in Table V. Accuracy in both situations is significantly higher than the corresponding 73.2 percent and 94.3 percent achieved using (8b).

As discussed previously, (9) reflects two changes to (8). A sigmoid function replaces ramp function f; and disorder activations d , ( t ) are directly computed rather than using the derivative or change rates of activations. To see which of these two changes is responsible for the improved accuracy, some additional experiments were performed. Using the same three experimental networks as before, the sigmoid function, both with and without gradual temperature reduction, was directly substituted for function f in (8b). Otherwise (8b) was still used unchanged in repeating the simulation described earlier. In other words, the sigmoid function was used to compute the derivative d ' , ( t ) , not directly used to compute the new activation value d, ( t + 1) itself. The results of these experiments showed little change from the results using (8b) with the ramp function. Furthermore, we did experiments to see if using the ramp function to directly compute the new activations (i.e., the ramp function f substituted for the sigmoid function in (9)) produced better results. The results were disastrous: in hundreds of cases the system either oscil- lated or stabilized with some disorders having their activations somewhere in between 0.05 and 0.95.

3The issue of an appropriate initial T value and an appropriate schedule for decreasing T during a simulation is more complex than i t might appear from these casual comments, See [32] for discussion of this issue.


V. CONCLUSION

In this research, a specific competition-based connectionist model is constructed to solve diagnostic problems formulated as causal networks. The diagnostic problem solving is regarded as a nonlinear optimization problem, and the equations governing node activation updating in the connectionist model are derived by decomposing the global optimization criterion to criteria local to individual nodes. This model was run over all 1024 possible different cases for each of three example networks. The experiments show that the diagnostic problems can be solved by this connectionist model within acceptable accuracy (73.2 percent of exactly correct answers if (8b) is used, and 87.2 percent correct if (9) is used) when only a single parallel computation to equilibrium is performed, and can be done with a very high accuracy when a partial resettling process is also performed (greater than 99 percent with (9)). Equa- tions (8) and (9), both based on r , ( t ) , are quite different activation rules. The fact that both of these activation rules work well for solving problems demonstrates the robust- ness of the model. Also, this competitive model is relatively insensitive to the variation of prior probabilities of disorders, as shown by the same degree of accuracy with example 2 and example 3 (the latter is the same as the former except all prior probabilities are doubled).

The computation time required for the experiments reported here on a sequential computer was large: a simulated parallel computation for D, and a sequential search for D+ were performed over thousands of cases, both being time consuming for large causal networks when run on a conventional sequential computer. Thus, we restricted simulations to three example networks that are relatively small compared with the networks occurring in real-world problems. Real-world problems would typically have similar sized or smaller sets M + , but the portion of the networks of relevance would be larger (formally, C n (causes( M + ) x effects(causes( M + ))). Whether this model will perform better or worse in the real world than with these experiments needs to be investigated (see [32]).

In summary, as far as diagnostic problem solving is concerned, the competitive connectionist model developed in this work offers a good alternative to existing sequential approaches [l l] , [20], [21]. The advantage of this model is that it avoids the combinatorial difficulties faced by existing sequential approaches, yet performs the problem solving with quite high accuracy, i.e., provides globally optimal solutions in most cases. The disadvantage is that global optimization is not guaranteed. In some cases, the solution obtained by this model may be one of the good alterna- tives, but not the globally optimal one.

The single-winner-takes-all phenomenon was an important feature demonstrated by earlier work on competition- based connectionist models [23], and it has subsequently been shown to scale up well to large, complex networks (e.g., associative networks involving 2000 nodes with 12000 connections [24]). With this approach, appropriate inhibition is achieved by competition between nodes, not through

mutual inhibitory connections. This current research shows that a multiple-winners-take-all phenomenon can also be achieved by competitive activation mechanisms, provided that appropriate activation rules are developed. In this context, internode relationships are much more compli- cated than the mutually-inhibitory relations that can be captured by static inhibitory links. An implication of this is that the competitive model not only can be applied to solve “low level” cognitive tasks involving single-winner- takes-all behavior such as word recognition, associative memory retrieval, or print-to-sound mapping, but may also be applicable to “higher level” AI tasks such as diagnosis and various optimization problems. The approach of “decomposing plus competition plus partial resettling” seems to be a sensible avenue to pursue in that direction.

To the authors’ knowledge, the only other model using parallel computation to solve diagnostic problems formulated as causal networks is Pearl’s belief propagation model [14]. The belief propagation model can be used to compute posterior probabilities for individual disorders as well as categorically find the most probable hypothesis (our D’) for the special case of causal networks called singly-connected causal networks [15]. Problem solving uses a local and autonomous message-passing process which, when implemented in parallel, only takes time proportional to the network diameter. The computations are provably correct.

In the belief propagation model, a singly-connected network refers to a network in which there is at most one path between any two nodes if links are considered as nondirec- tional. For example the network in Fig. l(b) is a singly- connected network. However, if a link ( d , , m 3 ) is added to Fig. l(b), it is no longer singly-connected because there would be two paths between d, and m3: one is the direct link between d , and m3, the other is a chain of links through nodes m, and d,. The causal networks of typical real world diagnostic problems are not singly- but multiconnected. (The three example networks used in this paper are strongly multiconnected also, as inspection of Table I1 will indicate.) One way to apply the belief propagation model to multiconnected networks is to find a smallest cutpoint set of the given network. The cutpoint set is a set of nodes whose removal would result in the network be- coming singly-connected. Then for each possible instantiation of the cutpoint set (instantiating the belief of each node in the cutpoint set to 0 or l), the parallel computation can be applied and a hypothesis obtained. Among all hypotheses resulting from all instantiations, the one with highest posterior probability is then chosen as the solution for the problem. If the cutpoint set contains k nodes, then there are 2k different instantiations. When causal networks are extensively multiconnected, which is the case for many large real world causal networks [21], the size of a smallest cutpoint set is of the order of ID[. For instance, the size of a smallest cutpoint set of the network of example 1 used in our experiment is five (e.g., one such set is { d,, d , , d , , d,, d , } ) which is half the size of ID1 =lo . In these situations, the belief propagation model could lead to combinatorial difficulty. The approach we have outlined in

296 IEEE TRAE’SACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 1 9 , NO. 2, MARCH/APRIL 1989

this paper does not face this problem with large and extensively multiconnected networks, even with a full resettling process (full resettling only multiplies the complexity by order of ID1 which is still better than exponential complexity of the belief propagation model in these situations). With partial resettling, we have seen that accuracy as high as 99 percent may be achieved ((9) and Table V) with even less computation, and this would presumably be sufficient for many applications. An interesting question would be whether an approximation method for belief propagation models can be developed that is directly applicable to multiconnected networks with accuracy similar to or better than the results we obtained for our competitive connectionist model. This remains to be seen.

This work can also be compared to energy minimizing connectionist models proposed to solve nonlinear optimization problems. For example, in [lo], a dedicated network of nodes is developed to solve an optimization problem by first casting the constraints of the given problem in terms of a global energy function of the network. A better solution to the problem corresponds to a lower value of the energy function if appropriate weights are used for all links. Then, similar to our model, the collective local interactions among nodes will eventually bring the network to equilibrium that is one of the local energy minima corresponding to a good solution for the given problem.

A well-known application example of energy minimizing networks is the traveling salesman problem (TSP) [lo]. From the n cities to be traveled and the distance matrix among these cities, a network of n 2 nodes is constructed and represented by an n-by-n matrix. A node (i, j) means that the salesman travels to city i as the j t h city in the tour, and competitive dynamics are implemented via inhibitory links from each node (i, j) to all nodes in the ith row and the j t h column of the node matrix. Each node updates its activation level based on information from all of its neighbors, and the system eventually settles down into one of the local minima, representing a presumably good tour. One difference between Hopfield’s energy minimizing model and our competitive model is that, in our model, there is no need to add inhibitory links to the semantic/associative network modeling the real- world problem, something that is especially important in situations like diagnosis where competitive relationships are determined dynamically rather than statically as in the TSP. Formulating a diagnostic problem in an analogous fashion to the approach taken in [lo] would also presumably require 2” nodes (one per elements of the power set 2 D ) rather than n 2 as with the TSP.4

Simulated annealing with energy minimizing connectionist networks is capable of deriving globally rather than locally optimal solutions for some problems, provided a proper annealling process is used [l], [12]. For example, for a Boltzmann machne [l] the knowledge used to solve a

4After completing the work described in this paper, we learned of an ongoing study [6] of connectionist models for abductive inference (hypothesis formation) that uses methods similar to those in [lo].

problem is entirely represented by the connection strengths in a network that must be learned in a training session prior to use. During the training session a large number of patterns, e.g., M + and D + pairs in diagnosis, would be needed to train the machine. There are a number of difficulties with this approach to diagnostic problem solving. First, it is computationally very time-consuming, and the number of links and hidden units required in this approach makes us pessimistic that it would scale up to reasonably-sized problems. Second, which and how many patterns should be selected so that a machine can be trained to work properly is currently an unsolved problem for practical applications. It is practically intractable to train a machine with all possible ( M + , D+) pairs (an astronomically large number in real world problems.) Moreover, training by patterns abandons preexisting knowledge, such as causal relations and causal strengths, that has been accumulated over years and could readily be represented in semantic networks using the approach we described.

Finally, the partial resettling process we used in which each d, E D, is clamped in turn to zero activation and the network reruns to equilibrium bears at least a superficial resemblance to the searching based on a “short-term memory reset wave” used in adaptive resonance theory (ART) systems [2] , [7]. The causal network we describe is analogous to the “attentional subsystem” in ART terminology, although unlike ART systems, we use a local representation, a competitive activation mechanism, and are not concerned in this paper with self-organization. The partial resettling we describe has some similarity to the search triggered by “arousal bursts” with selective inhibition of an active output node in ART, except our resettling process is not triggered by a mismatch between D, and M + via an “orienting subsystem.” Further, resettling in our model generalizes the “search” for an optimal solution as it is usually described in ART in that it cycles through sets of multiple simultaneous winners rather than involving a single-winner-takes-all (“choice”) situation. An interesting but unresolved issue at present is whether an “orienting subsystem” similar to that in ART theory could be used to selectively and automatically guide resettling during multiple-winners-take-all diagnostic problem solving in a pro- ductive manner.

Several other directions for future research are possible. The experiments we have conducted here were on causal networks having only two layers: A layer of disorder nodes and a layer of manifestation nodes. This work could be extended to handle problems on more general networks where nodes representing intermediate concepts such as syndromes are also connected by causal links [4], [19]. Secondly, the approach we took in this work, namely, “decomposition plus competition plus resettling process,” seems to be quite general. A number of questions may be raised regarding this approach. For example, can we derive a general schema for decomposition? Can the connectionist model dynamically evaluate the quality of the solution obtained from the connectionist model with respect to the


globally optimal solution? Can we obtain even higher accuracy by carrying out the resettling process further, e.g., to instantiate each pair of nodes after single nodes were instantiated as we did in this work, and if so when should the process stop? Exploring this approach further and applying it to other nonlinear optimization problems as well as real-world diagnostic applications should shed further light on its strengths and weaknesses.

APPENDIX

Proof of Equation (6)

By the definition of r i ( t ) and q , (d , ( t ) )

ri ( t ) = qi ( ‘ ) /q i (0 )

I n 1

I n I

k = l k # i

To simplify (10) consider a factor of the first product. By (41,

n 1 - mi( t ) 1- n ( l - c k , . d k ( t ) ) = l -

k =1 1- C i i d i ( t ) k t f

and

1- m i ( t ) = I - . (1 - C i j )

l - C i j . d ; ( t )

Dividing (13) by (12), we have n

k + i

REFERENCES

D. Ackley, G. Hinton, and T. Sejnowski, “A learning algorithm for Boltzmann machines,” Cog. Sei., vol. 9, pp. 147-169, 1985. G. Carpenter, and S. Grossberg, “A massively parallel architecture for a self-organizing neural pattern recognition machine,” Computer Vision, Graphics and Image Processing, vol. 37, pp. 54-115. 1987. L. Collatz and W. Wetterling, Optimizution Problems. New Yorh: Springer-Verlag, 1975. G. Cooper, “NESTOR: A computer-based medical diagnostic aid that integrates causal and probabilistic knowledge,” STAN-CS-84- 1031, Ph.D. dissertation, Dept. of Computer Sci., Stanford Univer- sity, Stanford, CA, Nov. 1984. J. Feldman and D. Ballard, “Connectionist models and their properties,” Cog. Sei., vol. 6 , pp. 205-254, 1982. A. Goel, J. Ramanujam, and P. Sadayappan, “Towards a neural architecture for abductive reasoning,” in Proc. Sec. IEEE I n / . Conf. Neural Networks, 1988. pp. 681-688. S. Grossberg, “How does a brain build a cognitive code,” Ps.vch. Reo., vol. 87, pp. 1-51, 1980. G. Hinton, “Implementing semantic networks in parallel hardware,” in Parallel Models of Associative Memoiy, G. Hinton and J. Anderson, Eds. Hillsdale, NJ: Erlbaum, 1981, pp. 161-187. J. Hopfield, “Neurons with graded response have collective computational properties like those of two-state neurons,” Proc. Nut/. Acad. Sci. USA, vol. 81, 1984, pp. 3088-3092. J. Hopfield and D. Tank, “Neural computation of decisions in optimization problems,” Biol. Cybern, vol. 52, pp. 141-152, 1985. J. Josephson, B. Chandrasekaran, J. Smith, and M. Tanner, “A mechanism for forming composite explanatory hypotheses,” IEEE Trans. Syst. Man Cybern., vol. 17, pp. 445-454, 1987. S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Sci., vol. 220, pp. 671-680, 1983. J. McClelland, and D. Rumelhart, “An interactive activation model of context effects in letter perception: 1. Account of basic findings.” Psych. Rev., vol. 88, pp. 375-407, 1981.


J. Pearl, “Fusion, propagation, and structuring in belief networks,” Artificial Intell. J . , vol. 29, pp. 241-288, Sept. 1986. -, “Distributed revision of composite beliefs,” Arttficial Intell. J . , vol. 33, pp. 173-215, Oct. 1987. Y. Peng, “Formalization of Parsimonious Covering and Probabilis- tic Reasoning in Abductive Diagnostic Inference,” Dept. of Comp. Sci., Univ. of Maryland, TR-1615, 1986. Y. Peng and J. Reggia, “A probabilistic causal model for diagnostic problem solving. Part one: Integrating symbolic causal inference with numeric probabilistic inference,” IEEE Trans. Syst. Man Cyhern., vol. 17, pp. 146-162, 1987. -, “A probabilistic causal model for diagnostic problem solving. Part two: Diagnostic strategy,” IEEE Trans. Syst., Man CY- bern., vol. 17, pp. 395-406, 1987. -, “Diagnostic problemsolving with causal chaining,” Int. J . Intelligent Syst., vol. 2, pp. 265-302, 1987. H. Pople, “Formation of composite hypotheses in diagnostic problem solving,” in Proc. Fifth Int. Joint Conf. in Artificial Intell., pp. 1030-1037,1977, J. Reggia, D. Nau, and P. Wang, “Diagnostic expert systems based on a set covering model,” Int . J . Man-Mach. Studies, vol. 19, pp.

J. Reggia, D. Nau, P. Wang, and Y. Peng, “A formal model of diagnostic inference,” Inform. Sci., vol. 37, pp. 227-285, 1985. J . Reggia, “Virtual lateral inhibition in parallel activation models of associative memory,” in Proc. 9th Int. Joint Con/. Artificial Intell., Los Angeles, CA, Aug. 1985, pp. 244-248. J. Reggia, P. Marsland, and R. Berndt, “Competitive dynamics in a dual-route connectionist model of print-to-sound transformation,” Complex Syst., in press. J. Reggia, “Properties of a competition-based activation mechanism in neuromimetic network models,” Proc. 1st Int. Conf. Neural Networks, vol. 11, San Diego, CA, pp. 131-138, 1987. J. Reggia and G. Sutton, “Self-processing networks and their biomedical implications,” Proc. IEEE, vol. 76, 1988, pp. 680-692. D. Rumelhart and J. McClelland, “An interactive activation model of context effects in letter perception: Part 2-The contextual enhancement effect and some tests and extension of the model,” Psych. Rev., vol. 89, 1982, pp. 60-94. M. Tagamets. and J. Reggia, “A data flow implementation of a competition-based connectionist model,” J . Parallel and Distributed Computing, in press, 1989. D. Touretzky and G. Hinton, “Symbols among the neurons,” Proc. Ninth Int. Joint Con!. Artificial Intell., 1985, pp. 238-243. D. Touretzky and G. Hinton, “A distributed connectionist produc- tion system,” TR-CMUCS-86-172, Dept. of Computer Science, Carnegie-Mellon University, Pittsburgh. PA, 1986.

437-460, 1983.

[31] D. Touretzky, “BoltzCONS: Reconciling connectionism with the recursive nature of stacks and trees,” in Proc. Eighth Annu. Con/. Cognitiue Sci. Soc., 1986, 522-530. J. Wald, M. Farach, M. Tagamets, and J. Reggia, “Generating plausible diagnostic hypotheses with self processing causal networks,” 1988, J . Experimental and Theoretical Artifical Intelligance. 1989, in press. P. Wang, S. Seidman, and J. Reggia, “Analysis of competition-based spreading activation in connectionist models,” Int. J . Man- Mach.

I321

[33]

Stud., vol. 28, pp. 77-97, 1988.

Yun Peng received the B.S. in electrical engineering from the Harbin Engineering Institute, Harbin, China, in 1970. He received the M.S. degree from Wayne State University, Detroit, MI, and the Ph.D. degree from the University of Maryland, College Park, in 1981 and 1985, respectively, both in computer science.

He is currently a Senior Research Scientist at the Institute for Software, Academia Sinica. Bei- jing, China. His interests include artificial intelligence, diagnostic expert systems, uncertainty rea-

soning, and neural network modeling

James A. Reggia received the M.D degree In 1975 and the Ph.D. degree in computer science in 1981, both from the Umversity of Maryland, College Park. He is a member of the faculty of the University of Maryland. He is jointly ap- pointed as Associate Professor of Computer Sci- ence and Neurology, and holds a position in the University of Maryland Institute for Advanced Computer Studies He is actively engaged In teaching and research in artificial intelligence with an emphasis on connectionist models, diag-

nostic problem solving, cognitive modeling, knowledge-based systems, and biomedical applications, and has authored many publications in these areas.

Documents

A connectionist model for diagnostic problem solving