AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential

Journal of Artificial Intelligence Research 13 (2000) 155-188 Submitted 6/00; published 10/00

AIS-BN: An Adaptive Importance Sampling Algorithm forEvidential Reasoning in Large Bayesian Networks

Jian Cheng [email protected]

Marek J. Druzdzel [email protected]

Decision Systems LaboratorySchool of Information Sciences and Intelligent Systems ProgramUniversity of Pittsburgh, Pittsburgh, PA 15260 USA

Abstract

Stochastic sampling algorithms, while an attractive alternative to exact algorithms invery large Bayesian network models, have been observed to perform poorly in evidentialreasoning with extremely unlikely evidence. To address this problem, we propose an adap-tive importance sampling algorithm, AIS-BN, that shows promising convergence rateseven under extreme conditions and seems to outperform the existing sampling algorithmsconsistently. Three sources of this performance improvement are (1) two heuristics forinitialization of the importance function that are based on the theoretical properties of im-portance sampling in finite-dimensional integrals and the structural advantages of Bayesiannetworks, (2) a smooth learning method for the importance function, and (3) a dynamicweighting function for combining samples from different stages of the algorithm.

We tested the performance of the AIS-BN algorithm along with two state of the artgeneral purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter& Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in ourtests three large real Bayesian network models available to the scientific community: theCPCS network (Pradhan et al., 1994), the PathFinder network (Heckerman, Horvitz,& Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel,1997), with evidence as unlikely as 10−41. While the AIS-BN algorithm always performedbetter than the other two algorithms, in the majority of the test cases it achieved orders ofmagnitude improvement in precision of the results. Improvement in speed given a desiredprecision is even more dramatic, although we are unable to report numerical results here,as the other algorithms almost never achieved the precision reached even by the first fewiterations of the AIS-BN algorithm.

1. Introduction

Bayesian networks (Pearl, 1988) are increasingly popular tools for modeling uncertainty inintelligent systems. With practical models reaching the size of several hundreds of variables(e.g., Pradhan et al., 1994; Conati et al., 1997), it becomes increasingly important to ad-dress the problem of feasibility of probabilistic inference. Even though several ingeniousexact algorithms have been proposed, in very large models they all stumble on the theo-retically demonstrated NP-hardness of inference (Cooper, 1990). The significance of thisresult can be observed in practice — exact algorithms applied to large, densely connectedpractical networks require either a prohibitive amount of memory or a prohibitive amountof computation and are unable to complete. While approximating inference to any desiredprecision has been shown to be NP-hard as well (Dagum & Luby, 1993), it is for very com-

c©2000 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Cheng & Druzdzel

plex networks the only alternative that will produce any result at all. Furthermore, whileobtaining the result is crucial in all applications, precision guarantees may not be criticalfor some types of problems and can be traded off against the speed of computation.

A prominent subclass of approximate algorithms is the family of stochastic samplingalgorithms (also called stochastic simulation or Monte Carlo algorithms). The precisionobtained by stochastic sampling generally increases with the number of samples generatedand is fairly unaffected by the network size. Execution time is fairly independent of thetopology of the network and is linear in the number of samples. Computation can beinterrupted at any time, yielding an anytime property of the algorithms, important in time-critical applications.

While stochastic sampling performs very well in predictive inference, diagnostic reason-ing, i.e., reasoning from observed evidence nodes to their ancestors in the network oftenexhibits poor convergence. When the number of observations increases, especially if theseobservations are unlikely a-priori, stochastic sampling often fails to converge to reason-able estimates of the posterior probabilities. Although this problem has been known sincethe very first sampling algorithm was proposed by Henrion (1988), little has been doneto address it effectively. Furthermore, various sampling algorithms proposed were testedon simple and small networks, or networks with special topology, without the presence ofextremely unlikely evidence and the practical significance of this problem has been un-derestimated. Given a typical number of samples used in real-time that are feasible ontoday’s hardware, say 106 samples, the behavior of a stochastic sampling algorithm will bedrastically different for different size networks. While in a network consisting of 10 nodesand a few observations, it may be possible to converge to exact probabilities, in very largenetworks only a negligibly small fraction of the total sample space will be probed. One ofthe practical Bayesian network models that we used in our tests, a subset of the CPCSnetwork (Pradhan et al., 1994), consists of 179 nodes. Its total sample space is larger than1061. With 106 samples, we can sample only 10−55 fraction of the sample space.

We believe that it is crucial (1) to study the feasibility and convergence properties ofsampling algorithms on very large practical networks, and (2) to develop sampling algo-rithms that will show good convergence under extreme, yet practical conditions, such asevidential reasoning given extremely unlikely evidence. After all, small networks can beupdated using any of the existing exact algorithms — it is precisely the very large networkswhere stochastic sampling can be most useful. As to the likelihood of evidence, we knowthat stochastic sampling will generally perform well when it is high (Henrion, 1988). So, itis important to look at those cases in which evidence is very unlikely. In this paper, we testtwo existing state of the art stochastic sampling algorithms for Bayesian networks, likeli-hood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling(Shachter & Peot, 1989), on a subset of the CPCS network with extremely unlikely evi-dence. We show that they both exhibit similarly poor convergence rates. We propose a newsampling algorithm, that we call the adaptive importance sampling for Bayesian networks(AIS-BN), which is suitable for evidential reasoning in large multiply-connected Bayesiannetworks. The AIS-BN algorithm is based on importance sampling, which is a widelyapplied method for variance reduction in simulation that has also been applied in Baye-sian networks (e.g., Shachter & Peot, 1989). We demonstrate empirically on three largepractical Bayesian network models that the AIS-BN algorithm consistently outperforms

156

Adaptive Importance Sampling in Bayesian Networks

the other two algorithms. In the majority of the test cases, it achieved over two orders ofmagnitude improvement in convergence. Improvement in speed given a desired precisionis even more dramatic, although we are unable to report numerical results here, as theother algorithms never achieved the precision reached even by the first few iterations ofthe AIS-BN algorithm. The main sources of improvement are: (1) two heuristics for theinitialization of the importance function that are based on the theoretical properties of im-portance sampling in finite-dimensional integrals and the structural advantages of Bayesiannetworks, (2) a smooth learning method for updating the importance function, and (3) adynamic weighting function for combining samples from different stages of the algorithm.We study the value of the two heuristics used in the AIS-BN algorithm: (1) initializationof the probability distributions of parents of evidence nodes to uniform distribution and(2) adjusting very small probabilities in the conditional probability tables, and show thatthey both play an important role in the AIS-BN algorithm but only a moderate role in theexisting algorithms.

The remainder of this paper is structured as follows. Section 2 first gives a generalintroduction to importance sampling in the domain of finite-dimensional integrals, where itwas originally proposed. We show how importance sampling can be used to compute prob-abilities in Bayesian networks and how it can draw additional benefits from the graphicalstructure of the network. Then we develop a generalized sampling scheme that will aid usin reviewing the previously proposed sampling algorithms and in describing the AIS-BNalgorithm. Section 3 describes the AIS-BN algorithm. We propose two heuristics for ini-tialization of the importance function and discuss their theoretical foundations. We describea smooth learning method for the importance function and a dynamic weighting functionfor combining samples from different stages of the algorithm. Section 4 describes the em-pirical evaluation of the AIS-BN algorithm. Finally, Section 5 suggests several possibleimprovements to the AIS-BN algorithm, possible applications of our learning scheme, anddirections for future work.

2. Importance Sampling Algorithms for Bayesian Networks

We feel that it is useful to go back to the theoretical roots of importance sampling in orderto be able to understand the source of speedup of the AIS-BN algorithm relative to theexisting state of the art importance sampling algorithms for Bayesian networks. We firstreview the general idea of importance sampling in finite-dimensional integrals and how itcan reduce the sampling variance. We then discuss the application of importance samplingto Bayesian networks. Readers interested in more details are directed to literature onMonte Carlo methods in computation of finite integrals, such as the excellent exposition byRubinstein (1981) that we are essentially following in the first section.

2.1 Mathematical Foundations

Let g(X) be a function of m variables X = (X1, ..., Xm) over a domain Ω ⊂ Rm, such thatcomputing g(X) for any X is feasible. Consider the problem of approximate computationof the integral

I =∫Ω

g(X) dX . (1)

157

Cheng & Druzdzel

Importance sampling approaches this problem by writing the integral (1) as

I =∫Ω

g(X)f(X)

f(X) dX ,

where f(X), often referred to as the importance function, is a probability density functionover Ω. f(X) can be used in importance sampling if there exists an algorithm for generatingsamples from f(X) and if the importance function is zero only when the original functionis zero, i.e., g(X) 6= 0 =⇒ f(X) 6= 0.

After we have independently sampled n points s1, s2, . . . , sn, si ∈ Ω, according to theprobability density function f(X), we can estimate the integral I by

In =1n

n∑i=1

g(si)f(si)

(2)

and estimate the variance of In by

σ2(In) =1

n · (n− 1)

n∑i=1

(g(si)f(si)

− In

)2

. (3)

It is straightforward to show that this estimator has the following properties:

1. E(In) = I

2. limn→∞ In = I

3.√

n · (In − I) n→∞−→ Normal(0, σ2f(X)), where

σ2f(X) =

∫Ω

(g(X)f(X)

− I

)2

f(X) dX (4)

4. E(σ2(In)

)= σ2(In) = σ2

f(X)/n

The variance of In is proportional to σ2f(X) and inversely proportional to the number of

samples. To minimize the variance of In, we can either increase the number of samples ortry to decrease σ2

f(X). With respect to the latter, Rubinstein (1981) reports the followinguseful theorem and corollary.

Theorem 1 The minimum of σ2f(X) is equal to

σ2f(X) =

(∫Ω|g(X)| dX

)2

− I2

and occurs when X is distributed according to the following probability density function

f(X) =|g(X)|∫

Ω |g(X)| dX.

158


Corollary 1 If g(X) >0, then the optimal probability density function is

f(X) =g(X)

I

and σ2f(X) = 0.

Although in practice sampling from precisely f(X) = g(X)/I will occur rarely, we expectthat functions that are close enough to it can still reduce the variance effectively. Usually,the closer the shape of the function f(X) is to the shape of the function g(X), the smalleris σ2

f(X). In high-dimensional integrals, selection of the importance function, f(X), is farmore critical than increasing the number of samples, since the former can dramaticallyaffect σ2

f(X). It seems prudent to put more energy in choosing an importance functionwhose shape is as close as possible to that of g(X) than to apply the brute force method ofincreasing the number of samples.

It is worth noting here that if f(X) is uniform, importance sampling becomes a generalMonte Carlo sampling. Another noteworthy property of importance sampling that can bederived from Equation 4 is that we should avoid f(X) |g(X)− I · f(X)| in any partof the domain of sampling, even if f(X) matches well g(X)/I in important regions. Iff(X) |g(X)− I · f(X)|, the variance can become very large or even infinite. We canavoid this by adjusting f(X) to be larger in unimportant regions of the domain of X.

While in this section we discussed importance sampling for continuous variables, theresults stated are valid for discrete variables as well, in which case integration should besubstituted by summation.

2.2 A Generic Importance Sampling Algorithm for Bayesian Networks

In the following discussion, all random variables used are multiple-valued, discrete variables.Capital letters, such as A, B, or C, denote random variables. Bold capital letters, such asA, B, or C, denote sets of variables. Bold capital letter E will usually be used to denotethe set of evidence variables. Lower case letters a, b, c denote particular instantiationsof variables A, B, and C respectively. Bold lower case letters, such as a, b, c, denoteparticular instantiations of sets A, B, and C respectively. Bold lower case letter e, inparticular, will be used to denote the observations, i.e., instantiations of the set of evidencevariables E. Anc(A) denotes the set of ancestors of node A. Pa(A) denotes the set ofparents (direct ancestors) of node A. pa(A) denotes a particular instantiation of Pa(A). \denotes set difference. Pa(A)|E=e denotes that we use the extended vertical bar to indicatesubstitution of e for E in A.

We know that the joint probability distribution over all variables of a Bayesian net-work model, Pr(X), is the product of the probability distributions over each of the nodesconditional on their parents, i.e.,

Pr(X) =n∏

i=1

Pr(Xi|Pa(Xi)) . (5)

In order to calculate Pr(E = e), we need to sum over all Pr(X\E,E = e).

Pr(E = e) =∑X\E

Pr(X\E,E = e) (6)

159

Cheng & Druzdzel

We can see that Equation 6 is almost identical to Equation 1 except that integration isreplaced by summation and the domain Ω is replaced by X\E. The theoretical resultsderived for the importance sampling that we reviewed in the previous section can thus bedirectly applied to computing probabilities in Bayesian networks.

While there has been previous work on importance sampling-based algorithms for Ba-yesian networks, we will postpone the discussion of this work until the next section. Herewe will present a generic stochastic sampling algorithm that will help us in both reviewingthe prior work and in presenting our algorithm.

The posterior probability Pr(a|e) can be obtained by first computing Pr(a, e) and Pr(e)and then combining these based on the definition of conditional probability

Pr(a|e) =Pr(a, e)Pr(e)

. (7)

In order to increase the accuracy of results of importance sampling in computing the pos-terior probabilities over different network variables given evidence, we should in general usedifferent importance functions for Pr(a, e) and for Pr(e). Doing so increases the computa-tion time only linearly while the gain in accuracy may be significant given that obtaininga desired accuracy is exponential in nature. Very often, it is a common practice to use thesame importance function (usually for Pr(e)) to sample both probabilities. If the difference

1. Order the nodes according to their topological order.

2. Initialize importance function Pr0(X\E), the desired number of samplesm, the updating interval l, and the score arrays for every node.

3. k ← 0, T ← ∅

4. for i← 1 to m do

5. if (i mod l == 0) then

6. k ← k + 1

7. Update importance function Prk(X\E) based on T .

end if

8. si ← generate a sample according to Prk(X\E)

9. T ← T ∪ si

10. Calculate Score(si,Pr(X\E, e),Prk(X\E)) and add it to the correspond-ing entry of every score array according to the instantiated states.

end for

11. Normalize the score arrays for every node.

Figure 1: A generic importance sampling algorithm.

160


between the optimal importance functions for these two quantities is large, the perfor-mance may deteriorate significantly. Although Pr(a, e) and Pr(e) are unbiased estimatorsaccording to Property 1 (Section 2.1), Pr(a|e) obtained by means of Equation 7 is not anunbiased estimator. However, as the number of samples increases, the bias decreases andcan be ignored altogether when the sample size is large enough (Fishman, 1995).

Figure 1 presents a generic stochastic sampling algorithm that captures most of theexisting sampling algorithms. Without the loss of generality, we restrict ourselves in ourdescription to so-called forward sampling, i.e., generation of samples in the topologicalorder of the nodes in the network. The forward sampling order is accomplished by theinitialization performed in Step 1, where parents of each node are placed before the nodeitself. In forward sampling, Step 8 of the algorithm, the actual generation of samples, worksas follows. (i) each evidence node is instantiated to its observed state and is further omittedfrom sample generation; (ii) each root node is randomly instantiated to one of its possiblestates, according to the importance prior probability of this node, which can be derivedfrom Prk(X\E); (iii) each node whose parents are instantiated is randomly instantiated toone of its possible states, according to the importance conditional probability distributionof this node given the values of the parents, which can also be derived from Prk(X\E); (iv)this procedure is followed until all nodes are instantiated. A complete instantiation si of thenetwork based on this method is one sample of the joint importance probability distributionPrk(X\E) over all variables of the network. The scoring of Step 10 amounts to calculatingPr(si, e)/Prk(si), as required by Equation 2. The ratio between the total score sum and thenumber of samples is an unbiased estimator of Pr(e). In Step 10, if we also count the scoresum under the condition A = a, i.e., that some unobserved variables A have the values a,the ratio between this score sum and the number of samples is an unbiased estimator ofPr(a, e).

Most existing algorithms focus on the posterior probability distributions of individualnodes. As we mentioned above, for the sake of efficiency they count the score sum corre-sponding to Pr(A = a, e), A ∈ X\E, and record it in an score array for node A. Eachentry of this array corresponds to a specified state of A. This method introduces additionalvariance, as opposed to using the importance function derived from Prk(X\E) to samplePr(A = a, e), A ∈ X\E, directly.

2.3 Existing Importance Sampling Algorithms for Bayesian Networks

The main difference between various stochastic sampling algorithms is in how they processSteps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.

Probabilistic logic sampling (Henrion, 1988) is the simplest and the first proposed sam-pling algorithm for Bayesian networks. The importance function is initialized in Step 2 toPr(X) and never updated (Step 7 is null). Without evidence, Pr(X) is the optimal im-portance function for the evidence set, which is empty anyway. It escapes most authorsthat Pr(X) may be not the optimal importance function for Pr(A = a), A ∈ X, when Ais not a root node. A mismatch between the optimal and the actually used importancefunction may result in a large variance. The sampling process with evidence is the sameas without evidence except that in Step 10 we do not count the scores for those samplesthat are inconsistent with the observed evidence, which amounts to discarding them. When

161

Cheng & Druzdzel

the evidence is very unlikely, there is a large difference between Pr(X) and the optimalimportance function. Effectively, most samples are discarded and the performance of logicsampling deteriorates badly.

Likelihood weighting (LW) (Fung & Chang, 1989; Shachter & Peot, 1989) enhances thelogic sampling in that it never discards samples. In likelihood weighting, the importancefunction in Step 2 is

Pr(X\E) =∏

xi /∈e

Pr(xi|Pa(Xi))

∣∣∣∣∣∣E=e

.

Likelihood weighting does not update the importance function in Step 7. Although likeli-hood weighting is an improvement on logic sampling, its convergence rate can be still veryslow when there is large difference between the optimal importance function and Pr(X\E),again especially in situations when evidence is very unlikely. Because of its simplicity, thelikelihood weighting algorithm has been the most commonly used simulation method forBayesian network inference. It often matches the performance of other, more sophisticatedschemes because it is simple and able to increase its precision by generating more samplesthan other algorithms in the same amount of time.

Backward sampling (Fung & del Favero, 1994) changes Step 1 of our generic algorithmand allows for generating samples from evidence nodes in the direction that is opposite tothe topological order of nodes in the network. In Step 2, backward sampling uses the likeli-hood of some of the observed evidence and some instantiated nodes to calculate Pr0(X\E).Although Fung and del Favero mentioned the possibility of dynamic node ordering, theydid not propose any scheme for updating the importance function in Step 7. Backwardsampling suffers from problems that are similar to those of likelihood weighting, i.e., a pos-sible mismatch between its importance function and the optimal importance function canlead to poor convergence.

Importance sampling (Shachter & Peot, 1989) is the same as our generic sampling algo-rithm. Shachter and Peot introduced two variants of importance sampling: self-importance(SIS) and heuristic importance. The importance function used in the first step of theself-importance algorithm is

Pr0(X\E) =∏

xi /∈e

Pr(xi|Pa(Xi))

∣∣∣∣∣∣E=e

.

This function is updated in Step 7. The algorithm tries to revise the conditional probabilitytables (CPTs) periodically in order to make the sampling distribution gradually approachthe posterior distribution. Since the same data are used to update the importance functionand to compute the estimator, this process introduces bias in the estimator. Heuristicimportance first removes edges from the network until it becomes a polytree, and thenuses a modified version of the polytree algorithm (Pearl, 1986) to compute the likelihoodfunctions for each of the unobserved nodes. Pr0(X\E) is a combination of these likelihoodfunctions with Pr(X\E, e). In Step 7 heuristic importance does not update Prk(X\E). AsShachter and Peot (1989) point out, this heuristic importance function can still lead to abad approximation of the optimal importance function. There exist also other algorithmssuch as a combination of self-importance and heuristic importance (Shachter & Peot, 1989;

162


Shwe & Cooper, 1991). Although some researchers suggested that this may be a promisingdirection for the work on sampling algorithms, we have not seen any results that wouldfollow up on this.

A separate group of stochastic sampling methods is formed by so-called Markov ChainMonte Carlo (MCMC) methods that are divided into Gibbs sampling, Metropolis sampling,and Hybrid Monte Carlo sampling (Geman & Geman, 1984; Gilks, Richardson, & Spiegel-halter, 1996; MacKay, 1998). Roughly speaking, these methods draw random samples froman unknown target distribution f(X) by biasing the search for this distribution towardshigher probability regions. When applied to Bayesian networks (Pearl, 1987; Chavez &Cooper, 1990) this approach determines the sampling distribution of a variable from itsprevious sample given its Markov blanket (Pearl, 1988). This corresponds to updatingPrk(X\E) when sampling every node. Prk(X\E) will converge to the optimal importancefunction for Pr(e) if Pr0(X\E) satisfies some ergodic properties (York, 1992). Since theconvergence to the limiting distribution is very slow and calculating updates of the sam-pling distribution is costly, these algorithms are not used in practice as often as the simplelikelihood weighting scheme.

There are also some other simulation algorithms, such as bounded variance algorithm(Dagum & Luby, 1997) and the AA algorithm (Dagum et al., 1995), which are essentiallybased on the LW algorithm and the Stopping-Rule Theorem (Dagum et al., 1995). Canoet al. (1996) proposed another importance sampling algorithm that performed somewhatbetter than LW in cases with extreme probability distributions, but, as the authors state, ingeneral cases it “produced similar results to the likelihood weighting algorithm.” Hernandezet al. (1998) also applied importance sampling and reported a moderate improvement onlikelihood weighting.

2.4 Practical Performance of the Existing Sampling Algorithms

The largest network that has been tested using sampling algorithms is QMR-DT (QuickMedical Reference — Decision Theoretic) (Shwe et al., 1991; Shwe & Cooper, 1991), whichcontains 534 adult diseases and 4,040 findings, with 40,740 arcs depicting disease-to-findingdependencies. The QMR-DT network belongs to a class of special bipartite networksand its structure is often referred to as BN2O (Henrion, 1991), because of its two-layercomposition: disease nodes in the top layer and finding nodes in the bottom layer. Shweand colleagues used an algorithm combining self-importance and heuristic importance andtested its convergence properties on the QMR-DT network. But since the heuristic methoditerative tabular Bayes (ITB) that makes use of a version of Bayes’ rule is designed forthe BN2O networks, it cannot be generalized to arbitrary networks. Although Shwe andcolleagues concluded that Markov blanket scoring and self-importance sampling significantlyimprove the convergence rate in their model, we cannot extend this conclusion to generalnetworks. The computation of Markov blanket scoring is more complex in a general multi-connected network than in a BN2O network. Also, the experiments conducted lacked agold-standard posterior probability distribution that could serve to judge the convergencerate.

Pradhan and Dagum (1996) tested an efficient version of the LW algorithm — boundedvariance algorithm (Dagum & Luby, 1997) and the AA algorithm (Dagum et al., 1995) on

163

Cheng & Druzdzel

a 146 node, multiply connected medical diagnostic Bayesian network. One limitation intheir tests is that the probability of evidence in the cases selected for testing was ratherhigh. Although over 10% of the cases had the probability of evidence on the order of10−8 or smaller, a simple calculation based on the reported mean µ = 34.5 number ofevidence nodes, shows that the average probability of an observed state of an evidence nodeconditional on its direct predecessors was on the order of (10−8)1/34.5 ≈ 0.59. Given thattheir algorithm is essentially based on the LW algorithm, based on our tests we suspectthat the performance will deteriorate on cases where the evidence is very unlikely. Bothalgorithms focus on the marginal probability of one hypothesis node. If there are manyqueried nodes, the efficiency may deteriorate.

We have tested the algorithms discussed in Section 2.3 on several large networks. Ourexperimental results show that in cases with very unlikely evidence, none of these algorithmsconverges to reasonable estimates of the posterior probabilities within a reasonable amountof time. The convergence becomes worse as the number of evidence nodes increases. Thus,when using these algorithms in very large networks, we simply cannot trust the results. Wewill present results of tests of the LW and SIS algorithms in more detail in Section 4.

3. AIS-BN: Adaptive Importance Sampling for Bayesian Networks

The main reason why the existing stochastic sampling algorithms converge so slowly is thatthey fail to learn a good importance function during the sampling process and, effectively,fail to reduce the sampling variance. When the importance function is optimal, such asin probabilistic logic sampling without any evidence, each of the algorithms is capableof converging to fairly good estimates of the posterior probabilities within relatively fewsamples. For example, assuming that the posterior probabilities are not extreme (i.e., largerthan say 0.01), as few as 1,000 samples may be sufficient to obtain good estimates. In thissection, we present the adaptive importance sampling algorithm for Bayesian networks(AIS-BN) that, as we will demonstrate in the next section, performs very well on mosttests. We will first describe the details of the algorithm and prove two theorems that areuseful in learning the optimal importance sampling function.

3.1 Basic Algorithm — AIS-BN

Compared with importance sampling used in normal finite-dimensional integrals, impor-tance sampling used in Bayesian networks has several significant advantages. First, thenetwork joint probability distribution Pr(X) is decomposable and can be factored intocomponent parts. Second, the network has a clear structure, which represents many con-ditional independence relationships. These properties are very helpful in estimating theoptimal importance function.

The basic AIS-BN algorithm is presented in Figure 2. The main differences betweenthe AIS-BN algorithm and the basic importance sampling algorithm in Figure 1 is thatwe introduce a monotonically increasing weight function wk and two effective heuristicinitialization methods in Step 2. We also introduce a special learning component in Step 7to let the updating process run more smoothly, avoiding oscillation of the parameters. The

164


1. Order the nodes according to their topological order.

2. Initialize importance function Pr0(X\E) using some heuristic methods, ini-tialize weight w0, set the desired number of samples m and the updatinginterval l, initialize the score arrays for every node.

3. k ← 0, T ← ∅, wTScore ← 0, wsum ← 0

4. for i← 1 to m do

5. if (i mod l == 0) then

6. k ← k + 1

7. Update the importance function Prk(X\E) and wk based on T .

end if

8. si ← generate a sample according to Prk(X\E)

9. T ← T ∪ si

10. wiScore ← Score (si, Pr(X\E, e), Prk(X\E), wk)

11. wTScore ← wTScore + wiScore

(Optional: add wiScore to the corresponding entry of every score array)

12. wsum ← wsum + wk

end for

13. Output estimate of Pr(E) as wTScore/wsum(Optional: Normalize the score arrays for every node)

Figure 2: The adaptive importance sampling for Bayesian Networks (AIS-BN) algorithm.

score processing in Step 10 is

wiScore = wk Pr(si, e)Prk(si)

.

Note that in this respect the algorithm in Figure 1 becomes a special case of AIS-BNwhen wk = 1. The reason why we use wk is that we want to give different weights tothe sampling results obtained at different stages of the algorithm. As each stage updatesthe importance function, they will all have different distance from the optimal importancefunction. We recommend that wk ∝ 1/σk, where σk is the standard deviation estimated instage k using Equation 3.1 In order to keep wk monotonically increasing, if wk is smallerthan wk−1, we adjust its value to wk−1. This weighting scheme may introduce bias into

1. A similar weighting scheme based on variance was apparently developed independently by Ortiz andKaelbling (2000), who recommend the weight wk ∝ 1/(σk)2.

165

Cheng & Druzdzel

the final result. Since the initial importance sampling functions are often inefficient andintroduce big variance into the results, we also recommend that wk = 0 in the first fewstages of the algorithm. We have designed this weighting scheme to reflect the fact that inpractice estimates with very small estimated variance are usually good estimates.

3.2 Modifying the Sampling Distribution in AIS-BN

Based on the theoretical considerations of Section 2.1, we know that the crucial element ofthe algorithm is converging on a good approximation of the optimal importance function.In what follows, we first give the optimal importance function for calculating Pr(E = e)and then discuss how to use the structural advantages of Bayesian networks to approximatethis function. In the sequel, we will use the symbol ρ to denote the importance samplingfunction and ρ∗ to denote the optimal importance sampling function.

Since Pr(X\E,E = e) > 0, from Corollary 1 we have

ρ(X\E) =Pr(X\E,E = e)

Pr(E = e)= Pr(X|E = e) .

The following corollary captures this result.

Corollary 2 The optimal importance sampling function ρ∗(X\E) for calculating Pr(E = e)in Equation 6 is Pr(X|E = e).

Although we know the mathematical expression for the optimal importance samplingfunction, it is difficult to obtain this function exactly. In our algorithm, we use the followingimportance sampling function

ρ(X\E) =n∏

i=1

Pr(Xi|Pa(Xi),E) . (8)

This function partially considers the effect of all the evidence on every node during thesampling process. When the network structure is the same as that of the network whichhas absorbed the evidence, this function is the optimal importance sampling function. Itis easy to learn and, as our experimental results show, it is a good approximation to theoptimal importance sampling function. Theoretically, when the posterior structure of themodel changes drastically as the result of observed evidence, this importance samplingfunction may perform poorly. We have tried to find practical networks where this wouldhappen, but to the day have not encountered a drastic example of this effect.

From Section 2.2, we know that the score sums corresponding to xi,pa(Xi), e canyield an unbiased estimator of Pr(xi,pa(Xi), e). According to the definition of conditionalprobability, we can get an estimator of Pr′(xi|pa(Xi), e). This can be achieved by main-taining an updating table for every node, the structure of which mimicks the structure ofthe CPT. Such tables allow us to decompose the above importance function into compo-nents that can be learned individually. We will call these tables the importance conditionalprobability tables (ICPT).

Definition 1 An importance conditional probability table (ICPT) of a node X is a tableof posterior probabilities Pr(X|Pa(X),E = e) conditional on the evidence and indexed byits immediate predecessors, Pa(X).

166


The ICPT tables will be modified during the process of learning the importance function.Now we will prove a useful theorem that will lead to considerable savings in the learningprocess.

Theorem 2

Xi ∈ X, Xi /∈ Anc(E)⇒ Pr(Xi|Pa(Xi),E) = Pr(Xi|Pa(Xi)) . (9)

Proof: Suppose we have set the values of all the parents of node Xi to pa(Xi). Node Xi isdependent on evidence E given pa(Xi) only when Xi is d-connecting with E given pa(Xi)(Pearl, 1988). According to the definition of d-connectivity, this happens only when thereexists a member of Xi’s descendants that belongs to the set of evidence nodes E. In otherwords Xi /∈ Anc(E). 2

Theorem 2 is very important for the AIS-BN algorithm. It states essentially thatthe ICPT tables of those nodes that are not ancestors of the evidence nodes are equal tothe CPT tables throughout the learning process. We only need to learn the ICPT tablesfor the ancestors of the evidence nodes. Very often this can lead to significant savings incomputation. If, for example, all evidence nodes are root nodes, we have our ICPT tables forevery node already and the AIS-BN algorithm becomes identical to the likelihood weightingalgorithm. Without evidence, the AIS-BN algorithm becomes identical to the probabilisticlogic sampling algorithm.

It is worth pointing out that for some Xi, Pr(Xi|Pa(Xi),E) (i.e., the ICPT table forXi), can be easily calculated using exact methods. For example, when Xi is the only parentof an evidence node Ej and Ej is the only child of Xi, the posterior probability distributionof Xi is straightforward to compute exactly. Since the focus of the current paper is on

Input: Initialized importance function Pr0(X\E), learning rate η(k).Output: An estimated importance function PrS(X\E).for stage k ← 0 to S do

1. Sample l points sk1, sk

2, . . . , skl independently according to the current im-

portance function Prk(X\E).

2. For every node Xi such that Xi ∈ X\E and Xi /∈ Anc(E) count score sumscorresponding to xi,pa(Xi), e and estimate Pr′(xi|pa(Xi), e) based on sk

1,sk2, . . . , sk

l .

3. Update Prk(X\E) according to the following formula:

Prk+1(xi|pa(Xi), e) =

Prk(xi|pa(Xi), e) + η(k) ·(Pr′(xi|pa(Xi), e)− Prk(xi|pa(Xi), e)

)end for

Figure 3: The AIS-BN algorithm for learning the optimal importance function.

167

Cheng & Druzdzel

sampling, the test results reported in this paper do not include this improvement of theAIS-BN algorithm.

Figure 3 lists an algorithm that implements Step 7 of the basic AIS-BN algorithm listedin Figure 2. When we estimate Pr′(xi|pa(Xi), e), we only use the samples obtained at thecurrent stage. One reason for this is that the information obtained in previous stages hasbeen absorbed by Prk(X\E). The other reason is that in principle, each successive iterationis more accurate than the previous one and the importance function is closer to the optimalimportance function. Thus, the samples generated by Prk+1(X\E) are better than thosegenerated by Prk(X\E). Pr′(Xi|pa(Xi), e) − Prk(Xi|pa(Xi), e) corresponds to the vectorof first partial derivatives in the direction of the maximum decrease in the error. η(k) isa positive function that determines the learning rate. When η(k) = 0 (lower bound), wedo not update our importance function. When η(k) = 1 (upper bound), at each stage wediscard the old function. The convergence speed is directly related to η(k). If it is small,the convergence will be very slow due to the large number of updating steps needed toreach a local minimum. On the other hand, if it is large, convergence rate will be initiallyvery fast, but the algorithm will eventually start to oscillate and thus may not reach aminimum. There are many papers in the field of neural network learning that discuss howto choose the learning rate and let estimated importance function converge quickly to thedestination function. Any method that can improve learning rate should be applicable tothis algorithm. Currently, we use the following function proposed by Ritter et al. (1991)

η(k) = a

(b

a

)k/kmax

, (10)

where a is the initial learning rate and b is the learning rate in the last step. This functionhas been reported to perform well in neural network learning (Ritter et al., 1991).

3.3 Heuristic Initialization in AIS-BN

The dimensionality of the problem of Bayesian network inference is equal to the number ofvariables in a network, which in the networks considered in this paper can be very high.As a result, the learning space of the optimal importance function is very large. Choice ofthe initial importance function Pr0(X\E) is an important factor affecting the learning —an initial value of the importance function that is close to the optimal importance functioncan greatly affect the speed of convergence. In this section, we present two heuristics thathelp to achieve this goal.

Due to their explicit encoding of the structure of a decomposable joint probability distri-bution, Bayesian networks offer computational advantages compared to finite-dimensionalintegrals. A possible first approximation of the optimal importance function is the priorprobability distribution over the network variables, Pr(X). We propose an improvement onthis initialization. We know that the effect of evidence nodes on a node will be attenuatedas the path length of that node to evidence nodes is increased (Henrion, 1989) and themost affected nodes are the direct ancestors of the evidence nodes. Initializing the ICPTtables of the parents of the evidence nodes to uniform distributions in our experience im-proves the convergence rate. Furthermore, the CPT tables of the parents of an evidencenode E may be not favorable to the observed state e if the probability of E = e without

168


any condition is less than a small value, such as Pr(E = e) < 1/(2 · nE), where nE is thenumber of outcomes of node E. Based on this observation, we change the CPT tables ofthe parents of an evidence node E to uniform distributions in our experiment only whenPr(E = e) < 1/(2 · nE), otherwise we leave them unchanged. This kind of initializationinvolves the knowledge of Pr(E = e), the marginal probability without evidence. Proba-bilistic logic sampling (Henrion, 1988) enhanced by Latin hypercube sampling (Cheng &Druzdzel, 2000b) or quasi-Monte Carlo methods (Cheng & Druzdzel, 2000a) will produce avery good estimate of Pr(E = e). This is an one-time effort that can be made at the modelbuilding stage and is worth pursuing to any desired precision.

Another serious problem related to sampling are extremely small probabilities. Supposethere exists a root node with a state s that has the prior probability Pr(s) = 0.0001. Letthe posterior probability of this state given evidence be Pr(s|E) = 0.8. A simple calculationshows that if we update the importance function every 1, 000 samples, we can expect tohit s only once every 10 updates. Thus s’s convergence rate will be very slow. We canovercome this problem by setting a threshold θ and replacing every probability p < θ in thenetwork by θ.2 At the same time, we subtract (θ − p) from the largest probability in thesame conditional probability distribution. For example, the value of θ = 10/l, where l isthe updating interval, will allow us to sample 10 times more often in the first stage of thealgorithm. If this state turns out to be more likely (having a large weight), we can increaseits probability even more in order to converge to the correct answer faster. Consideringthat we should avoid f(X) |g(X)− I · f(X)| in an unimportant region as discussed inSection 2.1, we need to make this threshold larger. We have found that the convergencerate is quite sensitive to this threshold. Based on our empirical tests, we suggest to useθ = 0.04 in networks whose maximum number of outcomes per node does not exceed five.A smaller threshold might lead to fast convergence in some cases but slow convergence inothers. If one threshold does not work, changing it in a specific network will usually improveconvergence rate.

3.4 Selection of Parameters

There are several tunable parameters in the AIS-BN algorithm. We base the choice ofthese parameters on the Central Limit Theorem (CLT). According to CLT, if Z1, Z2, . . . ,Zn are independent and identically distributed random variables with E(Zi) = µZ andVar(Zi) = σ2

Z , i = 1, ..., n, then Z = (Z1+ ...+Zn)/n is approximately normally distributedwhen n is sufficiently large. Thus,

limn→∞

P (

∣∣∣Z − µz

∣∣∣µz

≥ σZ/√

n

µz· t) =

2√2π

∫ ∞

te−x2/2dx . (11)

Although this approximation holds when n approaches infinity, CLT is known to be veryrobust and lead to excellent approximations even for small n. The formula of Equation 11is an (εr, δ) Relative Approximation, which is an estimate µ of µ that satisfies

P (|µ− µ|

µ≥ εr) ≤ δ .

2. This initialization heuristic was apparently developed independently by Ortiz and Kaelbling (2000).

169

Cheng & Druzdzel

If δ has been fixed,

εr =σZ/√

n

µz· Φ−1

Z (δ

2) ,

where ΦZ(z) = 1√2π

∫∞z e−x2/2dx. Since in our sampling problem, µz (corresponding to

Pr(E) in Figure 2) has been fixed, setting εr to a smaller value amounts to letting σZ/√

nbe smaller. So, we can adjust the parameters based on σZ/

√n, which can be estimated

using Equation 3. It is also the theoretical intuition behind our recommendation wk ∝ 1/σk

in Section 3.1. While we expect that this should work well in most networks, no guaranteescan be given here — there exist always some extreme cases in sampling algorithms in whichno good estimate of variance can be obtained.

3.5 A Generalization of AIS-BN: The Problem of Estimating Pr(a|e)

A typical focus of systems based on Bayesian networks is the posterior probability of variousoutcomes of individual variables given evidence, Pr(a|e). This can be generalized to thecomputation of the posterior probability of a particular instantiation of a set of variablesgiven evidence, i.e., Pr(A = a|e). There are two methods that are capable of performingthis computation. The first method is very efficient at the expense of precision. The secondmethod is less efficient, but offers in general better convergence rates. Both methods arebased on Equation 7.

The first method reuses the samples generated to estimate Pr(e) in estimating Pr(a, e).Estimation of Pr(a, e) amounts to counting the scored sum under the condition A = a.The main advantage of this method is its efficiency — we can use the same set of samplesto estimate the posterior probability of any state of a subset of the network given evidence.Its main disadvantage is that the variance of the estimated Pr(a, e) can be large, especiallywhen the numerical value of Pr(a|e) is extreme. This method is the most widely usedapproach in the existing stochastic sampling algorithms.

The second method, used much more rarely (e.g., Cano et al., 1996; Pradhan & Dagum,1996; Dagum & Luby, 1997), calls for estimating Pr(e) and Pr(a, e) separately. Afterestimating Pr(e), an additional call to the algorithm is made for each instantiation a ofthe set of variables of interest A. Pr(a, e) is estimated by sampling the network with theset of observations e extended by A = a. The main advantage of this method is that itis much better at reducing variance than the first method. Its main disadvantage is thecomputational cost associated with sampling for possibly many combinations of states ofnodes of interest.

Cano et al. (1996) suggested a modified version of the second method. Suppose that weare interested in the posterior distribution Pr(ai|e) for all possible values ai of A, i = 1,2, . . . , k. We can estimate Pr(ai, e) for each i = 1, . . . , k separately, and use the value∑k

i=1 Pr(ai, e) as an estimate for Pr(e). The assumption behind this approach is that theestimate of Pr(e) will be very accurate because of the large sample from which it is drawn.However, even if we can guarantee small variance in every Pr(ai, e), we cannot guaranteethat their sum will also have a small variance. So, in the AIS-BN algorithm we only usethe pure form of each of the methods. The algorithm listed in Figure 2 is based on the firstmethod when the optional computations in Steps 12 and 13 are performed. An algorithm

170


corresponding to the second method skips the optional steps and calls the basic AIS-BNalgorithm twice to estimate Pr(e) and Pr(a, e) separately.

The first method is very attractive because of its simplicity and possible computationalefficiency. However, as we have shown in Section 2.2, the performance of a sampling algo-rithm that uses just one set of samples (as in the first method above) to estimate Pr(a|e)will deteriorate if the difference between the optimal importance functions for Pr(a,e) andPr(e) is large. If the main focus of the computation is high accuracy of the posterior proba-bility distribution of a small number of nodes, we strongly recommend to use the algorithmbased on the second method. Also, this algorithm can be easily used to estimate confidenceintervals of the solution.

4. Experimental Results

In this section, we first describe the experimental method used in our tests. Our tests focuson the CPCS network, which is one of the largest and most realistic networks availableand for which we know precisely which nodes are observable. We were, therefore, ableto generate very realistic test cases. Since the AIS-BN algorithm uses two initializationheuristics, we designed an experiment that studies the contribution of each of these twoheuristics to the performance of the algorithm. To probe the extent of AIS-BN algorithm’sexcellent performance, we test it on several real and large networks.

4.1 Experimental Method

We performed empirical tests comparing the AIS-BN algorithm to the likelihood weighting(LW) and the self-importance sampling (SIS) algorithms. The two algorithms are basicallythe state of the art general purpose belief updating algorithms. The AA (Dagum et al.,1995) and the bounded variance (Dagum & Luby, 1997) algorithms, which were suggestedby a reviewer, are essentially enhanced special purpose versions of the basic LW algorithm.Our implementation of the three algorithms relied on essentially the same code with separatefunctions only when the algorithms differed. It is fair to assume, therefore, that the observeddifferences are purely due to the theoretical differences among the algorithms and not due tothe efficiency of implementation. In order to make the comparison of the AIS-BN algorithmto LW and SIS fair, we used the first method of computation (Section 3.5), i.e., one thatrelies on single sampling rather than calling the basic AIS-BN algorithm twice.

We measured the accuracy of approximation achieved by the simulation in terms of theMean Square Error (MSE), i.e., square root of the sum of square differences between Pr′(xij)and Pr(xij), the sampled and the exact marginal probabilities of state j (j = 1, 2, . . . , ni)of node i, such that Xi /∈ E. More precisely,

MSE =

√√√√ 1∑Xi∈N\E ni

∑Xi∈N\E

ni∑j=1

(Pr′(xij)− Pr(xij))2 ,

where N is the set of all nodes, E is the set of evidence nodes, and ni is the number ofoutcomes of node i. In all diagrams, the reported MSE is averaged over 10 runs. We usedthe clustering algorithm (Lauritzen & Spiegelhalter, 1988) to compute the gold standard

171

Cheng & Druzdzel

results for our comparisons of the mean square error. We performed all experiments on aPentium II, 333 MHz Windows computer.

While MSE is not perfect, it is the simplest way of capturing error that lends itself tofurther theoretical analysis. For example, it is possible to derive analytically the idealizedconvergence rate in terms of MSE, which, in turn, can be used to judge the quality of thealgorithm. MSE has been used in virtually all previous tests of sampling algorithms, whichallows interested readers to tie the current results to the past studies. A reviewer offeredan interesting suggestion of using cross-entropy or some other technique that weights smallchanges near zero much more strongly than the equivalent size change in the middle ofthe [0, 1] interval. Such measure would penalize the algorithm for imprecisions of possiblyseveral orders of magnitude in very small probabilities. While this idea is interesting, weare not aware of any theoretical reasons as to why this measure would make a difference incomparisons between AIS-BN, LW and SIS algorithms. The MSE, as we mentioned above,will allow us to compare the empirically determined convergence rate to the theoreticallyderived ideal convergence rate. Theoretically, the MSE is inversely proportional to thesquare root of the sample size.

Since there are several tunable parameters used in the AIS-BN algorithm, we list thevalues of the parameters used in our test: l = 2, 500; wk = 0 for k ≤ 9 and wk = 1otherwise. We stopped the updating process in Step 7 of Figure 2 after k ≥ 10. In otherwords, we used only the samples collected in the last step of the algorithm. The learningparameters used in our algorithm are kmax = 10, a = 0.4, and b = 0.14 (see Equation 10).We used an empirically determined value of the threshold θ = 0.04 (Section 3.3). We onlychange the CPT tables of the parents of a special evidence node A to uniform distributionswhen Pr(A = a) < 1/(2 · nA). Some of the parameters are a matter of design decision(e.g., the number of samples in our tests), others were chosen empirically. Although wehave found that these parameters may have different optimal values for different Bayesiannetworks, we used the above values in all our tests of the AIS-BN algorithm described inthis paper. Since the same set of parameters led to spectacular improvement in accuracyin all tested networks, it is fair to say that the superiority of the AIS-BN algorithm to theother algorithms is not too sensitive to the values of the parameters.

For the SIS algorithm, wk = 1 by the design of the algorithm. We used l = 2, 500. Theupdating function in Step 7 of Figure 1 is that of (Shwe et al., 1991; Cousins, Chen, &Frisse, 1993):

Prknew(xi|pa(Xi), e) =

Pr(xi|pa(Xi)) + k · Prcurrent(xi|pa(Xi), e)1 + k

,

where Pr(xi|pa(Xi)) is the original sampling distribution, Prcurrent(xi|pa(Xi), e) is anequivalent of our ICPT tables estimator based on all currently available information, andk is the updating step.

4.2 Results for the CPCS Network

The main network used in our tests is a subset of the CPCS (Computer-based Patient CaseStudy) model (Pradhan et al., 1994), a large multiply-connected multi-layer network con-sisting of 422 multi-valued nodes and covering a subset of the domain of internal medicine.

172


Among the 422 nodes, 14 nodes describe diseases, 33 nodes describe history and risk fac-tors, and the remaining 375 nodes describe various findings related to the diseases. TheCPCS network is among the largest real networks available to the research community atthe present time. The CPCS network contains many extreme probabilities, typically onthe order of 10−4. Our analysis is based on a subset of 179 nodes of the CPCS network,created by Max Henrion and Malcolm Pradhan. We used this smaller version in order tobe able to compute the exact solution for the purpose of measuring approximation error inthe sampling algorithms.

The AIS-BN algorithm has some learning overhead. The following comparison of exe-cution time vs. number of samples may give the reader an idea of this overhead. Updatingthe CPCS network with 20 evidence nodes on our system takes the AIS-BN algorithm atotal of 8.4 seconds to learn. It generates subsequently 3,640 samples per second, while theSIS algorithm generates 2,631 samples per second, and the LW algorithm generates 4,167samples per second. In order to remain conservative towards the AIS-BN algorithm, in allour experiments we fixed the execution time of the algorithms (our limit was 60 seconds)rather than the number of samples. In the CPCS network with 20 evidence nodes, in 60seconds, AIS-BN generates about 188,000 samples, SIS generates about 158,000 samplesand LW generates about 250,000 samples.

0

2

4

6

8

10

12

1E-40 1E-34 1E-28 1E-22 1E-16 1E-10

Probability of evidence

Fre

quen

cy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Figure 4: The probability distribution of evidence Pr(E = e) in our experiments.

We generated a total of 75 test cases consisting of five sequences of 15 test cases each. Weran each test case 10 times, each time with a different setting of the random number seed.Each sequence had a progressively higher number of evidence nodes: 15, 20, 25, 30, and35 evidence nodes respectively. The evidence nodes were chosen randomly (equiprobablesampling without replacement) from those nodes that described various plausible medical

173

Cheng & Druzdzel

findings. Almost all of these nodes were leaf nodes in the network. We believe that thisconstituted very realistic test cases for the algorithms. The distribution of the prior prob-ability of evidence, Pr(E = e), across all test runs of our experiments is shown in Figure 4.The least likely evidence was 5.54 × 10−42, the most likely evidence was 1.37 × 10−9, andthe median was 7× 10−24.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

15 30 45 60 75 90 105 120 135 150

Sample time (seconds)

Mea

n S

quar

e E

rror

AIS-BN SIS LW

Figure 5: A typical plot of convergence of the tested sampling algorithms in our experiments— Mean Square Error as a function of the execution time for a subset of theCPCS network with 20 evidence nodes chosen randomly among plausible medicalobservations (Pr(E = e) = 3.33× 10−26 in this particular case) for the AIS-BN,the SIS, and the LW algorithms. The curve for the AIS-BN algorithm is veryclose to the horizontal axis.

Figures 5 and 6 show a typical plot of convergence of the tested sampling algorithms inour experiments. The case illustrated involves updating the CPCS network with 20 evidencenodes. We plot the MSE after the initial 15 seconds during which the algorithms startconverging. In particular, the learning step of the AIS-BN algorithm is usually completedwithin the first 9 seconds. We ran the three algorithms in this case for 150 seconds ratherthan the 60 seconds in the actual experiment in order to be able to observe a wider range ofconvergence. The plot of the MSE for the AIS-BN algorithm almost touches the X axis inFigure 5. Figure 6 shows the same plot in a finer scale in order to show more detail in theAIS-BN convergence curve. It is clear that the AIS-BN algorithm dramatically improvesthe convergence rate. We can also see that the results of AIS-BN converge to exact resultsvery fast as the sampling time increases. In the case captured in Figures 5 and 6, a tenfoldincrease in the sampling time (after subtracting the overhead for the AIS-BN algorithm, it

174


0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

15 30 45 60 75 90 105 120 135 150

Sample time (seconds)

Mea

n S

quar

e E

rror

AIS-BN

Figure 6: The lower part of the plot of Figure 5 showing the convergence of the AIS-BNalgorithm to correct posterior probabilities.

corresponds to a 21.5-fold increase in the number of samples) results in a 4.55-fold decreaseof the MSE (to MSE≤ 0.00048). The observed convergence of both SIS and LW algorithmswas poor. A tenfold increase in sampling time had practically no effect on accuracy. Pleasenote that this is a very typical case observed in our experiments.

Original CPT Exact ICPT Learned ICPT“Absent” 0.99631 0.0037 0.015“Mild” 0.00183 0.1560 0.164“Moderate” 0.00093 0.1190 0.131“Severe” 0.00093 0.7213 0.690

Table 1: A fragment of the conditional probability table of a node of the CPCS network(node gasAcute, parents hepAcute=Mild and wbcTotTho=False) in Figure 6.

Figure 7 illustrates the ICPT learning process of the AIS-BN algorithm for the samplecase shown in Figure 6. The displayed conditional probabilities belong to the node gasAcutewhich is a parent of two evidence nodes, difInfGasMuc and abdPaiExaMea. The nodegasAcute has four states: “absent,” “mild,” “moderate,” and “severe”, and two parents.We randomly chose a combination of its parents’ states as our displayed configuration. Theoriginal CPT for this configuration without evidence, the exact ICPT with evidence andthe learned ICPT with evidence are summarized numerically in Table 1. Figure 7 illustrates

175

Cheng & Druzdzel

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8 9 10

Updating step

Prob

abili

ty

Absent Mild Moderate Severe

Figure 7: Convergence of the conditional probabilities during the example run of the AIS-BN algorithm captured in Figure 6. The displayed fragment of the conditionalprobability table belongs to node gasAcute which is a parent of one of the evidencenodes.

that the learned importance conditional probabilities begin to converge to the exact resultsstably after three updating steps. The learned probabilities in Step 10 are close to theexact results. In this example, the difference between Pr(xi|pa(Xi), e) and Pr(xi|pa(Xi)) isvery large. Sampling from Pr(xi|pa(Xi)) instead of Pr(xi|pa(Xi), e) would introduce largevariance into our results.

AIS-BN SIS LW

µ 0.00082 0.110 0.148σ 0.00022 0.076 0.093

min 0.00049 0.0016 0.0031median 0.00078 0.105 0.154

max 0.00184 0.316 0.343

Table 2: Summary of the simulation results for all of the 75 simulation cases on the CPCSnetwork. Figure 8 shows each of the 75 cases graphically.

Figure 8 shows the MSE for all 75 test cases in our experiments with the summarystatistics in Table 2. A paired one-tailed t-test resulted in statistically highly significantdifferences between the AIS-BN and SIS algorithms (p < 3.1 × 10−20), and also between

176


0.0001

0.001

0.01

0.1

1

1.4E-09 2.1E-14 3.1E-18 5.7E-22 6.7E-24 1.4E-27 1.3E-32 3.8E-39

Probability of evidence

Mea

n S

quar

e E

rror

AIS-BN SIS LW

Figure 8: Performance of the AIS-BN, SIS, and LW algorithms: Mean Square Error foreach of the 75 individual test cases plotted against the probability of evidence.The sampling time is 60 seconds.

the SIS and LW algorithms (p < 1.7 × 10−8). As far as the magnitude of difference isconcerned, AIS-BN was significantly better than SIS. SIS was better than LW, but thedifference was small. The mean MSEs of SIS and LW algorithms were both greater than0.1, which suggests that neither of these algorithms is suitable for large Bayesian networks.

The graph in Figure 9 shows the MSE ratio between the AIS-BN and SIS algorithms.We can see that the percentage of the cases whose ratio was greater than 100 (two ordersof magnitude improvement!) is 60%. In other words, we obtained two orders of magnitudeimprovement in MSE in more than half of the cases. In 80% cases, the ratio was greaterthan 50. The smallest ratio in our experiments was 2.67, which happened when posteriorprobabilities were dominated by the prior probabilities. In that case, even though the LWand SIS algorithms converged very fast, their MSE was still far larger than that of AIS-BN.

Our next experiment aimed at showing how close the AIS-BN algorithm can approachthe best possible sampling results. If we know the optimal importance sampling function,the convergence of the AIS-BN algorithm should be the same as that of forward samplingwithout evidence. In other words, the results of the probabilistic logic sampling algorithmwithout evidence approach the limit of how well stochastic sampling can perform. We ranthe logic sampling algorithm on the CPCS network without evidence mimicking the testruns of the AIS-BN algorithm, i.e., 5 blocks of 15 runs, each repeated 10 times with adifferent random number seed. The number of samples generated was equal to the averagenumber of samples generated by the AIS-BN algorithm for each series of 15 test runs.

177

Cheng & Druzdzel

0

2

4

6

8

10

12

14

16

18

The ratio of MSE between SIS and AIS-BN

Freq

uenc

y

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Figure 9: The ratio of MSE between SIS and AIS-BN versus percentage.

We obtained the average MSE µ = 0.00057, with σ = 0.000025, min = 0.00052, andmax = 0.00065. The best results should be around this range. From Table 2, we cansee that the minimum MSE for the AIS-BN algorithm was 0.00049, within the range ofthe optimal result. The mean MSE in AIS-BN is 0.00082, not too far from the optimalresults. The standard deviation, σ, is significantly larger in the AIS-BN algorithm, butthis is understandable given that the process of learning the optimal importance function isheuristic in nature. It is not difficult to understand that there exist a difference between theAIS-BN results and the optimal results. First, the AIS-BN algorithm in our tests updatedthe sampling distribution only 10 times, which may be too few times to let it convergeto the optimal importance distribution. Second, even if the algorithm has converged tothe optimal importance distribution, the sampling algorithm will still let the parameteroscillate around this distribution and there will be always small differences between the twodistributions.

Figure 10 shows the convergence rate for all tested cases for a four-fold increase insampling time (between 15 and 60 seconds). We adjusted the convergence ratio of theAIS-BN algorithm by dividing it by a constant. According to Equation 3, the theoreticallyexpected convergence ratio for a four-fold increase in the number of samples should bearound two. There are about 96% cases among the AIS-BN runs whose ratio lays inthe interval (1.75, 2.25], in a sharp contrast to 11% and 13% cases in the SIS and LWalgorithms. The ratios of the remaining 4% cases in AIS-BN lay in the interval [2.25, 2.5].In the SIS and LW algorithms, the percentage of cases whose ratio were smaller than 1.5was 71% and 77% respectively. Less than 1.5 means that the number of samples was toosmall to estimate variance and the results cannot be trusted. The ratio greater than 2.25

178


0% 0%

37%

59%

4%

0%

5%

11%

35%

20%

9%

5% 5%

3% 3%4%

3%

20%

36%

19%

9%8%

5%

0%

10%

20%

30%

40%

50%

60%

70%

0.5 - 0.75 0.75 - 1.0 1.0 - 1.25 1.25 - 1.5 1.5 - 1.75 1.75 - 2.0 2.0 - 2.25 2.25 - 2.5 2.5 - 2.75 2.75 - 3.0 more

Convergence rate

Freq

uenc

y

AIS-BN SIS LW

Figure 10: The distribution of the convergence ratio of the AIS-BN, SIS, and LW algo-rithms when the number of samples increases four times.

means possibly that 60 seconds was long enough to estimate the variance, but 15 secondswas too short.

4.3 The Role of AIS-BN Heuristics in Performance Improvement

From the above experimental results we can see that the AIS-BN algorithm can improvethe sampling performance significantly. Our next series of tests focused on studying the roleof the two AIS-BN initialization heuristics. The first is initializing the ICPT tables of theparents of evidence to uniform distributions, denoted by U. The second is adjusting smallprobabilities, denoted by S. We denote AIS-BN without any heuristic initialization methodto be the AIS algorithm. AIS+U+S equals AIS-BN. We compared the following versionsof the algorithms: SIS, AIS, SIS+U, AIS+U, SIS+S, AIS+S, SIS+U+S, AIS+U+S. Allalgorithms with SIS used the same number of samples as SIS. All algorithms with AIS usedthe same number of samples as AIS-BN. We tested these algorithms on the same 75 testcases used in the previous experiment. Figure 11 shows the MSE for each of the samplingalgorithms with the summary statistics in Table 3. Even though the AIS algorithm is betterthan the SIS algorithm, the difference is not as large as in case of the AIS+U, AIS+S, andAIS-BN algorithms. It seems that heuristic initialization methods help much. The resultsfor the SIS+S, SIS+U, SIS+U+S algorithms suggest that although heuristic initializationmethods can improve performance, they alone cannot improve too much. It is fair to saythat significant performance improvement in the AIS-BN algorithm is coming from thecombination of AIS with heuristic methods, not any method alone. It is not difficult to

179

Cheng & Druzdzel

understand that, as only with good heuristic initialization methods is it possible to let thelearning process quickly exit oscillation areas. Although both S and U methods alone canimprove the performance, the improvement is moderate compared to the combination ofthe two.

0.110

0.050

0.075

0.0500.060

0.0080.000820.00151

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Different Algorithms

Mea

n S

quar

e E

rror

Figure 11: A comparison of different algorithms in the CPCS network. Each bar is basedon 75 test cases. The dotted bar shows the MSE for the SIS algorithm whilethe gray bar shows the MSE for the AIS algorithm.

SIS AIS SIS+U AIS+U SIS+S AIS+S SIS+U+S AIS-BN

µ 0.110 0.060 0.050 0.0084 0.075 0.0015 0.050 0.00082σ 0.076 0.049 0.052 0.025 0.074 0.0016 0.059 0.00022

min 0.0016 0.00074 0.0011 0.00058 0.00072 0.00056 0.00086 0.00049median 0.105 0.045 0.031 0.0014 0.052 0.00087 0.028 0.00078

max 0.316 0.207 0.212 0.208 0.279 0.0085 0.265 0.0018

Table 3: Summary of the simulation results for different algorithms in the CPCS network.

4.4 Results for Other Networks

In order to make sure that the AIS-BN algorithm performs well in general, we tested it ontwo other large networks.

The first network that we used in our tests is the PathFinder network (Heckermanet al., 1990), which is the core element of an expert system that assists surgical pathologists

180


with the diagnosis of lymph-node diseases. There are two versions of this network. We usedthe larger version, consisting of 135 nodes. In contrast to the CPCS network, PathFindercontains many conditional probabilities that are equal to 1, which reflects deterministicrelationships in certain settings. To make the sampling challenging, we randomly selected20 evidence nodes from among the leaf nodes. Each of these was an observable node (DavidHeckerman, personal communication). We verified in each case that the probability of soselected evidence was not equal to zero.

We fixed the execution time of the algorithms to be 60 seconds. The learning overheadfor the AIS-BN algorithm in the PathFinder network was about 3.5 seconds. In 60seconds, AIS-BN generated about 366,000 samples, SIS generated about 250,000 samplesand LW generated about 2,700,000 samples. The reason why LW could generate more than10 times as many samples as SIS within the same amount of time is that the LW algorithmterminates sample generation at a very early stage in many samples, when the weight of asample becomes zero. This is a result of determinism in the probability tables, mentionedabove. We will see that LW benefits greatly from generating more samples. The otherparameters used in AIS-BN were the same as those used in the CPCS network.

We tested 20 cases, each with randomly selected 20 evidence nodes. The reported MSEfor each case was averaged over 10 runs. Some of the runs of the SIS and LW algorithms didnot manage to generate any effective samples (the weight score sum was equal to zero). SIShad only 75% effective runs and LW had only 89% effective runs, which means that in someruns SIS and LW were unable to yield any information about the posterior distributions.In all those cases, we discarded the run and only averaged over the effective runs. Allruns in the AIS-BN algorithm were effective. We report our experimental results with thesummary statistics in Table 4. From these data, we can see that the AIS-BN algorithmis still significantly better than the SIS and LW algorithms. Since the LW algorithm cangenerate more than ten times the number of samples than the SIS algorithm, its performanceis better than that of the SIS algorithm.

AIS-BN SIS LW

µ 0.00050 0.166 0.089σ 0.00037 0.107 0.0707

min 0.00025 0.00116 0.00080median 0.00037 0.184 0.0866

max 0.0017 0.467 0.294effective runs 200 150 178

Table 4: Summary of the simulation results for all of the 20 simulation cases on thePathFinder network.

The second network that we tested was one of the ANDES networks (Conati et al.,1997). ANDES is an intelligent tutoring system for classical Newtonian physics that isbeing developed by a team of researchers at the Learning Research and Development Centerat the University of Pittsburgh and researchers at the United States Naval Academy. Thestudent model in ANDES uses a Bayesian network to do long–term knowledge assessment,

181

Cheng & Druzdzel

plan recognition, and prediction of students’ actions during problem solving. We selectedthe largest ANDES network that was available to us, consisting of 223 nodes.

In contrast to the previous two networks, the depth of the ANDES network was sig-nificantly larger and so was its connectivity. There were only 22 leaf nodes. It is quitepredictable that this kind of networks will pose difficulties to learning. We selected 20evidence nodes randomly from the potential evidence nodes and tested 20 cases. All pa-rameters were the same as those used in the CPCS network. We fixed the execution timeof the algorithms to be 60 seconds. The learning overhead for the AIS-BN algorithm inthe ANDES network was 13.4 seconds. In 60 seconds, AIS-BN generated about 114,000samples, SIS generated about 98,000 samples and LW generated about 180,000 samples.In this network, LW still can generate almost two times the number of samples generatedby the SIS algorithm.

We report our experimental results with the summary statistics in Table 5. The resultsshow that also in the ANDES network the AIS-BN algorithm was significantly better thanthe SIS and LW algorithms. Since LW generated almost two times the number of samplesthat were generated by the SIS algorithm, its performance was better than that of the SISalgorithm.

AIS-BN SIS LW

µ 0.0059 0.0628 0.0404σ 0.0049 0.102 0.0539

min 0.0023 0.0028 0.0028median 0.0045 0.0190 0.0198

max 0.0237 0.321 0.221

Table 5: Summary of the simulation results for all of the 20 simulation cases on the ANDESnetwork.

While the AIS-BN algorithm is on the average an order of magnitude more precisethan the other two algorithms, the performance improvement is smaller than in the othertwo networks. The reason why the performance improvement of the AIS-BN algorithmover the SIS and LW algorithms in the ANDES network is smaller compared to that inthe CPCS and PathFinder networks is that: (1) The ANDES network used in our testswas apparently not challenging enough for sampling algorithms in general. In the ANDESnetwork, SIS and LW also can perform well in some cases. The minimum MSE of SIS andLW in our tested cases is almost the same as that of AIS-BN. (2) The number of samplesgenerated by AIS-BN in this network is significantly smaller than that in the previoustwo networks and AIS-BN needs more time to learn. Although increasing the number ofsamples will improve the performance of all three algorithms, it improves the performanceof AIS-BN more since the convergence ratio of the AIS-BN algorithm is usually larger thanthat of SIS and LW (see Figure 10). (3) The parameters that we used in this network weretuned for the CPCS network. (4) The large depth and fewer leaf nodes of the ANDESnetwork pose some difficulties to learning.

182


5. Discussion

There is a fundamental trade-off in the AIS-BN algorithm between the time spent onlearning the importance function and the time spent on sampling. Our current approach,which we believe to be reasonable, is to stop learning at the point when the importancefunction is good enough. In our experiments we stopped learning after 10 iterations.

There are several ways of improving the initialization of the conditional probabilitytables at the outset of the AIS-BN algorithm. In the current version of the algorithm, weinitialize the ICPT table of every parent N of an evidence node E (N ∈ Pa(E), E ∈ E)to the uniform distribution when Pr(E = e) < 1/(2 · nE). This can be improved further.We can extend the initialization to those nodes that are severely affected by the evidence.They can be identified by examining the network structure and local CPTs.

We can view the learning process of the AIS-BN algorithm as a network rebuildingprocess. The algorithm constructs a new network whose structure is the same as the originalnetwork (except that we delete the evidence nodes and corresponding arcs). The constructednetwork models the joint probability distribution ρ(X\E) in Equation 8, which approachesthe optimal importance function. We use the learned ρ′ to approximate this distribution.If ρ′ approximates Pr(X|E) accurately enough, we can use this new network to solve otherapproximate tasks, such as the problem of computing the Maximum A-Posterior assignment(MAP) (Pearl, 1988), finding k most likely scenarios (Seroussi & Golmard, 1994), etc. Alarge advantage of this approach is that we can solve each of these problems as if the networkhad no evidence nodes.

We know that Markov blanket scoring can improve convergence rates in some samplingalgorithms (Shwe & Cooper, 1991). It may also be applied to the AIS-BN algorithm toimprove its convergence rate. According to Property 4 (Section 2.1), any technique that canreduce the variance σ2

Pr(e)will reduce the variance of Pr(e) and correspondingly improve

the sampling performance. Since the variance of stratified sampling (Rubinstein, 1981) isnever much worse than that of random sampling, and can be much better, it can improve theconvergence rate. We expect some other variance reduction methods in statistics, such as:(i) the expected value of a random variable; (ii) antithetic variants correlations (stratifiedsampling, Latin hypercube sampling, etc.); and (iii) systematic sampling, will also improvethe sampling performance.

Current learning algorithm used a simple approach. Some heuristic learning methods,such as adjusting learning rates according to changes of the error (Jacobs, 1988), shouldalso be applicable to our algorithm. There are several tunable parameters in the AIS-BNalgorithm. Finding the optimal values of these parameters for any given network is anotherinteresting research topic.

It is worth observing that the plots presented in Figure 8 are fairly flat. In other words,in our tests the convergence of the sampling algorithms did not depend too strongly on theprobability of evidence. This seems to contradict the common belief that forward samplingschemes suffer from unlikely evidence. AIS-BN for one shows a fairly flat plot. Theconvergence of the SIS and LW algorithms seems to decrease slightly with unlikely evidence.It is possible that all three algorithms will perform much worse when the probability ofevidence drops below some threshold value, which our tests failed to approach. Until this

183

Cheng & Druzdzel

relationship has been studied carefully, we conjecture that the probability of evidence is nota good measure of difficulty of approximate inference.

Given that the problem of approximating probabilistic inference is NP-hard, there existnetworks that will be challenging for any algorithm and we have no doubt that even theAIS-BN algorithm will perform poorly on them. To the day, we have not found suchnetworks. There is one characteristic of networks that may be challenging to the AIS-BNalgorithm. In general, when the number of parameters that need to be learned by the AIS-BN algorithm increases, its performance will deteriorate. Nodes with many parents, forexample, are challenging to the AIS-BN learning algorithm, as it has to update the ICPTtables under all combinations of the parent nodes. It is possible that conditional probabilitydistributions with causal independence properties, such as Noisy-OR distributions (Pearl,1988; Henrion, 1989; Diez, 1993; Srinivas, 1993; Heckerman & Breese, 1994), common invery large practical networks, can be treated differently and lead to considerable savings inthe learning time.

One direction of testing approximate algorithms, suggested to us by a reviewer, is to usevery large networks for which exact solution cannot be computed at all. In this case, onecan try to infer from the difference in variance at various stages of the algorithm whetherit is converging or not. This is a very interesting idea that is worth exploring, especiallywhen combined with theoretical work on stopping criteria in the line of the work of Dagumand Luby (1997).

6. Conclusion

Computational complexity remains a major problem in application of probability theoryand decision theory in knowledge-based systems. It is important to develop schemes thatimprove the performance of updating algorithms — even though the theoretically demon-strated worst case will remain NP–hard, many practical cases may become tractable.

In this paper, we studied importance sampling in Bayesian networks. After reviewingthe most important theoretical results related to importance sampling in finite-dimensionalintegrals, we proposed a new algorithm for importance sampling in Bayesian networks thatwe call adaptive importance sampling (AIS-BN). While the process of learning the optimalimportance function for the AIS-BN algorithm is computationally intractable, based on thetheory of importance sampling in finite-dimensional integrals we proposed several heuristicsthat seem to work very well in practice. We proposed heuristic methods for initializing theimportance function that we have shown to accelerate the learning process, a smooth learn-ing method for updating importance function using the structural advantages of Bayesiannetworks, and a dynamic weighting function for combining samples from different stagesof the algorithm. All these methods help the AIS-BN algorithm to get fairly accurateestimates of the posterior probabilities in a limited time. Of the two applied heuristics,adjustment of small probabilities, seems to lead to the largest improvement in performance,although the largest decrease in MSE is achieved by a combination of the two heuristicswith the AIS-BN algorithm.

The AIS-BN algorithm can lead to a dramatic improvement in the convergence rates inlarge Bayesian networks with evidence compared to the existing state of the art algorithms.We compared the performance of the AIS-BN algorithm to the performance of likelihood

184


weighting and self-importance sampling on a large practical model, the CPCS network,with evidence as unlikely as 5.54× 10−42 and typically 7× 1.0−24. In our experiments, weobserved that the AIS-BN algorithm was always better than likelihood weighting and self-importance sampling and in over 60% of the cases it reached over two orders of magnitudeimprovement in accuracy. Tests performed on the other two networks, PathFinder andANDES, yielded similar results.

Although there may exist other approximate algorithms that will prove superior to AIS-BN in networks with special structure or distribution, the AIS-BN algorithm is simpleand robust for general evidential reasoning problems in large multiply-connected Bayesiannetworks.

Acknowledgments

We thank anonymous referees for several insightful comments that led to a substantialimprovement of the paper. This research was supported by the National Science Foundationunder Faculty Early Career Development (CAREER) Program, grant IRI–9624629, and bythe Air Force Office of Scientific Research grants F49620–97–1–0225 and F49620–00–1–0112. An earlier version of this paper has received the 2000 School of Information SciencesRobert R. Korfhage Award, University of Pittsburgh. Malcolm Pradhan and Max Henrionof the Institute for Decision Systems Research shared with us the CPCS network with a kindpermission from the developers of the Internist system at the University of Pittsburgh. Wethank David Heckerman for the PathFinder network and Abigail Gerner for the ANDESnetwork used in our tests. All experimental data have been obtained using SMILE, aBayesian inference engine developed at the Decision Systems Laboratory and available athttp://www2.sis.pitt.edu/∼genie.

References

Cano, J. E., Hernandez, L. D., & Moral, S. (1996). Importance sampling algorithms for thepropagation of probabilities in belief networks. International Journal of ApproximateReasoning, 15, 77–92.

Chavez, M. R., & Cooper, G. F. (1990). A randomized approximation algorithm for prob-abilistic inference on Bayesian belief networks. Networks, 20 (5), 661–685.

Cheng, J., & Druzdzel, M. J. (2000a). Computational investigations of low-discrepancysequences in simulation algorithms for Bayesian networks. In Proceedings of the Six-teenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–2000), pp.72–81 San Francisco, CA. Morgan Kaufmann Publishers.

Cheng, J., & Druzdzel, M. J. (2000b). Latin hypercube sampling in Bayesian networks. InProceedings of the 13th International Florida Artificial Intelligence Research Sympo-sium Conference (FLAIRS-2000), pp. 287–292 Orlando, Florida.

Conati, C., Gertner, A. S., VanLehn, K., & Druzdzel, M. J. (1997). On-line student modelingfor coached problem solving using Bayesian networks. In Proceedings of the Sixth

185

Cheng & Druzdzel

International Conference on User Modeling (UM–96), pp. 231–242 Vienna, New York.Springer Verlag.

Cooper, G. F. (1990). The computational complexity of probabilistic inference using Baye-sian belief networks. Artificial Intelligence, 42 (2–3), 393–405.

Cousins, S. B., Chen, W., & Frisse, M. E. (1993). A tutorial introduction to stochasticsimulation algorithm for belief networks. In Artificial Intelligence in Medicine, chap. 5,pp. 315–340. Elsevier Science Publishers B.V.

Dagum, P., Karp, R., Luby, M., & Ross, S. (1995). An optimal algorithm for MonteCarlo estimation (extended abstract). In Proceedings of the 36th IEEE Symposiumon Foundations of Computer Science, pp. 142–149 Portland, Oregon.

Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian beliefnetworks is NP-hard. Artificial Intelligence, 60 (1), 141–153.

Dagum, P., & Luby, M. (1997). An optimal approximation algorithm for Bayesian inference.Artificial Intelligence, 93, 1–27.

Diez, F. J. (1993). Parameter adjustment in Bayes networks. The generalized noisy OR-gate. In Proceedings of the Ninth Annual Conference on Uncertainty in ArtificialIntelligence (UAI–93), pp. 99–105 San Francisco, CA. Morgan Kaufmann Publishers.

Fishman, G. S. (1995). Monte Carlo: concepts, algorithms, and applications. Springer-Verlag.

Fung, R., & Chang, K.-C. (1989). Weighing and integrating evidence for stochastic simu-lation in Bayesian networks. In Uncertainty in Artificial Intelligence 5, pp. 209–219New York, N. Y. Elsevier Science Publishing Company, Inc.

Fung, R., & del Favero, B. (1994). Backward simulation in Bayesian networks. In Proceedingsof the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–94),pp. 227–234 San Francisco, CA. Morgan Kaufmann Publishers.

Geman, S., & Geman, D. (1984). Stochastic relaxations, Gibbs distributions and the Ba-yesian restoration of images. IEEE Transactions on Pattern Analysis and MachineIntelligence, 6 (6), 721–742.

Gilks, W., Richardson, S., & Spiegelhalter, D. (1996). Markov chain Monte Carlo in prac-tice. Chapman and Hall.

Heckerman, D., & Breese, J. S. (1994). A new look at causal independence. In Proceedingsof the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–94),pp. 286–292 San Mateo, CA. Morgan Kaufmann Publishers, Inc.

Heckerman, D. E., Horvitz, E. J., & Nathwani, B. N. (1990). Toward normative expertsystems: The Pathfinder project. Tech. rep. KSL–90–08, Medical Computer ScienceGroup, Section on Medical Informatics, Stanford University, Stanford, CA.

186


Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logicsampling. In Uncertainty in Artificial Intellgience 2, pp. 149–163 New York, N. Y.Elsevier Science Publishing Company, Inc.

Henrion, M. (1989). Some practical issues in constructing belief networks. In Kanal, L.,Levitt, T., & Lemmer, J. (Eds.), Uncertainty in Artificial Intelligence 3, pp. 161–173.Elsevier Science Publishers B.V., North Holland.

Henrion, M. (1991). Search-based methods to bound diagnostic probabilities in very largebelief nets. In Proceedings of the Seventh Annual Conference on Uncertainty in Ar-tificial Intelligence (UAI–91), pp. 142–150 San Mateo, California. Morgan KaufmannPublishers.

Hernandez, L. D., Moral, S., & Antonio, S. (1998). A Monte Carlo algorithm for probabilisticpropagation in belief networks based on importance sampling and stratified simulationtechniques. International Journal of Approximate Reasoning, 18, 53–91.

Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.Neural Networks, 1, 295–307.

Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities ongraphical structures and their application to expert systems. Journal of the RoyalStatistical Society, Series B (Methodological), 50 (2), 157–224.

MacKay, D. (1998). Intro to Monte Carlo methods. In Jordan, M. I. (Ed.), Learning inGraphical Models. The MIT Press, Cambridge, Massachusetts.

Ortiz, L. E., & Kaelbling, L. P. (2000). Adaptive importance sampling for estimation instructured domains. In Proceedings of the Sixteenth Annual Conference on Uncer-tainty in Artificial Intelligence (UAI–2000), pp. 446–454 San Francisco, CA. MorganKaufmann Publishers.

Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelli-gence, 29 (3), 241–288.

Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. ArtificalIntelligence, 32, 245–257.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann Publishers, Inc., San Mateo, CA.

Pradhan, M., & Dagum, P. (1996). Optimal Monte Carlo inference. In Proceedings of theTwelfth Annual Conference on Uncertainty in Artificial Intelligence (UAI–96), pp.446–453 San Francisco, CA. Morgan Kaufmann Publishers.

Pradhan, M., Provan, G., Middleton, B., & Henrion, M. (1994). Knowledge engineeringfor large belief networks. In Proceedings of the Tenth Annual Conference on Uncer-tainty in Artificial Intelligence (UAI–94), pp. 484–490 San Francisco, CA. MorganKaufmann Publishers.

187

Cheng & Druzdzel

Ritter, H., Martinetz, T., & Schulten, K. (1991). Neuronale Netze. Addison-Wesley,Munchen.

Rubinstein, R. Y. (1981). Simulation and the Monte Carlo Method. John Wiley & Sons.

Seroussi, B., & Golmard, J. L. (1994). An algorithm directly finding the K most probableconfigurations in Bayesian networks. International Journal of Approximate Reasoning,11, 205–233.

Shachter, R. D., & Peot, M. A. (1989). Simulation approaches to general probabilisticinference on belief networks. In Uncertainty in Artificial Intelligence 5, pp. 221–231New York, N. Y. Elsevier Science Publishing Company, Inc.

Shwe, M. A., & Cooper, G. F. (1991). An empirical analysis of likelihood-weighting simula-tion on a large, multiply-connected medical belief network. Computers and BiomedicalResearch, 24 (5), 453–475.

Shwe, M., Middleton, B., Heckerman, D., Henrion, M., Horvitz, E., & Lehmann, H. (1991).Probabilistic diagnosis using a reformulation of the INTERNIST–1/QMR knowledgebase: I. The probabilistic model and inference algorithms. Methods of Information inMedicine, 30 (4), 241–255.

Srinivas, S. (1993). A generalization of the noisy-OR model. In Proceedings of the NinthAnnual Conference on Uncertainty in Artificial Intelligence (UAI–93), pp. 208–215San Francisco, CA. Morgan Kaufmann Publishers.

York, J. (1992). Use of the Gibbs sampler in expert systems. Artificial Intelligence, 56,115–130.

188

Documents

AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential