Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Computer Science CPSC 502

Lecture 9

(up-to Ch. 6.4.2.4)

Lecture Overview

• Exact Inference: Variable Elimination• Factors• Algorithm

• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting

2

Bayesian Networks: Types of Inference

Diagnostic

People are leaving

L=t

P(F|L=t)=?

Predictive IntercausalMixed

Fire happensF=t

P(L|F=t)=?Alarm goes off

P(a) = 1.0

P(F|A=t,T=t)=?

People are leaving

L=t

There is no fireF=f

P(A|F=f,L=t)=?

Person smokes next to sensor

S=tFire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Smoking at

Sensor

We will use the same reasoning procedure for all of these types

Inference

),,(

),,,(),,|(

11

1111

jj

jjjj eEeEP

eEeEYPeEeEYP

• Variable Elimination is an algorithm that efficiently performs this operations by casting them as operations between factors

Yjj

jj

eEeEYP

eEeEYP

),,,(

),,,(

11

11

We need to compute this numerator for each value of Y, yi

We need to marginalize over all the variables Z1,…Zk not involved in the query

kZjjik

Zjji )eEe,Ey,Y,Z,P(ZeEeEyYP ,..,...),..,,( 11111

1

To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that

Y

i )yP(Y 1

Def of conditional probability

• Y is the query variable;

• E1=e1, …, Ej=ej are the observed variables (with their values)

• Z1, …,Zk are the remaining variables

X Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

Factors• A factor is a function from a tuple of random variables to the

real numbers R• We write a factor on variables X1,… ,Xj as f(X1,… ,Xj) • A factor denotes one or more (possibly partial) distributions

over the given tuple of variables, e.g.,

• P(X1, X2) is a factor f(X1, X2)

• P(Z | X,Y) is a factor f(Z,X,Y)

• P(Z=f|X,Y) is a factor f(X,Y)

• Note: Factors do not have to sum to one

Distribution

Set of DistributionsOne for each combination

of values for X and Y

Set of partial Distributions

f(X, Y ) Z = f

Operation 1: assigning a variable• We can make new factors out of an existing factor

• Our first operation: we can assign some or all of the variables of a factor.

X Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

f(X,Y,Z): t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

What is the result of assigning X= t ?

f(X=t,Y,Z) =f(X, Y, Z)X = t

Y Z val

t t 0.1

t f 0.9

f t 0.2

f f 0.8

Factor of Y,Z

6

7

More examples of assignmentX Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

f(X,Y,Z): t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

Y Z val

t t 0.1

t f 0.9

f t 0.2

f f 0.8

Y val

t 0.9

f 0.8

0.8 f(X=t,Y=f,Z=f):

f(X=t,Y,Z)

f(X=t,Y,Z=f):

Factor of Y,Z

Factor of Y

Number

Operation 2: Summing out a variable• Our second operation on factors:

we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new

factor defined on {X1,… ,Xn } \ {X}

)(212

11

),,,(,,Xdomx

jjX

XXxXfXXf

B A C val

t t t 0.03

t t f 0.07

f t t 0.54

f t f 0.36

f3= t f t 0.06

t f f 0.14

f f t 0.48

f f f 0.32

A C val

t t

t f

f t

f f

(Bf3)(A,C)

8

Operation 2: Summing out a variable• Our second operation on factors:

we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new

factor defined on {X1,… ,Xn } \ {X}

)(212

11

),,,(,,Xdomx

jjX

XXxXfXXf

B A C val

t t t 0.03

t t f 0.07

f t t 0.54

f t f 0.36

f3= t f t 0.06

t f f 0.14

f f t 0.48

f f f 0.32

A C val

t t 0.57

t f 0.43

f t 0.54

f f 0.46

(Bf3)(A,C)

9

Operation 3: multiplying factors

10

A B C val

t t t

t t f

t f t

t f f …

f t t

f t f

f f t

f f f

A B Val

t t 0.1

f1(A,B): t f 0.9

f t 0.2

f f 0.8

B C Val

t t 0.3

f2(B,C): t f 0.7

f t 0.6

f f 0.4

f1(A,B)× f2(B,C):

Recap: Factors and Operations on Them• If we assign variable A=a in factor f7(A,B), what is the correct form for the

resulting factor?– f(B).

When we assign variable A we remove it from the factor’s domain

• If we marginalize variable A out from factor f7(A,B), what is the correct form for the resulting factor?– f(B).

When we marginalize out variable A we remove it from the factor’s domain

• If we multiply factors f4(X,Y) and f6(Z,Y), what is the correct form for the resulting factor?– f(X,Y,Z)– When multiplying factors, the resulting factor’s domain is the union of the

multiplicands’ domains

• What is the correct form for B f5(A,B) × f6(B,C)

– As usual, product before sum: B ( f5(A,B) × f6(B,C) )

– Result of multiplication: f(A,B,C). Then marginalize out B: f’(A,C) 13

Remember our goal

)(

),()|(

eEP

eEYPeEYP

• All we need to compute is the numerator: joint probability of the query variable(s) and the evidence!• Variable Elimination is an algorithm that efficiently performs this operation by casting it as operations between factors

Y

eEYP

eEYP

),(

),(

We need to compute this numerator for each value of Y, yi

We need to marginalize over all the variables Z1,…Zk not involved in the query

kZik

Zi e),Ey,Y,Z,P(ZeEyYP 1

1

...),(

To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that

Yi EyP(Y 1)|

Def of conditional probability

• Y: subset of variables that is queried

• E: subset of variables that are observed . E = e

• Z1, …,Zk remaining variables in the JPD

Lecture Overview



15

Variable Elimination: Intro (1)• We can express the joint probability as a factor

– f(Y, E1…, Ej , Z1…,Zk )

• We can compute P(Y, E1=e1, …, Ej=ej) by– Assigning E1=e1, …, Ej=ej

– Marginalizing out variables Z1, …, Zk, one at a time• the order in which we do this is called our elimination ordering

• Are we done?

16

observed Other variables not involved in the query

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

No, this still represents the whole JPD (as a single factor)! Need to exploit the compactness of Bayesian networks

Variable Elimination Intro (2)

• We can express the joint factor as a product of factors, one for each

conditional probability

Recall the JPD of a Bayesian network

))(|( ) , ,|( ) , ,P(11

111

n

iii

n

iiin XpaXPXXXPXX

))(,())(|( iiii XpaXfXpaXP

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

1

11 ,,1

)(Z

eEeE

n

ii

Zjj

k

f

Computing sums of products• Inference in Bayesian networks thus reduces to computing the

sums of products

• To compute efficiently

– Factor out those terms that don't involve Zk, e.g.:

18

kZ

n

iif

1

kZ

kk YXfYZfYfZf ),(),()()( 4321

kZkk YZfZfYXfYf ),()(),()( 3142

1

11 ,,1

)(Z

eEeE

n

ii

Zjj

k

f

Slide 20

Decompose sum of products

1 2 1

111 )(Z Z Z

hiiZ

hZ

ffffffkk

General caseFactors that contain Z1Factors that do not contain Z1

Factors that contain Z1Factors that contain Z2

Factors that do not contain Z2 nor Z1

1

11

2

22

3

11 )()(Z

kZZZ

kZZjmZZ

ffffffk

Etc., continue given a predefined simplification ordering of the variables: variable elimination ordering

The variable elimination algorith,

1. Construct a factor for each conditional probability.

2. For each factor, assign the observed variables E to their observed values.

3. Given an elimination ordering, decompose sum of products

4. Sum out all variables Zi not involved in the query

5. Multiply the remaining factors (which only involve )

6. Normalize by dividing the resulting factor f(Y) by y

Yf )(

To compute P(Y=yi| E = e)

See the algorithm VE_BN in the P&M text, Section 6.4.1, Figure 6.8, p. 254.

Variable elimination example

P(G,H) = A,B,C,D,E,F,I P(A,B,C,D,E,F,G,H,I) =

= A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)

Compute P(G | H=h1 ).

Step 1: Construct a factor for each cond. probability

P(G,H) = A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)

P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)

• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)


Previous state:

P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)

Observe H :

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

Step 2: assign to observed variables their observed values.

P(G,H=h1)=A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)


H=h1

Step 3: Decompose sum of products

Previous state: P(G,H=h1) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)

Elimination ordering A, C, E, I, B, D, F : P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)


Step 4: sum out non query variables (one at a time)

Previous state:

P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)

Eliminate A: perform product and sum out A in

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,FCompute P(G | H=h1 ).

f10(B) does not depend on C, E, or I, so we can push it outside of those sums.


Previous state:

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)

Eliminate C: perform product and sum out C in

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)


Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)

Eliminate E: perform product and sum out E in

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)

Eliminate I: perform product and sum out I in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)



Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)

Eliminate B: perform product and sum out B in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)



Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)

Eliminate D: perform product and sum out D in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)



Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)

Eliminate F: perform product and sum out F in

f9(G) f13(G)f16(F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)



Slide 33

Step 5: Multiply remaining factors

Previous state:

P(G,H=h1) = f9(G) f13(G)f16(G)

Multiply remaining factors (all in G):

P(G,H=h1) = f17(G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)

f17(G)

Step 6: Normalize

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)


• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)

f17(G)

)('

17

17

)('

1

)'(

)(

),'(

),(

)(

),()|(

1

1

1

1

GdomgGdomg

gf

gf

hHgGP

hHgGP

hHP

hHgGPhHgGP

VE and conditional independence• So far, we haven’t use conditional independence!

– Before running VE, we can prune all variables Z that are conditionally

independent of the query Y given evidence E: Z ╨ Y | E

– They cannot change the belief over Y given E!

37

• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?

VE and conditional independence• So far, we haven’t use conditional independence!

– Before running VE, we can prune all variables Z that are conditionally

independent of the query Y given evidence E: Z ╨ Y | E

– They cannot change the belief over Y given E!

38

• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?

– A, B, and D. Both paths from these nodes to G are blocked • F is observed node in chain structure• C is an observed common parent

– Thus, we only need to consider this subnetwork

Variable elimination: pruning

Slide 39

Thus, if the query is

P(G=g| C=c1, F=f1, H=h1)

we only need to consider this

subnetwork

• We can also prune unobserved leaf nodes• Since they are unobserved and not predecessors of the query nodes, they

cannot influence the posterior probability of the query nodes

One last trick

• We can also prune unobserved leaf nodes– And we can do so recursively

• E.g., which nodes can we prune if the query is P(A)?

• Recursively prune unobserved leaf nodes:• we can prune all nodes other than A !

40

Complexity of Variable Elimination (VE)• A factor over n binary variables has to store 2n numbers

– The initial factors are typically quite small (variables typically only have few parents in Bayesian networks)

– But variable elimination constructs larger factors by multiplying factors together

• The complexity of VE is exponential in the maximum number of variables in any factor during its execution – This number is called the treewidth of a graph (along an ordering)– Elimination ordering influences treewidth

• Finding the best ordering is NP complete– I.e., the ordering that generates the minimum treewidth– Heuristics work well in practice (e.g. least connected variables first)– Even with best ordering, inference is sometimes infeasible

• In those cases, we need approximate inference.41

VE in AISpace

• To see how variable elimination works in the Aispace Applet• Select “Network options -> Query Models > verbose”• Compare what happens when you select “Prune Irrelevant variables” or

not in the VE window that pops up when you query a node• Try different heuristics for elimination ordering

Lecture Overview



43

Sampling: What is it?• Problem: how to estimate probability distributions that are hard to compute

via exact methods.• Idea: Estimate probabilities from sample data (samples) of the (unknown)

probabilities distribution• Use frequency of each event in the sample data to approximate its probability

Frequencies are good approximations only if based on large samples

• But these samples are often not easy to obtain from real-world observations How do we get the samples?

We use Sampling

• Sampling is a process to obtain samples adequate to estimate an unknown probability

• The samples are generated from a known probability distribution

P(x1)

P(xn)

Generating Samples from a Distribution• For a random variable X with

– values {x1,…,xk}

– Probability distribution P(X) = {P(x1),…,P(xk)}

• Partition the interval (0, 1] into k intervals pi , one for each xi , with length P(xi )

• To generate one sample• Randomly generate a value y in (0, 1] (i.e. generate a value from a uniform

distribution over (0, 1]).• Select the value of the sample based on the interval pi that includes y

• From probability theory:

)()()( iii xPpLengthpyP

Example• Consider a random variable Lecture with

• 3 values <good, bad, soso>• with probabilities 0.7, 0.1 and 0.2 respectively.

• We can have a sampler for this distribution by:– Using a random number generator that outputs numbers over (0, 1]– Partition (0,1] into 3 intervals corresponding to the probabilities of the three

Lecture values: (0, 0.7], (0.7, 0.8] and (0.8, 1]):– To obtain a sample, generate a random number n and pick the value for

Lecture based on which interval n falls into:• P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)• P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)• P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)

Example

0.73

0.2

0.87

0.1

0.9

0.5

0.3

sampleRandom n

good

good

bad

soso

good

good

soso

If we generate enough samples, the frequencies of the three values will get close to their probability

P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)

Samples as Probabilities

• Count total number of samples m

• Count the number ni of samples xi

• Generate the frequency of sample xi as ni / m

• This frequency is your estimated probability of xi

Sampling for Bayesian Networks

OK, but how can we use all this for probabilistic inference in Bayesian networks?

As we said earlier, if we can’t use exact algorithms to update the network, we need to resort to samples and frequencies to compute the probabilities we are interested in

We generate these samples by relying on the mechanism we just described

Sampling for Bayesian Networks (N)

Suppose we have the following BN with two binary variables

It corresponds to the joint probability distribution • P(A,B) =P(B|A)P(A)

To sample from this distribution• we first sample from P(A). Suppose we get A = 0.

• In this case, we then sample from P(B|A = 0).

• If we had sampled A = 1, then in the second step we would have sampled from P(B|A = 1).

A

B

0.3

P(A=1)

0.7

0.1

1

0

P(B=1|A)A

Forward (or Prior) Sampling In a BN

• we can order parents before children (topological order) and• we have CPTs available.

If no variables are instantiated (i.e., there is no evidence), this allows a simple algorithm: forward sampling.• Just sample variables in some fixed topological order, using the previously

sampled values of the parents to select the correct distribution to sample from.

Forward (or Prior) Sampling

ExampleCloudy

Sprinkler

0.5

P(C=T)

Rain

Wet Grass

T

F

C

0.8

0.2

P(R=T|C)

0.1FF

0.9TF

0.9FT

0.99TT

P(W=T|S,R)RS

Random => 0.4Sample=> Cloudy=

Random => 0.8 Sample=> Sprinkler =

Random => 0.4Sample=> Rain =

Random => 0.7Sample=> Wet Grass =

0.1

0.5

T

F

P(S=T|C)C

Example

n

T

Wet Grass

F

Sprinkler

........

3

2

TT1

RainCloudysample #

We generate as many samples as we can afford Then we can use them to compute the probability of any partially specified

event, • e.g. the probability of Rain = T in my Sprinkler network

by computing the frequency of that event in the sample set So, if we generate 1000 samples from the Sprinkler network, and 511 of then

have Rain = T, then the estimated probability of rain is

Sampling: why does it work?• Because of the Law of Large Numbers (LLN)

– Theorem in probability that describes the long-term stability of a random variable.

• Given repeated, independent trials with the same probability p of success in each trial, – the percentage of successes approaches p as the number of trials

approaches ∞

• Example: tossing a coin a large number of times, where the probability of heads on any toss is p.

– Let Sn be the number of heads that come up after n tosses.

Simulating Coin Tosses

• Here’s a graph of Sn/n against p for three different sequences of simulated coin tosses, each of length 200.

• Remember that P(head) = 0.5

Simulating Tosses with a Bias Coin

• Let's change P(head) = 0.8

http://socr.ucla.edu/htmls/SOCR_Experiments.html

Forward Sampling

This algorithm generates samples from the prior joint distribution represented by the network.

Let SPS(x1,…,xn) be the probability that a specific event (x1,…,xn) is generated by the algorithm (or sampling probability of the event).

Just by looking at the sampling process, we have

Because each sampling step depends only on the parent value But this is also the probability of event (x1,…,xn) according to the BN

representation of the JPD

• Thus: SPS(x1,…,xn) = P (x1,…,xn)

)( |(),..,(11 i

n

i inPS XparentsxpxxS

Forward Sampling Because of this result, we can answer any query on the distribution for X1,…,Xn

represented by our Bnet using the samples generated with Prior-Sample(Bnet)

Let N be the total number of samples generated, and NPS(x1,…,xn) be the number of samples generated for event x1,…,xn

We expect the frequency of a given event to converge to its expected value, according to its sampling probability

That is, event frequency in a sample set generated via forward sampling is a consistent estimate of that event’s probability

Given (cloudy, ⌐sprinkler, rain, wet-grass), with sampling probability 0.324 we expect, for large samples, to see 32% of them to correspond to this event

),..,(),..,(/),..,(lim 111 nnPSnPSN

xxPxxSNxxN

),..,(~/),..,( 11 nnPS xxPNxxN

Sampling: why does it work?

• The LLN is important because it "guarantees" stable long-term results for random events.

• However, it is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered.

• There is no principle ensuring that a small number of observations will converge to the expected value or that a streak of one value will immediately be "balanced" by the others.

• So, how many samples are enough?

Hoeffding’s inequality Suppose p is the true probability and s is the sample average from n

independent samples.

p above can be the probability of any event for random variable X = {X1,…Xn} described by a Bayesian network

If you want an infinitely small probability of having an error greater than ε, you need infinitely many samples

But if you settle on something less than infinitely small, let’s say δ, then you just need to set

So you pick • the error ε you can tolerate,

• the frequency δ with which you can tolerate it

And solve for n, i.e., the number of samples that can ensure this performance (1)

222)|(| nepsP

222 ne

Hoeffding’s inequality Examples:

• You can tolerate an error greater than 0.1 only in 5% of your cases

• Set ε =0.1, δ = 0.05

• Equation (1) gives you n > 184

If you can tolerate the same error (0.1) only in 1% of the cases, then you need 265 samples

If you want an error of 0.01 in no more than 5% of the cases, you need 18,445 samples

Rejection Sampling• Used specifically to compute conditional probabilities P(X|e)

given some evidence e– It generates samples from the prior distribution specified by the Bnet (e.g., by

using the Prior-Sample algorithm we saw earlier)– It rejects all those samples that do not match the evidence e– Then, for every value x of X, it estimates P (X = x | e) by looking at the frequency

with which X = x occurs in the remaining samples (i.e. those consistent with e)

Rejection Sampling

Example

• Estimate P(Rain|sprinkler) using 100 samples• 27 samples have sprinkler• Of these, 8 have rain and 19 have ⌐rain.• Estimated P(Rain|sprinkler) =

= P*(Rain|sprinkler) = <8/27, 19/27>

Analysis of Rejection Sampling• Main Problem: it rejects too many samples when the evidence is not very likely.

• Consider our previous exampleA

B

0.3

P(A=T)

0.7

0.1

T

F

P(=T|A)A

If we are interested in P(B|A=T),

• we can only use samples with A = T,

• but I only have 0.3 probability of getting A = T from Prior-Sample,

• so I likely have to reject 70% of my samples

Things get exponentially worse as the number of evidence variables grow

Analysis of Rejection Sampling

A

B

0.3

P(A)

If we are interested in P(B|A=T, C = T), • can only use samples (A=T, B, C=T), but the probability of getting them is

• P(A=T)P(C=T) = 0.03

• I should expect to reject in the order of 97% of my samples!

C0.1

P(C)

Analysis of Rejection Sampling

Note that rejection sampling resembles how probabilities are estimated from observations in the real world.

• You need the right type of event to happen in order to have a valid observation

e.g. P(rains tomorrow|redSky tonight))

• If the event is not common, I have to wait a long time to gather enough relevant events

e.g, nights with red sky

Likelihood Weighting (LW) Avoids the inefficiency of rejection sampling by generating only

events that are consistent with evidence e

• Fixes the values for the evidence variables E

• Samples only the remaining variables Z, i.e. the query X and hidden variables Y

But it still needs to account for the influence of the given evidence on the probability of the samples, otherwise it would not be sampling the correct distribution

Likelihood Weighting (LW)If the sample comes from the correct distribution (e.g.,

the priory distribution for my Bnet in forward sampling or rejection sampling)

• simply count the number of samples with the desired values for the query variable X

In LW, before counting, each sample is weighted to account for the actual likelihood of the corresponding event given the original probability distribution and the evidence

Basically, the point is to make events which are unlikely given the actual evidence have less weight than others

Example: P(Rain|sprinkler, wet-grass)

0.1

0.5

T

F

P(S|C)C

Cloudy

Sprinkler

0.5

P(C)

Rain

Wet Grass

T

F

C

0.8

0.2

P(R|C)

0.1FF

0.9TF

0.9FT

0.99TT

P(W|S,R)RS

Random => 0.4Sample=> cloudy

Sprinkler is fixed No sampling, but adjust weight

w2 = w1 * P(sprinkler|cloudy)

Random => 0.4Sample=> rain

w1 = 1w2 = w1* 0.1 = 0.1

Wet Grass is fixed No sampling, but adjust weight

w3 = w2 * P(wet-grass|sprinkler, rain)

w3 = w2* 0.99 = 0.099

Example: P(Rain|sprinkler, wet-grass)

Basically, LW generated the sample

<cloudy, sprinkler, rain, wet-grass>

by cheating, because we did not sample the evidence variables, we just set their values to what we wanted

But LW makes up for the cheating by giving a relative low weight to this sample (0.099), reflecting the fact that it is not very likely

LW uses the weighted value when counting through all generated samples to compute the frequencies of the events of interest

Likelihood Weighting (LW)Fix evidence variables, sample only non-evidence variables,and weight each sample by the likelihood it accords the evidence

Likelihood Weighting: why does it work?

Remember that we have fixed evidence variables E, and we are sampling only all the other variables Z = X U Y• The “right” distribution to sample from would be P(Z|e) but we have

seen with rejection sampling that this is really inefficient

So which distribution is Weighted-Sample using?

By looking at the algorithm, we see that to get an event (z,e) the algorithm samples each variable in Z given its parents

• SWS(z , e) = ∏i P(zi|parent(Zi))

Likelihood Weighting: why does it work?

P(zi|parent(Zi)) can include both hidden variables Y and evidence variables E, so SWS does pay some attention to evidence

The sampled values for each Zi will be influenced by evidence among Zi ancestors.

• Better than sampling from P(z) in this respect (i.e. by completely ignoring E)

However, they won’t be influenced by any evidence that is not an ancestor of Zi.

• SWS(z , e) pays less attention to evidence than the true distribution P(z|e)

The weights are there to make up for the difference

How? The weight for sample x = (z, e) is the product of the

likelihood of each evidence variable given its parents

• w(z , e) = ∏j P(ej |parent(Ej ))

So, the weighted probability of a sample (used by Likelihood Weighting instead of simple counts) can be expresses as

• SWS(z, e) w(z, e) = ∏i P(zi |parent(Zi ))*∏j P(ej |parent(Ej )) = P(z, e) (2)

• By definition of Bayesian networks, because the two products cover all the variables in the network

And it can be shown that Likelihood weighting generates consistent estimates for P(X|e)

Analysis of LW Contrary to Rejection Sampling, LW uses all the

samples that it generates.

It simulates the effect of many rejections via weighting

Example:A

B

0.3

P(A)

0.003

0.63

T

F

P(B|A)A

Analysis of LW

A

B

0.3

P(A)

0.003

0.63

T

F

P(B|A)A

• Suppose B = true, and we are looking for samples with A = true.

• If there are 1000 of them, likely only 3 will have B = true,

• Only ones that would not be rejected by Rejection Sampling.

• With LW, we fix B = true, and give the very first sample with A = true a weight of

• Thus LW simulates the effect of many rejections with only one sample

Efficiency of LW Likelihood weighting is good

• Uses all the samples that it generates, thus it is much more efficient that Rejection Sampling

• Takes evidence into account to generate the sample• E.g. here, W’s value will get picked based on the evidence values of S, R

But doesn’t solve all our problems• Evidence influences the choice of downstream variables, but not upstream

ones (C isn’t more likely to get a value matching the evidence)• Performance still degrades when number of evidence variables increase

• The more evidence variables, the more samples will have small weight• If one does not generate enough samples, the weighted estimate will

be dominated by the small fraction of sample with substantial weight

Possible approaches sample some of the variables, while using exact inference to generate the

posterior probability for others Use Monte Carlo Simulations (beyond the scope of this course), or

Particle Filtering (may see in next classes)

Cloudy

Rain

C

S R

W

• Variable elimination• Understating factors and their operations• Carry out variable elimination by using factors and the related operations• Use techniques to simplify variable elimination

• Approximate Inference • Explain why we need approximate algorithms in Bayesian networks

• define what sampling is and how it works at the general level

• Describe how to generate samples from a distribution and be able to apply the process to a given distribution

• Demonstrate how to apply sampling to the distribution represented by a given Bayesian Network

• Explain/write in pseudocode/implement/trace/debug different sampling algorithms: forward sampling, rejection sampling, likelihood weighting (NOT Importance sampling from textbook)

• Describe when each algorithm is appropriate, as well as its drawbacks. Recommend the most appropriate sampling algorithm for a specific problem

• Explain Hoeffding's inequality and describe what it allows us to compute. Apply Hoeffding's inequality to specific examples 85

Learning Goals For Inference in Bnets

Documents

Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)