76
Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Embed Size (px)

Citation preview

Page 1: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Computer Science CPSC 502

Lecture 9

(up-to Ch. 6.4.2.4)

Page 2: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Lecture Overview

• Exact Inference: Variable Elimination• Factors• Algorithm

• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting

2

Page 3: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Bayesian Networks: Types of Inference

Diagnostic

People are leaving

L=t

P(F|L=t)=?

Predictive IntercausalMixed

Fire happensF=t

P(L|F=t)=?Alarm goes off

P(a) = 1.0

P(F|A=t,T=t)=?

People are leaving

L=t

There is no fireF=f

P(A|F=f,L=t)=?

Person smokes next to sensor

S=tFire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Smoking at

Sensor

We will use the same reasoning procedure for all of these types

Page 4: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Inference

),,(

),,,(),,|(

11

1111

jj

jjjj eEeEP

eEeEYPeEeEYP

• Variable Elimination is an algorithm that efficiently performs this operations by casting them as operations between factors

Yjj

jj

eEeEYP

eEeEYP

),,,(

),,,(

11

11

We need to compute this numerator for each value of Y, yi

We need to marginalize over all the variables Z1,…Zk not involved in the query

kZjjik

Zjji )eEe,Ey,Y,Z,P(ZeEeEyYP ,..,...),..,,( 11111

1

To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that

Y

i )yP(Y 1

Def of conditional probability

• Y is the query variable;

• E1=e1, …, Ej=ej are the observed variables (with their values)

• Z1, …,Zk are the remaining variables

Page 5: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

X Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

Factors• A factor is a function from a tuple of random variables to the

real numbers R• We write a factor on variables X1,… ,Xj as f(X1,… ,Xj) • A factor denotes one or more (possibly partial) distributions

over the given tuple of variables, e.g.,

• P(X1, X2) is a factor f(X1, X2)

• P(Z | X,Y) is a factor f(Z,X,Y)

• P(Z=f|X,Y) is a factor f(X,Y)

• Note: Factors do not have to sum to one

Distribution

Set of DistributionsOne for each combination

of values for X and Y

Set of partial Distributions

f(X, Y ) Z = f

Page 6: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Operation 1: assigning a variable• We can make new factors out of an existing factor

• Our first operation: we can assign some or all of the variables of a factor.

X Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

f(X,Y,Z): t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

What is the result of assigning X= t ?

f(X=t,Y,Z) =f(X, Y, Z)X = t

Y Z val

t t 0.1

t f 0.9

f t 0.2

f f 0.8

Factor of Y,Z

6

Page 7: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

7

More examples of assignmentX Y Z val

t t t 0.1

t t f 0.9

t f t 0.2

f(X,Y,Z): t f f 0.8

f t t 0.4

f t f 0.6

f f t 0.3

f f f 0.7

Y Z val

t t 0.1

t f 0.9

f t 0.2

f f 0.8

Y val

t 0.9

f 0.8

0.8 f(X=t,Y=f,Z=f):

f(X=t,Y,Z)

f(X=t,Y,Z=f):

Factor of Y,Z

Factor of Y

Number

Page 8: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Operation 2: Summing out a variable• Our second operation on factors:

we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new

factor defined on {X1,… ,Xn } \ {X}

)(212

11

),,,(,,Xdomx

jjX

XXxXfXXf

B A C val

t t t 0.03

t t f 0.07

f t t 0.54

f t f 0.36

f3= t f t 0.06

t f f 0.14

f f t 0.48

f f f 0.32

A C val

t t

t f

f t

f f

(Bf3)(A,C)

8

Page 9: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Operation 2: Summing out a variable• Our second operation on factors:

we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new

factor defined on {X1,… ,Xn } \ {X}

)(212

11

),,,(,,Xdomx

jjX

XXxXfXXf

B A C val

t t t 0.03

t t f 0.07

f t t 0.54

f t f 0.36

f3= t f t 0.06

t f f 0.14

f f t 0.48

f f f 0.32

A C val

t t 0.57

t f 0.43

f t 0.54

f f 0.46

(Bf3)(A,C)

9

Page 10: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Operation 3: multiplying factors

10

A B C val

t t t

t t f

t f t

t f f …

f t t

f t f

f f t

f f f

A B Val

t t 0.1

f1(A,B): t f 0.9

f t 0.2

f f 0.8

B C Val

t t 0.3

f2(B,C): t f 0.7

f t 0.6

f f 0.4

f1(A,B)× f2(B,C):

Page 11: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Recap: Factors and Operations on Them• If we assign variable A=a in factor f7(A,B), what is the correct form for the

resulting factor?– f(B).

When we assign variable A we remove it from the factor’s domain

• If we marginalize variable A out from factor f7(A,B), what is the correct form for the resulting factor?– f(B).

When we marginalize out variable A we remove it from the factor’s domain

• If we multiply factors f4(X,Y) and f6(Z,Y), what is the correct form for the resulting factor?– f(X,Y,Z)– When multiplying factors, the resulting factor’s domain is the union of the

multiplicands’ domains

• What is the correct form for B f5(A,B) × f6(B,C)

– As usual, product before sum: B ( f5(A,B) × f6(B,C) )

– Result of multiplication: f(A,B,C). Then marginalize out B: f’(A,C) 13

Page 12: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Remember our goal

)(

),()|(

eEP

eEYPeEYP

• All we need to compute is the numerator: joint probability of the query variable(s) and the evidence!• Variable Elimination is an algorithm that efficiently performs this operation by casting it as operations between factors

Y

eEYP

eEYP

),(

),(

We need to compute this numerator for each value of Y, yi

We need to marginalize over all the variables Z1,…Zk not involved in the query

kZik

Zi e),Ey,Y,Z,P(ZeEyYP 1

1

...),(

To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that

Yi EyP(Y 1)|

Def of conditional probability

• Y: subset of variables that is queried

• E: subset of variables that are observed . E = e

• Z1, …,Zk remaining variables in the JPD

Page 13: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Lecture Overview

• Exact Inference: Variable Elimination• Factors• Algorithm

• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting

15

Page 14: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Variable Elimination: Intro (1)• We can express the joint probability as a factor

– f(Y, E1…, Ej , Z1…,Zk )

• We can compute P(Y, E1=e1, …, Ej=ej) by– Assigning E1=e1, …, Ej=ej

– Marginalizing out variables Z1, …, Zk, one at a time• the order in which we do this is called our elimination ordering

• Are we done?

16

observed Other variables not involved in the query

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

No, this still represents the whole JPD (as a single factor)! Need to exploit the compactness of Bayesian networks

Page 15: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Variable Elimination Intro (2)

• We can express the joint factor as a product of factors, one for each

conditional probability

Recall the JPD of a Bayesian network

))(|( ) , ,|( ) , ,P(11

111

n

iii

n

iiin XpaXPXXXPXX

))(,())(|( iiii XpaXfXpaXP

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

1

11 ,,1111 ),..,,,..,,(),,,(Z

eEeEkjZ

jj jj

k

ZZEEYfeEeEYP

1

11 ,,1

)(Z

eEeE

n

ii

Zjj

k

f

Page 16: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Computing sums of products• Inference in Bayesian networks thus reduces to computing the

sums of products

• To compute efficiently

– Factor out those terms that don't involve Zk, e.g.:

18

kZ

n

iif

1

kZ

kk YXfYZfYfZf ),(),()()( 4321

kZkk YZfZfYXfYf ),()(),()( 3142

1

11 ,,1

)(Z

eEeE

n

ii

Zjj

k

f

Page 17: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Slide 20

Decompose sum of products

1 2 1

111 )(Z Z Z

hiiZ

hZ

ffffffkk

General caseFactors that contain Z1Factors that do not contain Z1

Factors that contain Z1Factors that contain Z2

Factors that do not contain Z2 nor Z1

1

11

2

22

3

11 )()(Z

kZZZ

kZZjmZZ

ffffffk

Etc., continue given a predefined simplification ordering of the variables: variable elimination ordering

Page 18: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

The variable elimination algorith,

1. Construct a factor for each conditional probability.

2. For each factor, assign the observed variables E to their observed values.

3. Given an elimination ordering, decompose sum of products

4. Sum out all variables Zi not involved in the query

5. Multiply the remaining factors (which only involve )

6. Normalize by dividing the resulting factor f(Y) by y

Yf )(

To compute P(Y=yi| E = e)

See the algorithm VE_BN in the P&M text, Section 6.4.1, Figure 6.8, p. 254.

Page 19: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Variable elimination example

P(G,H) = A,B,C,D,E,F,I P(A,B,C,D,E,F,G,H,I) =

= A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)

Compute P(G | H=h1 ).

Page 20: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 1: Construct a factor for each cond. probability

P(G,H) = A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)

P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)

• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

Compute P(G | H=h1 ).

Page 21: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Previous state:

P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)

Observe H :

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

Step 2: assign to observed variables their observed values.

P(G,H=h1)=A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)

Compute P(G | H=h1 ).

H=h1

Page 22: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 3: Decompose sum of products

Previous state: P(G,H=h1) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)

Elimination ordering A, C, E, I, B, D, F : P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

Compute P(G | H=h1 ).

Page 23: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 4: sum out non query variables (one at a time)

Previous state:

P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)

Eliminate A: perform product and sum out A in

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,FCompute P(G | H=h1 ).

f10(B) does not depend on C, E, or I, so we can push it outside of those sums.

Page 24: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 4: sum out non query variables (one at a time)

Previous state:

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)

Eliminate C: perform product and sum out C in

P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

Page 25: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 4: sum out non query variables (one at a time)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)

Eliminate E: perform product and sum out E in

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

Page 26: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)

Eliminate I: perform product and sum out I in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

Step 4: sum out non query variables (one at a time)

Compute P(G | H=h1 ).

Page 27: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)

Eliminate B: perform product and sum out B in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

Step 4: sum out non query variables (one at a time)

Compute P(G | H=h1 ).

Page 28: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)

Eliminate D: perform product and sum out D in

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

Step 4: sum out non query variables (one at a time)

Compute P(G | H=h1 ).

Page 29: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Previous state:

P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)

Eliminate F: perform product and sum out F in

f9(G) f13(G)f16(F,G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)

Step 4: sum out non query variables (one at a time)

Compute P(G | H=h1 ).

Page 30: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Slide 33

Step 5: Multiply remaining factors

Previous state:

P(G,H=h1) = f9(G) f13(G)f16(G)

Multiply remaining factors (all in G):

P(G,H=h1) = f17(G)

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)

f17(G)

Page 31: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Step 6: Normalize

• f9(G)• f0(A)

• f1(B,A)

• f2(C)

• f3(D,B,C)

• f4(E,C)

• f5(F, D)

• f6(G,F,E)

• f7(H,G)

• f8(I,G)

• f10(B)

Compute P(G | H=h1 ).

• f11(B,D,E)

• f12(B,D,F,G)

• f13(G)

• f14(D,F,G)

• f15(F,G)

• f16(G)

f17(G)

)('

17

17

)('

1

)'(

)(

),'(

),(

)(

),()|(

1

1

1

1

GdomgGdomg

gf

gf

hHgGP

hHgGP

hHP

hHgGPhHgGP

Page 32: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

VE and conditional independence• So far, we haven’t use conditional independence!

– Before running VE, we can prune all variables Z that are conditionally

independent of the query Y given evidence E: Z ╨ Y | E

– They cannot change the belief over Y given E!

37

• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?

Page 33: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

VE and conditional independence• So far, we haven’t use conditional independence!

– Before running VE, we can prune all variables Z that are conditionally

independent of the query Y given evidence E: Z ╨ Y | E

– They cannot change the belief over Y given E!

38

• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?

– A, B, and D. Both paths from these nodes to G are blocked • F is observed node in chain structure• C is an observed common parent

– Thus, we only need to consider this subnetwork

Page 34: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Variable elimination: pruning

Slide 39

Thus, if the query is

P(G=g| C=c1, F=f1, H=h1)

we only need to consider this

subnetwork

• We can also prune unobserved leaf nodes• Since they are unobserved and not predecessors of the query nodes, they

cannot influence the posterior probability of the query nodes

Page 35: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

One last trick

• We can also prune unobserved leaf nodes– And we can do so recursively

• E.g., which nodes can we prune if the query is P(A)?

• Recursively prune unobserved leaf nodes:• we can prune all nodes other than A !

40

Page 36: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Complexity of Variable Elimination (VE)• A factor over n binary variables has to store 2n numbers

– The initial factors are typically quite small (variables typically only have few parents in Bayesian networks)

– But variable elimination constructs larger factors by multiplying factors together

• The complexity of VE is exponential in the maximum number of variables in any factor during its execution – This number is called the treewidth of a graph (along an ordering)– Elimination ordering influences treewidth

• Finding the best ordering is NP complete– I.e., the ordering that generates the minimum treewidth– Heuristics work well in practice (e.g. least connected variables first)– Even with best ordering, inference is sometimes infeasible

• In those cases, we need approximate inference.41

Page 37: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

VE in AISpace

• To see how variable elimination works in the Aispace Applet• Select “Network options -> Query Models > verbose”• Compare what happens when you select “Prune Irrelevant variables” or

not in the VE window that pops up when you query a node• Try different heuristics for elimination ordering

Page 38: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Lecture Overview

• Exact Inference: Variable Elimination• Factors• Algorithm

• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting

43

Page 39: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Sampling: What is it?• Problem: how to estimate probability distributions that are hard to compute

via exact methods.• Idea: Estimate probabilities from sample data (samples) of the (unknown)

probabilities distribution• Use frequency of each event in the sample data to approximate its probability

Frequencies are good approximations only if based on large samples

• But these samples are often not easy to obtain from real-world observations How do we get the samples?

Page 40: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

We use Sampling

• Sampling is a process to obtain samples adequate to estimate an unknown probability

• The samples are generated from a known probability distribution

P(x1)

P(xn)

Page 41: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Generating Samples from a Distribution• For a random variable X with

– values {x1,…,xk}

– Probability distribution P(X) = {P(x1),…,P(xk)}

• Partition the interval (0, 1] into k intervals pi , one for each xi , with length P(xi )

• To generate one sample• Randomly generate a value y in (0, 1] (i.e. generate a value from a uniform

distribution over (0, 1]).• Select the value of the sample based on the interval pi that includes y

• From probability theory:

)()()( iii xPpLengthpyP

Page 42: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example• Consider a random variable Lecture with

• 3 values <good, bad, soso>• with probabilities 0.7, 0.1 and 0.2 respectively.

• We can have a sampler for this distribution by:– Using a random number generator that outputs numbers over (0, 1]– Partition (0,1] into 3 intervals corresponding to the probabilities of the three

Lecture values: (0, 0.7], (0.7, 0.8] and (0.8, 1]):– To obtain a sample, generate a random number n and pick the value for

Lecture based on which interval n falls into:• P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)• P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)• P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)

Page 43: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example

0.73

0.2

0.87

0.1

0.9

0.5

0.3

sampleRandom n

good

good

bad

soso

good

good

soso

If we generate enough samples, the frequencies of the three values will get close to their probability

P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)

Page 44: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Samples as Probabilities

• Count total number of samples m

• Count the number ni of samples xi

• Generate the frequency of sample xi as ni / m

• This frequency is your estimated probability of xi

Page 45: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Sampling for Bayesian Networks

OK, but how can we use all this for probabilistic inference in Bayesian networks?

As we said earlier, if we can’t use exact algorithms to update the network, we need to resort to samples and frequencies to compute the probabilities we are interested in

We generate these samples by relying on the mechanism we just described

Page 46: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Sampling for Bayesian Networks (N)

Suppose we have the following BN with two binary variables

It corresponds to the joint probability distribution • P(A,B) =P(B|A)P(A)

To sample from this distribution• we first sample from P(A). Suppose we get A = 0.

• In this case, we then sample from P(B|A = 0).

• If we had sampled A = 1, then in the second step we would have sampled from P(B|A = 1).

A

B

0.3

P(A=1)

0.7

0.1

1

0

P(B=1|A)A

Page 47: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Forward (or Prior) Sampling In a BN

• we can order parents before children (topological order) and• we have CPTs available.

If no variables are instantiated (i.e., there is no evidence), this allows a simple algorithm: forward sampling.• Just sample variables in some fixed topological order, using the previously

sampled values of the parents to select the correct distribution to sample from.

Page 48: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Forward (or Prior) Sampling

Page 49: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

ExampleCloudy

Sprinkler

0.5

P(C=T)

Rain

Wet Grass

T

F

C

0.8

0.2

P(R=T|C)

0.1FF

0.9TF

0.9FT

0.99TT

P(W=T|S,R)RS

Random => 0.4Sample=> Cloudy=

Random => 0.8 Sample=> Sprinkler =

Random => 0.4Sample=> Rain =

Random => 0.7Sample=> Wet Grass =

0.1

0.5

T

F

P(S=T|C)C

Page 50: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example

n

T

Wet Grass

F

Sprinkler

........

3

2

TT1

RainCloudysample #

We generate as many samples as we can afford Then we can use them to compute the probability of any partially specified

event, • e.g. the probability of Rain = T in my Sprinkler network

by computing the frequency of that event in the sample set So, if we generate 1000 samples from the Sprinkler network, and 511 of then

have Rain = T, then the estimated probability of rain is

Page 51: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Sampling: why does it work?• Because of the Law of Large Numbers (LLN)

– Theorem in probability that describes the long-term stability of a random variable.

• Given repeated, independent trials with the same probability p of success in each trial, – the percentage of successes approaches p as the number of trials

approaches ∞

• Example: tossing a coin a large number of times, where the probability of heads on any toss is p.

– Let Sn be the number of heads that come up after n tosses.

Page 52: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Simulating Coin Tosses

• Here’s a graph of Sn/n against p for three different sequences of simulated coin tosses, each of length 200.

• Remember that P(head) = 0.5

Page 53: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Simulating Tosses with a Bias Coin

• Let's change P(head) = 0.8

http://socr.ucla.edu/htmls/SOCR_Experiments.html

Page 54: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Forward Sampling

This algorithm generates samples from the prior joint distribution represented by the network.

Let SPS(x1,…,xn) be the probability that a specific event (x1,…,xn) is generated by the algorithm (or sampling probability of the event).

Just by looking at the sampling process, we have

Because each sampling step depends only on the parent value But this is also the probability of event (x1,…,xn) according to the BN

representation of the JPD

• Thus: SPS(x1,…,xn) = P (x1,…,xn)

)( |(),..,(11 i

n

i inPS XparentsxpxxS

Page 55: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Forward Sampling Because of this result, we can answer any query on the distribution for X1,…,Xn

represented by our Bnet using the samples generated with Prior-Sample(Bnet)

Let N be the total number of samples generated, and NPS(x1,…,xn) be the number of samples generated for event x1,…,xn

We expect the frequency of a given event to converge to its expected value, according to its sampling probability

That is, event frequency in a sample set generated via forward sampling is a consistent estimate of that event’s probability

Given (cloudy, ⌐sprinkler, rain, wet-grass), with sampling probability 0.324 we expect, for large samples, to see 32% of them to correspond to this event

),..,(),..,(/),..,(lim 111 nnPSnPSN

xxPxxSNxxN

),..,(~/),..,( 11 nnPS xxPNxxN

Page 56: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Sampling: why does it work?

• The LLN is important because it "guarantees" stable long-term results for random events.

• However, it is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered.

• There is no principle ensuring that a small number of observations will converge to the expected value or that a streak of one value will immediately be "balanced" by the others.

• So, how many samples are enough?

Page 57: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Hoeffding’s inequality Suppose p is the true probability and s is the sample average from n

independent samples.

p above can be the probability of any event for random variable X = {X1,…Xn} described by a Bayesian network

If you want an infinitely small probability of having an error greater than ε, you need infinitely many samples

But if you settle on something less than infinitely small, let’s say δ, then you just need to set

So you pick • the error ε you can tolerate,

• the frequency δ with which you can tolerate it

And solve for n, i.e., the number of samples that can ensure this performance (1)

222)|(| nepsP

222 ne

Page 58: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Hoeffding’s inequality Examples:

• You can tolerate an error greater than 0.1 only in 5% of your cases

• Set ε =0.1, δ = 0.05

• Equation (1) gives you n > 184

If you can tolerate the same error (0.1) only in 1% of the cases, then you need 265 samples

If you want an error of 0.01 in no more than 5% of the cases, you need 18,445 samples

Page 59: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Rejection Sampling• Used specifically to compute conditional probabilities P(X|e)

given some evidence e– It generates samples from the prior distribution specified by the Bnet (e.g., by

using the Prior-Sample algorithm we saw earlier)– It rejects all those samples that do not match the evidence e– Then, for every value x of X, it estimates P (X = x | e) by looking at the frequency

with which X = x occurs in the remaining samples (i.e. those consistent with e)

Page 60: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Rejection Sampling

Page 61: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example

• Estimate P(Rain|sprinkler) using 100 samples• 27 samples have sprinkler• Of these, 8 have rain and 19 have ⌐rain.• Estimated P(Rain|sprinkler) =

= P*(Rain|sprinkler) = <8/27, 19/27>

Page 62: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Analysis of Rejection Sampling• Main Problem: it rejects too many samples when the evidence is not very likely.

• Consider our previous exampleA

B

0.3

P(A=T)

0.7

0.1

T

F

P(=T|A)A

If we are interested in P(B|A=T),

• we can only use samples with A = T,

• but I only have 0.3 probability of getting A = T from Prior-Sample,

• so I likely have to reject 70% of my samples

Things get exponentially worse as the number of evidence variables grow

Page 63: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Analysis of Rejection Sampling

A

B

0.3

P(A)

If we are interested in P(B|A=T, C = T), • can only use samples (A=T, B, C=T), but the probability of getting them is

• P(A=T)P(C=T) = 0.03

• I should expect to reject in the order of 97% of my samples!

C0.1

P(C)

Page 64: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Analysis of Rejection Sampling

Note that rejection sampling resembles how probabilities are estimated from observations in the real world.

• You need the right type of event to happen in order to have a valid observation

e.g. P(rains tomorrow|redSky tonight))

• If the event is not common, I have to wait a long time to gather enough relevant events

e.g, nights with red sky

Page 65: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Likelihood Weighting (LW) Avoids the inefficiency of rejection sampling by generating only

events that are consistent with evidence e

• Fixes the values for the evidence variables E

• Samples only the remaining variables Z, i.e. the query X and hidden variables Y

But it still needs to account for the influence of the given evidence on the probability of the samples, otherwise it would not be sampling the correct distribution

Page 66: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Likelihood Weighting (LW)If the sample comes from the correct distribution (e.g.,

the priory distribution for my Bnet in forward sampling or rejection sampling)

• simply count the number of samples with the desired values for the query variable X

In LW, before counting, each sample is weighted to account for the actual likelihood of the corresponding event given the original probability distribution and the evidence

Basically, the point is to make events which are unlikely given the actual evidence have less weight than others

Page 67: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example: P(Rain|sprinkler, wet-grass)

0.1

0.5

T

F

P(S|C)C

Cloudy

Sprinkler

0.5

P(C)

Rain

Wet Grass

T

F

C

0.8

0.2

P(R|C)

0.1FF

0.9TF

0.9FT

0.99TT

P(W|S,R)RS

Random => 0.4Sample=> cloudy

Sprinkler is fixed No sampling, but adjust weight

w2 = w1 * P(sprinkler|cloudy)

Random => 0.4Sample=> rain

w1 = 1w2 = w1* 0.1 = 0.1

Wet Grass is fixed No sampling, but adjust weight

w3 = w2 * P(wet-grass|sprinkler, rain)

w3 = w2* 0.99 = 0.099

Page 68: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Example: P(Rain|sprinkler, wet-grass)

Basically, LW generated the sample

<cloudy, sprinkler, rain, wet-grass>

by cheating, because we did not sample the evidence variables, we just set their values to what we wanted

But LW makes up for the cheating by giving a relative low weight to this sample (0.099), reflecting the fact that it is not very likely

LW uses the weighted value when counting through all generated samples to compute the frequencies of the events of interest

Page 69: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Likelihood Weighting (LW)Fix evidence variables, sample only non-evidence variables,and weight each sample by the likelihood it accords the evidence

Page 70: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Likelihood Weighting: why does it work?

Remember that we have fixed evidence variables E, and we are sampling only all the other variables Z = X U Y• The “right” distribution to sample from would be P(Z|e) but we have

seen with rejection sampling that this is really inefficient

So which distribution is Weighted-Sample using?

By looking at the algorithm, we see that to get an event (z,e) the algorithm samples each variable in Z given its parents

• SWS(z , e) = ∏i P(zi|parent(Zi))

Page 71: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Likelihood Weighting: why does it work?

P(zi|parent(Zi)) can include both hidden variables Y and evidence variables E, so SWS does pay some attention to evidence

The sampled values for each Zi will be influenced by evidence among Zi ancestors.

• Better than sampling from P(z) in this respect (i.e. by completely ignoring E)

However, they won’t be influenced by any evidence that is not an ancestor of Zi.

• SWS(z , e) pays less attention to evidence than the true distribution P(z|e)

The weights are there to make up for the difference

Page 72: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

How? The weight for sample x = (z, e) is the product of the

likelihood of each evidence variable given its parents

• w(z , e) = ∏j P(ej |parent(Ej ))

So, the weighted probability of a sample (used by Likelihood Weighting instead of simple counts) can be expresses as

• SWS(z, e) w(z, e) = ∏i P(zi |parent(Zi ))*∏j P(ej |parent(Ej )) = P(z, e) (2)

• By definition of Bayesian networks, because the two products cover all the variables in the network

And it can be shown that Likelihood weighting generates consistent estimates for P(X|e)

Page 73: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Analysis of LW Contrary to Rejection Sampling, LW uses all the

samples that it generates.

It simulates the effect of many rejections via weighting

Example:A

B

0.3

P(A)

0.003

0.63

T

F

P(B|A)A

Page 74: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Analysis of LW

A

B

0.3

P(A)

0.003

0.63

T

F

P(B|A)A

• Suppose B = true, and we are looking for samples with A = true.

• If there are 1000 of them, likely only 3 will have B = true,

• Only ones that would not be rejected by Rejection Sampling.

• With LW, we fix B = true, and give the very first sample with A = true a weight of

• Thus LW simulates the effect of many rejections with only one sample

Page 75: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

Efficiency of LW Likelihood weighting is good

• Uses all the samples that it generates, thus it is much more efficient that Rejection Sampling

• Takes evidence into account to generate the sample• E.g. here, W’s value will get picked based on the evidence values of S, R

But doesn’t solve all our problems• Evidence influences the choice of downstream variables, but not upstream

ones (C isn’t more likely to get a value matching the evidence)• Performance still degrades when number of evidence variables increase

• The more evidence variables, the more samples will have small weight• If one does not generate enough samples, the weighted estimate will

be dominated by the small fraction of sample with substantial weight

Possible approaches sample some of the variables, while using exact inference to generate the

posterior probability for others Use Monte Carlo Simulations (beyond the scope of this course), or

Particle Filtering (may see in next classes)

Cloudy

Rain

C

S R

W

Page 76: Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)

• Variable elimination• Understating factors and their operations• Carry out variable elimination by using factors and the related operations• Use techniques to simplify variable elimination

• Approximate Inference • Explain why we need approximate algorithms in Bayesian networks

• define what sampling is and how it works at the general level

• Describe how to generate samples from a distribution and be able to apply the process to a given distribution

• Demonstrate how to apply sampling to the distribution represented by a given Bayesian Network

• Explain/write in pseudocode/implement/trace/debug different sampling algorithms: forward sampling, rejection sampling, likelihood weighting (NOT Importance sampling from textbook)

• Describe when each algorithm is appropriate, as well as its drawbacks. Recommend the most appropriate sampling algorithm for a specific problem

• Explain Hoeffding's inequality and describe what it allows us to compute. Apply Hoeffding's inequality to specific examples 85

Learning Goals For Inference in Bnets