Upload
noel-wiggins
View
217
Download
0
Embed Size (px)
Citation preview
Computer Science CPSC 502
Lecture 9
(up-to Ch. 6.4.2.4)
Lecture Overview
• Exact Inference: Variable Elimination• Factors• Algorithm
• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting
2
Bayesian Networks: Types of Inference
Diagnostic
People are leaving
L=t
P(F|L=t)=?
Predictive IntercausalMixed
Fire happensF=t
P(L|F=t)=?Alarm goes off
P(a) = 1.0
P(F|A=t,T=t)=?
People are leaving
L=t
There is no fireF=f
P(A|F=f,L=t)=?
Person smokes next to sensor
S=tFire
Alarm
Leaving
Fire
Alarm
Leaving
Fire
Alarm
Leaving
Fire
Alarm
Smoking at
Sensor
We will use the same reasoning procedure for all of these types
Inference
),,(
),,,(),,|(
11
1111
jj
jjjj eEeEP
eEeEYPeEeEYP
• Variable Elimination is an algorithm that efficiently performs this operations by casting them as operations between factors
Yjj
jj
eEeEYP
eEeEYP
),,,(
),,,(
11
11
We need to compute this numerator for each value of Y, yi
We need to marginalize over all the variables Z1,…Zk not involved in the query
kZjjik
Zjji )eEe,Ey,Y,Z,P(ZeEeEyYP ,..,...),..,,( 11111
1
To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that
Y
i )yP(Y 1
Def of conditional probability
• Y is the query variable;
• E1=e1, …, Ej=ej are the observed variables (with their values)
• Z1, …,Zk are the remaining variables
X Y Z val
t t t 0.1
t t f 0.9
t f t 0.2
t f f 0.8
f t t 0.4
f t f 0.6
f f t 0.3
f f f 0.7
Factors• A factor is a function from a tuple of random variables to the
real numbers R• We write a factor on variables X1,… ,Xj as f(X1,… ,Xj) • A factor denotes one or more (possibly partial) distributions
over the given tuple of variables, e.g.,
• P(X1, X2) is a factor f(X1, X2)
• P(Z | X,Y) is a factor f(Z,X,Y)
• P(Z=f|X,Y) is a factor f(X,Y)
• Note: Factors do not have to sum to one
Distribution
Set of DistributionsOne for each combination
of values for X and Y
Set of partial Distributions
f(X, Y ) Z = f
Operation 1: assigning a variable• We can make new factors out of an existing factor
• Our first operation: we can assign some or all of the variables of a factor.
X Y Z val
t t t 0.1
t t f 0.9
t f t 0.2
f(X,Y,Z): t f f 0.8
f t t 0.4
f t f 0.6
f f t 0.3
f f f 0.7
What is the result of assigning X= t ?
f(X=t,Y,Z) =f(X, Y, Z)X = t
Y Z val
t t 0.1
t f 0.9
f t 0.2
f f 0.8
Factor of Y,Z
6
7
More examples of assignmentX Y Z val
t t t 0.1
t t f 0.9
t f t 0.2
f(X,Y,Z): t f f 0.8
f t t 0.4
f t f 0.6
f f t 0.3
f f f 0.7
Y Z val
t t 0.1
t f 0.9
f t 0.2
f f 0.8
Y val
t 0.9
f 0.8
0.8 f(X=t,Y=f,Z=f):
f(X=t,Y,Z)
f(X=t,Y,Z=f):
Factor of Y,Z
Factor of Y
Number
Operation 2: Summing out a variable• Our second operation on factors:
we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new
factor defined on {X1,… ,Xn } \ {X}
)(212
11
),,,(,,Xdomx
jjX
XXxXfXXf
B A C val
t t t 0.03
t t f 0.07
f t t 0.54
f t f 0.36
f3= t f t 0.06
t f f 0.14
f f t 0.48
f f f 0.32
A C val
t t
t f
f t
f f
(Bf3)(A,C)
8
Operation 2: Summing out a variable• Our second operation on factors:
we can marginalize out (or sum out) a variable– Exactly as before. Only difference: factors don’t sum to 1– Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new
factor defined on {X1,… ,Xn } \ {X}
)(212
11
),,,(,,Xdomx
jjX
XXxXfXXf
B A C val
t t t 0.03
t t f 0.07
f t t 0.54
f t f 0.36
f3= t f t 0.06
t f f 0.14
f f t 0.48
f f f 0.32
A C val
t t 0.57
t f 0.43
f t 0.54
f f 0.46
(Bf3)(A,C)
9
Operation 3: multiplying factors
10
A B C val
t t t
t t f
t f t
t f f …
f t t
f t f
f f t
f f f
A B Val
t t 0.1
f1(A,B): t f 0.9
f t 0.2
f f 0.8
B C Val
t t 0.3
f2(B,C): t f 0.7
f t 0.6
f f 0.4
f1(A,B)× f2(B,C):
Recap: Factors and Operations on Them• If we assign variable A=a in factor f7(A,B), what is the correct form for the
resulting factor?– f(B).
When we assign variable A we remove it from the factor’s domain
• If we marginalize variable A out from factor f7(A,B), what is the correct form for the resulting factor?– f(B).
When we marginalize out variable A we remove it from the factor’s domain
• If we multiply factors f4(X,Y) and f6(Z,Y), what is the correct form for the resulting factor?– f(X,Y,Z)– When multiplying factors, the resulting factor’s domain is the union of the
multiplicands’ domains
• What is the correct form for B f5(A,B) × f6(B,C)
– As usual, product before sum: B ( f5(A,B) × f6(B,C) )
– Result of multiplication: f(A,B,C). Then marginalize out B: f’(A,C) 13
Remember our goal
)(
),()|(
eEP
eEYPeEYP
• All we need to compute is the numerator: joint probability of the query variable(s) and the evidence!• Variable Elimination is an algorithm that efficiently performs this operation by casting it as operations between factors
Y
eEYP
eEYP
),(
),(
We need to compute this numerator for each value of Y, yi
We need to marginalize over all the variables Z1,…Zk not involved in the query
kZik
Zi e),Ey,Y,Z,P(ZeEyYP 1
1
...),(
To compute the denominator, marginalize over Y- Same value for every P(Y=yi). Normalization constant ensuring that
Yi EyP(Y 1)|
Def of conditional probability
• Y: subset of variables that is queried
• E: subset of variables that are observed . E = e
• Z1, …,Zk remaining variables in the JPD
Lecture Overview
• Exact Inference: Variable Elimination• Factors• Algorithm
• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting
15
Variable Elimination: Intro (1)• We can express the joint probability as a factor
– f(Y, E1…, Ej , Z1…,Zk )
• We can compute P(Y, E1=e1, …, Ej=ej) by– Assigning E1=e1, …, Ej=ej
– Marginalizing out variables Z1, …, Zk, one at a time• the order in which we do this is called our elimination ordering
• Are we done?
16
observed Other variables not involved in the query
1
11 ,,1111 ),..,,,..,,(),,,(Z
eEeEkjZ
jj jj
k
ZZEEYfeEeEYP
No, this still represents the whole JPD (as a single factor)! Need to exploit the compactness of Bayesian networks
Variable Elimination Intro (2)
• We can express the joint factor as a product of factors, one for each
conditional probability
Recall the JPD of a Bayesian network
))(|( ) , ,|( ) , ,P(11
111
n
iii
n
iiin XpaXPXXXPXX
))(,())(|( iiii XpaXfXpaXP
1
11 ,,1111 ),..,,,..,,(),,,(Z
eEeEkjZ
jj jj
k
ZZEEYfeEeEYP
1
11 ,,1111 ),..,,,..,,(),,,(Z
eEeEkjZ
jj jj
k
ZZEEYfeEeEYP
1
11 ,,1
)(Z
eEeE
n
ii
Zjj
k
f
Computing sums of products• Inference in Bayesian networks thus reduces to computing the
sums of products
• To compute efficiently
– Factor out those terms that don't involve Zk, e.g.:
18
kZ
n
iif
1
kZ
kk YXfYZfYfZf ),(),()()( 4321
kZkk YZfZfYXfYf ),()(),()( 3142
1
11 ,,1
)(Z
eEeE
n
ii
Zjj
k
f
Slide 20
Decompose sum of products
1 2 1
111 )(Z Z Z
hiiZ
hZ
ffffffkk
General caseFactors that contain Z1Factors that do not contain Z1
Factors that contain Z1Factors that contain Z2
Factors that do not contain Z2 nor Z1
1
11
2
22
3
11 )()(Z
kZZZ
kZZjmZZ
ffffffk
Etc., continue given a predefined simplification ordering of the variables: variable elimination ordering
The variable elimination algorith,
1. Construct a factor for each conditional probability.
2. For each factor, assign the observed variables E to their observed values.
3. Given an elimination ordering, decompose sum of products
4. Sum out all variables Zi not involved in the query
5. Multiply the remaining factors (which only involve )
6. Normalize by dividing the resulting factor f(Y) by y
Yf )(
To compute P(Y=yi| E = e)
See the algorithm VE_BN in the P&M text, Section 6.4.1, Figure 6.8, p. 254.
Variable elimination example
P(G,H) = A,B,C,D,E,F,I P(A,B,C,D,E,F,G,H,I) =
= A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)
Compute P(G | H=h1 ).
Step 1: Construct a factor for each cond. probability
P(G,H) = A,B,C,D,E,F,I P(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)
P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)
• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
Compute P(G | H=h1 ).
Previous state:
P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G)
Observe H :
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
Step 2: assign to observed variables their observed values.
P(G,H=h1)=A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)
Compute P(G | H=h1 ).
H=h1
Step 3: Decompose sum of products
Previous state: P(G,H=h1) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G)
Elimination ordering A, C, E, I, B, D, F : P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
Compute P(G | H=h1 ).
Step 4: sum out non query variables (one at a time)
Previous state:
P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A)
Eliminate A: perform product and sum out A in
P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Elimination order: A,C,E,I,B,D,FCompute P(G | H=h1 ).
f10(B) does not depend on C, E, or I, so we can push it outside of those sums.
Step 4: sum out non query variables (one at a time)
Previous state:
P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C)
Eliminate C: perform product and sum out C in
P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
Step 4: sum out non query variables (one at a time)
Previous state:
P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G) E f6(G,F,E) f11(B,D,E)
Eliminate E: perform product and sum out E in
P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
Previous state:
P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G)
Eliminate I: perform product and sum out I in
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
Step 4: sum out non query variables (one at a time)
Compute P(G | H=h1 ).
Previous state:
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G)
Eliminate B: perform product and sum out B in
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
• f14(D,F,G)
Step 4: sum out non query variables (one at a time)
Compute P(G | H=h1 ).
Previous state:
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G)
Eliminate D: perform product and sum out D in
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
• f14(D,F,G)
• f15(F,G)
Step 4: sum out non query variables (one at a time)
Compute P(G | H=h1 ).
Previous state:
P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G)
Eliminate F: perform product and sum out F in
f9(G) f13(G)f16(F,G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
• f14(D,F,G)
• f15(F,G)
• f16(G)
Step 4: sum out non query variables (one at a time)
Compute P(G | H=h1 ).
Slide 33
Step 5: Multiply remaining factors
Previous state:
P(G,H=h1) = f9(G) f13(G)f16(G)
Multiply remaining factors (all in G):
P(G,H=h1) = f17(G)
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Compute P(G | H=h1 ). Elimination order: A,C,E,I,B,D,F
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
• f14(D,F,G)
• f15(F,G)
• f16(G)
f17(G)
Step 6: Normalize
• f9(G)• f0(A)
• f1(B,A)
• f2(C)
• f3(D,B,C)
• f4(E,C)
• f5(F, D)
• f6(G,F,E)
• f7(H,G)
• f8(I,G)
• f10(B)
Compute P(G | H=h1 ).
• f11(B,D,E)
• f12(B,D,F,G)
• f13(G)
• f14(D,F,G)
• f15(F,G)
• f16(G)
f17(G)
)('
17
17
)('
1
)'(
)(
),'(
),(
)(
),()|(
1
1
1
1
GdomgGdomg
gf
gf
hHgGP
hHgGP
hHP
hHgGPhHgGP
VE and conditional independence• So far, we haven’t use conditional independence!
– Before running VE, we can prune all variables Z that are conditionally
independent of the query Y given evidence E: Z ╨ Y | E
– They cannot change the belief over Y given E!
37
• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?
VE and conditional independence• So far, we haven’t use conditional independence!
– Before running VE, we can prune all variables Z that are conditionally
independent of the query Y given evidence E: Z ╨ Y | E
– They cannot change the belief over Y given E!
38
• Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?
– A, B, and D. Both paths from these nodes to G are blocked • F is observed node in chain structure• C is an observed common parent
– Thus, we only need to consider this subnetwork
Variable elimination: pruning
Slide 39
Thus, if the query is
P(G=g| C=c1, F=f1, H=h1)
we only need to consider this
subnetwork
• We can also prune unobserved leaf nodes• Since they are unobserved and not predecessors of the query nodes, they
cannot influence the posterior probability of the query nodes
One last trick
• We can also prune unobserved leaf nodes– And we can do so recursively
• E.g., which nodes can we prune if the query is P(A)?
• Recursively prune unobserved leaf nodes:• we can prune all nodes other than A !
40
Complexity of Variable Elimination (VE)• A factor over n binary variables has to store 2n numbers
– The initial factors are typically quite small (variables typically only have few parents in Bayesian networks)
– But variable elimination constructs larger factors by multiplying factors together
• The complexity of VE is exponential in the maximum number of variables in any factor during its execution – This number is called the treewidth of a graph (along an ordering)– Elimination ordering influences treewidth
• Finding the best ordering is NP complete– I.e., the ordering that generates the minimum treewidth– Heuristics work well in practice (e.g. least connected variables first)– Even with best ordering, inference is sometimes infeasible
• In those cases, we need approximate inference.41
VE in AISpace
• To see how variable elimination works in the Aispace Applet• Select “Network options -> Query Models > verbose”• Compare what happens when you select “Prune Irrelevant variables” or
not in the VE window that pops up when you query a node• Try different heuristics for elimination ordering
Lecture Overview
• Exact Inference: Variable Elimination• Factors• Algorithm
• Approximate Inference: sampling methods• Forward (prior) sampling• Rejection sampling• Likelihood weighting
43
Sampling: What is it?• Problem: how to estimate probability distributions that are hard to compute
via exact methods.• Idea: Estimate probabilities from sample data (samples) of the (unknown)
probabilities distribution• Use frequency of each event in the sample data to approximate its probability
Frequencies are good approximations only if based on large samples
• But these samples are often not easy to obtain from real-world observations How do we get the samples?
We use Sampling
• Sampling is a process to obtain samples adequate to estimate an unknown probability
• The samples are generated from a known probability distribution
P(x1)
P(xn)
Generating Samples from a Distribution• For a random variable X with
– values {x1,…,xk}
– Probability distribution P(X) = {P(x1),…,P(xk)}
• Partition the interval (0, 1] into k intervals pi , one for each xi , with length P(xi )
• To generate one sample• Randomly generate a value y in (0, 1] (i.e. generate a value from a uniform
distribution over (0, 1]).• Select the value of the sample based on the interval pi that includes y
• From probability theory:
)()()( iii xPpLengthpyP
Example• Consider a random variable Lecture with
• 3 values <good, bad, soso>• with probabilities 0.7, 0.1 and 0.2 respectively.
• We can have a sampler for this distribution by:– Using a random number generator that outputs numbers over (0, 1]– Partition (0,1] into 3 intervals corresponding to the probabilities of the three
Lecture values: (0, 0.7], (0.7, 0.8] and (0.8, 1]):– To obtain a sample, generate a random number n and pick the value for
Lecture based on which interval n falls into:• P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)• P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)• P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)
Example
0.73
0.2
0.87
0.1
0.9
0.5
0.3
sampleRandom n
good
good
bad
soso
good
good
soso
If we generate enough samples, the frequencies of the three values will get close to their probability
P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good)P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad)P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)
Samples as Probabilities
• Count total number of samples m
• Count the number ni of samples xi
• Generate the frequency of sample xi as ni / m
• This frequency is your estimated probability of xi
Sampling for Bayesian Networks
OK, but how can we use all this for probabilistic inference in Bayesian networks?
As we said earlier, if we can’t use exact algorithms to update the network, we need to resort to samples and frequencies to compute the probabilities we are interested in
We generate these samples by relying on the mechanism we just described
Sampling for Bayesian Networks (N)
Suppose we have the following BN with two binary variables
It corresponds to the joint probability distribution • P(A,B) =P(B|A)P(A)
To sample from this distribution• we first sample from P(A). Suppose we get A = 0.
• In this case, we then sample from P(B|A = 0).
• If we had sampled A = 1, then in the second step we would have sampled from P(B|A = 1).
A
B
0.3
P(A=1)
0.7
0.1
1
0
P(B=1|A)A
Forward (or Prior) Sampling In a BN
• we can order parents before children (topological order) and• we have CPTs available.
If no variables are instantiated (i.e., there is no evidence), this allows a simple algorithm: forward sampling.• Just sample variables in some fixed topological order, using the previously
sampled values of the parents to select the correct distribution to sample from.
Forward (or Prior) Sampling
ExampleCloudy
Sprinkler
0.5
P(C=T)
Rain
Wet Grass
T
F
C
0.8
0.2
P(R=T|C)
0.1FF
0.9TF
0.9FT
0.99TT
P(W=T|S,R)RS
Random => 0.4Sample=> Cloudy=
Random => 0.8 Sample=> Sprinkler =
Random => 0.4Sample=> Rain =
Random => 0.7Sample=> Wet Grass =
0.1
0.5
T
F
P(S=T|C)C
Example
n
T
Wet Grass
F
Sprinkler
........
3
2
TT1
RainCloudysample #
We generate as many samples as we can afford Then we can use them to compute the probability of any partially specified
event, • e.g. the probability of Rain = T in my Sprinkler network
by computing the frequency of that event in the sample set So, if we generate 1000 samples from the Sprinkler network, and 511 of then
have Rain = T, then the estimated probability of rain is
Sampling: why does it work?• Because of the Law of Large Numbers (LLN)
– Theorem in probability that describes the long-term stability of a random variable.
• Given repeated, independent trials with the same probability p of success in each trial, – the percentage of successes approaches p as the number of trials
approaches ∞
• Example: tossing a coin a large number of times, where the probability of heads on any toss is p.
– Let Sn be the number of heads that come up after n tosses.
Simulating Coin Tosses
• Here’s a graph of Sn/n against p for three different sequences of simulated coin tosses, each of length 200.
• Remember that P(head) = 0.5
Simulating Tosses with a Bias Coin
• Let's change P(head) = 0.8
http://socr.ucla.edu/htmls/SOCR_Experiments.html
Forward Sampling
This algorithm generates samples from the prior joint distribution represented by the network.
Let SPS(x1,…,xn) be the probability that a specific event (x1,…,xn) is generated by the algorithm (or sampling probability of the event).
Just by looking at the sampling process, we have
Because each sampling step depends only on the parent value But this is also the probability of event (x1,…,xn) according to the BN
representation of the JPD
• Thus: SPS(x1,…,xn) = P (x1,…,xn)
)( |(),..,(11 i
n
i inPS XparentsxpxxS
Forward Sampling Because of this result, we can answer any query on the distribution for X1,…,Xn
represented by our Bnet using the samples generated with Prior-Sample(Bnet)
Let N be the total number of samples generated, and NPS(x1,…,xn) be the number of samples generated for event x1,…,xn
We expect the frequency of a given event to converge to its expected value, according to its sampling probability
That is, event frequency in a sample set generated via forward sampling is a consistent estimate of that event’s probability
Given (cloudy, ⌐sprinkler, rain, wet-grass), with sampling probability 0.324 we expect, for large samples, to see 32% of them to correspond to this event
),..,(),..,(/),..,(lim 111 nnPSnPSN
xxPxxSNxxN
),..,(~/),..,( 11 nnPS xxPNxxN
Sampling: why does it work?
• The LLN is important because it "guarantees" stable long-term results for random events.
• However, it is important to remember that the LLN only applies (as the name indicates) when a large number of observations are considered.
• There is no principle ensuring that a small number of observations will converge to the expected value or that a streak of one value will immediately be "balanced" by the others.
• So, how many samples are enough?
Hoeffding’s inequality Suppose p is the true probability and s is the sample average from n
independent samples.
p above can be the probability of any event for random variable X = {X1,…Xn} described by a Bayesian network
If you want an infinitely small probability of having an error greater than ε, you need infinitely many samples
But if you settle on something less than infinitely small, let’s say δ, then you just need to set
So you pick • the error ε you can tolerate,
• the frequency δ with which you can tolerate it
And solve for n, i.e., the number of samples that can ensure this performance (1)
222)|(| nepsP
222 ne
Hoeffding’s inequality Examples:
• You can tolerate an error greater than 0.1 only in 5% of your cases
• Set ε =0.1, δ = 0.05
• Equation (1) gives you n > 184
If you can tolerate the same error (0.1) only in 1% of the cases, then you need 265 samples
If you want an error of 0.01 in no more than 5% of the cases, you need 18,445 samples
Rejection Sampling• Used specifically to compute conditional probabilities P(X|e)
given some evidence e– It generates samples from the prior distribution specified by the Bnet (e.g., by
using the Prior-Sample algorithm we saw earlier)– It rejects all those samples that do not match the evidence e– Then, for every value x of X, it estimates P (X = x | e) by looking at the frequency
with which X = x occurs in the remaining samples (i.e. those consistent with e)
Rejection Sampling
Example
• Estimate P(Rain|sprinkler) using 100 samples• 27 samples have sprinkler• Of these, 8 have rain and 19 have ⌐rain.• Estimated P(Rain|sprinkler) =
= P*(Rain|sprinkler) = <8/27, 19/27>
Analysis of Rejection Sampling• Main Problem: it rejects too many samples when the evidence is not very likely.
• Consider our previous exampleA
B
0.3
P(A=T)
0.7
0.1
T
F
P(=T|A)A
If we are interested in P(B|A=T),
• we can only use samples with A = T,
• but I only have 0.3 probability of getting A = T from Prior-Sample,
• so I likely have to reject 70% of my samples
Things get exponentially worse as the number of evidence variables grow
Analysis of Rejection Sampling
A
B
0.3
P(A)
If we are interested in P(B|A=T, C = T), • can only use samples (A=T, B, C=T), but the probability of getting them is
• P(A=T)P(C=T) = 0.03
• I should expect to reject in the order of 97% of my samples!
C0.1
P(C)
Analysis of Rejection Sampling
Note that rejection sampling resembles how probabilities are estimated from observations in the real world.
• You need the right type of event to happen in order to have a valid observation
e.g. P(rains tomorrow|redSky tonight))
• If the event is not common, I have to wait a long time to gather enough relevant events
e.g, nights with red sky
Likelihood Weighting (LW) Avoids the inefficiency of rejection sampling by generating only
events that are consistent with evidence e
• Fixes the values for the evidence variables E
• Samples only the remaining variables Z, i.e. the query X and hidden variables Y
But it still needs to account for the influence of the given evidence on the probability of the samples, otherwise it would not be sampling the correct distribution
Likelihood Weighting (LW)If the sample comes from the correct distribution (e.g.,
the priory distribution for my Bnet in forward sampling or rejection sampling)
• simply count the number of samples with the desired values for the query variable X
In LW, before counting, each sample is weighted to account for the actual likelihood of the corresponding event given the original probability distribution and the evidence
Basically, the point is to make events which are unlikely given the actual evidence have less weight than others
Example: P(Rain|sprinkler, wet-grass)
0.1
0.5
T
F
P(S|C)C
Cloudy
Sprinkler
0.5
P(C)
Rain
Wet Grass
T
F
C
0.8
0.2
P(R|C)
0.1FF
0.9TF
0.9FT
0.99TT
P(W|S,R)RS
Random => 0.4Sample=> cloudy
Sprinkler is fixed No sampling, but adjust weight
w2 = w1 * P(sprinkler|cloudy)
Random => 0.4Sample=> rain
w1 = 1w2 = w1* 0.1 = 0.1
Wet Grass is fixed No sampling, but adjust weight
w3 = w2 * P(wet-grass|sprinkler, rain)
w3 = w2* 0.99 = 0.099
Example: P(Rain|sprinkler, wet-grass)
Basically, LW generated the sample
<cloudy, sprinkler, rain, wet-grass>
by cheating, because we did not sample the evidence variables, we just set their values to what we wanted
But LW makes up for the cheating by giving a relative low weight to this sample (0.099), reflecting the fact that it is not very likely
LW uses the weighted value when counting through all generated samples to compute the frequencies of the events of interest
Likelihood Weighting (LW)Fix evidence variables, sample only non-evidence variables,and weight each sample by the likelihood it accords the evidence
Likelihood Weighting: why does it work?
Remember that we have fixed evidence variables E, and we are sampling only all the other variables Z = X U Y• The “right” distribution to sample from would be P(Z|e) but we have
seen with rejection sampling that this is really inefficient
So which distribution is Weighted-Sample using?
By looking at the algorithm, we see that to get an event (z,e) the algorithm samples each variable in Z given its parents
• SWS(z , e) = ∏i P(zi|parent(Zi))
Likelihood Weighting: why does it work?
P(zi|parent(Zi)) can include both hidden variables Y and evidence variables E, so SWS does pay some attention to evidence
The sampled values for each Zi will be influenced by evidence among Zi ancestors.
• Better than sampling from P(z) in this respect (i.e. by completely ignoring E)
However, they won’t be influenced by any evidence that is not an ancestor of Zi.
• SWS(z , e) pays less attention to evidence than the true distribution P(z|e)
The weights are there to make up for the difference
How? The weight for sample x = (z, e) is the product of the
likelihood of each evidence variable given its parents
• w(z , e) = ∏j P(ej |parent(Ej ))
So, the weighted probability of a sample (used by Likelihood Weighting instead of simple counts) can be expresses as
• SWS(z, e) w(z, e) = ∏i P(zi |parent(Zi ))*∏j P(ej |parent(Ej )) = P(z, e) (2)
• By definition of Bayesian networks, because the two products cover all the variables in the network
And it can be shown that Likelihood weighting generates consistent estimates for P(X|e)
Analysis of LW Contrary to Rejection Sampling, LW uses all the
samples that it generates.
It simulates the effect of many rejections via weighting
Example:A
B
0.3
P(A)
0.003
0.63
T
F
P(B|A)A
Analysis of LW
A
B
0.3
P(A)
0.003
0.63
T
F
P(B|A)A
• Suppose B = true, and we are looking for samples with A = true.
• If there are 1000 of them, likely only 3 will have B = true,
• Only ones that would not be rejected by Rejection Sampling.
• With LW, we fix B = true, and give the very first sample with A = true a weight of
• Thus LW simulates the effect of many rejections with only one sample
Efficiency of LW Likelihood weighting is good
• Uses all the samples that it generates, thus it is much more efficient that Rejection Sampling
• Takes evidence into account to generate the sample• E.g. here, W’s value will get picked based on the evidence values of S, R
But doesn’t solve all our problems• Evidence influences the choice of downstream variables, but not upstream
ones (C isn’t more likely to get a value matching the evidence)• Performance still degrades when number of evidence variables increase
• The more evidence variables, the more samples will have small weight• If one does not generate enough samples, the weighted estimate will
be dominated by the small fraction of sample with substantial weight
Possible approaches sample some of the variables, while using exact inference to generate the
posterior probability for others Use Monte Carlo Simulations (beyond the scope of this course), or
Particle Filtering (may see in next classes)
Cloudy
Rain
C
S R
W
• Variable elimination• Understating factors and their operations• Carry out variable elimination by using factors and the related operations• Use techniques to simplify variable elimination
• Approximate Inference • Explain why we need approximate algorithms in Bayesian networks
• define what sampling is and how it works at the general level
• Describe how to generate samples from a distribution and be able to apply the process to a given distribution
• Demonstrate how to apply sampling to the distribution represented by a given Bayesian Network
• Explain/write in pseudocode/implement/trace/debug different sampling algorithms: forward sampling, rejection sampling, likelihood weighting (NOT Importance sampling from textbook)
• Describe when each algorithm is appropriate, as well as its drawbacks. Recommend the most appropriate sampling algorithm for a specific problem
• Explain Hoeffding's inequality and describe what it allows us to compute. Apply Hoeffding's inequality to specific examples 85
Learning Goals For Inference in Bnets