Upload
augustine-mosley
View
217
Download
0
Embed Size (px)
Citation preview
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGMonte Carlo Methods for Probabilistic Inference
AGENDA
Monte Carlo methods O(1/sqrt(N)) standard deviation
For Bayesian inference Likelihood weighting Gibbs sampling
MONTE CARLO INTEGRATION
Estimate large integrals/sums: I = f(x)p(x) dx I = f(x)p(x)
Using a sample of N i.i.d. samples from p(x) I 1/N f(x(i))
Examples: [a,b] f(x) dx (b-a)/N S f(x(i)) E[X] = x p(x) dx 1/N S x(i)
Volume of a set in Rn
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]?
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation)
= E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation)
= E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)
= 1/N S (E[f(x)]-E[f(x(i))]) = 1/N S 0 (x and x(i) are distributed
w.r.t. p(x))= 0
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]?
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)
= 1/N2 Var[S f(x(i))] (scaling of variance)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)
= 1/N2 Var[S f(x(i))] (scaling of variance)
= 1/N2 S Var[f(x(i))] (variance of a sum of independent variables)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)
= 1/N2 Var[S f(x(i))] (scaling of variance)
= 1/N2 S Var[f(x(i))]= 1/N Var[f(x)] (i.i.d. sample)
MEAN & VARIANCE OF ESTIMATE
Let IN be the random variable denoting the estimate of the integral with N samples
What is the bias (mean error) E[I-IN]? Unbiased estimator
What is the variance Var[IN]? 1/N Var[f(x)]
Standard deviation: O(1/sqrt(N))
APPROXIMATE INFERENCE THROUGH SAMPLING
Unconditional simulation: To estimate the probability of a coin flipping
heads, I can flip it a huge number of times and count the fraction of heads observed
APPROXIMATE INFERENCE THROUGH SAMPLING
Unconditional simulation: To estimate the probability of a coin flipping
heads, I can flip it a huge number of times and count the fraction of heads observed
Conditional simulation: To estimate the probability P(H) that a coin
picked out of bucket B flips heads: Repeat for i=1,…,N:1. Pick a coin C out of a random bucket b(i) chosen
with probability P(B)2. h(i) = flip C according to probability P(H|b(i))3. Sample (h(i),b(i)) comes from distribution P(H,B)
Result approximates P(H,B)
MONTE CARLO INFERENCE IN BAYES NETS
BN over variables X Repeat for i=1,…,N
In top-down order, generate x(i) as follows: Sample xj
(i) ~ P(Xj |paXj(i))
(RHS is taken by putting parent values in sample into the CPT for Xj)
Sample x(1)… x(N) approximates the
distribution over X
APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION
Sample from the joint distribution
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=0J=1M=0
APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION
As more samples are generated, the distribution of the samples approaches the joint distribution
B=0E=0A=0J=1M=0
B=0E=0A=0J=0M=0
B=0E=0A=0J=0M=0
B=1E=0A=1J=1M=0
BASIC METHOD FOR HANDLING EVIDENCE
Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e)
Remove the samples that conflict
B=0E=0A=0J=1M=0
B=0E=0A=0J=0M=0
B=0E=0A=0J=0M=0
B=1E=0A=1J=1M=0
Distribution of remaining samples approximates the conditional distribution
RARE EVENT PROBLEM:
What if some events are really rare (e.g., burglary & earthquake ?)
# of samples must be huge to get a reasonable estimate
Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the
ratio of(how likely the sampled value is to occur in the real world)
(how likely you were to generate the sampled value)
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
w=1
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1
w=0.008
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1A=1
w=0.0023
A=1 is enforced, and the weight updated to reflect the likelihood that this occurs
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=1A=1M=1J=1
w=0.0016
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0
w=3.988
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=1
w=0.004
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=0E=0A=1M=1J=1
w=0.0028
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=0A=1
w=0.00375
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=0A=1M=1J=1
w=0.0026
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
B=1E=1A=1M=1J=1
w=5e-7
LIKELIHOOD WEIGHTING
Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5
N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375
B=0E=1A=1M=1J=1
w=0.0016
B=0E=0A=1M=1J=1
w=0.0028
B=1E=0A=1M=1J=1
w=0.0026
B=1E=1A=1M=1J=1
w~=0
ANOTHER RARE-EVENT PROBLEM
B=b given as evidence Probability each bi is rare given all but one
setting of Ai (say, Ai=1)
Chance of sampling all 1’s is very low => most likelihood weights will be too low
Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b))
A1 A2 A10
B1 B2 B10
GIBBS SAMPLING
Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample xj ~ P(xj | x[1…j-1,j+1,…n])
Over the long run, the random walk taken by x approaches the true distribution P(x)
GIBBS SAMPLING IN BNS
Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi)
Look at values of “Markov blanket” of Xi: Parents PaXi
Children Y1,…,Yk
Parents of children (excluding Xi) PaY1/Xi, …, PaYk/Xi
Xi is independent of rest of network given Markov blanket
Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) Product of Xi’s factor and the factors of its
children
HANDLING EVIDENCE
Simply set each evidence variable to its appropriate value, don’t sample
Resulting walk approximates distribution P(X/E|E=e)
Uses evidence more efficiently than likelihood weighting
GIBBS SAMPLING ISSUES
Demonstrating correctness & convergence requires examining Markov Chain random walk (more later)
Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori
Numerous variants Known as Markov Chain Monte Carlo techniques
NEXT TIME
Continuous and hybrid distributions