Upload
voxuyen
View
225
Download
2
Embed Size (px)
Citation preview
Sampling Techniques for Probabilistic and Deterministic Graphical models
Bozhena Bidyuk
Vibhav Gogate
Rina Dechter
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Cutset-based Variance Reduction
6. AND/OR importance sampling
Probabilistic Reasoning;Graphical models
• Graphical models:
– Bayesian network, constraint networks, mixed network
• Queries
• Exact algorithm
– using inference,
– search and hybrids
• Graph parameters:
– tree-width, cycle-cutset, w-cutset
5
Bayesian Networks (Pearl, 1988)
P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B
lung Cancer
Smoking
X-ray
Bronchitis
Dyspnoea
P(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
CPT:
C B P(D|C,B)0 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1
Belief Updating:P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
))(|()(
))(|(
1
i
n
i
i
ii
XpaXPXP
XpaXP
:CPTs
Probability of evidence:P ( smoking=no, dyspnoea=yes ) = ?
Queries
• Probability of evidence (or partition function)
• Posterior marginal (beliefs):
• Most Probable Explanation
)var( 1
|)|()(eX
n
i
eii paxPeP X i
ii CZ )(
)var( 1
)var( 1
|)|(
|)|(
)(
),()|(
eX
n
j
ejj
XeX
n
j
ejj
ii
paxP
paxP
eP
exPexP i
e),xP(maxarg*xx
A Bred green
red yellow
green red
green yellow
yellow green
yellow red
Map coloring
Variables: countries (A B C etc.)
Values: colors (red green blue)
Constraints: etc. ,ED D, AB,A
C
A
B
D
E
F
G
Constraint Networks
Constraint graph
A
B
D
C
GF
E
Task: find a solutionCount solutions, find a good one
8
= {(¬C), (A v B v C), (¬A v B v E), (¬B v C v D)}.
Propositional Satisfiability
Mixed Networks: Mixing Belief and Constraints
Belief or Bayesian Networks
A
D
B C
E
F
A
D
B C
E
F
)|(),,|(
),|(),|(),|(),( :CPTS
}1,0{:Domains
,,,,, :Variables
AFPBAEP
CBDPACPABPAP
DDDDDD
FEDCBA
FEDCBA
Constraint Networks
)( :solutions ofset theExpresses
),(),(),(),( :sConstraint
}1,0{:Domains
,,,,, :Variables
4321
Rsol
EARBCDRACFRABCR
DDDDDD
FEDCBA
FEDCBA
B C D=0 D=1
0 0 0 1
0 1 .1 .9
1 0 .3 .7
1 1 1 0
),|( CBDP
allowednot is 1D1,C1,B
allowednot is 0,0,0
)(3
DCB
BCDR
B= R=
Constraints could be specified externally or may occur as zeros in the Belief network
Same queries (e.g., weighted counts)
)(
)(Rsolx
B xPM
10
Belief Updating
“Moral” graph
A
D E
CB
P(a|e=0) P(a,e=0)=
bcde ,,,0
P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=
0e
P(a) d
),,,( ecdahB
b
P(b|a)P(d|b,a)P(e|b,c)
B C
ED
Variable Elimination
P(c|a)c
11
Bucket Elimination Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(a|e=0)
W*=4Exp(w*)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
12
Bucket Elimination )0,()0|( eaPeaP
debc
cbePbadPacPabPaPeaP,0,,
),|(),|()|()|()()0,(
dc b e
badPcbePabPacPaP ),|(),|()|()|()(0
Elimination Order: d,e,b,cQuery:
D:
E:
B:
C:
A:
d
D badPbaf ),|(),(),|( badP
),|( cbeP ),|0(),( cbePcbfE
b
EDB cbfbafabPcaf ),(),()|(),(
)()()0,( afApeaP C)(aP
)|( acP c
BC cafacPaf ),()|()(
)|( abP
D,A,B E,B,C
B,A,C
C,A
A
),( bafD ),( cbfE
),( cafB
)(afC
AA
DD EE
CCBB
Bucket TreeD E
B
C
A
Original Functions Messages
Time and space exp(w*)
13
Complexity of Elimination
))((exp ( * dwnO
ddw ordering along graph moral of widthinduced the)(*
The effect of the ordering:
4)( 1
* dw 2)( 2
* dw“Moral” graph
A
D E
CB
B
C
D
E
A
E
D
C
B
A
Cutset-Conditioning
CP
J A
L
B
E
DF M
O
H
K
G N
CP
J
L
B
E
DF M
O
H
K
G N
A
CP
J
L
E
DF M
O
H
K
G N
B
P
J
L
E
DF M
O
H
K
G N
C
Cycle cutset = {A,B,C}
CP
J A
L
B
E
DF M
O
H
K
G N
CP
J
L
B
E
DF M
O
H
K
G N
CP
J
L
E
DF M
O
H
K
G N
CP
J A
L
B
E
DF M
O
H
K
G N
15
Search Over the Cutset
A=yellow A=green
B=red B=blue B=red B=blueB=green B=yellow
C
K
G
L
D
FH
M
J
E
C
K
G
L
D
FH
M
J
E
C
K
G
L
D
FH
M
J
E
C
K
G
L
D
FH
M
J
E
C
K
G
L
D
FH
M
J
E
C
K
G
L
D
FH
M
J
E
• Inference may require too much memory
• Condition on some of the variablesA
C
B K
G
L
D
FH
M
J
E
GraphColoringproblem
Space: exp(w): w is a user-controled parameter Time: exp(w+c(w))
21? ?? ?
A aB b
A AB b
3 4A | ?B | ?
? ?? ?
5 6A | aB | b •6 individuals
•Haplotype: {2, 3}• Genotype: {6}• Unknown
Linkage Analysis
17
L11m L11f
X11
L12m L12f
X12
L13m L13f
X13
L14m L14f
X14
L15m L15f
X15
L16m L16f
X16
S13m
S15m
S16mS15m
S15m
S15m
L21m L21f
X21
L22m L22f
X22
L23m L23f
X23
L24m L24f
X24
L25m L25f
X25
L26m L26f
X26
S23m
S25m
S26mS25m
S25m
S25m
L31m L31f
X31
L32m L32f
X32
L33m L33f
X33
L34m L34f
X34
L35m L35f
X35
L36m L36f
X36
S33m
S35m
S36mS35m
S35m
S35m
Linkage Analysis: 6 People, 3 Markers
Applications
• Determinism: More Ubiquitous than you may think!
• Transportation Planning (Liao et al. 2004, Gogate et al. 2005)– Predicting and Inferring Car Travel Activity of individuals
• Genetic Linkage Analysis (Fischelson and Geiger, 2002)– associate functionality of genes to their location on chromosomes.
• Functional/Software Verification (Bergeron, 2000)– Generating random test programs to check validity of hardware
• First Order Probabilistic models (Domingos et al. 2006, Milch et al. 2005)– Citation matching
19
Inference vs Conditioning-Search
Inference
Exp(w*) time/space
A
D
B C
E
F0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1
E
C
F
D
B
A 0 1 SearchExp(n) timeO(n) space
EK
F
L
H
C
BA
M
G
J
D
ABC
BDEF
DGF
EFH
FHK
HJ KLM
A=yellow A=green
B=blue B=red B=blueB=green
CK
G
LD
FH
M
J
EACB K
G
LD
FH
M
J
E
CK
G
LD
FH
M
J
E
CK
G
LD
FH
M
J
E
CK
G
LD
FH
M
J
E
Search+inference:Space: exp(w)Time: exp(w+c(w))
20
Approximation
• Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then:
• Bounding inference:• mini-bucket and mini-clustering• Belief propagation
• Bounding search:• Sampling
• Goal: an anytime scheme
21
Approximation
• Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then:
• Bounding inference:• mini-bucket and mini-clustering• Belief propagation
• Bounding search:• Sampling
• Goal: an anytime scheme
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Outline
• Definitions and Background on Statistics
• Theory of importance sampling
• Likelihood weighting
• State-of-the-art importance sampling techniques
23
A sample
• Given a set of variables X={X1,...,Xn}, a sample, denoted by St is an instantiation of all variables:
24
),...,,(t
n
ttt xxxS 21
How to draw a sample ?Univariate distribution
• Example: Given random variable X having domain {0, 1} and a distribution P(X) = (0.3, 0.7).
• Task: Generate samples of X from P.
• How?
– draw random number r [0, 1]
– If (r < 0.3) then set X=0
– Else set X=1
25
How to draw a sample?Multi-variate distribution
• Let X={X1,..,Xn} be a set of variables
• Express the distribution in product form
• Sample variables one by one from left to right, along the ordering dictated by the product form.
• Bayesian network literature: Logic sampling
26
),...,|(...)|()()( 11121 nn XXXPXXPXPXP
Logic sampling (example)
27
X1
X4
X2X3
)( 1XP
)|( 12 XXP )|( 13 XXP
)|( from Sample .4
)|( from Sample .3
)|( from Sample .2
)( from Sample .1
sample generate//
Evidence No
33,2244
1133
1122
11
xXxXxPx
xXxPx
xXxPx
xPx
k
),|()|()|()(),,,( 324131214321 XXXPXXPXXPXPXXXXP
),|( 324 XXXP
Expected value: Given a probability distribution P(X)
and a function g(X) defined over a set of variables X =
{X1, X2, … Xn}, the expected value of g w.r.t. P is
Variance: The variance of g w.r.t. P is:
Expected value and Variance
28
)()()]([ xPxgxgEx
P
)()]([)()]([2
xPxgExgxgVarx
PP
Monte Carlo Estimate
• Estimator:
– An estimator is a function of the samples.
– It produces an estimate of the unknown parameter of the sampling distribution.
29
T
t
t
P
SgT
g
P
1
T21
1
:bygiven is [g(x)]E of estimate carlo Monte the
, fromdrawn S ,S ,S samples i.i.d.Given
)(ˆ
Example: Monte Carlo estimate• Given:
– A distribution P(X) = (0.3, 0.7).– g(X) = 40 if X equals 0
= 50 if X equals 1.
• Estimate EP[g(x)]=(40x0.3+50x0.7)=47.• Generate k samples from P: 0,1,1,1,0,1,1,0,1,0
30
4610
650440
150040
samples
XsamplesXsamplesg
#
)(#)(#ˆ
Outline
• Definitions and Background on Statistics
• Theory of importance sampling
• Likelihood weighting
• State-of-the-art importance sampling techniques
31
Importance sampling: Main idea
• Transform the probabilistic inference problem into the problem of computing the expected value of a random variable w.r.t. to a distribution Q.
• Generate random samples from Q.
• Estimate the expected value from the generated samples.
32
Importance sampling for P(e)
)(,)()(ˆ
:
)]([)(
),(
)(
)(),(),()(
)(),(
,\
ZQzwT
eP
zwEzQ
ezPE
zQ
zQezPezPeP
zQezP
EXZLet
tT
t
t
zz
z where1
estimate Carlo Monte
:as P(e) rewritecan weThen,
00
satisfying on,distributi (proposal) a be Q(Z)Let
1
Properties of IS estimate of P(e)
• Convergence: by law of large numbers
• Unbiased.
• Variance:
Tfor )()(
1)(ˆ ..
1ePzw
TeP saT
i
i
T
zwVarzw
TVarePVar
QN
i
i
)]([)(
1)(ˆ
1
)()](ˆ[ ePePEQ
Properties of IS estimate of P(e)
• Mean Squared Error of the estimator
T
xwVar
ePVar
ePVarePEeP
ePePEePMSE
Q
Q
)]([
)(ˆ
)(ˆ)](ˆ[)(
)()(ˆ)(ˆ
2
2
This quantity enclosed in the brackets is zero because the expected value of the
estimator equals the expected value of g(x)
Estimating P(Xi|e)
36
)|()|(E:biased is Estimate
,
,
)(ˆ
),(ˆ)|(:estimate Ratio
IS.by r denominato andnumerator Estimate :Idea
)(
),(
)(
),()(
),(
),()(
)(
),()|(
otherwise. 0 and xcontains z if 1 is which function, delta-dirac a be (z)Let
T
1k
T
1k
ix i
exPexP
e)w(z
e))w(z(z
eP
exPexP
zQ
ezPE
zQ
ezPzE
ezP
ezPz
eP
exPexP
ii
k
kk
x
ii
Q
x
Q
z
z
x
ii
i
i
i
Properties of the IS estimator for P(Xi|e)
• Convergence: By Weak law of large numbers
• Asymptotically unbiased
• Variance
– Harder to analyze
– Liu suggests a measure called “Effective sample size”
37
T as )|()|( exPexP ii
)|()]|([lim exPexPE iiPT
Effective Sample size
38
possible. as small as bemust weights theof variance theTherefore,
Q. from samples T)ESS(Q, worth are P from samples T Thus
),()]|([
)]|(ˆ[
)]([var1),(:
1ˆ
:using e)|P(x estimatecan wee),|P(z from samplesGiven
)|()()|(
1
i
TQESS
T
exPVar
exPVar
zw
TTQESSDefine
)(zgT
|e)(xP
ezPzgexP
iQ
iP
Q
T
j
t
xi
z
xi
i
i
Ideal estimator
Measures how much the estimator deviates from the ideal one.
Outline
• Definitions and Background on Statistics
• Theory of importance sampling
• Likelihood weighting
• State-of-the-art importance sampling techniques
39
Likelihood Weighting: Proposal Distribution
40
EE
jj
EXX
ii
EXX EE
jjii
n
EXX
ii
j
i
i j
i
paeP
epaxP
paePepaxP
xQ
exPw
xxx
Weights
xXXXPXP
epaXPEXQ
)|(
),|(
)|(),|(
)(
),(
),..,(:
:
),|()(
),|()\(
\
\
\
sample aGiven
)X,Q(X
.xX Evidence
and )X,X|P(X)X|P(X)P(X)X,X,P(X :networkBayesian aGiven
:Example
1
2213131
22
213121321
Likelihood Weighting: Sampling
41
e e e e e
e e e e
Sample in topological order over X !
Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!
Outline
• Definitions and Background on Statistics
• Theory of importance sampling
• Likelihood weighting
• State-of-the-art importance sampling techniques
42
Proposal selection
• One should try to select a proposal that is as close as possible to the posterior distribution.
)|()(
)()(
),(
estimator variance-zero a have to,0)()(
),(
)()()(
),(1)]([)(ˆ
2
ezPzQ
zQeP
ezP
ePzQ
ezP
zQePzQ
ezP
NT
zwVarePVar
Zz
Q
Q
Proposal Distributions used in Literature
• AIS-BN (Adaptive proposal)
• Cheng and Druzdzel, 2000
• Iterative Belief Propagation
• Changhe and Druzdzel, 2003
• Iterative Join Graph Propagation (IJGP) and variable ordering
• Gogate and Dechter, 2005
Perfect sampling using Bucket Elimination
• Algorithm:
– Run Bucket elimination on the problem along an ordering o=(XN,..,X1).
– Sample along the reverse ordering: (X1,..,XN)
– At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.
SP2 46
Bucket elimination (BE) Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(e)
bucket B:
P(a)
P(C|A)
P(B|A) P(D|B,A) P(e|B,C)
bucket C:
bucket D:
bucket E:
bucket A:
B
C
D
E
A
e)(A,hD
(a)hE
e)C,D,(A,hB
e)D,(A,hC
SP2 47
Sampling from the output of BE(Dechter 2002)
bucket B:
P(A)
P(C|A)
P(B|A) P(D|B,A) P(e|B,C)
bucket C:
bucket D:
bucket E:
bucket A:
e)(A,hD
(A)hE
e)C,D,(A,hB
e)D,(A,hC
Q(A)aA:
(A)hP(A)Q(A)E
Sample
ignore :bucket Evidence
e)D,(a,he)a,|Q(Dd D :Sample
bucket in the aASet
C
e)C,d,(a,hd)e,a,|Q(C cC :Sample
bucket in the dDa,ASet
B
),|(),|()|( cbePaBdPaBP
d)e,a,|Q(Cb B :Sample
bucket in the cCd,Da,ASet
Mini-buckets: “local inference”
• Computation in a bucket is time and space exponential in the number of variables involved
• Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
• Can control the size of each “mini-bucket”, yielding polynomial complexity.
SP2 48
SP2 49 49
Mini-Bucket Elimination
bucket A:
bucket E:
bucket D:
bucket C:
bucket B:
ΣB
P(B|A) P(D|B,A)
hE(A)
hB(A,D)
P(e|B,C)
Mini-buckets
ΣB
P(C|A) hB(C,e)
hD(A)
hC(A,e)
Approximation of P(e)
Space and Time constraints:Maximum scope size of the new function generated should be bounded by 2
BE generates a function having scope size 3. So it cannot be used.
P(A)
SP2 50 50
Sampling from the output of MBE
bucket A:
bucket E:
bucket D:
bucket C:
bucket B: P(B|A) P(D|B,A)
hE(A)
hB(A,D)
P(e|B,C)
P(C|A) hB(C,e)
hD(A)
hC(A,e)Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini-bucket”
IJGP-Sampling (Gogate and Dechter, 2005)
• Iterative Join Graph Propagation (IJGP)
– A Generalized Belief Propagation scheme (Yedidiaet al., 2002)
• IJGP yields better approximations of P(X|E) than MBE
– (Dechter, Kask and Mateescu, 2002)
• Output of IJGP is same as mini-bucket “clusters”
• Currently the best performing IS scheme!
k
)(ˆ Re
')(Q Update
)(N
1)(ˆe)(EP̂
Q z,...,z samples Generate
dok to1iFor
0)(ˆ
))(|(...))(|()()(Q Proposal Initial
1k
1
N1
221
1
eEPturn
End
QQkQ
zweEP
from
eEP
ZpaZQZpaZQZQZ
kk
iN
j
k
k
nn
Adaptive Importance Sampling
Adaptive Importance Sampling
• General case
• Given k proposal distributions
• Take N samples out of each distribution
• Approximate P(e)
1
)(ˆ
1
k
j
proposaljthweightAvgk
eP
Estimating Q'(z)
sampling importanceby estimated is
)Z,..,Z|(ZQ'each where
))(|('...))(|(')(')(Q
1-i1i
221
'
nn ZpaZQZpaZQZQZ
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Markov Chain
• A Markov chain is a discrete random process with the property that the next state depends only on the
current state (Markov Property):
56
x1 x2 x3 x4
)|(),...,,|( 1121 tttt xxPxxxxP
• If P(Xt|xt-1) does not depend on t (time homogeneous) and state space is finite, then it is often expressed as a transition function (aka transition matrix) 1)(
x
xXP
Example: Drunkard’s Walk• a random walk on the number line where, at
each step, the position may change by +1 or −1 with equal probability
57
1 2 3
,...}2,1,0{)( XD5.05.0
)1()1(
n
nPnP
transition matrix P(X)
1 2
Example: Weather Model
58
5.05.0
1.09.0
)()(
sunny
rainy
sunnyPrainyP
transition matrix P(X)
},{)( sunnyrainyXD
rain rain rainrain sun
Multi-Variable System
• state is an assignment of values to all the variables
59
x1t
x2t
x3t
x1t+1
x2t+1
x3t+1
},...,,{ 21
t
n
ttt xxxx
finitediscreteXDXXXX i ,)(},,,{ 321
Bayesian Network System
• Bayesian Network is a representation of the joint probability distribution over 2 or more variables
X1t
X2t
X3t
},,{ 321
tttt xxxx
X1
X2 X3
60
},,{ 321 XXXX
x1t+1
x2t+1
x3t+1
61
Stationary DistributionExistence
• If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:
• Finite state space Markov chain has a unique stationary distribution if and only if:– The chain is irreducible
– All of its states are positive recurrent
)(
)|()()(XDx
jiji
i
xxPxx
62
Irreducible
• A state x is irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps
• If one state is irreducible, then all the states must be irreducible
(Liu, Ch. 12, pp. 249, Def. 12.1.1)
63
Recurrent
• A state x is recurrent if the chain returns to x
with probability 1
• Let M(x ) be the expected number of steps to return to state x
• State x is positive recurrent if M(x ) is finiteThe recurrent states in a finite state chain are positive recurrent .
Stationary Distribution Convergence
• Consider infinite Markov chain:nnn PPxxPP 00)( )|(
• Initial state is not important in the limit
“The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…”
(Liu, Ch. 12.1)
)(lim n
nP
64
• If the chain is both irreducible and aperiodic, then:
Aperiodic
• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic
• Positive recurrent, aperiodic states are ergodic
65
Markov Chain Monte Carlo
• How do we estimate P(X), e.g., P(X|e) ?
66
• Generate samples that form Markov Chain with stationary distribution =P(X|e)
• Estimate from samples (observed states):
visited states x0,…,xn can be viewed as “samples” from distribution
),(1
)(1
tT
t
xxT
x
)(lim xT
MCMC Summary
• Convergence is guaranteed in the limit
• Samples are dependent, not i.i.d.
• Convergence (mixing rate) may be slow
• The stronger correlation between states, the slower convergence!
• Initial state is not important, but… typically, we throw away first K samples - “burn-in”
67
Gibbs Sampling (Geman&Geman,1984)
• Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables
• Sample new variable value one variable at a time from the variable’s conditional distribution:
• Samples form a Markov chain with stationary distribution P(X|e)
68
)\|(},...,,,..,|()( 111 i
t
i
t
n
t
i
t
i
t
ii xxXPxxxxXPXP
Gibbs Sampling: Illustration
The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk):
In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).
Ordered Gibbs Sampler
Generate sample xt+1 from xt :
In short, for i=1 to N:
70
),\|( from sampled
),,...,,|(
...
),,...,,|(
),,...,,|(
1
1
1
1
2
1
1
1
3
1
12
1
22
321
1
11
exxXPxX
exxxXPxX
exxxXPxX
exxxXPxX
i
t
i
t
ii
t
N
tt
N
t
NN
t
N
ttt
t
N
ttt
ProcessAllVariablesIn SomeOrder
Transition Probabilities in BN
Markov blanket:
:)|( )\|( t
iii
t
i markovXPxxXP
ij chX
jjiii
t
i paxPpaxPxxxP )|()|()\|(
)()( jj chX
jiii pachpaXmarkov
Xi
Given Markov blanket (parents, children, and their parents),Xi is independent of all other nodes
71
Computation is linear in the size of Markov blanket!
Ordered Gibbs Sampling Algorithm(Pearl,1988)
Input: X, E=e
Output: T samples {xt }
Fix evidence E=e, initialize x0 at random1. For t = 1 to T (compute samples)
2. For i = 1 to N (loop through variables)
3. xit+1 P(Xi | markovi
t)
4. End For
5. End For
Gibbs Sampling Example - BN
73
X1
X4
X8 X5 X2
X3
X9 X7
X6
}{},,...,,{ 9921 XEXXXX
X1 = x10
X6 = x60
X2 = x20
X7 = x70
X3 = x30
X8 = x80
X4 = x40
X5 = x50
Gibbs Sampling Example - BN
74
X1
X4
X8 X5 X2
X3
X9 X7
X6
),,...,|( 9
0
8
0
21
1
1 xxxXPx
}{},,...,,{ 9921 XEXXXX
),,...,|( 9
0
8
1
12
1
2 xxxXPx
Answering Queries P(xi |e) = ?• Method 1: count # of samples where Xi = xi (histogram estimator):
T
t
t
iiiii markovxXPT
xXP1
)|(1
)(
T
t
t
iii xxT
xXP1
),(1
)(
Dirac delta f-n
• Method 2: average probability (mixture estimator):
• Mixture estimator converges faster (consider estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)
Rao-Blackwell Theorem
Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies
76
)]([}|)({[ RgVarLRgEVar
for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).
• theorem makes a weak promise, but works well in practice!• improvement depends the choice of R and L
Importance vs. Gibbs
T
tt
tt
t
T
t
t
T
t
xQ
xPxg
Tg
eXQX
xgT
Xg
eXPeXP
eXPx
1
1
)(
)()(1
)|(
)(1
)(ˆ
)|()|(ˆ
)|(ˆ
wt
Gibbs:
Importance:
Gibbs Sampling: Convergence
78
• Sample from `P(X|e)P(X|e)
• Converges iff chain is irreducible and ergodic
• Intuition - must be able to explore all states:– if Xi and Xj are strongly correlated, Xi=0 Xj=0,
then, we cannot explore states with Xi=1 and Xj=1
• All conditions are satisfied when all probabilities are positive
• Convergence rate can be characterized by the second eigen-value of transition matrix
Gibbs: Speeding Convergence
Reduce dependence between samples (autocorrelation)
• Skip samples
• Randomize Variable Sampling Order
• Employ blocking (grouping)
• Multiple chains
Reduce variance (cover in the next section)
79
Blocking Gibbs Sampler
• Sample several variables together, as a block
• Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:
+ Can improve convergence greatly when two variables are strongly correlated!
- Domain of the block variable grows exponentially with the #variables in a block!
80
)|,(),(
)(),|(
1111
1
tttt
tttt
xZYPwzy
wPzyXPx
Gibbs: Multiple Chains
• Generate M chains of size K
• Each chain produces independent estimate Pm:
81
M
i
mPM
P1
1ˆ
K
t
i
t
iim xxxPK
exP1
)\|(1
)|(
Treat Pm as independent random variables.
• Estimate P(xi|e) as average of Pm (xi|e) :
Gibbs Sampling Summary
• Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)
• Samples are dependent, form Markov Chain
• Sample from which converges to
• Guaranteed to converge when all P > 0
• Methods to improve convergence:– Blocking
– Rao-Blackwellised
82
)|( eXP )|( eXP
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Outline
• Rejection problem
• Backtrack-free distribution
– Constructing it in practice
• SampleSearch
– Construct the backtrack-free distribution on the fly.
• Approximate estimators
• Experiments
Outline
• Rejection problem
• Backtrack-free distribution
– Constructing it in practice
• SampleSearch
– Construct the backtrack-free distribution on the fly.
• Approximate estimators
• Experiments
Rejection problem
• Importance sampling requirement – P(z,e) > 0 Q(z)>0
• When P(z,e)=0 but Q(z) > 0, the weight of the sample is zero and it is rejected.
• The probability of generating a rejected sample should be very small.– Otherwise the estimate will be zero.
N
ii
i
zQ
ezP
NeP
1
1
)(
),()(ˆ
Rejection Problem
All Blue leaves correspond to solutions i.e. g(x) >0All Red leaves correspond to non-solutions i.e. g(x)=0
solution. anot is if 0
1 :sampling Importance
1
ii
N
ii
i
zezP
zQ
ezP
NeP
),(
)(
),()(ˆ
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0.4 0.6
0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
Constraints: A≠B A≠C
If z violates either A≠B or A≠C then P(z,e)=0
Q: Positive
Outline
• Rejection problem
• Backtrack-free distribution
– Constructing it in practice
• SampleSearch
– Construct the backtrack-free distribution on the fly.
• Approximate estimators
• Experiments
Backtrack-free distribution: A rejection-free distribution
All Blue leaves correspond to solutions i.e. g(x) >0All Red leaves correspond to non-solutions i.e. g(x)=0
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0 1
0 0 0 1 1 0 0 0
1 0
QF(branch)=0 if no solutions under it
QF(branch) αQ(branch) otherwise
Constraints: A≠B A≠C
Generating samples from QF
C=1
Root
0.80.2
?? Solution?? Solution
?? Solution?? Solution
A=0
00.4 0.6 1
?? Solution ?? Solution
B=1
00.2 0.81
QF(branch)=0 if no solutions under it
QF(branch) αQ(branch) otherwise
• Invoke an oracle ateach branch.• Oracle returns
True if there is a solution under a branch
• False, otherwise
Constraints: A≠B A≠C
Generating samples from QF
• Oracles• Adaptive consistency as pre-
processing step• Constant time table look-
up• Exponential in the
treewidth of the constraint portion.
• A complete CSP solver • Need to run it at each
assignment.
A=0
B=1
C=1
Root
0.80.2
1
1
Constraints: A≠B A≠C
Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005
Outline
• Rejection problem
• Backtrack-free distribution
– Constructing it in practice
• SampleSearch
– Construct the backtrack-free distribution on the fly.
• Approximate estimators
• Experiments
93
Algorithm SampleSearch
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0.4 0.6
0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
94
Algorithm SampleSearch
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0.4 0.6
0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.81
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
95
Algorithm SampleSearch
A=0
B=1 B=0 B=1
A=1
C=1C=1C=0 C=0 C=0 C=1
Root
0.8 0.2
0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
Resume Sampling
1
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
96
Algorithm SampleSearch
A=0
B=1 B=0 B=1
A=1
C=1C=1C=0 C=0 C=0 C=1
Root
0.8 0.2
1
0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
Constraint Violated
Until P(sample,e)>0
1
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
97
Generate more Samples
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0.4 0.6
0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
98
Generate more Samples
A=0
B=0
C=0
B=1 B=0 B=1
A=1
C=1C=1C=1 C=0 C=0 C=0 C=1
Root
0.8 0.2
0.4 0.6
0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8
0.2 0.8
1
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
Traces of SampleSearch
A=0
B=1
C=1
Root
A=0
B=0B=1
C=1
Root
A=0
B=1
C=1
Root
C=0
A=0
B=0 B=1
C=1
Root
C=0
Constraints: A≠B A≠C
Gogate and Dechter, AISTATS 2007, AAAI 2007
100
SampleSearch: Sampling Distribution
• Problem: Due to Search, the samples are no longer i.i.d. from Q.
• Thm: SampleSearch generates i.i.d. samples from the backtrack-free distribution
)()(,)(
),()( ePePE
zQ
ezP
NeP Q
N
jj
j
1
1
Gogate and Dechter, AISTATS 2007, AAAI 2007
)()(ˆ,)(
),()(ˆ ePePE
zQ
ezP
NeP FQ
N
jjF
j
F F
1
1
The Sampling distribution QF of SampleSearch
A=0
B=0 B=1
C=1C=1 C=0
Root
0.8
0 1
0 0 0 1
C=0
What is probability of generating A=0?
QF(A=0)=0.8
Why? SampleSearch is systematic
What is probability of generating (A=0,B=1)?
QF(B=1|A=0)=1
Why? SampleSearch is systematic
What is probability of generating (A=0,B=0)?
Simple: QF(B=0|A=0)=0
All samples generated by SampleSearch are
solutions
Backtrack-free distribution
Gogate and Dechter, AISTATS 2007, AAAI 2007
Constraints: A≠B A≠C
N
jj
j
zQ
ezP
NeP
1
1
)(
),()(ˆ
Outline
• Rejection problem
• Backtrack-free distribution
– Constructing it in practice
• SampleSearch
– Construct the backtrack-free distribution on the fly.
• Approximate estimators
• Experiments
Asymptotic approximations of QF
A=0
B=0 B=1
C=1C=1 C=0
Root
0.8
0 1
0 0 0 1
C=0
Hole ?
Don’t know
No solutions here
No solutions here
•IF Hole THEN
•UF=Q (i.e. there is a solution at the other branch)
•LF=0 (i.e. no solution at the other branch)
Gogate and Dechter, AISTATS 2007, AAAI 2007
Approximations:Convergence in the limit
• Store all possible traces
A=0
B=1
C=1
Root
0.8
1
1
?
Gogate and Dechter, AISTATS 2007, AAAI 2007
A=0
B=1
C=1
Root
C=0
0.8
?
0.6
1
?A=0
B=0B=1
C=1
Root
0.8
?
1
0.8
?
Approximations:Convergence in the limit
• From the combined sample tree, update U and L.
IF Hole THEN UFN=Q and LF
N=0
)()(ˆ)(
)()()(
)()(
),(lim
)(
),(lim
ePePeP
zLzQzU
ePzL
ezPE
zU
ezPE
L
FF
U
F
F
N
FF
N
F
N
NF
N
N
:Bounding
unbiasedally Asymptotic A=0
B=1
C=1
Root
0.8
1
1
?
Gogate and Dechter, AISTATS 2007, AAAI 2007
Upper and Lower Approximations
• Asymptotically unbiased.
• Upper and lower bound on the unbiased sample mean
• Linear time and space overhead
• Bias versus variance tradeoff
– Bias = difference between the upper and lower approximation.
Improving Naive SampleSeach
• Better Search Strategy
– Can use any state-of-the-art CSP/SAT solver e.g. minisat (Een and Sorrenson 2006)
• All theorems and result hold
• Better Importance Function
– Use output of generalized belief propagation to compute the initial importance function Q (Gogateand Dechter, 2005)
Gogate and Dechter, AISTATS 2007, AAAI 2007
Experiments
• Tasks– Weighted Counting– Marginals
• Benchmarks– Satisfiability problems (counting solutions)– Linkage networks– Relational instances (First order probabilistic networks)– Grid networks– Logistics planning instances
• Algorithms– SampleSearch/UB, SampleSearch/LB– SampleCount (Gomes et al. 2007, SAT)– ApproxCount (Wei and Selman, 2007, SAT)– RELSAT (Bayardo and Peshoueshk, 2000, SAT)– Edge Deletion Belief Propagation (Choi and Darwiche, 2006)– Iterative Join Graph Propagation (Dechter et al., 2002)– Variable Elimination and Conditioning (VEC)– EPIS (Changhe and Druzdzel, 2006)
Results: Solution CountsLangford instances
Time Bound: 10 hrs
Results: Probability of EvidenceLinkage instances (UAI 2006 evaluation)
Time Bound: 3 hrs
Results: Probability of EvidenceLinkage instances (UAI 2008 evaluation)
Time Bound: 3 hrs
Results on Marginals
• Evaluation Criteria
• Always bounded between 0 and 1
• Lower Bounds the KL distance
• When probabilities close to zero are present KL distance may tend to infinity.
n
xAxP
cedisHellinger
xAeApproximatxPExact
n
i Dx
ii
ii
ii
1
2
)()(2
1
tan
)(: )(:
Results: Posterior MarginalsLinkage instances (UAI 2006 evaluation)
Time Bound: 3 hrsDistance measure: Hellinger distance
Summary: SampleSearch
• Manages rejection problem while sampling– Systematic backtracking search
• Sampling Distribution of SampleSearch is the backtrack-free distribution QF
– Expensive to compute
• Approximation of QF based on storing all traces that yields an asymptotically unbiased estimator– Linear time and space overhead– Bound the sample mean from above and below
• Empirically, when a substantial number of zero probabilities are present, SampleSearch based schemes dominate their pure sampling counter-parts and Generalized Belief Propagation.
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Sampling: Performance
• Gibbs sampling
– Reduce dependence between samples
• Importance sampling
– Reduce variance
• Achieve both by sampling a subset of variables and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation
• Exploit graph structure to manage the extra cost
119
Smaller Subset State-Space
• Smaller state-space is easier to cover
120
},,,{ 4321 XXXXX },{ 21 XXX
64)( XD 16)( XD
Smoother Distribution
121
00
01
10
11
0
0.1
0.2
0001
1011
P(X1,X2,X3,X4)
0-0.1 0.1-0.2 0.2-0.26
0
1
0
0.1
0.2
0
1
P(X1,X2)
0-0.1 0.1-0.2 0.2-0.26
Speeding Up Convergence
• Mean Squared Error of the estimator:
PVarBIASPMSE QQ 2
2
2
][ˆ]ˆ[]ˆ[ PEPEPVarPMSE QQQQ
• Reduce variance speed up convergence !
• In case of unbiased estimator, BIAS=0
122
Rao-Blackwellisation
123
)}(~{]}|)([{)}({
)}(ˆ{
]}|)([{)}({
]}|)({var[]}|)([{)}({
]}|)([]|)([{1
)(~
)}()({1
)(ˆ
1
1
xgVarT
lxhEVar
T
xhVarxgVar
lxgEVarxgVar
lxgElxgEVarxgVar
lxhElxhET
xg
xhxhT
xg
LRX
T
T
Liu, Ch.2.3
Rao-Blackwellisation
• X=RL
• Importance Sampling:
• Gibbs Sampling: – autocovariances are lower (less correlation
between samples)
– if Xi and Xj are strongly correlated, Xi=0 Xj=0, only include one fo them into a sampling set
124
})(
)({}
),(
),({
RQ
RPVar
LRQ
LRPVar QQ
Liu, Ch.2.5.5
“Carry out analytical computation as much as possible” - Liu
Blocking Gibbs Sampler vs. Collapsed
• Standard Gibbs:
(1)
• Blocking:
(2)
• Collapsed:
(3)
125
X Y
Z
),|(),,|(),,|( yxzPzxyPzyxP
)|,(),,|( xzyPzyxP
)|(),|( xyPyxP
FasterConvergence
Collapsed Gibbs Sampling
Generating Samples
Generate sample ct+1 from ct :
126
),\|(
),,...,,|(
...
),,...,,|(
),,...,,|(
1
1
1
1
2
1
1
1
3
1
12
1
22
321
1
11
ecccPcC
eccccPcC
eccccPcC
eccccPcC
i
t
i
t
ii
t
K
tt
K
t
KK
t
K
ttt
t
K
ttt
from sampled
In short, for i=1 to K:
Collapsed Gibbs Sampler
Input: C X, E=e
Output: T samples {ct }
Fix evidence E=e, initialize c0 at random1. For t = 1 to T (compute samples)
2. For i = 1 to N (loop through variables)
3. cit+1 P(Ci | ct\ci)
4. End For
5. End For
Calculation Time
• Computing P(ci| ct\ci,e) is more expensive (requires inference)
• Trading #samples for smaller variance:
– generate more samples with higher covariance
– generate fewer samples with lower covariance
• Must control the time spent computing sampling probabilities in order to be time-effective!
128
Exploiting Graph Properties
Recall… computation time is exponential in the adjusted induced width of a graph
• w-cutset is a subset of variable s.t. when they are observed, induced width of the graph is w
• when sampled variables form a w-cutset , inference is exp(w) (e.g., using Bucket Tree
Elimination)
• cycle-cutset is a special case of w-cutset
129
Sampling w-cutset w-cutset sampling!
What If C=Cycle-Cutset ?
130
}{},{ 9
0
5
0
2
0 XE,xxc
X1
X7
X5 X4
X2
X9 X8
X3
X6
X1
X7
X4
X9 X8
X3
X6
P(x2,x5,x9) – can compute using Bucket Elimination
P(x2,x5,x9) – computation complexity is O(N)
Computing Transition Probabilities
131
),,1(:
),,0(:
932
932
xxxPBE
xxxPBE
X1
X7
X5 X4
X2
X9 X8
X3
X6
),,1()|1(
),,0()|0(
),,1(),,0(
93232
93232
932932
xxxPxxP
xxxPxxP
xxxPxxxP
Compute joint probabilities:
Normalize:
Cutset Sampling-Answering Queries
• Query: ci C, P(ci |e)=? same as Gibbs:
132
computed while generating sample tusing bucket tree elimination
compute after generating sample tusing bucket tree elimination
T
t
i
t
ii ecccPT
|ecP1
),\|(1
)(ˆ
T
t
t
ii ,ecxPT
|e)(xP1
)|(1
• Query: xi X\C, P(xi |e)=?
Cutset Sampling vs. Cutset Conditioning
133
)|()|(
)()|(
)|(1
)(
)(
1
ecPc,exP
T
ccountc,exP
,ecxPT
|e)(xP
CDc
i
CDc
i
T
t
t
ii
• Cutset Conditioning
• Cutset Sampling
)|()|()(
ecPc,exP|e)P(xCDc
ii
Cutset Sampling Example
134
)(
)(
)(
3
1)|(
)(
)(
)(
9
2
52
9
1
52
9
0
52
92
9
2
52
3
2
9
1
52
2
2
9
0
52
1
2
,x| xxP
,x| xxP
,x| xxP
xxP
,x| xxP x
,x| xxP x
,x| xxP x
X1
X7
X6 X5 X4
X2
X9 X8
X3
Estimating P(x2|e) for sampling node X2 :
Sample 1
Sample 2
Sample 3
Cutset Sampling Example
135
),,|(},{
),,|(},{
),,|(},{
9
3
5
3
23
3
5
3
2
3
9
2
5
2
23
2
5
2
2
2
9
1
5
1
23
1
5
1
2
1
xxxxPxxc
xxxxPxxc
xxxxPxxc
Estimating P(x3 |e) for non-sampled node X3 :
X1
X7
X6 X5 X4
X2
X9 X8
X3
),,|(
),,|(
),,|(
3
1)|(
9
3
5
3
23
9
2
5
2
23
9
1
5
1
23
93
xxxxP
xxxxP
xxxxP
xxP
CPCS54 Test Results
136
MSE vs. #samples (left) and time (right)
Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3
Exact Time = 30 sec using Cutset Conditioning
CPCS54, n=54, |C|=15, |E|=3
0
0.001
0.002
0.003
0.004
0 1000 2000 3000 4000 5000
# samples
Cutset Gibbs
CPCS54, n=54, |C|=15, |E|=3
0
0.0002
0.0004
0.0006
0.0008
0 5 10 15 20 25
Time(sec)
Cutset Gibbs
CPCS179 Test Results
137
MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35
Exact Time = 122 sec using Cutset Conditioning
CPCS179, n=179, |C|=8, |E|=35
0
0.002
0.004
0.006
0.008
0.01
0.012
100 500 1000 2000 3000 4000
# samples
Cutset Gibbs
CPCS179, n=179, |C|=8, |E|=35
0
0.002
0.004
0.006
0.008
0.01
0.012
0 20 40 60 80
Time(sec)
Cutset Gibbs
CPCS360b Test Results
138
MSE vs. #samples (left) and time (right)
Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36
Exact Time > 60 min using Cutset Conditioning
Exact Values obtained via Bucket Elimination
CPCS360b, n=360, |C|=21, |E|=36
0
0.00004
0.00008
0.00012
0.00016
0 200 400 600 800 1000
# samples
Cutset Gibbs
CPCS360b, n=360, |C|=21, |E|=36
0
0.00004
0.00008
0.00012
0.00016
1 2 3 5 10 20 30 40 50 60
Time(sec)
Cutset Gibbs
Random Networks
139
MSE vs. #samples (left) and time (right)
|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20
Exact Time = 30 sec using Cutset Conditioning
RANDOM, n=100, |C|=13, |E|=15-20
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0 200 400 600 800 1000 1200
# samples
Cutset Gibbs
RANDOM, n=100, |C|=13, |E|=15-20
0
0.0002
0.0004
0.0006
0.0008
0.001
0 1 2 3 4 5 6 7 8 9 10 11
Time(sec)
Cutset Gibbs
Coding NetworksCutset Transforms Non-Ergodic Chain to Ergodic
140
MSE vs. time (right)
Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50
Sample Ergodic Subspace U={U1, U2,…Uk}
Exact Time = 50 sec using Cutset Conditioning
x1 x2 x3 x4
u1 u2 u3 u4
p1 p2 p3 p4
y4y3y2y1
Coding Networks, n=100, |C|=12-14
0.001
0.01
0.1
0 10 20 30 40 50 60
Time(sec)
IBP Gibbs Cutset
Non-Ergodic Hailfinder
141
MSE vs. #samples (left) and time (right)
Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0
Exact Time = 2 sec using Loop-Cutset Conditioning
HailFinder, n=56, |C|=5, |E|=1
0.0001
0.001
0.01
0.1
1
1 2 3 4 5 6 7 8 9 10
Time(sec)
Cutset Gibbs
HailFinder, n=56, |C|=5, |E|=1
0.0001
0.001
0.01
0.1
0 500 1000 1500
# samples
Cutset Gibbs
CPCS360b - MSE
142
cpcs360b, N=360, |E|=[20-34], w*=20, MSE
0
0.000005
0.00001
0.000015
0.00002
0.000025
0 200 400 600 800 1000 1200 1400 1600
Time (sec)
Gibbs
IBP
|C|=26,fw=3
|C|=48,fw=2
MSE vs. Time
Ergodic, |X| = 360, |C| = 26, D(Xi)=2
Exact Time = 50 min using BTE
Cutset Importance Sampling
• Apply Importance Sampling over cutset C
T
t
tT
tt
t
wTcQ
ecP
TeP
11
1
)(
),(1)(ˆ
T
t
tt
ii wccT
ecP1
),(1
)|(
T
t
tt
ii wecxPT
exP1
),|(1
)|(
where P(ct,e) is computed using Bucket Elimination
(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)
Likelihood Cutset Weighting (LCS)
• Z=Topological Order{C,E}
• Generating sample t+1:
144
For End
If End
),...,|(
Else
If
:do For
1
1
11
1
1
t
i
t
i
t
i
ii
t
i
i
i
zzZPz
e ,zzz
E Z
ZZ • computed while generating sample tusing bucket tree elimination
• can be memoized for some number of instances K (based on memory available
KL[P(C|e), Q(C)+ ≤ KL*P(X|e), Q(X)]
Pathfinder 1
145
Pathfinder 2
146
Link
147
Summary
• i.i.d. samples
• Unbiased estimator
• Generates samples fast
• Samples from Q
• Reject samples with zero-weight
• Improves on cutset
• Dependent samples
• Biased estimator
• Generates samples slower
• Samples from `P(X|e)
• Does not converge in presence of constraints
• Improves on cutset
148
Importance Sampling Gibbs Sampling
CPCS360b
149
cpcs360b, N=360, |LC|=26, w*=21, |E|=15
1.E-05
1.E-04
1.E-03
1.E-02
0 2 4 6 8 10 12 14
Time (sec)
MS
ELW
AIS-BN
Gibbs
LCS
IBP
LW – likelihood weightingLCS – likelihood weighting on a cutset
CPCS422b
150
1.0E-05
1.0E-04
1.0E-03
1.0E-02
0 10 20 30 40 50 60
MS
E
Time (sec)
cpcs422b, N=422, |LC|=47, w*=22, |E|=28 LW
AIS-BN
Gibbs
LCS
IBP
LW – likelihood weightingLCS – likelihood weighting on a cutset
Coding Networks
151
coding, N=200, P=3, |LC|=26, w*=21
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
0 2 4 6 8 10
Time (sec)
MS
ELWAIS-BNGibbsLCSIBP
LW – likelihood weightingLCS – likelihood weighting on a cutset
Overview
1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling
Motivation
153
Expected value of the number on the face of a die:
What is the expected value of the product of the
numbers on the face of “k” dice?
536
654321.
k).( 53
Monte Carlo estimate
• Perform the following experiment N times.– Toss all the k dice.– Record the product of the numbers on the top
face of each die.
• Report the average over the N runs.
N
i
k
jNZ
1 1
thth run N in the dice"j" theof face on thenumber the1
)(ˆ
How the sample average converges?
155
10 dice. Exact Answer is (3.5)10
But This is Really Dumb?
156
The dice are independent.
A better Monte Carlo estimate1. Perform the experiment N times
2. For each dice record the average
3. Take a product of the averages
• Conventional estimate: Averages of products.
k
j
N
i
newN
Z1 1
thth run N in the dice"j" theof face on thenumber the1
)(ˆ
N
i
k
jNZ
1 1
thth run N in the dice"j" theof face on thenumber the1
)(ˆ
How the sample Average Converges
157
Moral of the story
158
• Make use of (conditional) independence to get better results
• Used for exact inference extensively– Bucket Elimination (Dechter, 1996)– Junction tree (Lauritzen and Speigelhalter, 1988)– Value Elimination (Bacchus et al. 2004)– Recursive Conditioning (Darwiche, 2001)– BTD (Jegou et al., 2002)– AND/OR search (Dechter and Mateescu, 2007)
• How to use it for sampling?– AND/OR Importance sampling
Background: AND/OR search space
B
A
C
≠
≠
{0,1,2}
{0,1,2}
{0,1,2}
Problem
A
B C
Pseudo-tree
A
B
CChain
Pseudo-tree
A
0 1 2
B C B C B C
1 2 1 202 0 2 0 101
OR
AND
OR
AND
AND/OR Tree
A
0 1 2
B B B
1 2 20 0 1
C
1 2
C
1 2
C
0 2
C
0 2
C
0 1
C
0 1
OR Tree
159
AND/OR sampling: Example
A
FD
CB
P(A)
P(B|A)P(C|A)
P(D|B)P(F|C)
)|()|()|()|()(),(,,
cfPbdPabPacPaPfdPcba
AND/OR Importance Sampling (General Idea)
161
• Decompose Expectation
A
B C
Pseudo-tree
)|()|()(
)|()|()|()|()(
)|()|()()|()|()(
)|()|()|()|()(),(
,,
acQabQaQ
cfPbdPabPacPaPE
acQabQaQacQabQaQ
cfPbdPabPacPaPfdP
Q
cba
)|()|()|()|()(),(,,
cfPbdPabPacPaPfdPcba
)|()|()(),,( ACQABQAQCBAQ
Gogate and Dechter, UAI 2008, CP2008
AND/OR Importance Sampling (General Idea)
162
• Decompose Expectation
A
B C
Pseudo-tree
)|()|()()|()|()(
)|()|()|()|()(),(
,,
acQabQaQacQabQaQ
cfPbdPabPacPaPfdP
cba
)|()|(
)|()|()|(
)|(
)|()|()(
)(
)(),( acQ
acQ
cfPacPabQ
abQ
bdPabPaQ
aQ
aPfdP
a cb
a
acQ
cfPacPEa
abQ
bdPabPE
aQ
aPEfdP QQQ |
)|(
)|()|(|
)|(
)|()|(
)(
)(),(
163
• Compute all expectations separately• How?
– Record all samples– For each sample that has A=a
• Estimate the conditional expectations separately using the generated samples
• Combine the results
AND/OR Importance Sampling (General Idea)
A
B C
Pseudo-tree
a
acQ
cfPacPEa
abQ
bdPabPE
aQ
aPEfdP QQQ |
)|(
)|()|(|
)|(
)|()|(
)(
)(),(
164
AND/OR Importance Sampling
A
0
B
2
1
1
C
0 1
B
21
C
0 1
Sample # A B C
1 0 1 0
2 0 2 1
3 1 1 1
4 1 2 0
2
02w(B0)A1,w(B
0A having B of samples of Weight Average
00
0 of Estimate
),
|)|(
)|()|(
A
AAbQ
bdPAbPE
A
B C
Pseudo-tree
165
AND/OR Importance Sampling
All AND nodes: Separate Components. Take ProductOperator: Product
All OR nodes: Conditional Expectations given the assignment above it
Operator: Weighted Average
Sample # Z X Y
1 0 1 0
2 0 2 1
3 1 1 1
4 1 2 0
Gogate and Dechter, UAI 2008, CP2008
A
0
B
2
1
1
C
0 1
B
21
C
0 1
Avg Avg Avg
Avg
Avg
Algorithm AND/OR Importance Sampling
166
1. Construct a pseudo-tree.2. Construct a proposal distribution along the pseudo-tree3. Generate samples x1, . . . ,xN from Q along O.4. Build a AND/OR sample tree for the samples x1, . . . ,xN
along the ordering O.5. FOR all leaf nodes i of AND-OR tree do
1. IF AND-node v(i)= 1 ELSE v(i)=0
6. FOR every node n from leaves to the root do1. IF AND-node v(n)=product of children2. IF OR-node v(n) = Average of children
7. Return v(root-node)
Gogate and Dechter, UAI 2008, CP2008
# samples in AND/OR vs Conventional
167
• 8 Samples in AND/OR space versus 4 samples in importance sampling
• Example: A=0, B=2, C=0 is not generated but still considered in the AND/OR space
A
0
B
2
1
1
C
0 1
A
21
B
0 1
Sample # A B C
1 0 1 0
2 0 2 1
3 1 1 1
4 1 2 0
Gogate and Dechter, UAI 2008, CP2008
Why AND/OR Importance Sampling
168
• AND/OR estimates have smaller variance.
• Variance Reduction– Easy to Prove for case of complete independence
(Goodman, 1960)
– Complicated to prove for general conditional independence case (See Vibhav Gogate’s thesis)!
Gogate and Dechter, UAI 2008, CP2008
tindependen
tindependennot
2
22
22
,][][][][][][
][
,][][][][][][
][
N
yVxV
N
xEyV
N
yExVyxV
N
yVxV
N
xEyV
N
yExVxyV
Note the squared term.
C
B D
A
C
B
D
A
≠
≠
≠
{0,1,2}
{0,1,2}
{0,1,2}
{0,1,2}
C
0 1
B D B D
1 2 1 202 0 2
A
0
A
1
A
2
A
0
C
0 1
B D B D
1 2 1 202 0 2
A
2
A
0
A
0 1
AND/OR sample graph12 samples
AND/OR sample tree8 samples
AND/OR Graph sampling
Combining AND/OR sampling and w-cutset sampling
• Reduce the variance of weights– Rao-Blackwellised w-cutset sampling (Bidyuk and
Dechter, 2007)
• Increase the number of samples; kind of– AND/OR Tree and Graph sampling (Gogate and
Dechter, 2008)
• Combine the two
N
zwVarzw
NVarePVar
QN
i
i
)]([)()(ˆ
1
1
170
Algorithm AND/OR w-cutset sampling
Given an integer constant w1. Partition the set of variables into K and R, such that the
treewidth of R is bounded by w.2. AND/OR sampling on K
1. Construct a pseudo-tree of K and compute Q(K) consistent with K2. Generate samples from Q(K) and store them on an AND/OR tree
3. Rao-Blackwellisation (Exact inference) at each leaf1. For each leaf node of the tree compute Z(R|g) where g is the
assignment from the leaf to the root.
4. Value computation: Recursively from the leaves to the root1. At each AND node compute product of values at children2. At each OR node compute a weighted average over the values at
children
5. Return the value of the root node171
AND/OR w-cutset sampling:Step 1: Partition the set of variables
D
E G
F
C
A B
Graphical model
Practical constraint: Can only perform exact inference if the treewidth is bounded by 1.
172
AND/OR w-cutset sampling:Step 2: AND/OR sampling over {A,B,C}
174
Pseudo-tree
C
A B
D
E G
F
C
A B
Graphical model
Samples: (C=0,A=0,B=1), (C=0,A=1,B=1), (C=1,A=0,B=0), (C=1,A=1,B=0)
AND/OR w-cutset sampling:Step 2: AND/OR sampling over {A,B,C}
175
Pseudo-tree
C
A B
C
AA B B
0 1
0 1 0 1 01
C
AA B B
0 1
0 1 0 1 01
Ff Gg
fBHfgHgBHgCH ),1(),(),1(),0(0)c|1v(B
:0C|1B of Value
AND/OR w-cutset sampling:Step 3: Exact inference at each leaf
176
AND/OR w-cutset sampling:Step 4: Value computation
177
C
AA B B
0 1
0 1 0 1 01
Ff Gg
fBHfgHgBHgCH ),1(),(),1(),0(0)C|1v(B
:0Cgiven 1B of Value
0|,,ˆ
:0Cgiven B of Value
CGFBE
0|,,ˆ
:0Cgiven A of Value
CEDAE
Value of C: Estimate of the partition function
Properties and Improvements
• Basic underlying scheme for sampling remains the same
– The only thing that changes is what you estimate from the samples
– Can be combined with any state-of-the-art importance sampling technique
• Graph vs Tree sampling
– Take full advantage of the conditional independence properties uncovered from the primal graph
178
AND/OR w-cutset samplingAdvantages and Disadvantages
• Advantages– Variance Reduction
– Relatively fewer calls to the Rao-Blackwellisation step due to efficient caching (Lazy Rao-Blackwellisation)
– Dynamic Rao-Blackwellisation when context-specific or logical dependencies are present• Particularly suitable for Markov logic networks (Richardson
and Domingos, 2006).
• Disadvantages– Increases time and space complexity and therefore
fewer samples may be generated.
179
Take away Figure:Variance Hierarchy and Complexity
IS
w-cutset ISAND/OR Tree
IS
AND/OR w-cutset Tree
IS
AND/OR
Graph IS
AND/OR w-cutset
Graph IS
O(nN)O(1) O(nN)
O(h)
O(nNexp(w))O(h+nexp(w))
O(nNexp(w))O(exp(w))
O(nNt)O(nN)
O(nNtexp(w))O(nN+nexp(w))
180
Experiments
• Benchmarks
– Linkage analysis
– Graph coloring
• Algorithms
– OR tree sampling
– AND/OR tree sampling
– AND/OR graph sampling
– w-cutset versions of the three schemes above
Results: Probability of EvidenceLinkage instances (UAI 2006 evaluation)
Time Bound: 1hr
Results: Probability of EvidenceLinkage instances (UAI 2008 evaluation)
Time Bound: 1hr
Results: Solution countingGraph coloring instance
Time Bound: 1hr
Summary: AND/OR Importance sampling
• AND/OR sampling: A general scheme to exploit conditional independence in sampling
• Theoretical guarantees: lower sampling error than conventional sampling
• Variance reduction orthogonal to Rao-Blackwellised sampling.
• Better empirical performance than conventional sampling.
187