Sampling Techniques for Probabilistic and …dechter/talks/tutorial-aaai-2010.pdfSampling Techniques for Probabilistic and Deterministic Graphical models Bozhena Bidyuk Vibhav Gogate

Sampling Techniques for Probabilistic and Deterministic Graphical models

Bozhena Bidyuk

Vibhav Gogate

Rina Dechter

Overview

1. Probabilistic Reasoning/Graphical models

2. Importance Sampling

3. Markov Chain Monte Carlo: Gibbs Sampling

4. Sampling in presence of Determinism

5. Rao-Blackwellisation

6. AND/OR importance sampling

Overview





5. Cutset-based Variance Reduction


Probabilistic Reasoning;Graphical models

• Graphical models:

– Bayesian network, constraint networks, mixed network

• Queries

• Exact algorithm

– using inference,

– search and hybrids

• Graph parameters:

– tree-width, cycle-cutset, w-cutset

5

Bayesian Networks (Pearl, 1988)

P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P(D|C,B)

P(B|S)

P(S)

P(X|C,S)

P(C|S)

CPT:

C B P(D|C,B)0 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1

Belief Updating:P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

))(|()(

))(|(

1

i

n

i

i

ii

XpaXPXP

XpaXP

:CPTs

Probability of evidence:P ( smoking=no, dyspnoea=yes ) = ?

Queries

• Probability of evidence (or partition function)

• Posterior marginal (beliefs):

• Most Probable Explanation

)var( 1

|)|()(eX

n

i

eii paxPeP X i

ii CZ )(

)var( 1

)var( 1

|)|(

|)|(

)(

),()|(

eX

n

j

ejj

XeX

n

j

ejj

ii

paxP

paxP

eP

exPexP i

e),xP(maxarg*xx

A Bred green

red yellow

green red

green yellow

yellow green

yellow red

Map coloring

Variables: countries (A B C etc.)

Values: colors (red green blue)

Constraints: etc. ,ED D, AB,A

C

A

B

D

E

F

G

Constraint Networks

Constraint graph

A

B

D

C

GF

E

Task: find a solutionCount solutions, find a good one

8

= {(¬C), (A v B v C), (¬A v B v E), (¬B v C v D)}.

Propositional Satisfiability

Mixed Networks: Mixing Belief and Constraints

Belief or Bayesian Networks

A

D

B C

E

F

A

D

B C

E

F

)|(),,|(

),|(),|(),|(),( :CPTS

}1,0{:Domains

,,,,, :Variables

AFPBAEP

CBDPACPABPAP

DDDDDD

FEDCBA

FEDCBA

Constraint Networks

)( :solutions ofset theExpresses

),(),(),(),( :sConstraint

}1,0{:Domains

,,,,, :Variables

4321

Rsol

EARBCDRACFRABCR

DDDDDD

FEDCBA

FEDCBA

B C D=0 D=1

0 0 0 1

0 1 .1 .9

1 0 .3 .7

1 1 1 0

),|( CBDP

allowednot is 1D1,C1,B

allowednot is 0,0,0

)(3

DCB

BCDR

B= R=

Constraints could be specified externally or may occur as zeros in the Belief network

Same queries (e.g., weighted counts)

)(

)(Rsolx

B xPM

10

Belief Updating

“Moral” graph

A

D E

CB

P(a|e=0) P(a,e=0)=

bcde ,,,0

P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=

0e

P(a) d

),,,( ecdahB

b

P(b|a)P(d|b,a)P(e|b,c)

B C

ED

Variable Elimination

P(c|a)c

11

Bucket Elimination Algorithm elim-bel (Dechter 1996)

b

Elimination operator

P(a|e=0)

W*=4Exp(w*)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

12

Bucket Elimination )0,()0|( eaPeaP

debc

cbePbadPacPabPaPeaP,0,,

),|(),|()|()|()()0,(

dc b e

badPcbePabPacPaP ),|(),|()|()|()(0

Elimination Order: d,e,b,cQuery:

D:

E:

B:

C:

A:

d

D badPbaf ),|(),(),|( badP

),|( cbeP ),|0(),( cbePcbfE

b

EDB cbfbafabPcaf ),(),()|(),(

)()()0,( afApeaP C)(aP

)|( acP c

BC cafacPaf ),()|()(

)|( abP

D,A,B E,B,C

B,A,C

C,A

A

),( bafD ),( cbfE

),( cafB

)(afC

AA

DD EE

CCBB

Bucket TreeD E

B

C

A

Original Functions Messages

Time and space exp(w*)

13

Complexity of Elimination

))((exp ( * dwnO

ddw ordering along graph moral of widthinduced the)(*

The effect of the ordering:

4)( 1

* dw 2)( 2

* dw“Moral” graph

A

D E

CB

B

C

D

E

A

E

D

C

B

A

Cutset-Conditioning

CP

J A

L

B

E

DF M

O

H

K

G N

CP

J

L

B

E

DF M

O

H

K

G N

A

CP

J

L

E

DF M

O

H

K

G N

B

P

J

L

E

DF M

O

H

K

G N

C

Cycle cutset = {A,B,C}

CP

J A

L

B

E

DF M

O

H

K

G N

CP

J

L

B

E

DF M

O

H

K

G N

CP

J

L

E

DF M

O

H

K

G N

CP

J A

L

B

E

DF M

O

H

K

G N

15

Search Over the Cutset

A=yellow A=green

B=red B=blue B=red B=blueB=green B=yellow

C

K

G

L

D

FH

M

J

E

C

K

G

L

D

FH

M

J

E

C

K

G

L

D

FH

M

J

E

C

K

G

L

D

FH

M

J

E

C

K

G

L

D

FH

M

J

E

C

K

G

L

D

FH

M

J

E

• Inference may require too much memory

• Condition on some of the variablesA

C

B K

G

L

D

FH

M

J

E

GraphColoringproblem

Space: exp(w): w is a user-controled parameter Time: exp(w+c(w))

21? ?? ?

A aB b

A AB b

3 4A | ?B | ?

? ?? ?

5 6A | aB | b •6 individuals

•Haplotype: {2, 3}• Genotype: {6}• Unknown

Linkage Analysis

17

L11m L11f

X11

L12m L12f

X12

L13m L13f

X13

L14m L14f

X14

L15m L15f

X15

L16m L16f

X16

S13m

S15m

S16mS15m

S15m

S15m

L21m L21f

X21

L22m L22f

X22

L23m L23f

X23

L24m L24f

X24

L25m L25f

X25

L26m L26f

X26

S23m

S25m

S26mS25m

S25m

S25m

L31m L31f

X31

L32m L32f

X32

L33m L33f

X33

L34m L34f

X34

L35m L35f

X35

L36m L36f

X36

S33m

S35m

S36mS35m

S35m

S35m

Linkage Analysis: 6 People, 3 Markers

Applications

• Determinism: More Ubiquitous than you may think!

• Transportation Planning (Liao et al. 2004, Gogate et al. 2005)– Predicting and Inferring Car Travel Activity of individuals

• Genetic Linkage Analysis (Fischelson and Geiger, 2002)– associate functionality of genes to their location on chromosomes.

• Functional/Software Verification (Bergeron, 2000)– Generating random test programs to check validity of hardware

• First Order Probabilistic models (Domingos et al. 2006, Milch et al. 2005)– Citation matching

19

Inference vs Conditioning-Search

Inference

Exp(w*) time/space

A

D

B C

E

F0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1

E

C

F

D

B

A 0 1 SearchExp(n) timeO(n) space

EK

F

L

H

C

BA

M

G

J

D

ABC

BDEF

DGF

EFH

FHK

HJ KLM

A=yellow A=green

B=blue B=red B=blueB=green

CK

G

LD

FH

M

J

EACB K

G

LD

FH

M

J

E

CK

G

LD

FH

M

J

E

CK

G

LD

FH

M

J

E

CK

G

LD

FH

M

J

E

Search+inference:Space: exp(w)Time: exp(w+c(w))

20

Approximation

• Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then:

• Bounding inference:• mini-bucket and mini-clustering• Belief propagation

• Bounding search:• Sampling

• Goal: an anytime scheme

21

Approximation

• Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then:

• Bounding inference:• mini-bucket and mini-clustering• Belief propagation

• Bounding search:• Sampling

• Goal: an anytime scheme

Overview







Outline

• Definitions and Background on Statistics

• Theory of importance sampling

• Likelihood weighting

• State-of-the-art importance sampling techniques

23

A sample

• Given a set of variables X={X1,...,Xn}, a sample, denoted by St is an instantiation of all variables:

24

),...,,(t

n

ttt xxxS 21

How to draw a sample ?Univariate distribution

• Example: Given random variable X having domain {0, 1} and a distribution P(X) = (0.3, 0.7).

• Task: Generate samples of X from P.

• How?

– draw random number r [0, 1]

– If (r < 0.3) then set X=0

– Else set X=1

25

How to draw a sample?Multi-variate distribution

• Let X={X1,..,Xn} be a set of variables

• Express the distribution in product form

• Sample variables one by one from left to right, along the ordering dictated by the product form.

• Bayesian network literature: Logic sampling

26

),...,|(...)|()()( 11121 nn XXXPXXPXPXP

Logic sampling (example)

27

X1

X4

X2X3

)( 1XP

)|( 12 XXP )|( 13 XXP

)|( from Sample .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

Evidence No

33,2244

1133

1122

11

xXxXxPx

xXxPx

xXxPx

xPx

k

),|()|()|()(),,,( 324131214321 XXXPXXPXXPXPXXXXP

),|( 324 XXXP

Expected value: Given a probability distribution P(X)

and a function g(X) defined over a set of variables X =

{X1, X2, … Xn}, the expected value of g w.r.t. P is

Variance: The variance of g w.r.t. P is:

Expected value and Variance

28

)()()]([ xPxgxgEx

P

)()]([)()]([2

xPxgExgxgVarx

PP

Monte Carlo Estimate

• Estimator:

– An estimator is a function of the samples.

– It produces an estimate of the unknown parameter of the sampling distribution.

29

T

t

t

P

SgT

g

P

1

T21

1

:bygiven is [g(x)]E of estimate carlo Monte the

, fromdrawn S ,S ,S samples i.i.d.Given

)(ˆ

Example: Monte Carlo estimate• Given:

– A distribution P(X) = (0.3, 0.7).– g(X) = 40 if X equals 0

= 50 if X equals 1.

• Estimate EP[g(x)]=(40x0.3+50x0.7)=47.• Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

30

4610

650440

150040

samples

XsamplesXsamplesg

#

)(#)(#ˆ

Outline





31

Importance sampling: Main idea

• Transform the probabilistic inference problem into the problem of computing the expected value of a random variable w.r.t. to a distribution Q.

• Generate random samples from Q.

• Estimate the expected value from the generated samples.

32

Importance sampling for P(e)

)(,)()(ˆ

:

)]([)(

),(

)(

)(),(),()(

)(),(

,\

ZQzwT

eP

zwEzQ

ezPE

zQ

zQezPezPeP

zQezP

EXZLet

tT

t

t

QQ

zz

z where1

estimate Carlo Monte

:as P(e) rewritecan weThen,

00

satisfying on,distributi (proposal) a be Q(Z)Let

1

Properties of IS estimate of P(e)

• Convergence: by law of large numbers

• Unbiased.

• Variance:

Tfor )()(

1)(ˆ ..

1ePzw

TeP saT

i

i

T

zwVarzw

TVarePVar

QN

i

i

QQ

)]([)(

1)(ˆ

1

)()](ˆ[ ePePEQ

Properties of IS estimate of P(e)

• Mean Squared Error of the estimator

T

xwVar

ePVar

ePVarePEeP

ePePEePMSE

Q

Q

QQ

QQ

)]([

)(ˆ

)(ˆ)](ˆ[)(

)()(ˆ)(ˆ

2

2

This quantity enclosed in the brackets is zero because the expected value of the

estimator equals the expected value of g(x)

Estimating P(Xi|e)

36

)|()|(E:biased is Estimate

,

,

)(ˆ

),(ˆ)|(:estimate Ratio

IS.by r denominato andnumerator Estimate :Idea

)(

),(

)(

),()(

),(

),()(

)(

),()|(

otherwise. 0 and xcontains z if 1 is which function, delta-dirac a be (z)Let

T

1k

T

1k

ix i

exPexP

e)w(z

e))w(z(z

eP

exPexP

zQ

ezPE

zQ

ezPzE

ezP

ezPz

eP

exPexP

ii

k

kk

x

ii

Q

x

Q

z

z

x

ii

i

i

i

Properties of the IS estimator for P(Xi|e)

• Convergence: By Weak law of large numbers

• Asymptotically unbiased

• Variance

– Harder to analyze

– Liu suggests a measure called “Effective sample size”

37

T as )|()|( exPexP ii

)|()]|([lim exPexPE iiPT

Effective Sample size

38

possible. as small as bemust weights theof variance theTherefore,

Q. from samples T)ESS(Q, worth are P from samples T Thus

),()]|([

)]|(ˆ[

)]([var1),(:

1ˆ

:using e)|P(x estimatecan wee),|P(z from samplesGiven

)|()()|(

1

i

TQESS

T

exPVar

exPVar

zw

TTQESSDefine

)(zgT

|e)(xP

ezPzgexP

iQ

iP

Q

T

j

t

xi

z

xi

i

i

Ideal estimator

Measures how much the estimator deviates from the ideal one.

Outline





39

Likelihood Weighting: Proposal Distribution

40

EE

jj

EXX

ii

EXX EE

jjii

n

EXX

ii

j

i

i j

i

paeP

epaxP

paePepaxP

xQ

exPw

xxx

Weights

xXXXPXP

epaXPEXQ

)|(

),|(

)|(),|(

)(

),(

),..,(:

:

),|()(

),|()\(

\

\

\

sample aGiven

)X,Q(X

.xX Evidence

and )X,X|P(X)X|P(X)P(X)X,X,P(X :networkBayesian aGiven

:Example

1

2213131

22

213121321

Likelihood Weighting: Sampling

41

e e e e e

e e e e

Sample in topological order over X !

Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!

Outline





42

Proposal selection

• One should try to select a proposal that is as close as possible to the posterior distribution.

)|()(

)()(

),(

estimator variance-zero a have to,0)()(

),(

)()()(

),(1)]([)(ˆ

2

ezPzQ

zQeP

ezP

ePzQ

ezP

zQePzQ

ezP

NT

zwVarePVar

Zz

Q

Q

Proposal Distributions used in Literature

• AIS-BN (Adaptive proposal)

• Cheng and Druzdzel, 2000

• Iterative Belief Propagation

• Changhe and Druzdzel, 2003

• Iterative Join Graph Propagation (IJGP) and variable ordering

• Gogate and Dechter, 2005

Perfect sampling using Bucket Elimination

• Algorithm:

– Run Bucket elimination on the problem along an ordering o=(XN,..,X1).

– Sample along the reverse ordering: (X1,..,XN)

– At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.

SP2 46

Bucket elimination (BE) Algorithm elim-bel (Dechter 1996)

b

Elimination operator

P(e)

bucket B:

P(a)

P(C|A)

P(B|A) P(D|B,A) P(e|B,C)

bucket C:

bucket D:

bucket E:

bucket A:

B

C

D

E

A

e)(A,hD

(a)hE

e)C,D,(A,hB

e)D,(A,hC

SP2 47

Sampling from the output of BE(Dechter 2002)

bucket B:

P(A)

P(C|A)

P(B|A) P(D|B,A) P(e|B,C)

bucket C:

bucket D:

bucket E:

bucket A:

e)(A,hD

(A)hE

e)C,D,(A,hB

e)D,(A,hC

Q(A)aA:

(A)hP(A)Q(A)E

Sample

ignore :bucket Evidence

e)D,(a,he)a,|Q(Dd D :Sample

bucket in the aASet

C

e)C,d,(a,hd)e,a,|Q(C cC :Sample

bucket in the dDa,ASet

B

),|(),|()|( cbePaBdPaBP

d)e,a,|Q(Cb B :Sample

bucket in the cCd,Da,ASet

Mini-buckets: “local inference”

• Computation in a bucket is time and space exponential in the number of variables involved

• Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

• Can control the size of each “mini-bucket”, yielding polynomial complexity.

SP2 48

SP2 49 49

Mini-Bucket Elimination

bucket A:

bucket E:

bucket D:

bucket C:

bucket B:

ΣB

P(B|A) P(D|B,A)

hE(A)

hB(A,D)

P(e|B,C)

Mini-buckets

ΣB

P(C|A) hB(C,e)

hD(A)

hC(A,e)

Approximation of P(e)

Space and Time constraints:Maximum scope size of the new function generated should be bounded by 2

BE generates a function having scope size 3. So it cannot be used.

P(A)

SP2 50 50

Sampling from the output of MBE

bucket A:

bucket E:

bucket D:

bucket C:

bucket B: P(B|A) P(D|B,A)

hE(A)

hB(A,D)

P(e|B,C)

P(C|A) hB(C,e)

hD(A)

hC(A,e)Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini-bucket”

IJGP-Sampling (Gogate and Dechter, 2005)

• Iterative Join Graph Propagation (IJGP)

– A Generalized Belief Propagation scheme (Yedidiaet al., 2002)

• IJGP yields better approximations of P(X|E) than MBE

– (Dechter, Kask and Mateescu, 2002)

• Output of IJGP is same as mini-bucket “clusters”

• Currently the best performing IS scheme!

k

)(ˆ Re

')(Q Update

)(N

1)(ˆe)(EP̂

Q z,...,z samples Generate

dok to1iFor

0)(ˆ

))(|(...))(|()()(Q Proposal Initial

1k

1

N1

221

1

eEPturn

End

QQkQ

zweEP

from

eEP

ZpaZQZpaZQZQZ

kk

iN

j

k

k

nn

Adaptive Importance Sampling

Adaptive Importance Sampling

• General case

• Given k proposal distributions

• Take N samples out of each distribution

• Approximate P(e)

1

)(ˆ

1

k

j

proposaljthweightAvgk

eP

Estimating Q'(z)

sampling importanceby estimated is

)Z,..,Z|(ZQ'each where

))(|('...))(|(')(')(Q

1-i1i

221

'

nn ZpaZQZpaZQZQZ

Overview







Markov Chain

• A Markov chain is a discrete random process with the property that the next state depends only on the

current state (Markov Property):

56

x1 x2 x3 x4

)|(),...,,|( 1121 tttt xxPxxxxP

• If P(Xt|xt-1) does not depend on t (time homogeneous) and state space is finite, then it is often expressed as a transition function (aka transition matrix) 1)(

x

xXP

Example: Drunkard’s Walk• a random walk on the number line where, at

each step, the position may change by +1 or −1 with equal probability

57

1 2 3

,...}2,1,0{)( XD5.05.0

)1()1(

n

nPnP

transition matrix P(X)

1 2

Example: Weather Model

58

5.05.0

1.09.0

)()(

sunny

rainy

sunnyPrainyP

transition matrix P(X)

},{)( sunnyrainyXD

rain rain rainrain sun

Multi-Variable System

• state is an assignment of values to all the variables

59

x1t

x2t

x3t

x1t+1

x2t+1

x3t+1

},...,,{ 21

t

n

ttt xxxx

finitediscreteXDXXXX i ,)(},,,{ 321

Bayesian Network System

• Bayesian Network is a representation of the joint probability distribution over 2 or more variables

X1t

X2t

X3t

},,{ 321

tttt xxxx

X1

X2 X3

60

},,{ 321 XXXX

x1t+1

x2t+1

x3t+1

61

Stationary DistributionExistence

• If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:

• Finite state space Markov chain has a unique stationary distribution if and only if:– The chain is irreducible

– All of its states are positive recurrent

)(

)|()()(XDx

jiji

i

xxPxx

62

Irreducible

• A state x is irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps

• If one state is irreducible, then all the states must be irreducible

(Liu, Ch. 12, pp. 249, Def. 12.1.1)

63

Recurrent

• A state x is recurrent if the chain returns to x

with probability 1

• Let M(x ) be the expected number of steps to return to state x

• State x is positive recurrent if M(x ) is finiteThe recurrent states in a finite state chain are positive recurrent .

Stationary Distribution Convergence

• Consider infinite Markov chain:nnn PPxxPP 00)( )|(

• Initial state is not important in the limit

“The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…”

(Liu, Ch. 12.1)

)(lim n

nP

64

• If the chain is both irreducible and aperiodic, then:

Aperiodic

• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic

• Positive recurrent, aperiodic states are ergodic

65

Markov Chain Monte Carlo

• How do we estimate P(X), e.g., P(X|e) ?

66

• Generate samples that form Markov Chain with stationary distribution =P(X|e)

• Estimate from samples (observed states):

visited states x0,…,xn can be viewed as “samples” from distribution

),(1

)(1

tT

t

xxT

x

)(lim xT

MCMC Summary

• Convergence is guaranteed in the limit

• Samples are dependent, not i.i.d.

• Convergence (mixing rate) may be slow

• The stronger correlation between states, the slower convergence!

• Initial state is not important, but… typically, we throw away first K samples - “burn-in”

67

Gibbs Sampling (Geman&Geman,1984)

• Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables

• Sample new variable value one variable at a time from the variable’s conditional distribution:

• Samples form a Markov chain with stationary distribution P(X|e)

68

)\|(},...,,,..,|()( 111 i

t

i

t

n

t

i

t

i

t

ii xxXPxxxxXPXP

Gibbs Sampling: Illustration

The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk):

In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

70

),\|( from sampled

),,...,,|(

...

),,...,,|(

),,...,,|(

1

1

1

1

2

1

1

1

3

1

12

1

22

321

1

11

exxXPxX

exxxXPxX

exxxXPxX

exxxXPxX

i

t

i

t

ii

t

N

tt

N

t

NN

t

N

ttt

t

N

ttt

ProcessAllVariablesIn SomeOrder

Transition Probabilities in BN

Markov blanket:

:)|( )\|( t

iii

t

i markovXPxxXP

ij chX

jjiii

t

i paxPpaxPxxxP )|()|()\|(

)()( jj chX

jiii pachpaXmarkov

Xi

Given Markov blanket (parents, children, and their parents),Xi is independent of all other nodes

71

Computation is linear in the size of Markov blanket!

Ordered Gibbs Sampling Algorithm(Pearl,1988)

Input: X, E=e

Output: T samples {xt }

Fix evidence E=e, initialize x0 at random1. For t = 1 to T (compute samples)

2. For i = 1 to N (loop through variables)

3. xit+1 P(Xi | markovi

t)

4. End For

5. End For

Gibbs Sampling Example - BN

73

X1

X4

X8 X5 X2

X3

X9 X7

X6

}{},,...,,{ 9921 XEXXXX

X1 = x10

X6 = x60

X2 = x20

X7 = x70

X3 = x30

X8 = x80

X4 = x40

X5 = x50

Gibbs Sampling Example - BN

74

X1

X4

X8 X5 X2

X3

X9 X7

X6

),,...,|( 9

0

8

0

21

1

1 xxxXPx

}{},,...,,{ 9921 XEXXXX

),,...,|( 9

0

8

1

12

1

2 xxxXPx

Answering Queries P(xi |e) = ?• Method 1: count # of samples where Xi = xi (histogram estimator):

T

t

t

iiiii markovxXPT

xXP1

)|(1

)(

T

t

t

iii xxT

xXP1

),(1

)(

Dirac delta f-n

• Method 2: average probability (mixture estimator):

• Mixture estimator converges faster (consider estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)

Rao-Blackwell Theorem

Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies

76

)]([}|)({[ RgVarLRgEVar

for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).

• theorem makes a weak promise, but works well in practice!• improvement depends the choice of R and L

Importance vs. Gibbs

T

tt

tt

t

T

t

t

T

t

xQ

xPxg

Tg

eXQX

xgT

Xg

eXPeXP

eXPx

1

1

)(

)()(1

)|(

)(1

)(ˆ

)|()|(ˆ

)|(ˆ

wt

Gibbs:

Importance:

Gibbs Sampling: Convergence

78

• Sample from `P(X|e)P(X|e)

• Converges iff chain is irreducible and ergodic

• Intuition - must be able to explore all states:– if Xi and Xj are strongly correlated, Xi=0 Xj=0,

then, we cannot explore states with Xi=1 and Xj=1

• All conditions are satisfied when all probabilities are positive

• Convergence rate can be characterized by the second eigen-value of transition matrix

Gibbs: Speeding Convergence

Reduce dependence between samples (autocorrelation)

• Skip samples

• Randomize Variable Sampling Order

• Employ blocking (grouping)

• Multiple chains

Reduce variance (cover in the next section)

79

Blocking Gibbs Sampler

• Sample several variables together, as a block

• Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

80

)|,(),(

)(),|(

1111

1

tttt

tttt

xZYPwzy

wPzyXPx

Gibbs: Multiple Chains

• Generate M chains of size K

• Each chain produces independent estimate Pm:

81

M

i

mPM

P1

1ˆ

K

t

i

t

iim xxxPK

exP1

)\|(1

)|(

Treat Pm as independent random variables.

• Estimate P(xi|e) as average of Pm (xi|e) :

Gibbs Sampling Summary

• Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

• Samples are dependent, form Markov Chain

• Sample from which converges to

• Guaranteed to converge when all P > 0

• Methods to improve convergence:– Blocking

– Rao-Blackwellised

82

)|( eXP )|( eXP

Overview







Outline

• Rejection problem

• Backtrack-free distribution

– Constructing it in practice

• SampleSearch

– Construct the backtrack-free distribution on the fly.

• Approximate estimators

• Experiments

Outline




• SampleSearch



• Experiments

Rejection problem

• Importance sampling requirement – P(z,e) > 0 Q(z)>0

• When P(z,e)=0 but Q(z) > 0, the weight of the sample is zero and it is rejected.

• The probability of generating a rejected sample should be very small.– Otherwise the estimate will be zero.

N

ii

i

zQ

ezP

NeP

1

1

)(

),()(ˆ

Rejection Problem

All Blue leaves correspond to solutions i.e. g(x) >0All Red leaves correspond to non-solutions i.e. g(x)=0

solution. anot is if 0

1 :sampling Importance

1

ii

N

ii

i

zezP

zQ

ezP

NeP

),(

)(

),()(ˆ

A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0.4 0.6

0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8

Constraints: A≠B A≠C

If z violates either A≠B or A≠C then P(z,e)=0

Q: Positive

Outline




• SampleSearch



• Experiments

Backtrack-free distribution: A rejection-free distribution

All Blue leaves correspond to solutions i.e. g(x) >0All Red leaves correspond to non-solutions i.e. g(x)=0

A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0 1

0 0 0 1 1 0 0 0

1 0

QF(branch)=0 if no solutions under it

QF(branch) αQ(branch) otherwise


Generating samples from QF

C=1

Root

0.80.2

?? Solution?? Solution

?? Solution?? Solution

A=0

00.4 0.6 1

?? Solution ?? Solution

B=1

00.2 0.81

QF(branch)=0 if no solutions under it

QF(branch) αQ(branch) otherwise

• Invoke an oracle ateach branch.• Oracle returns

True if there is a solution under a branch

• False, otherwise


Generating samples from QF

• Oracles• Adaptive consistency as pre-

processing step• Constant time table look-

up• Exponential in the

treewidth of the constraint portion.

• A complete CSP solver • Need to run it at each

assignment.

A=0

B=1

C=1

Root

0.80.2

1

1


Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005

Outline




• SampleSearch



• Experiments

93

Algorithm SampleSearch

A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0.4 0.6

0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8

N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

Gogate and Dechter, AISTATS 2007, AAAI 2007


94


A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0.4 0.6

0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.81



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

95


A=0

B=1 B=0 B=1

A=1

C=1C=1C=0 C=0 C=0 C=1

Root

0.8 0.2

0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8

Resume Sampling

1



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

96


A=0

B=1 B=0 B=1

A=1

C=1C=1C=0 C=0 C=0 C=1

Root

0.8 0.2

1

0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8

Constraint Violated

Until P(sample,e)>0

1



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

97

Generate more Samples

A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0.4 0.6

0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

98

Generate more Samples

A=0

B=0

C=0

B=1 B=0 B=1

A=1

C=1C=1C=1 C=0 C=0 C=0 C=1

Root

0.8 0.2

0.4 0.6

0.2 0.8 0.2 0.8 0.2 0.8 0.2 0.8

0.2 0.8

1



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

Traces of SampleSearch

A=0

B=1

C=1

Root

A=0

B=0B=1

C=1

Root

A=0

B=1

C=1

Root

C=0

A=0

B=0 B=1

C=1

Root

C=0



100

SampleSearch: Sampling Distribution

• Problem: Due to Search, the samples are no longer i.i.d. from Q.

• Thm: SampleSearch generates i.i.d. samples from the backtrack-free distribution

)()(,)(

),()( ePePE

zQ

ezP

NeP Q

N

jj

j

1

1


)()(ˆ,)(

),()(ˆ ePePE

zQ

ezP

NeP FQ

N

jjF

j

F F

1

1

The Sampling distribution QF of SampleSearch

A=0

B=0 B=1

C=1C=1 C=0

Root

0.8

0 1

0 0 0 1

C=0

What is probability of generating A=0?

QF(A=0)=0.8

Why? SampleSearch is systematic

What is probability of generating (A=0,B=1)?

QF(B=1|A=0)=1

Why? SampleSearch is systematic

What is probability of generating (A=0,B=0)?

Simple: QF(B=0|A=0)=0

All samples generated by SampleSearch are

solutions

Backtrack-free distribution



N

jj

j

zQ

ezP

NeP

1

1

)(

),()(ˆ

Outline




• SampleSearch



• Experiments

Asymptotic approximations of QF

A=0

B=0 B=1

C=1C=1 C=0

Root

0.8

0 1

0 0 0 1

C=0

Hole ?

Don’t know

No solutions here

No solutions here

•IF Hole THEN

•UF=Q (i.e. there is a solution at the other branch)

•LF=0 (i.e. no solution at the other branch)


Approximations:Convergence in the limit

• Store all possible traces

A=0

B=1

C=1

Root

0.8

1

1

?


A=0

B=1

C=1

Root

C=0

0.8

?

0.6

1

?A=0

B=0B=1

C=1

Root

0.8

?

1

0.8

?

Approximations:Convergence in the limit

• From the combined sample tree, update U and L.

IF Hole THEN UFN=Q and LF

N=0

)()(ˆ)(

)()()(

)()(

),(lim

)(

),(lim

ePePeP

zLzQzU

ePzL

ezPE

zU

ezPE

L

FF

U

F

F

N

FF

N

F

N

NF

N

N

:Bounding

unbiasedally Asymptotic A=0

B=1

C=1

Root

0.8

1

1

?


Upper and Lower Approximations

• Asymptotically unbiased.

• Upper and lower bound on the unbiased sample mean

• Linear time and space overhead

• Bias versus variance tradeoff

– Bias = difference between the upper and lower approximation.

Improving Naive SampleSeach

• Better Search Strategy

– Can use any state-of-the-art CSP/SAT solver e.g. minisat (Een and Sorrenson 2006)

• All theorems and result hold

• Better Importance Function

– Use output of generalized belief propagation to compute the initial importance function Q (Gogateand Dechter, 2005)


Experiments

• Tasks– Weighted Counting– Marginals

• Benchmarks– Satisfiability problems (counting solutions)– Linkage networks– Relational instances (First order probabilistic networks)– Grid networks– Logistics planning instances

• Algorithms– SampleSearch/UB, SampleSearch/LB– SampleCount (Gomes et al. 2007, SAT)– ApproxCount (Wei and Selman, 2007, SAT)– RELSAT (Bayardo and Peshoueshk, 2000, SAT)– Edge Deletion Belief Propagation (Choi and Darwiche, 2006)– Iterative Join Graph Propagation (Dechter et al., 2002)– Variable Elimination and Conditioning (VEC)– EPIS (Changhe and Druzdzel, 2006)

Results: Solution CountsLangford instances

Time Bound: 10 hrs

Results: Probability of EvidenceLinkage instances (UAI 2006 evaluation)

Time Bound: 3 hrs


Time Bound: 3 hrs

Results on Marginals

• Evaluation Criteria

• Always bounded between 0 and 1

• Lower Bounds the KL distance

• When probabilities close to zero are present KL distance may tend to infinity.

n

xAxP

cedisHellinger

xAeApproximatxPExact

n

i Dx

ii

ii

ii

1

2

)()(2

1

tan

)(: )(:

Results: Posterior MarginalsLinkage instances (UAI 2006 evaluation)

Time Bound: 3 hrsDistance measure: Hellinger distance

Summary: SampleSearch

• Manages rejection problem while sampling– Systematic backtracking search

• Sampling Distribution of SampleSearch is the backtrack-free distribution QF

– Expensive to compute

• Approximation of QF based on storing all traces that yields an asymptotically unbiased estimator– Linear time and space overhead– Bound the sample mean from above and below

• Empirically, when a substantial number of zero probabilities are present, SampleSearch based schemes dominate their pure sampling counter-parts and Generalized Belief Propagation.

Overview







Sampling: Performance

• Gibbs sampling

– Reduce dependence between samples

• Importance sampling

– Reduce variance

• Achieve both by sampling a subset of variables and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation

• Exploit graph structure to manage the extra cost

119

Smaller Subset State-Space

• Smaller state-space is easier to cover

120

},,,{ 4321 XXXXX },{ 21 XXX

64)( XD 16)( XD

Smoother Distribution

121

00

01

10

11

0

0.1

0.2

0001

1011

P(X1,X2,X3,X4)

0-0.1 0.1-0.2 0.2-0.26

0

1

0

0.1

0.2

0

1

P(X1,X2)

0-0.1 0.1-0.2 0.2-0.26

Speeding Up Convergence

• Mean Squared Error of the estimator:

PVarBIASPMSE QQ 2

2

2

][ˆ]ˆ[]ˆ[ PEPEPVarPMSE QQQQ

• Reduce variance speed up convergence !

• In case of unbiased estimator, BIAS=0

122

Rao-Blackwellisation

123

)}(~{]}|)([{)}({

)}(ˆ{

]}|)([{)}({

]}|)({var[]}|)([{)}({

]}|)([]|)([{1

)(~

)}()({1

)(ˆ

1

1

xgVarT

lxhEVar

T

xhVarxgVar

lxgEVarxgVar

lxgElxgEVarxgVar

lxhElxhET

xg

xhxhT

xg

LRX

T

T

Liu, Ch.2.3

Rao-Blackwellisation

• X=RL

• Importance Sampling:

• Gibbs Sampling: – autocovariances are lower (less correlation

between samples)

– if Xi and Xj are strongly correlated, Xi=0 Xj=0, only include one fo them into a sampling set

124

})(

)({}

),(

),({

RQ

RPVar

LRQ

LRPVar QQ

Liu, Ch.2.5.5

“Carry out analytical computation as much as possible” - Liu

Blocking Gibbs Sampler vs. Collapsed

• Standard Gibbs:

(1)

• Blocking:

(2)

• Collapsed:

(3)

125

X Y

Z

),|(),,|(),,|( yxzPzxyPzyxP

)|,(),,|( xzyPzyxP

)|(),|( xyPyxP

FasterConvergence

Collapsed Gibbs Sampling

Generating Samples

Generate sample ct+1 from ct :

126

),\|(

),,...,,|(

...

),,...,,|(

),,...,,|(

1

1

1

1

2

1

1

1

3

1

12

1

22

321

1

11

ecccPcC

eccccPcC

eccccPcC

eccccPcC

i

t

i

t

ii

t

K

tt

K

t

KK

t

K

ttt

t

K

ttt

from sampled

In short, for i=1 to K:

Collapsed Gibbs Sampler

Input: C X, E=e

Output: T samples {ct }

Fix evidence E=e, initialize c0 at random1. For t = 1 to T (compute samples)

2. For i = 1 to N (loop through variables)

3. cit+1 P(Ci | ct\ci)

4. End For

5. End For

Calculation Time

• Computing P(ci| ct\ci,e) is more expensive (requires inference)

• Trading #samples for smaller variance:

– generate more samples with higher covariance

– generate fewer samples with lower covariance

• Must control the time spent computing sampling probabilities in order to be time-effective!

128

Exploiting Graph Properties

Recall… computation time is exponential in the adjusted induced width of a graph

• w-cutset is a subset of variable s.t. when they are observed, induced width of the graph is w

• when sampled variables form a w-cutset , inference is exp(w) (e.g., using Bucket Tree

Elimination)

• cycle-cutset is a special case of w-cutset

129

Sampling w-cutset w-cutset sampling!

What If C=Cycle-Cutset ?

130

}{},{ 9

0

5

0

2

0 XE,xxc

X1

X7

X5 X4

X2

X9 X8

X3

X6

X1

X7

X4

X9 X8

X3

X6

P(x2,x5,x9) – can compute using Bucket Elimination

P(x2,x5,x9) – computation complexity is O(N)

Computing Transition Probabilities

131

),,1(:

),,0(:

932

932

xxxPBE

xxxPBE

X1

X7

X5 X4

X2

X9 X8

X3

X6

),,1()|1(

),,0()|0(

),,1(),,0(

93232

93232

932932

xxxPxxP

xxxPxxP

xxxPxxxP

Compute joint probabilities:

Normalize:

Cutset Sampling-Answering Queries

• Query: ci C, P(ci |e)=? same as Gibbs:

132

computed while generating sample tusing bucket tree elimination

compute after generating sample tusing bucket tree elimination

T

t

i

t

ii ecccPT

|ecP1

),\|(1

)(ˆ

T

t

t

ii ,ecxPT

|e)(xP1

)|(1

• Query: xi X\C, P(xi |e)=?

Cutset Sampling vs. Cutset Conditioning

133

)|()|(

)()|(

)|(1

)(

)(

1

ecPc,exP

T

ccountc,exP

,ecxPT

|e)(xP

CDc

i

CDc

i

T

t

t

ii

• Cutset Conditioning

• Cutset Sampling

)|()|()(

ecPc,exP|e)P(xCDc

ii

Cutset Sampling Example

134

)(

)(

)(

3

1)|(

)(

)(

)(

9

2

52

9

1

52

9

0

52

92

9

2

52

3

2

9

1

52

2

2

9

0

52

1

2

,x| xxP

,x| xxP

,x| xxP

xxP

,x| xxP x

,x| xxP x

,x| xxP x

X1

X7

X6 X5 X4

X2

X9 X8

X3

Estimating P(x2|e) for sampling node X2 :

Sample 1

Sample 2

Sample 3

Cutset Sampling Example

135

),,|(},{

),,|(},{

),,|(},{

9

3

5

3

23

3

5

3

2

3

9

2

5

2

23

2

5

2

2

2

9

1

5

1

23

1

5

1

2

1

xxxxPxxc

xxxxPxxc

xxxxPxxc

Estimating P(x3 |e) for non-sampled node X3 :

X1

X7

X6 X5 X4

X2

X9 X8

X3

),,|(

),,|(

),,|(

3

1)|(

9

3

5

3

23

9

2

5

2

23

9

1

5

1

23

93

xxxxP

xxxxP

xxxxP

xxP

CPCS54 Test Results

136

MSE vs. #samples (left) and time (right)

Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3

Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3

0

0.001

0.002

0.003

0.004

0 1000 2000 3000 4000 5000

# samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0

0.0002

0.0004

0.0006

0.0008

0 5 10 15 20 25

Time(sec)

Cutset Gibbs

CPCS179 Test Results

137

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35


CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

0 20 40 60 80

Time(sec)

Cutset Gibbs

CPCS360b Test Results

138


Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

Random Networks

139


|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20


RANDOM, n=100, |C|=13, |E|=15-20

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0 200 400 600 800 1000 1200

# samples

Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0

0.0002

0.0004

0.0006

0.0008

0.001

0 1 2 3 4 5 6 7 8 9 10 11

Time(sec)

Cutset Gibbs

Coding NetworksCutset Transforms Non-Ergodic Chain to Ergodic

140

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}


x1 x2 x3 x4

u1 u2 u3 u4

p1 p2 p3 p4

y4y3y2y1

Coding Networks, n=100, |C|=12-14

0.001

0.01

0.1

0 10 20 30 40 50 60

Time(sec)

IBP Gibbs Cutset

Non-Ergodic Hailfinder

141


Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

0 500 1000 1500

# samples

Cutset Gibbs

CPCS360b - MSE

142

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0

0.000005

0.00001

0.000015

0.00002

0.000025

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

Gibbs

IBP

|C|=26,fw=3

|C|=48,fw=2

MSE vs. Time

Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Cutset Importance Sampling

• Apply Importance Sampling over cutset C

T

t

tT

tt

t

wTcQ

ecP

TeP

11

1

)(

),(1)(ˆ

T

t

tt

ii wccT

ecP1

),(1

)|(

T

t

tt

ii wecxPT

exP1

),|(1

)|(

where P(ct,e) is computed using Bucket Elimination

(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)

Likelihood Cutset Weighting (LCS)

• Z=Topological Order{C,E}

• Generating sample t+1:

144

For End

If End

),...,|(

Else

If

:do For

1

1

11

1

1

t

i

t

i

t

i

ii

t

i

i

i

zzZPz

e ,zzz

E Z

ZZ • computed while generating sample tusing bucket tree elimination

• can be memoized for some number of instances K (based on memory available

KL[P(C|e), Q(C)+ ≤ KL*P(X|e), Q(X)]

Pathfinder 1

145

Pathfinder 2

146

Link

147

Summary

• i.i.d. samples

• Unbiased estimator

• Generates samples fast

• Samples from Q

• Reject samples with zero-weight

• Improves on cutset

• Dependent samples

• Biased estimator

• Generates samples slower

• Samples from `P(X|e)

• Does not converge in presence of constraints

• Improves on cutset

148

Importance Sampling Gibbs Sampling

CPCS360b

149

cpcs360b, N=360, |LC|=26, w*=21, |E|=15

1.E-05

1.E-04

1.E-03

1.E-02

0 2 4 6 8 10 12 14

Time (sec)

MS

ELW

AIS-BN

Gibbs

LCS

IBP

LW – likelihood weightingLCS – likelihood weighting on a cutset

CPCS422b

150

1.0E-05

1.0E-04

1.0E-03

1.0E-02

0 10 20 30 40 50 60

MS

E

Time (sec)

cpcs422b, N=422, |LC|=47, w*=22, |E|=28 LW

AIS-BN

Gibbs

LCS

IBP


Coding Networks

151

coding, N=200, P=3, |LC|=26, w*=21

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

0 2 4 6 8 10

Time (sec)

MS

ELWAIS-BNGibbsLCSIBP


Overview







Motivation

153

Expected value of the number on the face of a die:

What is the expected value of the product of the

numbers on the face of “k” dice?

536

654321.

k).( 53

Monte Carlo estimate

• Perform the following experiment N times.– Toss all the k dice.– Record the product of the numbers on the top

face of each die.

• Report the average over the N runs.

N

i

k

jNZ

1 1

thth run N in the dice"j" theof face on thenumber the1

)(ˆ

How the sample average converges?

155

10 dice. Exact Answer is (3.5)10

But This is Really Dumb?

156

The dice are independent.

A better Monte Carlo estimate1. Perform the experiment N times

2. For each dice record the average

3. Take a product of the averages

• Conventional estimate: Averages of products.

k

j

N

i

newN

Z1 1


)(ˆ

N

i

k

jNZ

1 1


)(ˆ

How the sample Average Converges

157

Moral of the story

158

• Make use of (conditional) independence to get better results

• Used for exact inference extensively– Bucket Elimination (Dechter, 1996)– Junction tree (Lauritzen and Speigelhalter, 1988)– Value Elimination (Bacchus et al. 2004)– Recursive Conditioning (Darwiche, 2001)– BTD (Jegou et al., 2002)– AND/OR search (Dechter and Mateescu, 2007)

• How to use it for sampling?– AND/OR Importance sampling

Background: AND/OR search space

B

A

C

≠

≠

{0,1,2}

{0,1,2}

{0,1,2}

Problem

A

B C

Pseudo-tree

A

B

CChain

Pseudo-tree

A

0 1 2

B C B C B C

1 2 1 202 0 2 0 101

OR

AND

OR

AND

AND/OR Tree

A

0 1 2

B B B

1 2 20 0 1

C

1 2

C

1 2

C

0 2

C

0 2

C

0 1

C

0 1

OR Tree

159

AND/OR sampling: Example

A

FD

CB

P(A)

P(B|A)P(C|A)

P(D|B)P(F|C)

)|()|()|()|()(),(,,

cfPbdPabPacPaPfdPcba

AND/OR Importance Sampling (General Idea)

161

• Decompose Expectation

A

B C

Pseudo-tree

)|()|()(

)|()|()|()|()(

)|()|()()|()|()(

)|()|()|()|()(),(

,,

acQabQaQ

cfPbdPabPacPaPE

acQabQaQacQabQaQ

cfPbdPabPacPaPfdP

Q

cba

)|()|()|()|()(),(,,

cfPbdPabPacPaPfdPcba

)|()|()(),,( ACQABQAQCBAQ

Gogate and Dechter, UAI 2008, CP2008


162

• Decompose Expectation

A

B C

Pseudo-tree

)|()|()()|()|()(

)|()|()|()|()(),(

,,

acQabQaQacQabQaQ

cfPbdPabPacPaPfdP

cba

)|()|(

)|()|()|(

)|(

)|()|()(

)(

)(),( acQ

acQ

cfPacPabQ

abQ

bdPabPaQ

aQ

aPfdP

a cb

a

acQ

cfPacPEa

abQ

bdPabPE

aQ

aPEfdP QQQ |

)|(

)|()|(|

)|(

)|()|(

)(

)(),(

163

• Compute all expectations separately• How?

– Record all samples– For each sample that has A=a

• Estimate the conditional expectations separately using the generated samples

• Combine the results


A

B C

Pseudo-tree

a

acQ

cfPacPEa

abQ

bdPabPE

aQ

aPEfdP QQQ |

)|(

)|()|(|

)|(

)|()|(

)(

)(),(

164

AND/OR Importance Sampling

A

0

B

2

1

1

C

0 1

B

21

C

0 1

Sample # A B C

1 0 1 0

2 0 2 1

3 1 1 1

4 1 2 0

2

02w(B0)A1,w(B

0A having B of samples of Weight Average

00

0 of Estimate

),

|)|(

)|()|(

A

AAbQ

bdPAbPE

A

B C

Pseudo-tree

165

AND/OR Importance Sampling

All AND nodes: Separate Components. Take ProductOperator: Product

All OR nodes: Conditional Expectations given the assignment above it

Operator: Weighted Average

Sample # Z X Y

1 0 1 0

2 0 2 1

3 1 1 1

4 1 2 0


A

0

B

2

1

1

C

0 1

B

21

C

0 1

Avg Avg Avg

Avg

Avg

Algorithm AND/OR Importance Sampling

166

1. Construct a pseudo-tree.2. Construct a proposal distribution along the pseudo-tree3. Generate samples x1, . . . ,xN from Q along O.4. Build a AND/OR sample tree for the samples x1, . . . ,xN

along the ordering O.5. FOR all leaf nodes i of AND-OR tree do

1. IF AND-node v(i)= 1 ELSE v(i)=0

6. FOR every node n from leaves to the root do1. IF AND-node v(n)=product of children2. IF OR-node v(n) = Average of children

7. Return v(root-node)


# samples in AND/OR vs Conventional

167

• 8 Samples in AND/OR space versus 4 samples in importance sampling

• Example: A=0, B=2, C=0 is not generated but still considered in the AND/OR space

A

0

B

2

1

1

C

0 1

A

21

B

0 1

Sample # A B C

1 0 1 0

2 0 2 1

3 1 1 1

4 1 2 0


Why AND/OR Importance Sampling

168

• AND/OR estimates have smaller variance.

• Variance Reduction– Easy to Prove for case of complete independence

(Goodman, 1960)

– Complicated to prove for general conditional independence case (See Vibhav Gogate’s thesis)!


tindependen

tindependennot

2

22

22

,][][][][][][

][

,][][][][][][

][

N

yVxV

N

xEyV

N

yExVyxV

N

yVxV

N

xEyV

N

yExVxyV

Note the squared term.

C

B D

A

C

B

D

A

≠

≠

≠

{0,1,2}

{0,1,2}

{0,1,2}

{0,1,2}

C

0 1

B D B D

1 2 1 202 0 2

A

0

A

1

A

2

A

0

C

0 1

B D B D

1 2 1 202 0 2

A

2

A

0

A

0 1

AND/OR sample graph12 samples

AND/OR sample tree8 samples

AND/OR Graph sampling

Combining AND/OR sampling and w-cutset sampling

• Reduce the variance of weights– Rao-Blackwellised w-cutset sampling (Bidyuk and

Dechter, 2007)

• Increase the number of samples; kind of– AND/OR Tree and Graph sampling (Gogate and

Dechter, 2008)

• Combine the two

N

zwVarzw

NVarePVar

QN

i

i

QQ

)]([)()(ˆ

1

1

170

Algorithm AND/OR w-cutset sampling

Given an integer constant w1. Partition the set of variables into K and R, such that the

treewidth of R is bounded by w.2. AND/OR sampling on K

1. Construct a pseudo-tree of K and compute Q(K) consistent with K2. Generate samples from Q(K) and store them on an AND/OR tree

3. Rao-Blackwellisation (Exact inference) at each leaf1. For each leaf node of the tree compute Z(R|g) where g is the

assignment from the leaf to the root.

4. Value computation: Recursively from the leaves to the root1. At each AND node compute product of values at children2. At each OR node compute a weighted average over the values at

children

5. Return the value of the root node171

AND/OR w-cutset sampling:Step 1: Partition the set of variables

D

E G

F

C

A B

Graphical model

Practical constraint: Can only perform exact inference if the treewidth is bounded by 1.

172

AND/OR w-cutset sampling:Step 2: AND/OR sampling over {A,B,C}

174

Pseudo-tree

C

A B

D

E G

F

C

A B

Graphical model

Samples: (C=0,A=0,B=1), (C=0,A=1,B=1), (C=1,A=0,B=0), (C=1,A=1,B=0)

AND/OR w-cutset sampling:Step 2: AND/OR sampling over {A,B,C}

175

Pseudo-tree

C

A B

C

AA B B

0 1

0 1 0 1 01

C

AA B B

0 1

0 1 0 1 01

Ff Gg

fBHfgHgBHgCH ),1(),(),1(),0(0)c|1v(B

:0C|1B of Value

AND/OR w-cutset sampling:Step 3: Exact inference at each leaf

176

AND/OR w-cutset sampling:Step 4: Value computation

177

C

AA B B

0 1

0 1 0 1 01

Ff Gg

fBHfgHgBHgCH ),1(),(),1(),0(0)C|1v(B

:0Cgiven 1B of Value

0|,,ˆ

:0Cgiven B of Value

CGFBE

0|,,ˆ

:0Cgiven A of Value

CEDAE

Value of C: Estimate of the partition function

Properties and Improvements

• Basic underlying scheme for sampling remains the same

– The only thing that changes is what you estimate from the samples

– Can be combined with any state-of-the-art importance sampling technique

• Graph vs Tree sampling

– Take full advantage of the conditional independence properties uncovered from the primal graph

178

AND/OR w-cutset samplingAdvantages and Disadvantages

• Advantages– Variance Reduction

– Relatively fewer calls to the Rao-Blackwellisation step due to efficient caching (Lazy Rao-Blackwellisation)

– Dynamic Rao-Blackwellisation when context-specific or logical dependencies are present• Particularly suitable for Markov logic networks (Richardson

and Domingos, 2006).

• Disadvantages– Increases time and space complexity and therefore

fewer samples may be generated.

179

Take away Figure:Variance Hierarchy and Complexity

IS

w-cutset ISAND/OR Tree

IS

AND/OR w-cutset Tree

IS

AND/OR

Graph IS

AND/OR w-cutset

Graph IS

O(nN)O(1) O(nN)

O(h)

O(nNexp(w))O(h+nexp(w))

O(nNexp(w))O(exp(w))

O(nNt)O(nN)

O(nNtexp(w))O(nN+nexp(w))

180

Experiments

• Benchmarks

– Linkage analysis

– Graph coloring

• Algorithms

– OR tree sampling

– AND/OR tree sampling

– AND/OR graph sampling

– w-cutset versions of the three schemes above


Time Bound: 1hr


Time Bound: 1hr

Results: Solution countingGraph coloring instance

Time Bound: 1hr

Summary: AND/OR Importance sampling

• AND/OR sampling: A general scheme to exploit conditional independence in sampling

• Theoretical guarantees: lower sampling error than conventional sampling

• Variance reduction orthogonal to Rao-Blackwellised sampling.

• Better empirical performance than conventional sampling.

187

Documents

Sampling Techniques for Probabilistic and …dechter/talks/tutorial-aaai-2010.pdfSampling Techniques for Probabilistic and Deterministic Graphical models Bozhena Bidyuk Vibhav Gogate