137
Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland http://www.cs.umd.edu/~getoor

Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Graphical Models: An Introduction

Lise GetoorComputer Science DeptUniversity of Maryland

http://www.cs.umd.edu/~getoor

Page 2: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Reading List for Next Lecture

• Learning Probabilistic Relational Models, L. Getoor, N. Friedman, D. Koller, A. Pfeffer. Invited contribution to the book Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001.

http://www.cs.umd.edu/~getoor/Publications/lprm-ch.ps http://www.cs.umd.edu/class/spring2005/cmsc828g/Readings/lprm-ch.pdf

• Probabilistic Models for Relational Data, David Heckerman, Christopher Meek and Daphne Koller

http://www.cs.umd.edu/projects/srl2004/Papers/heckerman.pdfftp://ftp.research.microsoft.com/pub/tr/TR-2004-30.pdf

Page 3: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Graphical Models

• e.g. Bayesian networks, Bayes nets, Belief nets, Markov networks, HMMs, Dynamic Bayes nets, etc.

• Themes:– representation– reasoning– learning

• Materials based on upcoming book by Nir Friedman and Daphne Koller.

• Slides based on material from Nir Friedman.

Page 4: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Probability Distributions

• Let X1,…,Xp be discrete random variables

• Let P be a joint distribution over X1,…,Xp

If the variables are binary, then we need O(2p) parameters to describe P

Can we do better?• Key idea: use properties of independence

Page 5: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Independent Random Variables

• Two variables X and Y are independent if– P(X = x|Y = y) = P(X = x) for all values x, y– That is, learning the values of Y does not change

prediction of X

• If X and Y are independent then – P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)

• In general, if X1,…,Xp are independent, then– P(X1,…,Xp)= P(X1)...P(Xp)

– Requires O(n) parameters

Page 6: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Conditional Independence

• Unfortunately, most of random variables of interest are not independent of each other

• A more suitable notion is that of conditional independence

• Two variables X and Y are conditionally independent given Z if– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z– That is, learning the values of Y does not change prediction

of X once we know the value of Z

– notation: I ( X , Y | Z )

Page 7: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Example: Naïve Bayesian Model

• A common model in early diagnosis:– Symptoms are conditionally independent given the

disease (or fault)

• Thus, if – X1,…,Xp denote whether the symptoms exhibited by

the patient (headache, high-fever, etc.) and – H denotes the hypothesis about the patients health

• then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),

• This naïve Bayesian model allows compact representation

– It does embody strong independence assumptions

Page 8: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Graphical Models

• Graph is language for representing independencies– Directed Acyclic Graph -> Bayesian Network– Undirected Graph -> Markov Network

Page 9: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

DAGS: Markov Assumption

• We now make this independence assumption more precise for directed acyclic graphs (DAGs)

• Each random variable X, is independent of its non-descendents, given its parents Pa(X)

• Formally,I (X, NonDesc(X) | Pa(X))

Descendent

Ancestor

Parent

Non-descendent

X

Y1 Y2

Non-descendent

Page 10: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Markov Assumption Example

• In this example:– I ( E, B )– I ( B, {E, R} )– I ( R, {A, B, C} | E )– I ( A, R | B,E )– I ( C, {B, E, R} | A)

Earthquake

Radio

Burglary

Alarm

Call

Page 11: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

I-Maps

• A DAG G is an I-Map of a distribution P if the all Markov assumptions implied by G are satisfied by P(Assuming G and P both use the same set of random

variables)

Examples:

X Y

x y P(x,y)0 0 0.250 1 0.251 0 0.251 1 0.25

X Y

x y P(x,y)0 0 0.20 1 0.31 0 0.41 1 0.1

Page 12: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Factorization

• Given that G is an I-Map of P, can we simplify the representation of P?

• Example:

• Since I(X,Y), we have that P(X|Y) = P(X)• Applying the chain rule

P(X,Y) = P(X|Y) P(Y) = P(X) P(Y)

• Thus, we have a simpler representation of P(X,Y)

X Y

Page 13: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Factorization Theorem

From assumption:

)X(NonDesc)X(Pa}X,X{

}X,X{)X(Pa

ii1i,1

1i,1i

i

iip1 ))X(Pa|X(P)X,...,X(P

Thm: if G is an I-Map of P, then

i

1i1ip1 )X,...,X|X(P)X,...,X(PProof:• By chain rule:

• wlog. X1,…,Xp is an ordering consistent with G

• Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi))

• We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )

))X(Pa|)X(Pa}X,X{,X(I ii1i,1i • Hence,

Page 14: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Factorization Example

P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E)

Earthquake

Radio

Burglary

Alarm

Call

versusP(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)

Page 15: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Consequences

• We can write P in terms of “local” conditional probabilities

If G is sparse,– that is, |Pa(Xi)| < k ,

each conditional probability can be specified compactly

– e.g. for binary variables, these require O(2k) params.

representation of P is compact– linear in number of variables

Page 16: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

DAGS: Summary

• The Markov Independences of a DAG G– I (Xi , NonDesc(Xi) | Pai )

• G is an I-Map of a distribution P– If P satisfies the Markov independencies implied by G

• if G is an I-Map of P, then

i

iin1 )Pa|X(P)X,...,X(P

Page 17: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

• Let Markov(G) be the set of Markov Independencies implied by G

• The factorization theorem shows

G is an I-Map of P

• We can also show the opposite:

Thm: G is an I-Map

of P

Conditional Independencies

i

iin PaXPXXP )|(),...,( 1

i

iin PaXPXXP )|(),...,( 1

Page 18: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Implied Independencies

• Does a graph G imply additional independencies as a consequence of Markov(G)?

• We can define a logic of independence statements

• Some axioms:– I( X ; Y | Z ) I( Y; X | Z )

– I( X ; Y1, Y2 | Z ) I( X; Y1 | Z )

Page 19: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

d-seperation

• A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no

• Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from

Markov(G)

Page 20: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Paths

• Intuition: dependency must “flow” along paths in the graph

• A path is a sequence of neighboring variables

Examples:• R E A B• C A E R

Earthquake

Radio

Burglary

Alarm

Call

Page 21: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Paths

• We want to know when a path is– active -- creates dependency between end nodes– blocked -- cannot create dependency end nodes

• We want to classify situations in which paths are active.

Page 22: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Blocked Unblocked

E

R A

E

R A

Path Blockage

Three cases:– Common cause

Blocked Active

Page 23: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Blocked Unblocked

E

C

A

E

C

A

Path Blockage

Three cases:– Common cause

– Intermediate cause

Blocked Active

Page 24: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Blocked Unblocked

E B

A

C

E B

A

CE B

A

C

Path Blockage

Three cases:– Common cause

– Intermediate cause

– Common Effect

Blocked Active

Page 25: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Path Blockage -- General Case

A path is active, given evidence Z, if• Whenever we have the configuration

B or one of its descendents are in Z

• No other nodes in the path are in Z

A path is blocked, given evidence Z, if it is not active.

A C

B

Page 26: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

A

– d-sep(R,B)?

Example

E B

C

R

Page 27: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

– d-sep(R,B) = yes– d-sep(R,B|A)?

Example

E B

A

C

R

Page 28: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

– d-sep(R,B) = yes– d-sep(R,B|A) = no– d-sep(R,B|E,A)?

Example

E B

A

C

R

Page 29: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

d-Separation

• X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z.

• Checking d-separation can be done efficiently (linear time in number of edges)– Bottom-up phase:

Mark all nodes whose descendents are in Z– X to Y phase:

Traverse (BFS) all edges on paths from X to Y and check if they are blocked

Page 30: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Soundness

Thm: • If

– G is an I-Map of P– d-sep( X; Y | Z, G ) = yes

• then– P satisfies I( X; Y | Z )

Informally,• Any independence reported by d-separation is

satisfied by underlying distribution

Page 31: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Completeness

Thm: • If d-sep( X; Y | Z, G ) = no• then there is a distribution P such that

– G is an I-Map of P– P does not satisfy I( X; Y | Z )

Informally,• Any independence not reported by d-separation

might be violated by the underlying distribution• We cannot determine this by examining the

graph structure alone

Page 32: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

I-Maps revisited

• The fact that G is I-Map of P might not be that useful

• For example, complete DAGs– A DAG is G is complete is we cannot add an arc without

creating a cycle

• These DAGs do not imply any independencies• Thus, they are I-Maps of any distribution

X1

X3

X2

X4

X1

X3

X2

X4

Page 33: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Minimal I-Maps

A DAG G is a minimal I-Map of P if• G is an I-Map of P• If G’ G, then G’ is not an I-Map of P

Removing any arc from G introduces (conditional) independencies that do not hold in P

Page 34: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Minimal I-Map Example

• If is a minimal I-Map

• Then, these are not I-Maps:

X1

X3

X2

X4

X1

X3

X2

X4

X1

X3

X2

X4

X1

X3

X2

X4

X1

X3

X2

X4

Page 35: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Constructing minimal I-Maps

The factorization theorem suggests an algorithm

• Fix an ordering X1,…,Xn

• For each i, – select Pai to be a minimal subset of {X1,…,Xi-1 },

such that I(Xi ; {X1,…,Xi-1 } - Pai | Pai )

• Clearly, the resulting graph is a minimal I-Map.

Page 36: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Non-uniqueness of minimal I-Map

• Unfortunately, there may be several minimal I-Maps for the same distribution– Applying I-Map construction procedure with different orders

can lead to different structures

E B

A

C

R

Original I-Map

E B

A

C

R

Order: C, R, A, E, B

Page 37: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Choosing Ordering & Causality

• The choice of order can have drastic impact on the complexity of minimal I-Map

• Heuristic argument: construct I-Map using causal ordering among variables

• Justification?– It is often reasonable to assume that graphs of causal

influence should satisfy the Markov properties.

Page 38: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

P-Maps

• A DAG G is P-Map (perfect map) of a distribution P if– I(X; Y | Z) if and only if

d-sep(X; Y |Z, G) = yes

Notes:• A P-Map captures all the independencies in the

distribution• P-Maps are unique, up to DAG equivalence

Page 39: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

P-Maps

• Unfortunately, some distributions do not have a P-Map

Page 40: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Bayesian Networks

• A Bayesian network specifies a probability distribution via two components:

– A DAG G– A collection of conditional probability distributions

P(Xi|Pai)

• The joint distribution P is defined by the factorization

• Additional requirement: G is a minimal I-Map of P

i

iin PaXPXXP )|(),...,( 1

Page 41: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

DAGs and BNs

• DAGs as a representation of conditional independencies:

– Markov independencies of a DAG– Tight correspondence between Markov(G) and the

factorization defined by G– d-separation, a sound & complete procedure for

computing the consequences of the independencies– Notion of minimal I-Map– P-Maps

• This theory is the basis for defining Bayesian networks

Page 42: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Undirected Graphs: Markov Networks

• Alternative representation of conditional independencies

• Let U be an undirected graph

• Let Ni be the set of neighbors of Xi

• Define Markov(U) to be the set of independenciesI( Xi ; {X1,…,Xn} - Ni - {Xi } | Ni )

• U is an I-Map of P if P satisfies Markov(U)

Page 43: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Example

This graph implies that• I(A; C | B, D )• I(B; D | A, C )

• Note: this example does not have a directed P-Map

A

D

B

C

Page 44: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Markov Network Factorization

Thm: if

• P is strictly positive, that is P(x1, …, xn ) > 0 for all assignments

then• U is an I-Map of Pif and only if• there is a factorization

where C1, …, Ck are the maximal cliques in U

Alternative form:

i

in1 )C(fZ1

)X,,X(P

i

iCg

n1 eZ1

XXP)(

),,(

Page 45: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Relationship between Directed & Undirected Models

Chain Graphs

Directed Graphs

UndirectedGraphs

Page 46: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

CPDs

• So far, we focused on how to represent independencies using DAGs

• The “other” component of a Bayesian networks is the specification of the conditional probability distributions (CPDs)

• Here, we’ll just discuss the simplest representation of CPDs

Page 47: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Tabular CPDs

• When the variable of interest are all discrete, the common representation is as a table:

• For example P(C|A,B) can be represented by

A B P(C = 0 | A, B) P(C = 1 | A, B)

0 0 0.25 0.750 1 0.50 0.501 0 0.12 0.881 1 0.33 0.67

Page 48: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Tabular CPDs

Pros:• Very flexible, can capture any CPD of discrete

variables• Can be easily stored and manipulated

Cons:• Representation size grows exponentially with

the number of parents!• Unwieldy to assess probabilities for more than

few parents

Page 49: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Continuous CPDs

• When X is a continuous variables, we need to represent the density of X, given any value of its parents

– Gaussian– Conditional Gaussian

Page 50: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

CPDs: Summary

• Many choices for representing CPDs• Any “statistical” model of conditional

distribution can be used– e.g., any regression model

• Representing structure in CPDs can have implications on independencies among variables

Page 51: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Bayesian Inference in Bayesian NetworksNetworks

Page 52: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference

• We now have compact representations of probability distributions:– Bayesian Networks– Markov Networks

• Network describes a unique probability distribution P

• How do we answer queries about P?

• inference is name for the process of computing answers to such queries

Page 53: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Queries: Likelihood

• There are many types of queries we might ask. • Most of these involve evidence

– An evidence e is an assignment of values to a set E variables in the domain

– Without loss of generality E = { Xk+1, …, Xn }

• Simplest query: compute probability of evidence

• This is often referred to as computing the likelihood of the evidence

1x

1 ),,,( )(kx

kxxPP ee

Page 54: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Queries: A posteriori belief

• Often we are interested in the conditional probability of a variable given the evidence

• This is the a posteriori belief in X, given evidence e

• A related task is computing the term P(X, e) – i.e., the likelihood of e and X = x for values of X – we can recover the a posteriori belief by

x

xXPxXP

xXP),(

),()|(

ee

e

)(),(

)|(eP

eXPeXP

Page 55: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

A posteriori belief

This query is useful in many cases:• Prediction: what is the probability of an outcome

given the starting condition– Target is a descendent of the evidence

• Diagnosis: what is the probability of disease/fault given symptoms– Target is an ancestor of the evidence

• Note: the direction between variables does not restrict the directions of the queries– Probabilistic inference can combine evidence form all

parts of the network

Page 56: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Queries: MAP

• In this query we want to find the maximum a posteriori assignment for some variable of interest (say X1,…,Xl )

• That is, x1,…,xl maximize the probability

P(x1,…,xl | e)

• Note that this is equivalent to maximizing

P(x1,…,xl, e)

Page 57: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Queries: MAP

We can use MAP for:• Classification

– find most likely label, given the evidence

• Explanation – What is the most likely scenario, given the evidence

Page 58: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Queries: MAP

Cautionary note:• The MAP depends on the set of variables• Example:

– MAP of X– MAP of (X, Y)

x y P(x,y)

0 0 0.35

0 1 0.05

1 0 0.3

1 1 0.3

Page 59: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Complexity of Inference

Thm:Computing P(X = x) in a Bayesian network is NP-hard

Not surprising, since we can simulate Boolean gates.

Page 60: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Hardness

• Hardness does not mean we cannot solve inference– It implies that we cannot find a general procedure

that works efficiently for all networks– For particular families of networks, we can have

provably efficient procedures

Page 61: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Approaches to inference

•Exact inference – Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms

•Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory

Page 62: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains

How do we compute P(X2)?

X1 X2

11

)|()(),()( 121212xx

xxPxPxxPxP

Page 63: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

How do we compute P(X3)?

• we already know how to compute P(X2)...

X1 X2

22

)|()(),()( 232323xx

xxPxPxxPxP

X3

11

)|()(),()( 121212xx

xxPxPxxPxP

Page 64: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

How do we compute P(Xn)?

• Compute P(X1), P(X2), P(X3), …

• We compute each term by using the previous one

ix

iiii xxPxPxP )|()()( 11

X1 X2 X3Xn

...

Complexity:

• Each step costs O(|Val(Xi)|*|Val(Xi+1)|) operations

• Compare to naïve evaluation, that requires summing over joint values of n-1 variables

Page 65: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

• Suppose that we observe the value of X2 =x2

• How do we compute P(X1|x2)?– Recall that we it suffices to compute P(X1,x2)

X1 X2

)()|(),( 11221 xPxxPxxP

Page 66: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

• Suppose that we observe the value of X3 =x3

• How do we compute P(X1,x3)?

• How do we compute P(x3|x1)?

X1 X2

)|()(),( 13131 xxPxPxxP

X3

2

22

)|()|(

),|()|()|,()|(

2312

2131213213

x

xx

xxPxxP

xxxPxxPxxxPxxP

Page 67: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

• Suppose that we observe the value of Xn =xn

• How do we compute P(X1,xn)?

• We compute P(xn|xn-1), P(xn|xn-2), … iteratively

X1 X2

)|()(),( 111 xxPxPxxP nn

X3

i

i

xinii

xiniin

xxPxxP

xxxPxxP

)|()|(

)|,()|(

11

11

Xn...

Page 68: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Inference in Simple Chains (cont.)

• Suppose that we observe the value of Xn =xn

• We want to find P(Xk|xn )

• How do we compute P(Xk,xn )?

• We compute P(Xk ) by forward iterations

• We compute P(xn | Xk ) by backward iterations

X1 X2

)|()(),( knknk xxPxPxxP

Xk Xn......

Page 69: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination in Chains

• We now try to understand the simple chain example using first-order principles

• Using definition of probability, we have

d c b a

edcbaPeP ),,,,()(

A B C ED

Page 70: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination in Chains

• By chain decomposition, we get

A B C ED

d c b a

d c b a

dePcdPbcPabPaP

edcbaPeP

)|()|()|()|()(

),,,,()(

Page 71: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination in Chains

• Rearranging terms ...

A B C ED

d c b a

d c b a

abPaPdePcdPbcP

dePcdPbcPabPaPeP

)|()()|()|()|(

)|()|()|()|()()(

Page 72: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination in Chains

• Now we can perform innermost summation

• This summation, is exactly the first step in the forward iteration we describe before

A B C ED

d c b

d c b a

bpdePcdPbcP

abPaPdePcdPbcPeP

)()|()|()|(

)|()()|()|()|()(

X

Page 73: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination in Chains

• Rearranging and then summing again, we get

A B C ED

d c

d c b

d c b

cpdePcdP

bpbcPdePcdP

bpdePcdPbcPeP

)()|()|(

)()|()|()|(

)()|()|()|()(

X X

Page 74: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Variable Elimination

General idea:• Write query in the form

• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product

kx x x i

iin paxPXP3 2

)|(),( e

Page 75: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

A More Complex Example

Visit to Asia

Smoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

• “Asia” network:

Page 76: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b

Initial factors

Page 77: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b

Initial factors

Eliminate: v

Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term

Compute: v

v vtPvPtf )|()()(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

Page 78: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: s,x,t,l,a,b

• Initial factors

Eliminate: s

Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables

Compute: s

s slPsbPsPlbf )|()|()(),(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

),|()|(),|(),()( badPaxPltaPlbftf sv

Page 79: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: x,t,l,a,b

• Initial factors

Eliminate: x

Note: fx(a) = 1 for all values of a !!

Compute: x

x axPaf )|()(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

),|()|(),|(),()( badPaxPltaPlbftf sv

),|(),|()(),()( badPltaPaflbftf xsv

Page 80: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: t,l,a,b

• Initial factors

Eliminate: t

Compute: t

vt ltaPtflaf ),|()(),(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

),|()|(),|(),()( badPaxPltaPlbftf sv

),|(),|()(),()( badPltaPaflbftf xsv

),|(),()(),( badPlafaflbf txs

Page 81: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: l,a,b

• Initial factors

Eliminate: l

Compute: l

tsl laflbfbaf ),(),(),(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

),|()|(),|(),()( badPaxPltaPlbftf sv

),|(),|()(),()( badPltaPaflbftf xsv

),|(),()(),( badPlafaflbf txs

),|()(),( badPafbaf xl

Page 82: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

V S

LT

A B

X D

),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP

• We want to compute P(d)• Need to eliminate: b

• Initial factors

Eliminate: a,bCompute:

b

aba

xla dbfdfbadpafbafdbf ),()(),|()(),(),(

),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv

),|()|(),|(),()( badPaxPltaPlbftf sv

),|(),|()(),()( badPltaPaflbftf xsv

),|()(),( badPafbaf xl),|(),()(),( badPlafaflbf txs

)(),( dfdbf ba

Page 83: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Variable Elimination

• We now understand variable elimination as a sequence of rewriting operations

• Actual computation is done in elimination step

• Exactly the same computation procedure applies to Markov networks

• Computation depends on order of elimination

Page 84: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

x

kxkx yyxfyyf ),,,('),,( 11

m

ilikx i

yyxfyyxf1

,1,1,11 ),,(),,,('

Complexity of variable elimination

• Suppose in one elimination step we compute

This requires • multiplications

– For each value for x, y1, …, yk, we do m multiplications

• additions

– For each value of y1, …, yk , we do |Val(X)| additions

Complexity is exponential in number of variables in the intermediate factor!

i

iYXm )Val()Val(

i

iYX )Val()Val(

Page 85: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Understanding Variable Elimination

• We want to select “good” elimination orderings that reduce complexity

• We start by attempting to understand variable elimination via the graph we are working with

• This will reduce the problem of finding good ordering to graph-theoretic operation that is well-understood

Page 86: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Undirected graph representation

• At each stage of the procedure, we have an algebraic term that we need to evaluate

• In general this term is of the form:

where Zi are sets of variables

• We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor

– that is, if X,Y are in some Zi

• Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

1

)(),,( 1y y i

ikn

fxxP iZ

Page 87: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Chordal Graphs

• elimination ordering undirected chordal graph

Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest

clique in graph

LT

A B

X

V S

D

V S

LT

A B

X D

Page 88: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Induced Width

• The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination

• This quantity is called the induced width of a graph according to the specified ordering

• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

Page 89: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

General Networks

• From graph theory:Thm:• Finding an ordering that minimizes the

induced width is NP-HardHowever,• There are reasonable heuristic for finding

“relatively” good ordering• There are provable approximations to the best

induced width• If the graph has a small induced width, there

are algorithms that find it in polynomial time

Page 90: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Elimination on Trees

• Formally, for any tree, there is an elimination ordering with induced width = 1

Thm• Inference on trees is linear in number of

variables

Page 91: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

PolyTrees

• A polytree is a network where there is at most one path from one variable to another

Thm:• Inference in a polytree is linear in the

representation size of the network– This assumes tabular CPT representation

A

CB

D E

F G

H

Page 92: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Approaches to inference

•Exact inference – Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms

•Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory

Page 93: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Bayesian Learning Bayesian NetworksNetworks

Page 94: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Bayesian networks

InducerInducerInducerInducerData + Prior information

E

R

B

A

C .9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Page 95: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Known Structure -- Complete Data

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is specified– Inducer needs to estimate parameters

• Data does not contain missing values

Page 96: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Unknown Structure -- Complete Data

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is not specified– Inducer needs to select arcs & estimate parameters

• Data does not contain missing values

Page 97: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Known Structure -- Incomplete Data

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is specified• Data contains missing values

– We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Page 98: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Known Structure / Complete Data

• Given a network structure G– And choice of parametric family for P(Xi|Pai)

• Learn parameters for network

Goal• Construct a network that is “closest” to

probability that generated the data

Page 99: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Parameters for a Bayesian Network

E B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

• Training data has the form:

Page 100: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Parameters for a Bayesian Network

E B

A

C

• Since we assume i.i.d. samples,likelihood function is

m

mCmAmBmEPDL ):][],[],[],[():(

Page 101: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Parameters for a Bayesian Network

E B

A

C

• By definition of network, we get

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 102: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Parameters for a Bayesian Network

E B

A

C

• Rewriting terms, we get

m

m

m

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 103: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

General Bayesian Networks

Generalizing for any Bayesian network:

• The likelihood decomposes according to the structure of the network.

iii

i miii

m iiii

mn

DL

mPamxP

mPamxP

mxmxPDL

):(

):][|][(

):][|][(

):][,],[():( 1 i.i.d. samples

Network factorization

Page 104: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

General Bayesian Networks (Cont.)

Decomposition Independent Estimation Problems

If the parameters for each family are not related, then they can be estimated independently of

each other.

Page 105: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

From Binomial to Multinomial

• For example, suppose X can have the values 1,2,…,K

• We want to learn the parameters 1, 2. …, K

Sufficient statistics:

• N1, N2, …, NK - the number of times each outcome is observed

Likelihood function:

MLE:

K

k

Nk

kDL1

):(

N

Nkk

ˆ

Page 106: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Likelihood for Multinomial Networks

• When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:

i i

ii

ii

i i

ii

i ii

pa x

paxNpax

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(|

),(

][,

):|(

):|][(

):][|][():(

m

iiiii mPamxPDL ):][|][():(

i i

ii

i ii

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(

][,

):|(

):|][(

):][|][():(

i iipa pamPamiii

miiiii

pamxP

mPamxPDL

][,

):|][(

):][|][():(

Page 107: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Likelihood for Multinomial Networks

• When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:

• For each value pai of the parents of Xi we get an independent multinomial problem

• The MLE is

i i

ii

iipa x

paxNpaxii DL ),(

|):(

)(),(ˆ

|i

iipax paN

paxNii

Page 108: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Bayesian Approach: Dirichlet Priors

• Recall that the likelihood function is

• A Dirichlet prior with hyperparameters 1,…,K is defined as

for legal 1,…, K

Then the posterior has the same form, with

hyperparameters 1+N 1,…,K +N K

K

1k

Nk

kDL ):(

K

kk

kP1

1)(

K

k

Nk

K

k

Nk

K

kk

kkkkDPPDP1

1

11

1)|()()|(

Page 109: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Dirichlet Priors (cont.)

• We can compute the prediction on a new event in closed form:

• If P() is Dirichlet with hyperparameters 1,…,K then

• Since the posterior is also Dirichlet, we get

kk d)(P)k]1[X(P

)N(

Nd)D|(P)D|k]1M[X(P kk

k

Page 110: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Prior Knowledge

• The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience

• Equivalent sample size = 1+…+K

• The larger the equivalent sample size the more confident we are in our prior

Page 111: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Bayesian Prediction(cont.)

• Given these observations, we can compute the posterior for each multinomial Xi | pai

independently– The posterior is Dirichlet with parameters

(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)

• The predictive distribution is then represented by the parameters

)pa(N)pa()pa,x(N)pa,x(~

ii

iiiipa|x ii

Page 112: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Parameters: Summary

• Estimation relies on sufficient statistics– For multinomial these are of the form N (xi,pai)

– Parameter estimation

• Bayesian methods also require choice of priors• Both MLE and Bayesian are asymptotically

equivalent and consistent• Both can be implemented in an on-line manner

by accumulating sufficient statistics

)()(),(),(~

|ii

iiiipax paNpa

paxNpaxii

)(),(ˆ

|i

iipax paN

paxNii

MLE Bayesian (Dirichlet)

Page 113: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Learning Structure from Complete Data

Page 114: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Why Struggle for Accurate Structure?

• Increases the number of parameters to be fitted

• Wrong assumptions about causality and domain structure

• Cannot be compensated by accurate fitting of parameters

• Also misses causality and domain structure

Earthquake Alarm Set

Sound

BurglaryEarthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arc Missing an arc

Page 115: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Approaches to Learning Structure

• Constraint based– Perform tests of conditional independence– Search for a network that is consistent with the

observed dependencies and independencies

• Pros & Cons Intuitive, follows closely the construction of BNs Separates structure learning from the form of the

independence tests Sensitive to errors in individual tests

Page 116: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Approaches to Learning Structure

• Score based– Define a score that evaluates how well the (in)dependencies

in a structure match the observations– Search for a structure that maximizes the score

• Pros & Cons Statistically motivated Can make compromises Takes the structure of conditional probabilities into account Computationally hard

Page 117: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Likelihood Score for Structures

First cut approach: – Use likelihood function

• Recall, the likelihood score for a network structure and parameters is

• Since we know how to maximize parameters from now we assume

m ii,G

Gii

mGn1G

),G:]m[Pa|]m[x(P

),G:]m[x,],m[x(P)D:,G(L

):,(max):( DGLDGL GG

Page 118: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Avoiding Overfitting

“Classic” issue in learning. Approaches:• Restricting the hypotheses space

– Limits the overfitting capability of the learner– Example: restrict # of parents or # of parameters

• Minimum description length– Description length measures complexity– Prefer models that compactly describes the training data

• Bayesian methods– Average over all possible parameter values– Use prior knowledge

Page 119: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Bayesian Inference

• Bayesian Reasoning---compute expectation over unknown G

• Assumption: Gs are mutually exclusive and exhaustive

• We know how to compute P(x[M+1]|G,D)– Same as prediction with fixed structure

• How do we compute P(G|D)?

G

)D|G(P)G,D|]1M[x(P)D|]1M[x(P

Page 120: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Marginal likelihood

Prior over structures

)()()|(

)|(DP

GPGDPDGP

Using Bayes rule:

P(D) is the same for all structures GCan be ignored when comparing structures

Probability of Data

Posterior Score

Page 121: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Marginal Likelihood

• By introduction of variables, we have that

• This integral measures sensitivity to choice of parameters

dGPGDPGDP )|(),|()|(

LikelihoodPrior over parameters

Page 122: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Marginal Likelihood for General Network

The marginal likelihood has the form:

where • N(..) are the counts from the data (..) are the hyperparameters for each family

given G

i pa xG

i

Gi

Gi

GG

G

Gi i i

ii

ii

i

pax

paxNpax

paNpa

paGDP

)),((

)),(),((

)()(

)()|(

Dirichlet Marginal LikelihoodFor the sequence of values of Xi when

Xi’s parents have a particular value

Page 123: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Priors

• We need: prior counts (..) for each network structure G

• This can be a formidable task– There are exponentially many structures…

Page 124: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

BDe Score

Possible solution: The BDe prior

• Represent prior using two elements M0, B0

– M0 - equivalent sample size

– B0 - network representing the prior probability of events

Page 125: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

BDe Score

Intuition: M0 prior examples distributed by B0

• Set (xi,paiG) = M0 P(xi,pai

G| B0) – Note that pai

G are not the same as the parents of Xi in B0.

– Compute P(xi,paiG| B0) using standard inference

procedures

• Such priors have desirable theoretical properties– Equivalent networks are assigned the same score

Page 126: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Bayesian Score: Asymptotic Behavior

Theorem: If the prior P( |G) is “well-behaved”, then

)1()dim(2

log):()|(log OG

MDGlGDP

Page 127: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Asymptotic Behavior: Consequences

• Bayesian score is consistent– As M the “true” structure G* maximizes the score

(almost surely)– For sufficiently large M, the maximal scoring

structures are equivalent to G*

• Observed data eventually overrides prior information– Assuming that the prior assigns positive probability to

all cases

)1()dim(2

log):()|(log OG

MDGlGDP

Page 128: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Asymptotic Behavior

• This score can also be justified by the Minimal Description Length (MDL) principle

• This equation explicitly shows the tradeoff between– Fitness to data --- likelihood term– Penalty for complexity --- regularization term

)dim(2

log):():Score( G

MDGlDG

Page 129: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Scores -- Summary

• Likelihood, MDL, (log) BDe have the form

• BDe requires assessing prior network.It can naturally incorporate prior knowledge and previous experience

• BDe is consistent and asymptotically equivalent (up to a constant) to MDL

• All are score-equivalent– G equivalent to G’ Score(G) = Score(G’)

))(:|():( iii

Gi PaXNPaXScoreDGScore

i

Page 130: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Optimization Problem

Input:– Training data– Scoring function (including priors, if needed)– Set of possible structures

• Including prior knowledge about structure

Output:– A network (or networks) that maximize the score

Key Property:– Decomposability: the score of a network is a sum of

terms.

Page 131: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Heuristic Search

We address the problem by using heuristic search

• Define a search space:– nodes are possible structures– edges denote adjacency of structures

• Traverse this space looking for high-scoring structures

Search techniques:– Greedy hill-climbing– Best first search– Simulated Annealing– ...

Page 132: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Heuristic Search (cont.)

• Typical operations:

S C

E

D

S C

E

D

Reverse C EDelete C

E

Add C

D

S C

E

D

S C

E

D

Page 133: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Exploiting Decomposability in Local Search

• Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move

S C

E

D

S C

E

D

S C

E

D

S C

E

D

Page 134: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Greedy Hill-Climbing

• Simplest heuristic local search– Start with a given network

• empty network• best tree • a random network

– At each iteration• Evaluate all possible changes• Apply change that leads to best improvement in score• Reiterate

– Stop when no modification improves score

• Each step requires evaluating approximately n new changes

Page 135: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Greedy Hill-Climbing: Possible Pitfalls

• Greedy Hill-Climbing can get struck in:– Local Maxima:

• All one-edge changes reduce the score

– Plateaus:• Some one-edge changes leave the score unchanged• Happens because equivalent networks received the

same score and are neighbors in the search space

• Both occur during structure search• Standard heuristics can escape both

– Random restarts– TABU search

Page 136: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Search: Summary

• Discrete optimization problem• In general, NP-Hard

– Need to resort to heuristic search– In practice, search is relatively fast (~100 vars in ~10 min):

• Decomposability• Sufficient statistics

• In some cases, we can reduce the search problem to an easy optimization problem– Example: learning trees

Page 137: Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland getoor

Graphical Models Intro Summary

• Representations– Graphs are cool way to put constraints on

distributions, so that you can say lots of stuff without even looking at the numbers!

• Inference– GM let you compute all kinds of different

probabilities efficiently

• Learning– You can even learn them auto-magically!