Download pdf - Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740/lecture18-convertGM.pdfReading: Chap 8, C. Bishop Book Undirected Graphical Models ... In a Bayesian

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012

Le Song

Lecture 17, Oct 25, 2012

Reading: Chap 8, C. Bishop Book

Undirected Graphical Models or Markov Random Fields

Read conditional independence from UGM

Global Markov Independence 𝐴 ⊥ 𝐵 | 𝐶

Independence based on separation

Local Markov Independence 𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡|𝐴𝐵𝐶𝐷

ABCD Markov blanket

2

𝑋

𝐴 𝐵

𝐶 𝐷

𝐴 𝐶 𝐵

Edges give dependence between variables, but no causal relations and generate sample is more complicated

Maximal Cliques

For 𝐺 = 𝑉, 𝐸 , a complete subgraph (clique) is a subgraph 𝐺’ = 𝑉’ ⊆ 𝑉, 𝐸’ ⊆ 𝐸 s.t. nodes in 𝑉’ are fully connected

A (maximal) clique is a complete subgraph s.t. any superset 𝑉′′,𝑉’ ⊂ 𝑉’’, is not fully connected

Example:

Maximal cliques = {A,B,C}, {A,B,D}

Sub-cliques = {A}, {B}, {A,B}, {C,D} … (all edges and singletons)

3

𝐴 𝐵

𝐶

𝐷

MN: Conditional independence vs distribution factorization

MN encodes global Markov assumptions 𝐼𝑙 𝐺

4

If global conditional independence in MN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ψ 𝐷𝑖𝑖

obtain

Then global conditional independence in MN are


𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) =1


obtain

This MN is an I-map of P

𝑃 factorizes according to BN

Every P has least one MN structure G

Read independence of P from MN structure G

Counter Example

𝑋1, … , 𝑋4 are binary, and only eight assignments have positive probability (each with 1/8)

(0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0), (0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1)

Eg., 𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4

But distribution does not factorize!

eg. 𝑃 0,0,1,0 = 0 =1

𝑍 Ψ12 0, 0 Ψ23 0,1 Ψ34 1,0 Ψ14(0,0) 5

𝑋2 𝑋4

𝑋1

𝑋3

𝑋𝟐𝑋𝟒 𝑋𝟏𝑋𝟑

00 01 10 11

00 ½ ½ 0 0

01 0 ½ 0 ½

10 ½ 0 ½ 0

11 0 0 ½ ½

𝑋𝟐𝑋𝟒 𝑋𝟏

00 01 10 11

0 ½ 1 0 ½

1 ½ 0 1 ½

𝑋𝟐𝑋𝟒 𝑋𝟑

00 01 10 11

0 1 ½ ½ 0

1 0 ½ ½ 1

Markov Network Representation

For all 𝑥, 𝑃 𝑋 = 𝑥 > 0

Known as Hammersley-Clifford Theorem

6

If global conditional independence in MN are


a strictly postive 𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 1


obtain

This MN is an I-map of P

𝑃 factorizes according to BN

Every P has least one MN structure G

Minimal I-maps and Markov networks

A fully connected graph is an I-map for all distribution

Remember minimal I-maps

Deleting an edge make it no longer I-map

In a Bayesian Network, there is no unique minimal I-map

For strictly positive distributions and Markov network, minimal I-map is unique!

Many ways to find minimal I-map, eg.,

Take pairwise Markov assumption

If 𝑃 does not entail it, add edge

7

How about a perfect map?

Perfect maps?

Independence in the graph are exactly the same as those in 𝑃

For Bayesian networks, does not always exist

Counter example: swinging couple of variables

How about for Markov networks?

Counter example: V-structure

8

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑆

𝐴 𝐹

Minimal I-map MN Not a P-map Can not code 𝐴 ⊥ 𝐹

𝑃

𝐵𝑁 𝑀𝑁

Some common Markov networks

Pairwise Markov network

Exponential family (or log-linear model)

Factor graphs

9

Pairwise Markov Networks

All factors over single variables or pairs of variables

Node potentials Ψ𝑖 𝑋𝑖

Edge potentials Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗

Factorization

𝑃 𝑋 = Ψ𝑖 𝑋𝑖 Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗(𝑖,𝑗)∈𝐸𝑖∈𝑉

Note that there may be bigger cliques in the graph, but only consider pairwise potentials

10

𝐴 𝐵

𝐶

𝐷

An example

𝑃 𝐴, 𝐵, 𝐶, 𝐷 =1

𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷

𝑍 = Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷 𝑎,𝑏,𝑐,𝑑

Use two 3-way tables

𝑃′ 𝐴, 𝐵, 𝐶, 𝐷 =1

𝑍Ψ 𝐴𝐶 Ψ 𝐵𝐶 Ψ 𝐴𝐵 Ψ 𝐴𝐷 Ψ(𝐵𝐷)

Use five 2-way tables

What is the relation between 𝐼 𝑃 and 𝐼 𝑃’ ?

𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 ⊆ 𝐼 𝑃 ⊆ 𝐼 𝑃’ ⊆ 𝑑𝑖𝑠𝑐𝑜𝑛𝑛𝑒𝑡𝑒𝑑

11

𝐴 𝐵

𝐶

𝐷


Maximal clique specification

𝐴 𝐵

𝐶

𝐷

Applications of Pairwise Markov Networks

12

bg

fg

Image segmentation: separate foreground from background

Graph structure: Grid with one node per pixel

Pairwise Markov networks

Node potential

Background color vs. foreground color

Ψ 𝑋𝑖 = 𝑓𝑔,𝑚𝑖 = exp −||𝑚𝑖 − 𝜇𝑓𝑔||2

Ψ 𝑋𝑖 = 𝑏𝑔,𝑚𝑖 = exp −||𝑚𝑖 − 𝜇𝑏𝑔||2

Edge potential

Neighbors likely have the same label

𝑋𝑖 𝑋𝑗 Fg Bg

Fg 10 1

Bg 1 10

Ψ(𝑋𝑖 , 𝑋𝑗) =

Exponential Form

Standard model:

Assuming strictly positive potentials:

𝑃(𝑋1, … , 𝑋𝑛) = 1


= 1

𝑍 exp log Ψ 𝐷𝑖𝑖

=1

𝑍exp log Ψ 𝐷𝑖𝑖

= 1

𝑍exp( Φ(𝐷𝑖)𝑖 )

We can maintain table Φ 𝐷𝑖 (can have negative entries) rather than table Ψ 𝐷𝑖 (strictly positive entries)

13

𝑃(𝑋1, … , 𝑋𝑛) = 1


Exponential Form—Log-linear Models

Features are some functions 𝑓(𝐷) for a subset of variables 𝐷

Log-linear model over a Markov network G:

A set of features 𝑓1 𝐷1 , … , 𝑓𝑘 𝐷𝑘

Each 𝐷𝑖 is over a subset of a clique in G

Eg. Pairwise model 𝐷𝑖 = 𝑋𝑖 , 𝑋𝑗

Two f’s can be over the same variables

It’s ok for 𝐷𝑖 = 𝐷𝑗

A set of weights 𝑤1, … , 𝑤𝑘

Usually learned from data

𝑃 𝑋1, … , 𝑋𝑛 = 1

𝑍exp( 𝑤𝑖𝑓𝑖(𝐷𝑖)

𝑘𝑖=1 )

14

MN: Gaussian Graphical Models

A Gaussian distribution can be represented by a fully connected graph with pairwise edge potentials over continuous variable nodes

The overall exponential form is:

𝑃 𝑋1, … , 𝑋𝑛

= exp (− (𝑋𝑖𝑖𝑗∈E − 𝜇𝑖)Σ𝑖𝑗−1(𝑋𝑗 − 𝜇𝑗))

= exp (− 𝑋 − 𝜇 Σ−1 𝑋 − 𝜇 )

Also know as Gaussian graphical models (GGM)

15

Sparse precision vs. sparse covariance in GGM

16

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5

Σ−1 =

1 6 0 0 0

6 2 7 0 0

0 7 3 8 0

0 0 8 4 9

0 0 0 9 5

Σ =

0.10 0.15 -0.13 -0.08 0.15

0.15 -0.03 0.02 0.01 -0.03

-0.13 0.02 0.10 0.07 -0.12

-0.08 0.01 0.07 -0.04 0.07

0.15 -0.03 -0.12 0.07 0.08

Σ15−1 = 0 ⇔ 𝑋1 ⊥ 𝑋5 |𝑇ℎ𝑒𝑅𝑒𝑠𝑡

𝑋1 ⊥ 𝑋5 ⇔ Σ15 = 0

Factor Graph

Maximal clique specification

𝑃 𝐴, 𝐵, 𝐶, 𝐷 =1

𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷


𝑃′ 𝐴, 𝐵, 𝐶, 𝐷 =1

𝑍Ψ 𝐴𝐶 Ψ 𝐵𝐶 Ψ 𝐴𝐵 Ψ 𝐴𝐷 Ψ(𝐵𝐷)

Can not look at the graph and tell what potential is using

Factor graph is to make this clear in graphical form

17

𝐴 𝐵

𝐶

𝐷

Factor Graph

Make factor dependency explicit

Useful for later inference

Bipartite graph:

Variable nodes (circle) for 𝑋1, … , 𝑋𝑛

Factor nodes (square) for Ψ1, … ,Ψ𝑚

Edge 𝑋𝑖 −Ψ𝑗 if 𝑋𝑖 ∈ 𝐷𝑗 (Scope of Ψ𝑗 𝐷𝑗 )

18

𝐴 𝐵

𝐶

𝐷

𝐴 𝐵 𝐶 𝐷

Ψ1 Ψ2

𝐴 𝐵 𝐶 𝐷

Ψ1 Ψ3 Ψ2 Ψ4 Ψ5

1

𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷

1

𝑍Ψ1 𝐴𝐵 Ψ2 𝐴𝐶 Ψ3 𝐵𝐶 Ψ4 𝐴𝐷 Ψ5(𝐵𝐷)

Conditional random Fields

Focus on conditional distribution

𝑃(𝑌1, … , 𝑌𝑛|𝑋1, … , 𝑋𝑛, 𝑋)

Do not explicitly model dependence between 𝑋1, … , 𝑋𝑛, 𝑋

Only model relation between 𝑋 − 𝑌 and 𝑌 − 𝑌

𝑃(𝑌1, 𝑌2, 𝑌3, 𝑌4 |𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋)

=1

𝑍 𝑋1,𝑋2,𝑋3,𝑋4,𝑋Ψ 𝑌1, 𝑌2, 𝑋1, 𝑋2, 𝑋 Ψ(𝑌2, 𝑌3, 𝑋2, 𝑋3, 𝑋)

𝑍 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋 = Ψ 𝑌1, 𝑌2, 𝑋1, 𝑋2, 𝑋 Ψ(𝑌2, 𝑌3, 𝑋2, 𝑋3, 𝑋) 𝑦1𝑦2𝑦3

19

𝑌1 𝑌2 𝑌3

𝑋1 𝑋2 𝑋3

𝑋

Bayesian vs. Markov Networks

Conditional Independence Assumptions

Global Markov Assumption

𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶

Derived local and pairwise assumption

𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 |𝑀𝐵𝑋

𝑋 ⊥ 𝑌 | 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 (no X—Y)

21

Local Markov Assumption

𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋|𝑃𝑎𝑋

D-separation, active trail

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑆 𝐻 𝐴

¬ (𝐴 ⊥ 𝐻) 𝐴 ⊥ 𝐻 | 𝑆 𝑁

𝑆

𝐻

𝑁 ⊥ 𝐻 | 𝑆 ¬(𝑁 ⊥ 𝐻)

𝐴 𝐶 𝐵 𝑋

𝑃𝑎𝑋

𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋

𝑋 𝐴 𝐵

𝐶 𝐷 𝑀𝐵𝑋 = {𝐴𝐵𝐶𝐷}

𝐵𝑁 𝑀𝑁

Distribution Factorization

Bayesian Networks (Directed Graphical Models) 𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)

𝑛

𝑖=1

22

Markov Networks (Undirected Graphical Models) 𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ψ𝑖 𝐷𝑖

𝑚

𝑖=1

𝑍 = Ψ𝑖 𝐷𝑖

𝑚

𝑖=1𝑥1,𝑥2,…,𝑥𝑛

Clique Potentials

Conditional Probability Tables (CPTs)

Maximal Clique Normalization

(Partition Function)

Minimal I-map not unique

Do not always have P-map

Representation Power

23

𝑃

𝐵𝑁 𝑀𝑁

Minimal I-map unique

Do not always have P-map

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3

?

convert?

Is there a BN that is a P-map for a given MN?

MN for swing couple of variables does not have a P-map as BN

24

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3

Is there an MN that is P-map for a given BN?

BN for V-structure does not have a P-map as MN

25

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐴 ⊥ 𝐹 | 𝑆

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)

𝑆

𝐹 𝐴

Conversion using Minimal I-map instead

Instead of attempting P-maps between BNs and MNs, we can try minimal I-maps for conversion

Recall: 𝐺 is a minimal I-map for 𝑃 if

𝐼 𝐺 ⊆ 𝐼 𝑃

Removal of a single edge in 𝐺 render it not an I-map

Note: If 𝐺 is a minimal I-map of 𝑃, 𝐺 need not necessarily satisfy all conditional independence relation in 𝑃

26

Conversion from BN to MN

MN: Markov blanket 𝑀𝐵𝑋𝑖 of 𝑋𝑖: the set of immediate

neighbors of 𝑋𝑖 in the graph

𝑋𝑖 ⊥ 𝑉 – 𝑋𝑖 –𝑀𝐵𝑋𝑖 | 𝑀𝐵𝑋𝑖 : ∀𝑖

Markov blanket for BN?

27

𝑋

A B

C D

E 𝑋

A B

C D

𝑀𝐵𝑋 = 𝐴, 𝐵, 𝐶, 𝐷 𝑀𝐵𝑋 = ?

F

G H

I J

¬ (𝑋 ⊥ 𝐷𝐹|𝐴𝐵𝐶𝐷)

Strategy: go outward from 𝑋, try to block all active trails to 𝑋

Markov blanket for BN

28

𝑋

A B

C D

E

𝑀𝐵𝑋 = 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹

𝑋 ⊥ 𝑉 – 𝑋 – 𝑀𝐵𝑋 | 𝑀𝐵𝑋

F

G H

I J

1 2

3 4 𝑋

A B

C D

E F

G H

I J

Markov Blanket for BN

𝑀𝐵𝑋 in BN is the set of nodes consisting of 𝑋’s parents, 𝑋’s children and other parents of 𝑋’s children

Moral graph 𝑀(𝐺) of a BN 𝐺 is an undirected graph that contains an undirected edge between 𝑋 and 𝑌 if

There is a directed edge between them in the either direction

𝑋 and 𝑌 are parents of a common children

Moral graph insure that 𝑀𝐵𝑋 in the set of neighbors in undirected graph 𝑀(𝐺)

29

A B

C D 𝐺

A B

C D

𝑀 𝐺 moralize

Minimal I-map from BNs to MNs

Moral graph of 𝑀 𝐺 of any BN 𝐺 is a minimal I-map for G

Moralization turns each 𝑋, 𝑃𝑎𝑋 into a fully connected component

CPTs associated with BN can be used as clique potentials

The moral graph loses some independence relation

eg. BN: 𝐴 ⊥ 𝐸, 𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐸

MN: can not read marginal independence

30

A B

C D 𝐺

E

moralize A B

C D

𝑀 𝐺

E

Minimal I-maps from MNs to BNs

Any BN I-map for an MN must add triangulating edges into the graph

Intuition:

V-structures in BN introduce immoralities

These immoralities were not present in a Markov networks

Triangulation eliminates immoralities

31

𝑋2 𝑋4

𝑋1

𝑋3

triangulate 𝑋2 𝑋4

𝑋1

𝑋3

Chordal graphs

Let 𝑋1 − 𝑋2 −⋯− 𝑋𝑘 − 𝑋1 be a loop in a graph. A chord in a loop is an edge connecting non-consecutive 𝑋𝑖 and 𝑋𝑗

An undirected graph G is chordal if any loop 𝑋1 − 𝑋2 −⋯−𝑋𝑘 − 𝑋1 for 𝑘 ≥ 4 has a chord

A directed graph G is chordal if its underlying undirected graph is chordal

32

𝐵 𝐶

𝐴

𝐷 𝐸

𝐹

Let 𝐻 be an MN, and 𝐺 be any BN minimal I-map for 𝐻. Then 𝐺 can have no immoralities

Intuitive reason: immoralities introduce additional independencies that are not in the original MN

Let 𝐺 any BN minimal I-map for 𝐻. Then 𝐺 is necessarily chordal!

Because any non-triangulated loop of length at least 4 in a Bayesian network necessarily contains an immorality

Process of adding edges are called triangulation

Minimal I-maps from MNs to BNs

33

𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐵 C

𝐴

𝐷

Summary

34

𝑃

𝐵𝑁 𝑀𝑁

undirected trees undirected chordal graph

Moralize BN Triangulating MN