Modeling Rich Structured Data via Kernel Distribution...

Preview:

Citation preview

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012

Le Song

Lecture 17, Oct 25, 2012

Reading: Chap 8, C. Bishop Book

Undirected Graphical Models or Markov Random Fields

Read conditional independence from UGM

Global Markov Independence 𝐴 βŠ₯ 𝐡 | 𝐢

Independence based on separation

Local Markov Independence 𝑋 βŠ₯ π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘|𝐴𝐡𝐢𝐷

ABCD Markov blanket

2

𝑋

𝐴 𝐡

𝐢 𝐷

𝐴 𝐢 𝐡

Edges give dependence between variables, but no causal relations and generate sample is more complicated

Maximal Cliques

For 𝐺 = 𝑉, 𝐸 , a complete subgraph (clique) is a subgraph 𝐺’ = 𝑉’ βŠ† 𝑉, 𝐸’ βŠ† 𝐸 s.t. nodes in 𝑉’ are fully connected

A (maximal) clique is a complete subgraph s.t. any superset 𝑉′′,𝑉’ βŠ‚ 𝑉’’, is not fully connected

Example:

Maximal cliques = {A,B,C}, {A,B,D}

Sub-cliques = {A}, {B}, {A,B}, {C,D} … (all edges and singletons)

3

𝐴 𝐡

𝐢

𝐷

MN: Conditional independence vs distribution factorization

MN encodes global Markov assumptions 𝐼𝑙 𝐺

4

If global conditional independence in MN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 βŠ† 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ξ¨ 𝐷𝑖𝑖

obtain

Then global conditional independence in MN are

subset of conditional independence in

𝑃 𝐼𝑙 𝐺 βŠ† 𝐼 𝑃

If the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) =1

𝑍 Ξ¨ 𝐷𝑖𝑖

obtain

This MN is an I-map of P

𝑃 factorizes according to BN

Every P has least one MN structure G

Read independence of P from MN structure G

Counter Example

𝑋1, … , 𝑋4 are binary, and only eight assignments have positive probability (each with 1/8)

(0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0), (0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1)

Eg., 𝑋1 βŠ₯ 𝑋3 | 𝑋2, 𝑋4

But distribution does not factorize!

eg. 𝑃 0,0,1,0 = 0 =1

𝑍 Ξ¨12 0, 0 Ξ¨23 0,1 Ξ¨34 1,0 Ξ¨14(0,0) 5

𝑋2 𝑋4

𝑋1

𝑋3

π‘‹πŸπ‘‹πŸ’ π‘‹πŸπ‘‹πŸ‘

00 01 10 11

00 Β½ Β½ 0 0

01 0 Β½ 0 Β½

10 Β½ 0 Β½ 0

11 0 0 Β½ Β½

π‘‹πŸπ‘‹πŸ’ π‘‹πŸ

00 01 10 11

0 Β½ 1 0 Β½

1 Β½ 0 1 Β½

π‘‹πŸπ‘‹πŸ’ π‘‹πŸ‘

00 01 10 11

0 1 Β½ Β½ 0

1 0 Β½ Β½ 1

Markov Network Representation

For all π‘₯, 𝑃 𝑋 = π‘₯ > 0

Known as Hammersley-Clifford Theorem

6

If global conditional independence in MN are

subset of conditional independence in

a strictly postive 𝑃 𝐼𝑙 𝐺 βŠ† 𝐼 𝑃

Then the joint probability 𝑃 can be written as

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ξ¨ 𝐷𝑖𝑖

obtain

This MN is an I-map of P

𝑃 factorizes according to BN

Every P has least one MN structure G

Minimal I-maps and Markov networks

A fully connected graph is an I-map for all distribution

Remember minimal I-maps

Deleting an edge make it no longer I-map

In a Bayesian Network, there is no unique minimal I-map

For strictly positive distributions and Markov network, minimal I-map is unique!

Many ways to find minimal I-map, eg.,

Take pairwise Markov assumption

If 𝑃 does not entail it, add edge

7

How about a perfect map?

Perfect maps?

Independence in the graph are exactly the same as those in 𝑃

For Bayesian networks, does not always exist

Counter example: swinging couple of variables

How about for Markov networks?

Counter example: V-structure

8

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑆

𝐴 𝐹

Minimal I-map MN Not a P-map Can not code 𝐴 βŠ₯ 𝐹

𝑃

𝐡𝑁 𝑀𝑁

Some common Markov networks

Pairwise Markov network

Exponential family (or log-linear model)

Factor graphs

9

Pairwise Markov Networks

All factors over single variables or pairs of variables

Node potentials Ψ𝑖 𝑋𝑖

Edge potentials Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗

Factorization

𝑃 𝑋 = Ψ𝑖 𝑋𝑖 Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗(𝑖,𝑗)βˆˆπΈπ‘–βˆˆπ‘‰

Note that there may be bigger cliques in the graph, but only consider pairwise potentials

10

𝐴 𝐡

𝐢

𝐷

An example

𝑃 𝐴, 𝐡, 𝐢, 𝐷 =1

𝑍 Ξ¨1 𝐴, 𝐡, 𝐢 Ξ¨2 𝐴, 𝐡, 𝐷

𝑍 = Ξ¨1 𝐴, 𝐡, 𝐢 Ξ¨2 𝐴, 𝐡, 𝐷 π‘Ž,𝑏,𝑐,𝑑

Use two 3-way tables

𝑃′ 𝐴, 𝐡, 𝐢, 𝐷 =1

𝑍Ψ 𝐴𝐢 Ξ¨ 𝐡𝐢 Ξ¨ 𝐴𝐡 Ξ¨ 𝐴𝐷 Ξ¨(𝐡𝐷)

Use five 2-way tables

What is the relation between 𝐼 𝑃 and 𝐼 𝑃’ ?

𝐹𝑒𝑙𝑙𝑦 π‘π‘œπ‘›π‘›π‘’π‘π‘‘π‘’π‘‘ βŠ† 𝐼 𝑃 βŠ† 𝐼 𝑃’ βŠ† π‘‘π‘–π‘ π‘π‘œπ‘›π‘›π‘’π‘‘π‘’π‘‘

11

𝐴 𝐡

𝐢

𝐷

Pairwise Markov Networks

Maximal clique specification

𝐴 𝐡

𝐢

𝐷

Applications of Pairwise Markov Networks

12

bg

fg

Image segmentation: separate foreground from background

Graph structure: Grid with one node per pixel

Pairwise Markov networks

Node potential

Background color vs. foreground color

Ξ¨ 𝑋𝑖 = 𝑓𝑔,π‘šπ‘– = exp βˆ’||π‘šπ‘– βˆ’ πœ‡π‘“π‘”||2

Ξ¨ 𝑋𝑖 = 𝑏𝑔,π‘šπ‘– = exp βˆ’||π‘šπ‘– βˆ’ πœ‡π‘π‘”||2

Edge potential

Neighbors likely have the same label

𝑋𝑖 𝑋𝑗 Fg Bg

Fg 10 1

Bg 1 10

Ξ¨(𝑋𝑖 , 𝑋𝑗) =

Exponential Form

Standard model:

Assuming strictly positive potentials:

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ξ¨ 𝐷𝑖𝑖

= 1

𝑍 exp log Ξ¨ 𝐷𝑖𝑖

=1

𝑍exp log Ξ¨ 𝐷𝑖𝑖

= 1

𝑍exp( Ξ¦(𝐷𝑖)𝑖 )

We can maintain table Ξ¦ 𝐷𝑖 (can have negative entries) rather than table Ξ¨ 𝐷𝑖 (strictly positive entries)

13

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ξ¨ 𝐷𝑖𝑖

Exponential Formβ€”Log-linear Models

Features are some functions 𝑓(𝐷) for a subset of variables 𝐷

Log-linear model over a Markov network G:

A set of features 𝑓1 𝐷1 , … , π‘“π‘˜ π·π‘˜

Each 𝐷𝑖 is over a subset of a clique in G

Eg. Pairwise model 𝐷𝑖 = 𝑋𝑖 , 𝑋𝑗

Two f’s can be over the same variables

It’s ok for 𝐷𝑖 = 𝐷𝑗

A set of weights 𝑀1, … , π‘€π‘˜

Usually learned from data

𝑃 𝑋1, … , 𝑋𝑛 = 1

𝑍exp( 𝑀𝑖𝑓𝑖(𝐷𝑖)

π‘˜π‘–=1 )

14

MN: Gaussian Graphical Models

A Gaussian distribution can be represented by a fully connected graph with pairwise edge potentials over continuous variable nodes

The overall exponential form is:

𝑃 𝑋1, … , 𝑋𝑛

= exp (βˆ’ (π‘‹π‘–π‘–π‘—βˆˆE βˆ’ πœ‡π‘–)Ξ£π‘–π‘—βˆ’1(𝑋𝑗 βˆ’ πœ‡π‘—))

= exp (βˆ’ 𝑋 βˆ’ πœ‡ Ξ£βˆ’1 𝑋 βˆ’ πœ‡ )

Also know as Gaussian graphical models (GGM)

15

Sparse precision vs. sparse covariance in GGM

16

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5

Ξ£βˆ’1 =

1 6 0 0 0

6 2 7 0 0

0 7 3 8 0

0 0 8 4 9

0 0 0 9 5

Ξ£ =

0.10 0.15 -0.13 -0.08 0.15

0.15 -0.03 0.02 0.01 -0.03

-0.13 0.02 0.10 0.07 -0.12

-0.08 0.01 0.07 -0.04 0.07

0.15 -0.03 -0.12 0.07 0.08

Ξ£15βˆ’1 = 0 ⇔ 𝑋1 βŠ₯ 𝑋5 |π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘

𝑋1 βŠ₯ 𝑋5 ⇔ Ξ£15 = 0

Factor Graph

Maximal clique specification

𝑃 𝐴, 𝐡, 𝐢, 𝐷 =1

𝑍 Ξ¨1 𝐴, 𝐡, 𝐢 Ξ¨2 𝐴, 𝐡, 𝐷

Pairwise Markov Networks

𝑃′ 𝐴, 𝐡, 𝐢, 𝐷 =1

𝑍Ψ 𝐴𝐢 Ξ¨ 𝐡𝐢 Ξ¨ 𝐴𝐡 Ξ¨ 𝐴𝐷 Ξ¨(𝐡𝐷)

Can not look at the graph and tell what potential is using

Factor graph is to make this clear in graphical form

17

𝐴 𝐡

𝐢

𝐷

Factor Graph

Make factor dependency explicit

Useful for later inference

Bipartite graph:

Variable nodes (circle) for 𝑋1, … , 𝑋𝑛

Factor nodes (square) for Ξ¨1, … ,Ξ¨π‘š

Edge 𝑋𝑖 βˆ’Ξ¨π‘— if 𝑋𝑖 ∈ 𝐷𝑗 (Scope of Ψ𝑗 𝐷𝑗 )

18

𝐴 𝐡

𝐢

𝐷

𝐴 𝐡 𝐢 𝐷

Ξ¨1 Ξ¨2

𝐴 𝐡 𝐢 𝐷

Ξ¨1 Ξ¨3 Ξ¨2 Ξ¨4 Ξ¨5

1

𝑍 Ξ¨1 𝐴, 𝐡, 𝐢 Ξ¨2 𝐴, 𝐡, 𝐷

1

𝑍Ψ1 𝐴𝐡 Ξ¨2 𝐴𝐢 Ξ¨3 𝐡𝐢 Ξ¨4 𝐴𝐷 Ξ¨5(𝐡𝐷)

Conditional random Fields

Focus on conditional distribution

𝑃(π‘Œ1, … , π‘Œπ‘›|𝑋1, … , 𝑋𝑛, 𝑋)

Do not explicitly model dependence between 𝑋1, … , 𝑋𝑛, 𝑋

Only model relation between 𝑋 βˆ’ π‘Œ and π‘Œ βˆ’ π‘Œ

𝑃(π‘Œ1, π‘Œ2, π‘Œ3, π‘Œ4 |𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋)

=1

𝑍 𝑋1,𝑋2,𝑋3,𝑋4,𝑋Ψ π‘Œ1, π‘Œ2, 𝑋1, 𝑋2, 𝑋 Ξ¨(π‘Œ2, π‘Œ3, 𝑋2, 𝑋3, 𝑋)

𝑍 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋 = Ξ¨ π‘Œ1, π‘Œ2, 𝑋1, 𝑋2, 𝑋 Ξ¨(π‘Œ2, π‘Œ3, 𝑋2, 𝑋3, 𝑋) 𝑦1𝑦2𝑦3

19

π‘Œ1 π‘Œ2 π‘Œ3

𝑋1 𝑋2 𝑋3

𝑋

Bayesian vs. Markov Networks

Conditional Independence Assumptions

Global Markov Assumption

𝐴 βŠ₯ 𝐡|𝐢, 𝑠𝑒𝑝𝐺 𝐴, 𝐡; 𝐢

Derived local and pairwise assumption

𝑋 βŠ₯ π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘ |𝑀𝐡𝑋

𝑋 βŠ₯ π‘Œ | π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘ (no Xβ€”Y)

21

Local Markov Assumption

𝑋 βŠ₯ π‘π‘œπ‘›π‘‘π‘’π‘ π‘π‘’π‘›π‘‘π‘Žπ‘›π‘‘π‘‹|π‘ƒπ‘Žπ‘‹

D-separation, active trail

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑆 𝐻 𝐴

Β¬ (𝐴 βŠ₯ 𝐻) 𝐴 βŠ₯ 𝐻 | 𝑆 𝑁

𝑆

𝐻

𝑁 βŠ₯ 𝐻 | 𝑆 Β¬(𝑁 βŠ₯ 𝐻)

𝐴 𝐢 𝐡 𝑋

π‘ƒπ‘Žπ‘‹

π‘π‘œπ‘›π‘‘π‘’π‘ π‘π‘’π‘›π‘‘π‘Žπ‘›π‘‘π‘‹

𝑋 𝐴 𝐡

𝐢 𝐷 𝑀𝐡𝑋 = {𝐴𝐡𝐢𝐷}

𝐡𝑁 𝑀𝑁

Distribution Factorization

Bayesian Networks (Directed Graphical Models) 𝐼 βˆ’ π‘šπ‘Žπ‘: 𝐼𝑙 𝐺 βŠ† 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | π‘ƒπ‘Žπ‘‹π‘–)

𝑛

𝑖=1

22

Markov Networks (Undirected Graphical Models) π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘™π‘¦ π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ 𝑃, 𝐼 βˆ’ π‘šπ‘Žπ‘: 𝐼 𝐺 βŠ† 𝐼 𝑃

⇔

𝑃(𝑋1, … , 𝑋𝑛) = 1

𝑍 Ψ𝑖 𝐷𝑖

π‘š

𝑖=1

𝑍 = Ψ𝑖 𝐷𝑖

π‘š

𝑖=1π‘₯1,π‘₯2,…,π‘₯𝑛

Clique Potentials

Conditional Probability Tables (CPTs)

Maximal Clique Normalization

(Partition Function)

Minimal I-map not unique

Do not always have P-map

Representation Power

23

𝑃

𝐡𝑁 𝑀𝑁

Minimal I-map unique

Do not always have P-map

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆) 𝑆

𝐴 𝐹

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 βŠ₯ 𝑋3 | 𝑋2, 𝑋4 𝑋2 βŠ₯ 𝑋4 | 𝑋1, 𝑋3

?

convert?

Is there a BN that is a P-map for a given MN?

MN for swing couple of variables does not have a P-map as BN

24

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 βŠ₯ 𝑋3 | 𝑋2, 𝑋4 𝑋2 βŠ₯ 𝑋4 | 𝑋1, 𝑋3

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 βŠ₯ 𝑋3 | 𝑋2, 𝑋4 𝑋2 βŠ₯ 𝑋4 | 𝑋1, 𝑋3

𝑋2 𝑋4

𝑋1

𝑋3

𝑋1 βŠ₯ 𝑋3 | 𝑋2, 𝑋4 𝑋2 βŠ₯ 𝑋4 | 𝑋1, 𝑋3

Is there an MN that is P-map for a given BN?

BN for V-structure does not have a P-map as MN

25

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐴 βŠ₯ 𝐹 | 𝑆

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆)

𝑆

𝐹 𝐴

Conversion using Minimal I-map instead

Instead of attempting P-maps between BNs and MNs, we can try minimal I-maps for conversion

Recall: 𝐺 is a minimal I-map for 𝑃 if

𝐼 𝐺 βŠ† 𝐼 𝑃

Removal of a single edge in 𝐺 render it not an I-map

Note: If 𝐺 is a minimal I-map of 𝑃, 𝐺 need not necessarily satisfy all conditional independence relation in 𝑃

26

Conversion from BN to MN

MN: Markov blanket 𝑀𝐡𝑋𝑖 of 𝑋𝑖: the set of immediate

neighbors of 𝑋𝑖 in the graph

𝑋𝑖 βŠ₯ 𝑉 – 𝑋𝑖 –𝑀𝐡𝑋𝑖 | 𝑀𝐡𝑋𝑖 : βˆ€π‘–

Markov blanket for BN?

27

𝑋

A B

C D

E 𝑋

A B

C D

𝑀𝐡𝑋 = 𝐴, 𝐡, 𝐢, 𝐷 𝑀𝐡𝑋 = ?

F

G H

I J

Β¬ (𝑋 βŠ₯ 𝐷𝐹|𝐴𝐡𝐢𝐷)

Strategy: go outward from 𝑋, try to block all active trails to 𝑋

Markov blanket for BN

28

𝑋

A B

C D

E

𝑀𝐡𝑋 = 𝐴, 𝐡, 𝐢, 𝐷, 𝐸, 𝐹

𝑋 βŠ₯ 𝑉 – 𝑋 – 𝑀𝐡𝑋 | 𝑀𝐡𝑋

F

G H

I J

1 2

3 4 𝑋

A B

C D

E F

G H

I J

Markov Blanket for BN

𝑀𝐡𝑋 in BN is the set of nodes consisting of 𝑋’s parents, 𝑋’s children and other parents of 𝑋’s children

Moral graph 𝑀(𝐺) of a BN 𝐺 is an undirected graph that contains an undirected edge between 𝑋 and π‘Œ if

There is a directed edge between them in the either direction

𝑋 and π‘Œ are parents of a common children

Moral graph insure that 𝑀𝐡𝑋 in the set of neighbors in undirected graph 𝑀(𝐺)

29

A B

C D 𝐺

A B

C D

𝑀 𝐺 moralize

Minimal I-map from BNs to MNs

Moral graph of 𝑀 𝐺 of any BN 𝐺 is a minimal I-map for G

Moralization turns each 𝑋, π‘ƒπ‘Žπ‘‹ into a fully connected component

CPTs associated with BN can be used as clique potentials

The moral graph loses some independence relation

eg. BN: 𝐴 βŠ₯ 𝐸, 𝐴 βŠ₯ 𝐡, 𝐴 βŠ₯ 𝐸

MN: can not read marginal independence

30

A B

C D 𝐺

E

moralize A B

C D

𝑀 𝐺

E

Minimal I-maps from MNs to BNs

Any BN I-map for an MN must add triangulating edges into the graph

Intuition:

V-structures in BN introduce immoralities

These immoralities were not present in a Markov networks

Triangulation eliminates immoralities

31

𝑋2 𝑋4

𝑋1

𝑋3

triangulate 𝑋2 𝑋4

𝑋1

𝑋3

Chordal graphs

Let 𝑋1 βˆ’ 𝑋2 βˆ’β‹―βˆ’ π‘‹π‘˜ βˆ’ 𝑋1 be a loop in a graph. A chord in a loop is an edge connecting non-consecutive 𝑋𝑖 and 𝑋𝑗

An undirected graph G is chordal if any loop 𝑋1 βˆ’ 𝑋2 βˆ’β‹―βˆ’π‘‹π‘˜ βˆ’ 𝑋1 for π‘˜ β‰₯ 4 has a chord

A directed graph G is chordal if its underlying undirected graph is chordal

32

𝐡 𝐢

𝐴

𝐷 𝐸

𝐹

Let 𝐻 be an MN, and 𝐺 be any BN minimal I-map for 𝐻. Then 𝐺 can have no immoralities

Intuitive reason: immoralities introduce additional independencies that are not in the original MN

Let 𝐺 any BN minimal I-map for 𝐻. Then 𝐺 is necessarily chordal!

Because any non-triangulated loop of length at least 4 in a Bayesian network necessarily contains an immorality

Process of adding edges are called triangulation

Minimal I-maps from MNs to BNs

33

𝐴 βŠ₯ 𝐹 Β¬ (𝐴 βŠ₯ 𝐹 | 𝑆)

𝑆

𝐴 𝐹

𝐡 C

𝐴

𝐷

Summary

34

𝑃

𝐡𝑁 𝑀𝑁

undirected trees undirected chordal graph

Moralize BN Triangulating MN

Recommended