Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012
Le Song
Lecture 17, Oct 25, 2012
Reading: Chap 8, C. Bishop Book
Undirected Graphical Models or Markov Random Fields
Read conditional independence from UGM
Global Markov Independence π΄ β₯ π΅ | πΆ
Independence based on separation
Local Markov Independence π β₯ πβππ ππ π‘|π΄π΅πΆπ·
ABCD Markov blanket
2
π
π΄ π΅
πΆ π·
π΄ πΆ π΅
Edges give dependence between variables, but no causal relations and generate sample is more complicated
Maximal Cliques
For πΊ = π, πΈ , a complete subgraph (clique) is a subgraph πΊβ = πβ β π, πΈβ β πΈ s.t. nodes in πβ are fully connected
A (maximal) clique is a complete subgraph s.t. any superset πβ²β²,πβ β πββ, is not fully connected
Example:
Maximal cliques = {A,B,C}, {A,B,D}
Sub-cliques = {A}, {B}, {A,B}, {C,D} β¦ (all edges and singletons)
3
π΄ π΅
πΆ
π·
MN: Conditional independence vs distribution factorization
MN encodes global Markov assumptions πΌπ πΊ
4
If global conditional independence in MN are
subset of conditional independence in
π πΌπ πΊ β πΌ π
Then the joint probability π can be written as
π(π1, β¦ , ππ) = 1
π Ξ¨ π·ππ
obtain
Then global conditional independence in MN are
subset of conditional independence in
π πΌπ πΊ β πΌ π
If the joint probability π can be written as
π(π1, β¦ , ππ) =1
π Ξ¨ π·ππ
obtain
This MN is an I-map of P
π factorizes according to BN
Every P has least one MN structure G
Read independence of P from MN structure G
Counter Example
π1, β¦ , π4 are binary, and only eight assignments have positive probability (each with 1/8)
(0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0), (0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1)
Eg., π1 β₯ π3 | π2, π4
But distribution does not factorize!
eg. π 0,0,1,0 = 0 =1
π Ξ¨12 0, 0 Ξ¨23 0,1 Ξ¨34 1,0 Ξ¨14(0,0) 5
π2 π4
π1
π3
ππππ ππππ
00 01 10 11
00 Β½ Β½ 0 0
01 0 Β½ 0 Β½
10 Β½ 0 Β½ 0
11 0 0 Β½ Β½
ππππ ππ
00 01 10 11
0 Β½ 1 0 Β½
1 Β½ 0 1 Β½
ππππ ππ
00 01 10 11
0 1 Β½ Β½ 0
1 0 Β½ Β½ 1
Markov Network Representation
For all π₯, π π = π₯ > 0
Known as Hammersley-Clifford Theorem
6
If global conditional independence in MN are
subset of conditional independence in
a strictly postive π πΌπ πΊ β πΌ π
Then the joint probability π can be written as
π(π1, β¦ , ππ) = 1
π Ξ¨ π·ππ
obtain
This MN is an I-map of P
π factorizes according to BN
Every P has least one MN structure G
Minimal I-maps and Markov networks
A fully connected graph is an I-map for all distribution
Remember minimal I-maps
Deleting an edge make it no longer I-map
In a Bayesian Network, there is no unique minimal I-map
For strictly positive distributions and Markov network, minimal I-map is unique!
Many ways to find minimal I-map, eg.,
Take pairwise Markov assumption
If π does not entail it, add edge
7
How about a perfect map?
Perfect maps?
Independence in the graph are exactly the same as those in π
For Bayesian networks, does not always exist
Counter example: swinging couple of variables
How about for Markov networks?
Counter example: V-structure
8
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π) π
π΄ πΉ
π
π΄ πΉ
Minimal I-map MN Not a P-map Can not code π΄ β₯ πΉ
π
π΅π ππ
Some common Markov networks
Pairwise Markov network
Exponential family (or log-linear model)
Factor graphs
9
Pairwise Markov Networks
All factors over single variables or pairs of variables
Node potentials Ξ¨π ππ
Edge potentials Ξ¨ππ ππ , ππ
Factorization
π π = Ξ¨π ππ Ξ¨ππ ππ , ππ(π,π)βπΈπβπ
Note that there may be bigger cliques in the graph, but only consider pairwise potentials
10
π΄ π΅
πΆ
π·
An example
π π΄, π΅, πΆ, π· =1
π Ξ¨1 π΄, π΅, πΆ Ξ¨2 π΄, π΅, π·
π = Ξ¨1 π΄, π΅, πΆ Ξ¨2 π΄, π΅, π· π,π,π,π
Use two 3-way tables
πβ² π΄, π΅, πΆ, π· =1
πΞ¨ π΄πΆ Ξ¨ π΅πΆ Ξ¨ π΄π΅ Ξ¨ π΄π· Ξ¨(π΅π·)
Use five 2-way tables
What is the relation between πΌ π and πΌ πβ ?
πΉπ’πππ¦ πππππππ‘ππ β πΌ π β πΌ πβ β πππ ππππππ‘ππ
11
π΄ π΅
πΆ
π·
Pairwise Markov Networks
Maximal clique specification
π΄ π΅
πΆ
π·
Applications of Pairwise Markov Networks
12
bg
fg
Image segmentation: separate foreground from background
Graph structure: Grid with one node per pixel
Pairwise Markov networks
Node potential
Background color vs. foreground color
Ξ¨ ππ = ππ,ππ = exp β||ππ β πππ||2
Ξ¨ ππ = ππ,ππ = exp β||ππ β πππ||2
Edge potential
Neighbors likely have the same label
ππ ππ Fg Bg
Fg 10 1
Bg 1 10
Ξ¨(ππ , ππ) =
Exponential Form
Standard model:
Assuming strictly positive potentials:
π(π1, β¦ , ππ) = 1
π Ξ¨ π·ππ
= 1
π exp log Ξ¨ π·ππ
=1
πexp log Ξ¨ π·ππ
= 1
πexp( Ξ¦(π·π)π )
We can maintain table Ξ¦ π·π (can have negative entries) rather than table Ξ¨ π·π (strictly positive entries)
13
π(π1, β¦ , ππ) = 1
π Ξ¨ π·ππ
Exponential FormβLog-linear Models
Features are some functions π(π·) for a subset of variables π·
Log-linear model over a Markov network G:
A set of features π1 π·1 , β¦ , ππ π·π
Each π·π is over a subset of a clique in G
Eg. Pairwise model π·π = ππ , ππ
Two fβs can be over the same variables
Itβs ok for π·π = π·π
A set of weights π€1, β¦ , π€π
Usually learned from data
π π1, β¦ , ππ = 1
πexp( π€πππ(π·π)
ππ=1 )
14
MN: Gaussian Graphical Models
A Gaussian distribution can be represented by a fully connected graph with pairwise edge potentials over continuous variable nodes
The overall exponential form is:
π π1, β¦ , ππ
= exp (β (ππππβE β ππ)Ξ£ππβ1(ππ β ππ))
= exp (β π β π Ξ£β1 π β π )
Also know as Gaussian graphical models (GGM)
15
Sparse precision vs. sparse covariance in GGM
16
π1 π2 π3 π4 π5
Ξ£β1 =
1 6 0 0 0
6 2 7 0 0
0 7 3 8 0
0 0 8 4 9
0 0 0 9 5
Ξ£ =
0.10 0.15 -0.13 -0.08 0.15
0.15 -0.03 0.02 0.01 -0.03
-0.13 0.02 0.10 0.07 -0.12
-0.08 0.01 0.07 -0.04 0.07
0.15 -0.03 -0.12 0.07 0.08
Ξ£15β1 = 0 β π1 β₯ π5 |πβππ ππ π‘
π1 β₯ π5 β Ξ£15 = 0
Factor Graph
Maximal clique specification
π π΄, π΅, πΆ, π· =1
π Ξ¨1 π΄, π΅, πΆ Ξ¨2 π΄, π΅, π·
Pairwise Markov Networks
πβ² π΄, π΅, πΆ, π· =1
πΞ¨ π΄πΆ Ξ¨ π΅πΆ Ξ¨ π΄π΅ Ξ¨ π΄π· Ξ¨(π΅π·)
Can not look at the graph and tell what potential is using
Factor graph is to make this clear in graphical form
17
π΄ π΅
πΆ
π·
Factor Graph
Make factor dependency explicit
Useful for later inference
Bipartite graph:
Variable nodes (circle) for π1, β¦ , ππ
Factor nodes (square) for Ξ¨1, β¦ ,Ξ¨π
Edge ππ βΞ¨π if ππ β π·π (Scope of Ξ¨π π·π )
18
π΄ π΅
πΆ
π·
π΄ π΅ πΆ π·
Ξ¨1 Ξ¨2
π΄ π΅ πΆ π·
Ξ¨1 Ξ¨3 Ξ¨2 Ξ¨4 Ξ¨5
1
π Ξ¨1 π΄, π΅, πΆ Ξ¨2 π΄, π΅, π·
1
πΞ¨1 π΄π΅ Ξ¨2 π΄πΆ Ξ¨3 π΅πΆ Ξ¨4 π΄π· Ξ¨5(π΅π·)
Conditional random Fields
Focus on conditional distribution
π(π1, β¦ , ππ|π1, β¦ , ππ, π)
Do not explicitly model dependence between π1, β¦ , ππ, π
Only model relation between π β π and π β π
π(π1, π2, π3, π4 |π1, π2, π3, π4, π)
=1
π π1,π2,π3,π4,πΞ¨ π1, π2, π1, π2, π Ξ¨(π2, π3, π2, π3, π)
π π1, π2, π3, π4, π = Ξ¨ π1, π2, π1, π2, π Ξ¨(π2, π3, π2, π3, π) π¦1π¦2π¦3
19
π1 π2 π3
π1 π2 π3
π
Bayesian vs. Markov Networks
Conditional Independence Assumptions
Global Markov Assumption
π΄ β₯ π΅|πΆ, π πππΊ π΄, π΅; πΆ
Derived local and pairwise assumption
π β₯ πβππ ππ π‘ |ππ΅π
π β₯ π | πβππ ππ π‘ (no XβY)
21
Local Markov Assumption
π β₯ ππππππ πππππππ‘π|πππ
D-separation, active trail
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π) π
π΄ πΉ
π π» π΄
Β¬ (π΄ β₯ π») π΄ β₯ π» | π π
π
π»
π β₯ π» | π Β¬(π β₯ π»)
π΄ πΆ π΅ π
πππ
ππππππ πππππππ‘π
π π΄ π΅
πΆ π· ππ΅π = {π΄π΅πΆπ·}
π΅π ππ
Distribution Factorization
Bayesian Networks (Directed Graphical Models) πΌ β πππ: πΌπ πΊ β πΌ π
β
π(π1, β¦ , ππ) = π(ππ | ππππ)
π
π=1
22
Markov Networks (Undirected Graphical Models) π π‘ππππ‘ππ¦ πππ ππ‘ππ£π π, πΌ β πππ: πΌ πΊ β πΌ π
β
π(π1, β¦ , ππ) = 1
π Ξ¨π π·π
π
π=1
π = Ξ¨π π·π
π
π=1π₯1,π₯2,β¦,π₯π
Clique Potentials
Conditional Probability Tables (CPTs)
Maximal Clique Normalization
(Partition Function)
Minimal I-map not unique
Do not always have P-map
Representation Power
23
π
π΅π ππ
Minimal I-map unique
Do not always have P-map
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π) π
π΄ πΉ
π2 π4
π1
π3
π1 β₯ π3 | π2, π4 π2 β₯ π4 | π1, π3
?
convert?
Is there a BN that is a P-map for a given MN?
MN for swing couple of variables does not have a P-map as BN
24
π2 π4
π1
π3
π1 β₯ π3 | π2, π4 π2 β₯ π4 | π1, π3
π2 π4
π1
π3
π1 β₯ π3 | π2, π4 π2 β₯ π4 | π1, π3
π2 π4
π1
π3
π1 β₯ π3 | π2, π4 π2 β₯ π4 | π1, π3
Is there an MN that is P-map for a given BN?
BN for V-structure does not have a P-map as MN
25
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π)
π
π΄ πΉ
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π)
π
π΄ πΉ
π΄ β₯ πΉ | π
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π)
π
πΉ π΄
Conversion using Minimal I-map instead
Instead of attempting P-maps between BNs and MNs, we can try minimal I-maps for conversion
Recall: πΊ is a minimal I-map for π if
πΌ πΊ β πΌ π
Removal of a single edge in πΊ render it not an I-map
Note: If πΊ is a minimal I-map of π, πΊ need not necessarily satisfy all conditional independence relation in π
26
Conversion from BN to MN
MN: Markov blanket ππ΅ππ of ππ: the set of immediate
neighbors of ππ in the graph
ππ β₯ π β ππ βππ΅ππ | ππ΅ππ : βπ
Markov blanket for BN?
27
π
A B
C D
E π
A B
C D
ππ΅π = π΄, π΅, πΆ, π· ππ΅π = ?
F
G H
I J
Β¬ (π β₯ π·πΉ|π΄π΅πΆπ·)
Strategy: go outward from π, try to block all active trails to π
Markov blanket for BN
28
π
A B
C D
E
ππ΅π = π΄, π΅, πΆ, π·, πΈ, πΉ
π β₯ π β π β ππ΅π | ππ΅π
F
G H
I J
1 2
3 4 π
A B
C D
E F
G H
I J
Markov Blanket for BN
ππ΅π in BN is the set of nodes consisting of πβs parents, πβs children and other parents of πβs children
Moral graph π(πΊ) of a BN πΊ is an undirected graph that contains an undirected edge between π and π if
There is a directed edge between them in the either direction
π and π are parents of a common children
Moral graph insure that ππ΅π in the set of neighbors in undirected graph π(πΊ)
29
A B
C D πΊ
A B
C D
π πΊ moralize
Minimal I-map from BNs to MNs
Moral graph of π πΊ of any BN πΊ is a minimal I-map for G
Moralization turns each π, πππ into a fully connected component
CPTs associated with BN can be used as clique potentials
The moral graph loses some independence relation
eg. BN: π΄ β₯ πΈ, π΄ β₯ π΅, π΄ β₯ πΈ
MN: can not read marginal independence
30
A B
C D πΊ
E
moralize A B
C D
π πΊ
E
Minimal I-maps from MNs to BNs
Any BN I-map for an MN must add triangulating edges into the graph
Intuition:
V-structures in BN introduce immoralities
These immoralities were not present in a Markov networks
Triangulation eliminates immoralities
31
π2 π4
π1
π3
triangulate π2 π4
π1
π3
Chordal graphs
Let π1 β π2 ββ―β ππ β π1 be a loop in a graph. A chord in a loop is an edge connecting non-consecutive ππ and ππ
An undirected graph G is chordal if any loop π1 β π2 ββ―βππ β π1 for π β₯ 4 has a chord
A directed graph G is chordal if its underlying undirected graph is chordal
32
π΅ πΆ
π΄
π· πΈ
πΉ
Let π» be an MN, and πΊ be any BN minimal I-map for π». Then πΊ can have no immoralities
Intuitive reason: immoralities introduce additional independencies that are not in the original MN
Let πΊ any BN minimal I-map for π». Then πΊ is necessarily chordal!
Because any non-triangulated loop of length at least 4 in a Bayesian network necessarily contains an immorality
Process of adding edges are called triangulation
Minimal I-maps from MNs to BNs
33
π΄ β₯ πΉ Β¬ (π΄ β₯ πΉ | π)
π
π΄ πΉ
π΅ C
π΄
π·
Summary
34
π
π΅π ππ
undirected trees undirected chordal graph
Moralize BN Triangulating MN