Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012
Le Song
Lecture 17, Oct 25, 2012
Reading: Chap 8, C. Bishop Book
Undirected Graphical Models or Markov Random Fields
Read conditional independence from UGM
Global Markov Independence 𝐴 ⊥ 𝐵 | 𝐶
Independence based on separation
Local Markov Independence 𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡|𝐴𝐵𝐶𝐷
ABCD Markov blanket
2
𝑋
𝐴 𝐵
𝐶 𝐷
𝐴 𝐶 𝐵
Edges give dependence between variables, but no causal relations and generate sample is more complicated
Maximal Cliques
For 𝐺 = 𝑉, 𝐸 , a complete subgraph (clique) is a subgraph 𝐺’ = 𝑉’ ⊆ 𝑉, 𝐸’ ⊆ 𝐸 s.t. nodes in 𝑉’ are fully connected
A (maximal) clique is a complete subgraph s.t. any superset 𝑉′′,𝑉’ ⊂ 𝑉’’, is not fully connected
Example:
Maximal cliques = {A,B,C}, {A,B,D}
Sub-cliques = {A}, {B}, {A,B}, {C,D} … (all edges and singletons)
3
𝐴 𝐵
𝐶
𝐷
MN: Conditional independence vs distribution factorization
MN encodes global Markov assumptions 𝐼𝑙 𝐺
4
If global conditional independence in MN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ 𝐷𝑖𝑖
obtain
Then global conditional independence in MN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
If the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) =1
𝑍 Ψ 𝐷𝑖𝑖
obtain
This MN is an I-map of P
𝑃 factorizes according to BN
Every P has least one MN structure G
Read independence of P from MN structure G
Counter Example
𝑋1, … , 𝑋4 are binary, and only eight assignments have positive probability (each with 1/8)
(0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0), (0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1)
Eg., 𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4
But distribution does not factorize!
eg. 𝑃 0,0,1,0 = 0 =1
𝑍 Ψ12 0, 0 Ψ23 0,1 Ψ34 1,0 Ψ14(0,0) 5
𝑋2 𝑋4
𝑋1
𝑋3
𝑋𝟐𝑋𝟒 𝑋𝟏𝑋𝟑
00 01 10 11
00 ½ ½ 0 0
01 0 ½ 0 ½
10 ½ 0 ½ 0
11 0 0 ½ ½
𝑋𝟐𝑋𝟒 𝑋𝟏
00 01 10 11
0 ½ 1 0 ½
1 ½ 0 1 ½
𝑋𝟐𝑋𝟒 𝑋𝟑
00 01 10 11
0 1 ½ ½ 0
1 0 ½ ½ 1
Markov Network Representation
For all 𝑥, 𝑃 𝑋 = 𝑥 > 0
Known as Hammersley-Clifford Theorem
6
If global conditional independence in MN are
subset of conditional independence in
a strictly postive 𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ 𝐷𝑖𝑖
obtain
This MN is an I-map of P
𝑃 factorizes according to BN
Every P has least one MN structure G
Minimal I-maps and Markov networks
A fully connected graph is an I-map for all distribution
Remember minimal I-maps
Deleting an edge make it no longer I-map
In a Bayesian Network, there is no unique minimal I-map
For strictly positive distributions and Markov network, minimal I-map is unique!
Many ways to find minimal I-map, eg.,
Take pairwise Markov assumption
If 𝑃 does not entail it, add edge
7
How about a perfect map?
Perfect maps?
Independence in the graph are exactly the same as those in 𝑃
For Bayesian networks, does not always exist
Counter example: swinging couple of variables
How about for Markov networks?
Counter example: V-structure
8
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆
𝐴 𝐹
𝑆
𝐴 𝐹
Minimal I-map MN Not a P-map Can not code 𝐴 ⊥ 𝐹
𝑃
𝐵𝑁 𝑀𝑁
Some common Markov networks
Pairwise Markov network
Exponential family (or log-linear model)
Factor graphs
9
Pairwise Markov Networks
All factors over single variables or pairs of variables
Node potentials Ψ𝑖 𝑋𝑖
Edge potentials Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗
Factorization
𝑃 𝑋 = Ψ𝑖 𝑋𝑖 Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗(𝑖,𝑗)∈𝐸𝑖∈𝑉
Note that there may be bigger cliques in the graph, but only consider pairwise potentials
10
𝐴 𝐵
𝐶
𝐷
An example
𝑃 𝐴, 𝐵, 𝐶, 𝐷 =1
𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷
𝑍 = Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷 𝑎,𝑏,𝑐,𝑑
Use two 3-way tables
𝑃′ 𝐴, 𝐵, 𝐶, 𝐷 =1
𝑍Ψ 𝐴𝐶 Ψ 𝐵𝐶 Ψ 𝐴𝐵 Ψ 𝐴𝐷 Ψ(𝐵𝐷)
Use five 2-way tables
What is the relation between 𝐼 𝑃 and 𝐼 𝑃’ ?
𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 ⊆ 𝐼 𝑃 ⊆ 𝐼 𝑃’ ⊆ 𝑑𝑖𝑠𝑐𝑜𝑛𝑛𝑒𝑡𝑒𝑑
11
𝐴 𝐵
𝐶
𝐷
Pairwise Markov Networks
Maximal clique specification
𝐴 𝐵
𝐶
𝐷
Applications of Pairwise Markov Networks
12
bg
fg
Image segmentation: separate foreground from background
Graph structure: Grid with one node per pixel
Pairwise Markov networks
Node potential
Background color vs. foreground color
Ψ 𝑋𝑖 = 𝑓𝑔,𝑚𝑖 = exp −||𝑚𝑖 − 𝜇𝑓𝑔||2
Ψ 𝑋𝑖 = 𝑏𝑔,𝑚𝑖 = exp −||𝑚𝑖 − 𝜇𝑏𝑔||2
Edge potential
Neighbors likely have the same label
𝑋𝑖 𝑋𝑗 Fg Bg
Fg 10 1
Bg 1 10
Ψ(𝑋𝑖 , 𝑋𝑗) =
Exponential Form
Standard model:
Assuming strictly positive potentials:
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ 𝐷𝑖𝑖
= 1
𝑍 exp log Ψ 𝐷𝑖𝑖
=1
𝑍exp log Ψ 𝐷𝑖𝑖
= 1
𝑍exp( Φ(𝐷𝑖)𝑖 )
We can maintain table Φ 𝐷𝑖 (can have negative entries) rather than table Ψ 𝐷𝑖 (strictly positive entries)
13
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ 𝐷𝑖𝑖
Exponential Form—Log-linear Models
Features are some functions 𝑓(𝐷) for a subset of variables 𝐷
Log-linear model over a Markov network G:
A set of features 𝑓1 𝐷1 , … , 𝑓𝑘 𝐷𝑘
Each 𝐷𝑖 is over a subset of a clique in G
Eg. Pairwise model 𝐷𝑖 = 𝑋𝑖 , 𝑋𝑗
Two f’s can be over the same variables
It’s ok for 𝐷𝑖 = 𝐷𝑗
A set of weights 𝑤1, … , 𝑤𝑘
Usually learned from data
𝑃 𝑋1, … , 𝑋𝑛 = 1
𝑍exp( 𝑤𝑖𝑓𝑖(𝐷𝑖)
𝑘𝑖=1 )
14
MN: Gaussian Graphical Models
A Gaussian distribution can be represented by a fully connected graph with pairwise edge potentials over continuous variable nodes
The overall exponential form is:
𝑃 𝑋1, … , 𝑋𝑛
= exp (− (𝑋𝑖𝑖𝑗∈E − 𝜇𝑖)Σ𝑖𝑗−1(𝑋𝑗 − 𝜇𝑗))
= exp (− 𝑋 − 𝜇 Σ−1 𝑋 − 𝜇 )
Also know as Gaussian graphical models (GGM)
15
Sparse precision vs. sparse covariance in GGM
16
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Σ−1 =
1 6 0 0 0
6 2 7 0 0
0 7 3 8 0
0 0 8 4 9
0 0 0 9 5
Σ =
0.10 0.15 -0.13 -0.08 0.15
0.15 -0.03 0.02 0.01 -0.03
-0.13 0.02 0.10 0.07 -0.12
-0.08 0.01 0.07 -0.04 0.07
0.15 -0.03 -0.12 0.07 0.08
Σ15−1 = 0 ⇔ 𝑋1 ⊥ 𝑋5 |𝑇ℎ𝑒𝑅𝑒𝑠𝑡
𝑋1 ⊥ 𝑋5 ⇔ Σ15 = 0
Factor Graph
Maximal clique specification
𝑃 𝐴, 𝐵, 𝐶, 𝐷 =1
𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷
Pairwise Markov Networks
𝑃′ 𝐴, 𝐵, 𝐶, 𝐷 =1
𝑍Ψ 𝐴𝐶 Ψ 𝐵𝐶 Ψ 𝐴𝐵 Ψ 𝐴𝐷 Ψ(𝐵𝐷)
Can not look at the graph and tell what potential is using
Factor graph is to make this clear in graphical form
17
𝐴 𝐵
𝐶
𝐷
Factor Graph
Make factor dependency explicit
Useful for later inference
Bipartite graph:
Variable nodes (circle) for 𝑋1, … , 𝑋𝑛
Factor nodes (square) for Ψ1, … ,Ψ𝑚
Edge 𝑋𝑖 −Ψ𝑗 if 𝑋𝑖 ∈ 𝐷𝑗 (Scope of Ψ𝑗 𝐷𝑗 )
18
𝐴 𝐵
𝐶
𝐷
𝐴 𝐵 𝐶 𝐷
Ψ1 Ψ2
𝐴 𝐵 𝐶 𝐷
Ψ1 Ψ3 Ψ2 Ψ4 Ψ5
1
𝑍 Ψ1 𝐴, 𝐵, 𝐶 Ψ2 𝐴, 𝐵, 𝐷
1
𝑍Ψ1 𝐴𝐵 Ψ2 𝐴𝐶 Ψ3 𝐵𝐶 Ψ4 𝐴𝐷 Ψ5(𝐵𝐷)
Conditional random Fields
Focus on conditional distribution
𝑃(𝑌1, … , 𝑌𝑛|𝑋1, … , 𝑋𝑛, 𝑋)
Do not explicitly model dependence between 𝑋1, … , 𝑋𝑛, 𝑋
Only model relation between 𝑋 − 𝑌 and 𝑌 − 𝑌
𝑃(𝑌1, 𝑌2, 𝑌3, 𝑌4 |𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋)
=1
𝑍 𝑋1,𝑋2,𝑋3,𝑋4,𝑋Ψ 𝑌1, 𝑌2, 𝑋1, 𝑋2, 𝑋 Ψ(𝑌2, 𝑌3, 𝑋2, 𝑋3, 𝑋)
𝑍 𝑋1, 𝑋2, 𝑋3, 𝑋4, 𝑋 = Ψ 𝑌1, 𝑌2, 𝑋1, 𝑋2, 𝑋 Ψ(𝑌2, 𝑌3, 𝑋2, 𝑋3, 𝑋) 𝑦1𝑦2𝑦3
19
𝑌1 𝑌2 𝑌3
𝑋1 𝑋2 𝑋3
𝑋
Bayesian vs. Markov Networks
Conditional Independence Assumptions
Global Markov Assumption
𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶
Derived local and pairwise assumption
𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 |𝑀𝐵𝑋
𝑋 ⊥ 𝑌 | 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 (no X—Y)
21
Local Markov Assumption
𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋|𝑃𝑎𝑋
D-separation, active trail
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆
𝐴 𝐹
𝑆 𝐻 𝐴
¬ (𝐴 ⊥ 𝐻) 𝐴 ⊥ 𝐻 | 𝑆 𝑁
𝑆
𝐻
𝑁 ⊥ 𝐻 | 𝑆 ¬(𝑁 ⊥ 𝐻)
𝐴 𝐶 𝐵 𝑋
𝑃𝑎𝑋
𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋
𝑋 𝐴 𝐵
𝐶 𝐷 𝑀𝐵𝑋 = {𝐴𝐵𝐶𝐷}
𝐵𝑁 𝑀𝑁
Distribution Factorization
Bayesian Networks (Directed Graphical Models) 𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
⇔
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑛
𝑖=1
22
Markov Networks (Undirected Graphical Models) 𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃
⇔
𝑃(𝑋1, … , 𝑋𝑛) = 1
𝑍 Ψ𝑖 𝐷𝑖
𝑚
𝑖=1
𝑍 = Ψ𝑖 𝐷𝑖
𝑚
𝑖=1𝑥1,𝑥2,…,𝑥𝑛
Clique Potentials
Conditional Probability Tables (CPTs)
Maximal Clique Normalization
(Partition Function)
Minimal I-map not unique
Do not always have P-map
Representation Power
23
𝑃
𝐵𝑁 𝑀𝑁
Minimal I-map unique
Do not always have P-map
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆
𝐴 𝐹
𝑋2 𝑋4
𝑋1
𝑋3
𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3
?
convert?
Is there a BN that is a P-map for a given MN?
MN for swing couple of variables does not have a P-map as BN
24
𝑋2 𝑋4
𝑋1
𝑋3
𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3
𝑋2 𝑋4
𝑋1
𝑋3
𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3
𝑋2 𝑋4
𝑋1
𝑋3
𝑋1 ⊥ 𝑋3 | 𝑋2, 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1, 𝑋3
Is there an MN that is P-map for a given BN?
BN for V-structure does not have a P-map as MN
25
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑆
𝐴 𝐹
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑆
𝐴 𝐹
𝐴 ⊥ 𝐹 | 𝑆
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑆
𝐹 𝐴
Conversion using Minimal I-map instead
Instead of attempting P-maps between BNs and MNs, we can try minimal I-maps for conversion
Recall: 𝐺 is a minimal I-map for 𝑃 if
𝐼 𝐺 ⊆ 𝐼 𝑃
Removal of a single edge in 𝐺 render it not an I-map
Note: If 𝐺 is a minimal I-map of 𝑃, 𝐺 need not necessarily satisfy all conditional independence relation in 𝑃
26
Conversion from BN to MN
MN: Markov blanket 𝑀𝐵𝑋𝑖 of 𝑋𝑖: the set of immediate
neighbors of 𝑋𝑖 in the graph
𝑋𝑖 ⊥ 𝑉 – 𝑋𝑖 –𝑀𝐵𝑋𝑖 | 𝑀𝐵𝑋𝑖 : ∀𝑖
Markov blanket for BN?
27
𝑋
A B
C D
E 𝑋
A B
C D
𝑀𝐵𝑋 = 𝐴, 𝐵, 𝐶, 𝐷 𝑀𝐵𝑋 = ?
F
G H
I J
¬ (𝑋 ⊥ 𝐷𝐹|𝐴𝐵𝐶𝐷)
Strategy: go outward from 𝑋, try to block all active trails to 𝑋
Markov blanket for BN
28
𝑋
A B
C D
E
𝑀𝐵𝑋 = 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹
𝑋 ⊥ 𝑉 – 𝑋 – 𝑀𝐵𝑋 | 𝑀𝐵𝑋
F
G H
I J
1 2
3 4 𝑋
A B
C D
E F
G H
I J
Markov Blanket for BN
𝑀𝐵𝑋 in BN is the set of nodes consisting of 𝑋’s parents, 𝑋’s children and other parents of 𝑋’s children
Moral graph 𝑀(𝐺) of a BN 𝐺 is an undirected graph that contains an undirected edge between 𝑋 and 𝑌 if
There is a directed edge between them in the either direction
𝑋 and 𝑌 are parents of a common children
Moral graph insure that 𝑀𝐵𝑋 in the set of neighbors in undirected graph 𝑀(𝐺)
29
A B
C D 𝐺
A B
C D
𝑀 𝐺 moralize
Minimal I-map from BNs to MNs
Moral graph of 𝑀 𝐺 of any BN 𝐺 is a minimal I-map for G
Moralization turns each 𝑋, 𝑃𝑎𝑋 into a fully connected component
CPTs associated with BN can be used as clique potentials
The moral graph loses some independence relation
eg. BN: 𝐴 ⊥ 𝐸, 𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐸
MN: can not read marginal independence
30
A B
C D 𝐺
E
moralize A B
C D
𝑀 𝐺
E
Minimal I-maps from MNs to BNs
Any BN I-map for an MN must add triangulating edges into the graph
Intuition:
V-structures in BN introduce immoralities
These immoralities were not present in a Markov networks
Triangulation eliminates immoralities
31
𝑋2 𝑋4
𝑋1
𝑋3
triangulate 𝑋2 𝑋4
𝑋1
𝑋3
Chordal graphs
Let 𝑋1 − 𝑋2 −⋯− 𝑋𝑘 − 𝑋1 be a loop in a graph. A chord in a loop is an edge connecting non-consecutive 𝑋𝑖 and 𝑋𝑗
An undirected graph G is chordal if any loop 𝑋1 − 𝑋2 −⋯−𝑋𝑘 − 𝑋1 for 𝑘 ≥ 4 has a chord
A directed graph G is chordal if its underlying undirected graph is chordal
32
𝐵 𝐶
𝐴
𝐷 𝐸
𝐹
Let 𝐻 be an MN, and 𝐺 be any BN minimal I-map for 𝐻. Then 𝐺 can have no immoralities
Intuitive reason: immoralities introduce additional independencies that are not in the original MN
Let 𝐺 any BN minimal I-map for 𝐻. Then 𝐺 is necessarily chordal!
Because any non-triangulated loop of length at least 4 in a Bayesian network necessarily contains an immorality
Process of adding edges are called triangulation
Minimal I-maps from MNs to BNs
33
𝐴 ⊥ 𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑆
𝐴 𝐹
𝐵 C
𝐴
𝐷
Summary
34
𝑃
𝐵𝑁 𝑀𝑁
undirected trees undirected chordal graph
Moralize BN Triangulating MN