Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Free Energy Approximation
Solmaz Torabi
Dept. of Electrical and Computer EngineeringDrexel [email protected]
Advisor: Dr. John M. Walsh
June 19, 2014
1/101
hey
1
Refrences
M. Opper and D. Saad, “Advanced mean field methods: Theory andpractice,” MIT press, 2001.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructingfree-energy approximations and generalized belief propagationalgorithms.” Information Theory, IEEE Transactions, vol. 51, 2005.
M. Welling and Y. W. Teh, “Approximate inference in boltzmannmachines,” Artificial Intelligence, vol. 143, pp. 19–50, 2003.
A. Montanari, “Lecture notes, inference in graphical models,” 2011.
2/101
hey
2
Outline
I Basics of graphical model
I Basics of message passing algorithm
I Variational free energy
I Mean field approximation
I TAP ( Thouless, Anderson and Palmer )
I Region Based approximation
I Bethe free energy
I Kikuchi approximation
3/101
hey
3
Undirected graphical model, Markov random field
Undirected graphical model with random vector X = (X1, ...,Xn)
I Given an undirected graph G = (V ,E ), each node s has anassociated random variable Xs
I A clique C ⊆ V is a fully connected subset of V .
I The distribution p factorizes according to G if it can be expressed asa product over cliques.
p(x) =1
Z
∏C∈C
ψC (xC )
p(x) =1
Zψ1(x1, x2, x3)ψ2(x3, x4, x5)ψ3(x4, x5, x6)ψ4(x4, x7)
4/101
hey
4
graphical model, Factor Graph
I Factor graph is bipartite graph G = (V ,F ,E ), where V is theoriginal set of vertices, and (s, a) ∈ E if xs participates in the factorindexed by a ∈ F
I We assume that the functions fa(xa) are non-negative and finite.
P(X) =1
Z
∏a
fa(xa)
P(x) =1
ZfA(x1, x2)fB(x2, x3, x4)fC (x4)
5/101
hey
5
graphical model- Undirected graph, Factor Graph
I Maximal cliques:C = {1, 2, 3, 4}, {4, 5, 6}, {6, 7}
I Vertex set V = {1, ..., 7}factor set F = {a, b, c}
P(x) =1
Zfa(x1, x2, x3, x4)fb(x4, x5, x6)fc(x6, x7)
6/101
hey
6
Pairwise graphical model
I Subclass of Markov networks commonly encounteredI Ising model, Boltzmann machines
I Computer vision
P(x1, x2, ...xN) =1
Z
∏(ij)
ψij(xi , xj)∏i
ψi (xi )
where ψij(xi , xj) is compatibility function and ψi (xi ) is the evidenceof node iψi : X → R+ for each i ∈ Vψij : X × X → R+ for each (i , j) ∈ E
7/101
hey
7
Boltzmann distribution
I Physicists specialize on the class of distribution P known asBoltzman distribution (Gibbs distribution)
P(X) =e−H[X]
Z
I H(X) is the energy of each state
I Z =∑X
e−H[X] is the normalizing partition function
I Pair-wise Markov random Field
P(X) =1
Z
∏(ij)
ψij(xi , xj)∏i
ψi (xi ) =e−H[X]
Z
energy is
H[X] = −∑ij
lnψij(xi , xj)−∑i
lnψi (xi )
8/101
hey
8
Ising model
I An example of pairwise model with ψij(xi , xj) = exp{Jijxixj},ψi (xi ) = exp{θixi}
I is a mathematical model of ferromagnetism in statistical mechanics.
I xi represents magnetic dipole moments of atomic spins,xi ∈ {+1,−1}, any two adjacent sites i , j has an interaction Jij
I each site i has an external magnetic field θi
I The energy for each configuration is
H(X) = −∑i,j
Jijxixj −∑i
θixi
I The configuration probability is
P(X) =e−H(X)
Z=
e−
∑i,j
Jijxixj−∑i
θixi
Z 9/101
hey
9
Inference tasks
I Computing marginal distribution p(xA) over a particular subsetA ⊂ V on nodes.
I Computing conditional distribution P(xA|xB)
I Computing the most probable configurations. (MAP)
x = argmaxx∈Xm
P(x)
10/101
hey
10
Outline
I Basics of graphical model
I Basics of message passing algorithm
I Variational free energy
I Mean field approximation
I TAP ( Thouless, Anderson and Palmer )
I Region Based approximation
I Bethe free energy
I Kikuchi approximation
11/101
hey
11
Belief propagation
I BP is a method for computing marginal probability functions.
I The computed marginal probability is exact if the factor graph hasno cycles.
mi→a(xi ) =∏
c∈N(i)\a
mc→i (xi )
ma→i (xi ) =∑xa\xi
fa(xa)∏
c∈N(i)\a
mc→i (xi )
I i is used as general index over variables, a over factors.
12/101
hey
12
Belief propagation
In case this iteration converges, marginals are approximated by,
bi (xi ) ∝∏a∈Ni
ma→i (xi )
ba(xa) ∝ fa(xa)∏i∈Na
mi→a(xi )
I In general LBP may not converge.I If it does, bi (xi ) may not be close to the true marginal P(xi ).
I The set of pseudomarginals b may not be realizable.
13/101
hey
13
Outline
I Basics of graphical model
I Basics of message passing algorithm
I Variational free energy
I Mean field approximation
I TAP ( Thouless, Anderson and Palmer )
I Region Based approximation
I Bethe free energy
I Kikuchi approximation
14/101
hey
14
Write down the energy function
Construct an approximation
Find the stationary condition
15/101
hey
15
Variational free energy
I Variational method approximates an intractable distribution P(X) ofrandom variables X = (S1, ...,SN) by a tractable distribution Q(X)
I Q is chosen to minimize certain distance measure.
KL(Q||P) =∑X
Q(X) lnQ(X)
P(X)=⟨
lnQ
P
⟩Q
where 〈.〉Q denotes the expectation with respect to Q
16/101
hey
16
Variational free energy
To find the best approximate to P = e−H(X)
Z
KL(Q||P) = ln Z + E [Q]− S [Q]
where
I S [Q] = −∑X
Q(X) ln Q(X) is the entropy of Q
I E [Q] =∑X
Q(X)H[X] is called average energy
=⇒ minQ
KL(Q||P) = ln Z + minQ
(E [Q]− S [Q])︸ ︷︷ ︸Variational free energy
17/101
hey
17
Variational free energy for Ising model
I The model under consideration is a Boltzmann machine.
P(X) =e−H(X)
Z=
e−
∑i,j
Jijxixj−∑i
θixi
Z
I For binary variable it is convenient to reparametrize these marginalsas follows,
pi (xi = 1) =1 + mi
2
18/101
hey
18
Mean Field approximation
Find a factorized distribution that best describes the true distribution.
I For binary variable the most general factorized distribution has theform.
QMF (x) =∏i
Qi (xi ) =∏i
(1 + ximi )
2
I KL(QMF ||P) = E (QMF )− S(QMF ) + log(Z )
I E (QMF ) =∑
QMFH(x) = −∑ij
Jijmimj −∑i
θimi
I S(QMF ) = −∑i
QMF ln QMF = −∑i
(1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
)
19/101
hey
19
Mean Field approximation
How to solve?
minmi
KL(QMF ||P)
I By taking derivative with respect to mi
I ∂∂mi
{−∑ij
Jijmimj−∑i
θimi+∑i
1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2 +log(Z )}
20/101
hey
20
Mean Field fixed points
∂KL
∂mi= −
∑j∈N(i)
Jijmj − θi + log( mi
1−mi
)I Fixed points of MF approximation:
mi =
exp(∑j
Jijmj + θi )− exp(−∑j
Jijmj − θi )
exp(∑j
Jijmj + θi ) + exp(−∑j
Jijmj − θi )
⇒ mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N
21/101
hey
21
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
22
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
23
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
24
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
25
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
26
Mean Field
mi = tanh(∑j
Jijmj + θi ), i = 1, ...,N (1)
Note
I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.
I These MF equations are run sequentially, i.e. we fix all mj except mi .
I In each step MF free energy is convex. Equation (1) finds minimumin one step.
I This procedure can be interpreted as coordinate descent in the mi
I Alternatively, all parameters mi can be updated in parallel.
I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).
I Some of the solutions may not be local minima
22/101
hey
27
Mean Field
I In d-dimensional Ising model without theexternal magnetic field (θ = 0) and havingthe same interaction Jij = α
m(t+1) = tanh(2dαm(t))
I For α < 12d , the iteration converges to lim
t→∞m(t) = 0 (left figure)
I For α > 12d , if m(0) ≶ 0⇒ lim
t→∞m(t) = ∓m∗
[4]A. Montanari, Lecture notes for inference in graphical models,201123/101
hey
28
Mean Field
I MF neglects the dependency between the random variables.
However,
I We get an upper bound on the exact free energy.
KL(QMF ||P) = E (QMF )− S(QMF )︸ ︷︷ ︸=F [QMF ] Variational MF energy
− (− log(Z ))︸ ︷︷ ︸Exact free energy
Since KL(QMF ||P) ≥ 0
F (QMF ) ≥ − log(Z )
24/101
hey
29
Mean Field Method in general
I P(x) = 1Z
∏a∈F
fa(xa) is True distribution
I Q(x) =∏i
qi (xi ) is Approximate distribution
FMF (Q) =∑i
S(qi ) +∑a∈F
∑xa
∏xi∈N(a)
qi (xi ) log fa(xa)
I We passed from (|X |n − 1) to n(|X | − 1)
I FMF is no longer convex.
minQ
FMF (Q) subject to∑xi
qi (xi ) = 1
25/101
hey
30
Mean Field Method in general
I Add Lagrange multiplier λi
I Find the stationary condition by ∂L(Q,λ)∂qi (xi )
= 0
qi (xi ) ∝∏
a∈N(i)
ma→i (xi )
where
ma→i (xi ) = exp
( ∑xj :j∈N(a)\i
log fa(xa)∏
j∈N(a)\i
qj(xj)
)
I A simple greedy algorithm for finding a stationary point consists inupdating the q by iterating the above equations until convergence.
26/101
hey
31
Outline
I Basics of graphical model
I Basics of message passing algorithm
I Variational free energy
I Mean field approximation
I TAP ( Thouless, Anderson and Palmer )
I Region Based approximation
I Bethe free energy
I Kikuchi approximation
27/101
hey
32
TAP approximation
The Legendre Transform and Plefka’s Expansion
28/101
hey
33
Plefka Expansion
I Don’t restrict the approximate distribution Q to be productdistributions
I Minimize free energy in two steps:
I Constrained minimization in the family of distributions satisfying〈X〉Q = m for fixed m
G(m) = minQ{F [Q] = E [Q]− S [Q] |〈X〉Q = m}
I Minimize G(m) with respect to m
29/101
hey
34
Plefka Expansion
G (m) = minQ{F [Q] | 〈X〉Q = m}
By adding Lagrange multiplier λThen Lagrangian
G (m, λ) = E [Q]− S [Q]−∑i
λi (〈xi 〉Q −mi )
G (m, λ) =∑X
Q(X)H[X]− S [Q]−∑x
∑i
λixiQ(X) +∑i
λimi
is the form of variational free energy, where H[X] is replaced byH[X]−
∑i
λixi . We can construct such a gibbs free energy by adding a
set of external auxiliary field.
⇒ Qλ(X) = 1Z e−H[X]+
∑i
λixi
30/101
hey
35
Plefka Expansion
The dual function is,
G (mi ) = maxλi
{∑i
λimi − log(Z (λi ))}
I This equation known as Legendre transform between {λi} and {mi}.
I Z (λi ) is the normalizing constant for the Gibbs distribution
Qλ(X) =1
Zλi
e−H[X]+
∑i
λixi=
1
Zλi
e−
∑i,j
Jijxixj−∑i
θixi+∑i
λixi
I Set θ → 0 by shifting the Lagrange multiplier λi → λi − θi
I Z (λi ) =∑xi
exp(−∑i,j
Jijxixj +∑i
λixi )
31/101
hey
36
Plefka Expansion
G (mi ) = maxλi
{∑i
λimi − log(∑xi
exp(−∑i,j
βJijxixj +∑i
λixi ))}
I Plefka expansion is derived by Jij → βJij , by Taylor expanding theGibbs free energy around β = 0, where β is an inverse temperaturein physics,
Notice
I For each term in Taylor expansion, one has to expand the Lagrangemultiplier λi which maximize the Gibbs distribution as well as log(Z )
I The auxiliary field is temperature dependent.
32/101
hey
37
Plefka Expansion
I with Gn = ∂n
∂βn G (m)|β=0
G (m) = G0(m) + βG1(m) +β2
2!G2(m) + ...
I G0(m) =∑i
{1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
}Spins are entirely
controlled by the auxiliary field.
I G1(m) = −∑i<j
Jijmimj
I G2(m) = − 12
∑ij
J2ij (1−m2
i )(1−m2j )
I ...
33/101
hey
38
Plefka Expansion
I with Gn = ∂n
∂βn G (m)|β=0
G (m) = G0(m) + βG1(m) +β2
2!G2(m) + ...
I G0(m) =∑i
{1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
}Spins are entirely
controlled by the auxiliary field.
I G1(m) = −∑i<j
Jijmimj
I G2(m) = − 12
∑ij
J2ij (1−m2
i )(1−m2j )
I ...
33/101
hey
39
Plefka Expansion
I with Gn = ∂n
∂βn G (m)|β=0
G (m) = G0(m) + βG1(m) +β2
2!G2(m) + ...
I G0(m) =∑i
{1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
}Spins are entirely
controlled by the auxiliary field.
I G1(m) = −∑i<j
Jijmimj
I G2(m) = − 12
∑ij
J2ij (1−m2
i )(1−m2j )
I ...
33/101
hey
40
Plefka Expansion
I with Gn = ∂n
∂βn G (m)|β=0
G (m) = G0(m) + βG1(m) +β2
2!G2(m) + ...
I G0(m) =∑i
{1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
}Spins are entirely
controlled by the auxiliary field.
I G1(m) = −∑i<j
Jijmimj
I G2(m) = − 12
∑ij
J2ij (1−m2
i )(1−m2j )
I ...
33/101
hey
41
Plefka Expansion
with Gn = ∂n
∂βn G (m)|β=0
G (m) = G0(m) + βG1(m) +β2
2!G2(m) + ...
I G0 =∑i
{1+mi
2 ln 1+mi
2 + 1−mi
2 ln 1−mi
2
}⇒ MF variational entropy
I G1(m) = −∑i<j
Jijmimj ⇒ MF variational energy
I G2(m) = − 12
∑ij
J2ij (1−m2
i )(1−m2j )
I ...⇒ Takes into account the higher order dependencies
34/101
hey
42
TAP approximation
TAP approximation= Minimizing G (m) for β = 1 and keeping only termsup to second order
GTAP(mi ) =−∑(ij)
Jijmimj +∑i
{1 + mi
2ln
1 + mi
2+
1−mi
2ln
1−mi
2
}− 1/2
∑(ij)
J2ij (1−m2
i )(1−m2j )
︸ ︷︷ ︸dependencies between rvs
I TAP takes in to account the dependencies between random variables.
I It’s exact in the high temperature for certain classes of models (SKmodels).
35/101
hey
43
TAP approximation
Fixed points of TAP approximation:
mi = tanh( ∑
j∈N(i)
Jijmj +1
2(1− 2mi )
∑j∈N(i)
J2ijmj(1−mj)
)
I Running these equations doesn’t guarantee that TAP-Gibbs freeenergy decreases. (mi appears on both sides)
I There is danger that radius of convergence (of taylor expansion) willbe too small to obtain result for values of β we are interested in.
36/101
hey
44
Outline
I Standard BP algorithm
I Junction tree algorithm
I Region Based free energyI Different types of region graph
I Special case: Bethe free energy
I Stationary points of Bethe free energy = BP Fixed points
I Generalized belief propagation (GBP)I Stationary points of Region based free approximation
37/101
hey
45
Outline
I Standard BP algorithm
I Junction tree algorithm
I Region Based free energyI Different types of region graph
I Special case: Bethe free energy
I Stationary points of Bethe free energy = BP Fixed points
I Generalized belief propagation (GBP)I Stationary points of Region based free approximation
38/101
hey
46
Message Passing - Computing the marginals
p(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)fC (x4)
b1(x1) = p(x1) =?
39/101
hey
47
Message Passing
I b1(x1) = mA→1(x1)
I
I
I
40/101
hey
48
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)m2→A(x2)
I
I
41/101
hey
49
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)m2→A(x2)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I
42/101
hey
50
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I b1(x1) =∑x2,x3,x4
fA(x1, x2)fB(x2, x3, x4)m3→Bm4→B(x2)
43/101
hey
51
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I b1(x1) =∑x2,x3,x4
fA(x1, x2)fB(x2, x3, x4)m4→B(x2)
44/101
hey
52
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I b1(x1) =∑x2,x3,x4
fA(x1, x2)fB(x2, x3, x4)m4→B(x2)
45/101
hey
53
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I b1(x1) =∑x2,x3,x4
fA(x1, x2)fB(x2, x3, x4)mC→4(x4)
46/101
hey
54
Message Passing
I b1(x1) = mA→1(x1)
I b1(x1) =∑x2
fA(x1, x2)mB→2(x2)
I b1(x1) =∑x2,x3,x4
fA(x1, x2)fB(x2, x3, x4)fC (x4)
47/101
hey
55
Outline
I Standard BP algorithm
I Junction tree algorithm
I Region Based free energyI Different types of region graph
I Special case: Bethe free energy
I Stationary points of Bethe free energy = BP Fixed points
I Generalized belief propagation (GBP)I Stationary points of Region based free approximation
48/101
hey
56
Junction Tree algorithm
I Works for general graphI Tree shape graphs
I Graphs with cycles
I Directed graphs
I Undirected graphs
I Remove cycles by clustering nodes into cliques.
I Perform Belief Propagation on cliques.
I Exact inference of (clique) marginals.
49/101
hey
57
Junction Tree algorithm - Moralization
I we first moralize the graph by connecting all unconnected parents.After this we make the graph an undirected graph
50/101
hey
58
Junction Tree algorithm- Triangulation
I Triangulation i.e. for any given cycle there is an edge between anytwo non-successive nodes in the cycle
51/101
hey
59
Junction Tree algorithm
ψC1(xA, xB) = ψA,B(xA, xB)
52/101
hey
60
Junction Tree algorithm
ψC2(xB , xC , xF ) = ψB,C (xB , xC )ψC ,F (xC , xF )
53/101
hey
61
Junction Tree algorithm
ψC3(xC , xF , xG ) = ψC ,F (xC , xF )ψF ,G (xF , xG )
54/101
hey
62
Junction Tree algorithm
ψC4(xC , xD , xG , xH) =
ψC ,D,H(xC , xD , xH)ψD,G ,H(xD , xG , xH)
55/101
hey
63
Junction Tree algorithm
ψC5(xC , xE , xH) = ψC ,E ,H(xC , xE , xH)
56/101
hey
64
Independence in junction tree
I supposeI T is a junction tree for graph G .
I Consider cliques Ci and Cj with separator Sij = Ci ∩ Cj
I Variables X and Y are on opposite site of separator.
I X and Y are independent given Sij
57/101
hey
65
Junction Tree algorithm
Given junction tree and potentials on the cliques, the messages fromclique Ci to Cj is
mij(xSij ) =∑Ci\Sij
ψCi (xCi )∏
k∈N(i)\j
mki (xSki)
I Sij : nodes shared by i and j
I N(i): neighboring cliques of i
I The marginal distribution of any cliquesare
p(xCi ) = ψCi
∏k∈N(i)
mki (xSki)
p(xSij ) = mijmji
58/101
hey
66
Junction Tree algorithm
I m12(xB) =∑xA
ψC1(xA, xB)
I m23(xC , xF ) =∑xB
ψC2(xB , xC , xF )m12(xB)
I m34(xC , xG ) =∑xF
ψC3(xC , xF , xG )m23(xC , xF )
I m45(xC , xH) =∑xD ,xG
ψC4(xC , xD , xG , xH)m34(xC , xG )
59/101
hey
67
Outline
I Standard BP algorithm
I Junction tree algorithm
I Region Based free energyI Different types of region graph
I Special case: Bethe free energy
I Stationary points of Bethe free energy = BP Fixed points
I Generalized belief propagation (GBP)I Stationary points of Region based free approximation
60/101
hey
68
Variational free energy
To find the best approximate to P = 1Z
∏c∈cliques
φc(xc)
KL(Q||P) =∑X
Q(x) ln Q(x)−∑x
Q(x) ln p(x)
where
I U[Q] = −∑x
Q(x) ln Q(x) is the entropy of Q
I H[Q] = −∑
c∈cliques
∑xc
Q(xc) log φc(xc) is called average energy
=⇒ minQ
KL(Q||P) = ln Z + minQ
(U[Q]− H[Q])︸ ︷︷ ︸Variational free energy
61/101
hey
69
Variational Free energy
I Two solution methods to
minQ
F [Q]
I Approximate F[Q]
I Region Based approximation =⇒ FR(qR)
I Choose a simpler form of Q
I Mean Field Approximation =⇒ Q =∏
qi
62/101
hey
70
Region Based free energy
I We decompose the system into subsystems and then approximatethe free energy by combining the free energies of the subsystems
I Group nodes in to (possibly overlapping) clusters.
I In each region, all variable nodes connected to any included factornodes are included.
I The sets of nodes {1, 2},{B,C , 2, 3, 4} could be regions.
I {B, 3} could not be a region.
63/101
hey
71
Region Based free energy
I The overall energy is the sum of the free energies of all the regions.
I If some of the large regions overlap, subtract out the free energies ofthese overlap region.
I Each factor and variable node should be counted exactly once.
I For every factor node a and every variable node i in a set of regionsR, the counting number is∑
R∈R
cRI(a ∈ FR) =∑R∈R
cRI(i ∈ VR) = 1
where I(x ∈ S) = 1 if x ∈ S
64/101
hey
72
Region Based free energy
I The overall energy is the sum of the free energies of all the regions.
I If some of the large regions overlap, subtract out the free energies ofthese overlap region.
I Each factor and variable node should be counted exactly once.
I For every factor node a and every variable node i in a set of regionsR, the counting number is∑
R∈R
cRI(a ∈ FR) =∑R∈R
cRI(i ∈ VR) = 1
where I(x ∈ S) = 1 if x ∈ S
64/101
hey
73
Region Based free energy
I The overall energy is the sum of the free energies of all the regions.
I If some of the large regions overlap, subtract out the free energies ofthese overlap region.
I Each factor and variable node should be counted exactly once.
I For every factor node a and every variable node i in a set of regionsR, the counting number is∑
R∈R
cRI(a ∈ FR) =∑R∈R
cRI(i ∈ VR) = 1
where I(x ∈ S) = 1 if x ∈ S
64/101
hey
74
Region Based free energy
I The overall energy is the sum of the free energies of all the regions.
I If some of the large regions overlap, subtract out the free energies ofthese overlap region.
I Each factor and variable node should be counted exactly once.
I For every factor node a and every variable node i in a set of regionsR, the counting number is∑
R∈R
cRI(a ∈ FR) =∑R∈R
cRI(i ∈ VR) = 1
where I(x ∈ S) = 1 if x ∈ S
64/101
hey
75
Region Based free energy
I Region base free energy for a set of region R is
FR(bR) = UR(bR)− HR(bR)
I Count every node once.
I UR(bR) =∑
R∈RcRUR(bR) =⇒ region based average energy
I HR(bR) =∑
R∈RcRHR(bR) =⇒ region based approximate entropy
65/101
hey
76
Region Based free energy
if ∑R∈R
cRI(i ∈ FR) = 1for all a ∈ F
andbR(xR) = pR(xR)
=⇒ The average energy becomes exact.
UR(bR) =∑R∈R
cRUR(bR) = −∑R∈R
cR∑xR
bR(xR)∑a∈FR
ln fa(xa)
Exact energy⇒U =∑x∈S
p(x)E (x) = −∑a
∑xa
pa(xa) ln fa(xa)
66/101
hey
77
Region Based free energy
I Counting each variable node and factor node exactly once, results inexactness of the average energy.
I However, the region based entropy is still an approximation.
HR(bR) =∑R∈R
cRHR(bR) = −∑R∈R
cR∑xR
bR(xR) ln bR(xR)
I We are interested in the accuracy of HR(bR) near its maximum.
minbR
FR(bR) = minbR{UR(bR)− HR(bR)}
I HR(bR) should achieve its maximum when all beliefs bR(xR) areuniform. (Maxent normal )
67/101
hey
78
Outline
I Standard BP algorithm
I Junction tree algorithm
I Region Based free energyI Different types of region graph
I Special case: Bethe free energy
I Stationary points of Bethe free energy = BP Fixed points
I Generalized belief propagation (GBP)I Stationary points of Region based free approximation
68/101
hey
79
Bethe Free energy
Regions are R = {Ri ,Ra, i ∈ V , a ∈ F}I Ri = ({i}, 0, 0)
I Ra = ({N (a)}, {a}, {(i , a) : i ∈ N (a)})
I Large regions containing a single factornode a and all attached variable nodes.cr = 1
I Small regions containing a single variablenode cr = 1− di where di = |N (i)|
I R1 is subregion of R2 if R1 ⊂ R2
69/101
hey
80
Bethe Free energy
I Bethe region graph for thefollowing factor graph
70/101
hey
81
Bethe Free energy
I Bethe region graph for thefollowing factor graph
71/101
hey
82
Bethe Free energy
I Bethe region graph for thefollowing factor graph
72/101
hey
83
Bethe Free energy
cr = 1 for r ∈ Ra
cr = 1− di for r ∈ Ri
73/101
hey
84
Bethe Free energy
I Assigning counting number to the regions.
74/101
hey
85
Bethe Free energy
I Every variable node and factor node is counted once.
75/101
hey
86
Bethe Free energy
I Bethe free energy:
FBethe = UBethe − HBethe
I Bethe average energy:
UBethe = −∑a
∑xa
ba(xa) ln fa(xa)
I Bethe entropy:
HBethe =−∑a
∑xa
ba(xa) ln ba(xa)
+∑i
(di − 1)∑xi
bi (xi ) ln bi (xi )
76/101
hey
87
Bethe Free energy - Maxent normal
I Global maximum of Bethe entropy is achieved when the beliefsbi (xi ), ba(xa) are uniform.
HBethe =∑i
H(bi )−∑a
I (ba)
whereH(bi ) = −
∑xa
bi (xi ) ln bi (xi )
I (ba) = −(∑
xa
ba(xa) ln ba(xa)−∑
i∈N(a)
H(bi ))
I Maximum of H(bi ) achieved when bi (xi ) has uniform dist.
I I (ba) ≥ 0→ when the beliefs are uniform, I (ba) = 0
77/101
hey
88
Constrained Bethe free energy
Constrained Bethe free energy enforces the beliefs to obey:
I The normalization constrains:∑xi
bi (xi ) = 1
∑xa
ba(xa) = 1
I Consistency constraints ∑xa\xi
ba(xa) = bi (xi )
I Inactive Constraint ⇒ Complementary slackness
0 ≤ bi (xi ) ≤ 1
0 ≤ ba(xa) ≤ 178/101
hey
89
Minimizing Constrained Bethe free energy
Theorem:Stationary points of the constrained Bethe free energy are BP fixedpoints.
minimizeb
FBethe
subject to∑xi
bi (xi ) = 1∑xa
ba(xa) = 1∑xa\xi
ba(xa) = bi (xi )
ba(xa), bi (xi ) ≥ 0
79/101
hey
90
Minimizing Constrained Bethe free energy
I Lagrangian:
L = FBethe +∑i
γi
{∑xi
bi (xi )− 1
}
+∑a
∑i∈N(a)
∑xi
λai (xi )
{∑xa\xi
ba(xa)− bi (xi )
}
I ∂L∂bi (xi )
= 0 =⇒ bi (xi ) = exp
(1
di−1{1− γi +∑
a∈N(i)
λai (xi )}
)
I ∂L∂ba(xa)
= 0 =⇒ ba(xa) = exp
(− Ea(xa) +
∑a∈N(i)
λai (xi )
)
80/101
hey
91
Minimizing Constrained Bethe free energy
I Lagrangian:
L = FBethe +∑i
γi
{∑xi
bi (xi )− 1
}
+∑a
∑i∈N(a)
∑xi
λai (xi )
{∑xa\xi
ba(xa)− bi (xi )
}
I ∂L∂bi (xi )
= 0 =⇒ bi (xi ) = exp
(1
di−1{1− γi +∑
a∈N(i)
λai (xi )}
)
I ∂L∂ba(xa)
= 0 =⇒ ba(xa) = exp
(− Ea(xa) +
∑a∈N(i)
λai (xi )
)
80/101
hey
92
Minimizing Constrained Bethe free energy
I Lagrangian:
L = FBethe +∑i
γi
{∑xi
bi (xi )− 1
}
+∑a
∑i∈N(a)
∑xi
λai (xi )
{∑xa\xi
ba(xa)− bi (xi )
}
I ∂L∂bi (xi )
= 0 =⇒ bi (xi ) = exp
(1
di−1{1− γi +∑
a∈N(i)
λai (xi )}
)
I ∂L∂ba(xa)
= 0 =⇒ ba(xa) = exp
(− Ea(xa) +
∑a∈N(i)
λai (xi )
)
80/101
hey
93
Bethe Fixed points
Define
λai (xi ) = ln∏
b∈N(i)\a
mb→i (xi )
Obtain BP equations:
bi (xi ) ∝∏
a∈N(i)
ma→i (xi )
ba(xa) ∝ fa(xa)∏
i∈N(a)
∏b∈N(i)\a
mb→i (xi )
81/101
hey
94
Unrealizable beliefs
I bA(x1, x2) =
(0.4 0.10.1 0.4
)
I bB(x2, x3) =
(0.4 0.10.1 0.4
)
I bC (x1, x3) =
(0.1 0.40.4 0.1
)I b1(x1) = b2(x2) = b3(x3) =
(0.50.5
)
I There is no b(x1, x2, x3)!
82/101
hey
95
Unrealizable beliefs
I bA(x1, x2) =
(0.4 0.10.1 0.4
)
I bB(x2, x3) =
(0.4 0.10.1 0.4
)
I bC (x1, x3) =
(0.1 0.40.4 0.1
)I b1(x1) = b2(x2) = b3(x3) =
(0.50.5
)I There is no b(x1, x2, x3)!
82/101
hey
96
Region based energy
I How to select a set of regions R and and counting number cR?
I Some methods are:I Bethe method
I Junction Graph method
I Cluster variation method
I Region Graph method
83/101
hey
97
Region Graph
I Region graph is a directed acyclic graph, R → R ′ ⇒ R ′ ⊆ R.
I If there is a directed path between R and R ′, we say R is ancestor ofR ′ , R ∈ A(R ′) and R ′ is a descendant of R, R ′ ∈ D(R)
I In in a region graph these set of conditions satisfied,
cR = 1−∑
R′∈A(R)
c ′R for all R ∈ R
84/101
hey
98
Region Graph Condition
I Every nodes is counted once:∑R∈R
cRI(a ∈ FR) =∑R∈R
cRI(i ∈ VR) = 1
⇒ ensures that the region graph average energy is exact
I Regions containing a particular variable node, form a connectedsubgraph⇒ Marginal probability is consistent.
85/101
hey
99
Example of not valid region graph
I This is not a valid region graph. Variable 5 is not counted once.
86/101
hey
100
Example of valid region graph
I Bethe region graph for thefollowing factor graph
87/101
hey
101
Example of valid region graph
I Bethe region graph for thefollowing factor graph
88/101
hey
102
Example of valid region graph
I Bethe region graph for thefollowing factor graph
89/101
hey
103
Region graph
cR = 1−∑
R′∈A(R)
c ′R for all R ∈ R
90/101
hey
104
Region graph
I Valid region graph (every node is counted once)
91/101
hey
105
Generalized Belief Propagation
I Theorem: The stationary points of the constrained region-basedfree energy for a valid region graph, are the fixed points ofGeneralized belief propagation” for that region.
Stationary point of FR({bR}) =∑R∈R
cRFR(bR)
subject to∑xR
bR(xR) = 1 forall R ∈ R∑xP\xC
bP(xP) = bC (xC ) Parent, Child regions ∈ R
bR(xR) ≥ 0
92/101
hey
106
Generalized Belief Propagation
I Belief in a region is product of:
I Local information (factors in region)
I Messages from parent regions
I Messages into descendant regions from parents who ware notdescendant.
I Message update rules obtained by enforcing marginalizationconstraints.
93/101
hey
107
Generalized Belief Propagation
Belief in a region is:
bR(xR) ∝∏a∈AR
fa(xa)×
( ∏P∈P(R)
mP→R(xR)
)︸ ︷︷ ︸Messages from parent regions
×
( ∏D∈D(R)
∏P′∈P(D)\ε(R)
mP′→D(xD)
)︸ ︷︷ ︸
messages into descendant regions from parents who ware not descendant
94/101
hey
108
Generalized Belief Propagation
I Bethe region graph for thefollowing graph
[2]J.S. Yedidia, Construction free energy approximation, 2005
95/101
hey
109
Generalized Belief propagation
96/101
hey
110
Generalized Belief propagation
97/101
hey
111
Generalized Belief propagation
98/101
hey
112
Generalized Belief propagation
Use marginalization constraints to derive message-update rules
99/101
hey
113
Generalized Belief propagation
Use marginalization constraints to derive message-update rules
100/101
hey
114
Thanks
Questions?
101/101
hey
115