Free Energy Approximation - Drexel University College of ...Outline I Basics of graphical model I Basics of message passing algorithm I Variational free energy I Mean eld approximation

Free Energy Approximation

Solmaz Torabi

Dept. of Electrical and Computer EngineeringDrexel [email protected]

Advisor: Dr. John M. Walsh

June 19, 2014

1/101

hey

1

Refrences

M. Opper and D. Saad, “Advanced mean field methods: Theory andpractice,” MIT press, 2001.

J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructingfree-energy approximations and generalized belief propagationalgorithms.” Information Theory, IEEE Transactions, vol. 51, 2005.

M. Welling and Y. W. Teh, “Approximate inference in boltzmannmachines,” Artificial Intelligence, vol. 143, pp. 19–50, 2003.

A. Montanari, “Lecture notes, inference in graphical models,” 2011.

2/101

hey

2

Outline

I Basics of graphical model

I Basics of message passing algorithm

I Variational free energy

I Mean field approximation

I TAP ( Thouless, Anderson and Palmer )

I Region Based approximation

I Bethe free energy

I Kikuchi approximation

3/101

hey

3

Undirected graphical model, Markov random field

Undirected graphical model with random vector X = (X1, ...,Xn)

I Given an undirected graph G = (V ,E ), each node s has anassociated random variable Xs

I A clique C ⊆ V is a fully connected subset of V .

I The distribution p factorizes according to G if it can be expressed asa product over cliques.

p(x) =1

Z

∏C∈C

ψC (xC )

p(x) =1

Zψ1(x1, x2, x3)ψ2(x3, x4, x5)ψ3(x4, x5, x6)ψ4(x4, x7)

4/101

hey

4

graphical model, Factor Graph

I Factor graph is bipartite graph G = (V ,F ,E ), where V is theoriginal set of vertices, and (s, a) ∈ E if xs participates in the factorindexed by a ∈ F

I We assume that the functions fa(xa) are non-negative and finite.

P(X) =1

Z

∏a

fa(xa)

P(x) =1

ZfA(x1, x2)fB(x2, x3, x4)fC (x4)

5/101

hey

5

graphical model- Undirected graph, Factor Graph

I Maximal cliques:C = {1, 2, 3, 4}, {4, 5, 6}, {6, 7}

I Vertex set V = {1, ..., 7}factor set F = {a, b, c}

P(x) =1

Zfa(x1, x2, x3, x4)fb(x4, x5, x6)fc(x6, x7)

6/101

hey

6

Pairwise graphical model

I Subclass of Markov networks commonly encounteredI Ising model, Boltzmann machines

I Computer vision

P(x1, x2, ...xN) =1

Z

∏(ij)

ψij(xi , xj)∏i

ψi (xi )

where ψij(xi , xj) is compatibility function and ψi (xi ) is the evidenceof node iψi : X → R+ for each i ∈ Vψij : X × X → R+ for each (i , j) ∈ E

7/101

hey

7

Boltzmann distribution

I Physicists specialize on the class of distribution P known asBoltzman distribution (Gibbs distribution)

P(X) =e−H[X]

Z

I H(X) is the energy of each state

I Z =∑X

e−H[X] is the normalizing partition function

I Pair-wise Markov random Field

P(X) =1

Z

∏(ij)

ψij(xi , xj)∏i

ψi (xi ) =e−H[X]

Z

energy is

H[X] = −∑ij

lnψij(xi , xj)−∑i

lnψi (xi )

8/101

hey

8

Ising model

I An example of pairwise model with ψij(xi , xj) = exp{Jijxixj},ψi (xi ) = exp{θixi}

I is a mathematical model of ferromagnetism in statistical mechanics.

I xi represents magnetic dipole moments of atomic spins,xi ∈ {+1,−1}, any two adjacent sites i , j has an interaction Jij

I each site i has an external magnetic field θi

I The energy for each configuration is

H(X) = −∑i,j

Jijxixj −∑i

θixi

I The configuration probability is

P(X) =e−H(X)

Z=

e−

∑i,j

Jijxixj−∑i

θixi

Z 9/101

hey

9

Inference tasks

I Computing marginal distribution p(xA) over a particular subsetA ⊂ V on nodes.

I Computing conditional distribution P(xA|xB)

I Computing the most probable configurations. (MAP)

x = argmaxx∈Xm

P(x)

10/101

hey

10

Outline







I Bethe free energy


11/101

hey

11

Belief propagation

I BP is a method for computing marginal probability functions.

I The computed marginal probability is exact if the factor graph hasno cycles.

mi→a(xi ) =∏

c∈N(i)\a

mc→i (xi )

ma→i (xi ) =∑xa\xi

fa(xa)∏

c∈N(i)\a

mc→i (xi )

I i is used as general index over variables, a over factors.

12/101

hey

12

Belief propagation

In case this iteration converges, marginals are approximated by,

bi (xi ) ∝∏a∈Ni

ma→i (xi )

ba(xa) ∝ fa(xa)∏i∈Na

mi→a(xi )

I In general LBP may not converge.I If it does, bi (xi ) may not be close to the true marginal P(xi ).

I The set of pseudomarginals b may not be realizable.

13/101

hey

13

Outline







I Bethe free energy


14/101

hey

14

Write down the energy function

Construct an approximation

Find the stationary condition

15/101

hey

15

Variational free energy

I Variational method approximates an intractable distribution P(X) ofrandom variables X = (S1, ...,SN) by a tractable distribution Q(X)

I Q is chosen to minimize certain distance measure.

KL(Q||P) =∑X

Q(X) lnQ(X)

P(X)=⟨

lnQ

P

⟩Q

where 〈.〉Q denotes the expectation with respect to Q

16/101

hey

16


To find the best approximate to P = e−H(X)

Z

KL(Q||P) = ln Z + E [Q]− S [Q]

where

I S [Q] = −∑X

Q(X) ln Q(X) is the entropy of Q

I E [Q] =∑X

Q(X)H[X] is called average energy

=⇒ minQ

KL(Q||P) = ln Z + minQ

(E [Q]− S [Q])︸︷︷︸Variational free energy

17/101

hey

17

Variational free energy for Ising model

I The model under consideration is a Boltzmann machine.

P(X) =e−H(X)

Z=

e−

∑i,j

Jijxixj−∑i

θixi

Z

I For binary variable it is convenient to reparametrize these marginalsas follows,

pi (xi = 1) =1 + mi

2

18/101

hey

18

Mean Field approximation

Find a factorized distribution that best describes the true distribution.

I For binary variable the most general factorized distribution has theform.

QMF (x) =∏i

Qi (xi ) =∏i

(1 + ximi )

2

I KL(QMF ||P) = E (QMF )− S(QMF ) + log(Z )

I E (QMF ) =∑

QMFH(x) = −∑ij

Jijmimj −∑i

θimi

I S(QMF ) = −∑i

QMF ln QMF = −∑i

(1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

)

19/101

hey

19

Mean Field approximation

How to solve?

minmi

KL(QMF ||P)

I By taking derivative with respect to mi

I ∂∂mi

{−∑ij

Jijmimj−∑i

θimi+∑i

1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2 +log(Z )}

20/101

hey

20

Mean Field fixed points

∂KL

∂mi= −

∑j∈N(i)

Jijmj − θi + log( mi

1−mi

)I Fixed points of MF approximation:

mi =

exp(∑j

Jijmj + θi )− exp(−∑j

Jijmj − θi )

exp(∑j

Jijmj + θi ) + exp(−∑j

Jijmj − θi )

⇒ mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N

21/101

hey

21

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note

I The intractable task of computing marginals has been replaced bythe problem of solving a set of nonlinear equations.

I These MF equations are run sequentially, i.e. we fix all mj except mi .

I In each step MF free energy is convex. Equation (1) finds minimumin one step.

I This procedure can be interpreted as coordinate descent in the mi

I Alternatively, all parameters mi can be updated in parallel.

I Doesn’t guarantee of decreasing the cost function at each iteration.I There might be many solutions to (1).

I Some of the solutions may not be local minima

22/101

hey

22

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note








22/101

hey

23

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note








22/101

hey

24

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note








22/101

hey

25

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note








22/101

hey

26

Mean Field

mi = tanh(∑j

Jijmj + θi ), i = 1, ...,N (1)

Note








22/101

hey

27

Mean Field

I In d-dimensional Ising model without theexternal magnetic field (θ = 0) and havingthe same interaction Jij = α

m(t+1) = tanh(2dαm(t))

I For α < 12d , the iteration converges to lim

t→∞m(t) = 0 (left figure)

I For α > 12d , if m(0) ≶ 0⇒ lim

t→∞m(t) = ∓m∗

[4]A. Montanari, Lecture notes for inference in graphical models,201123/101

hey

28

Mean Field

I MF neglects the dependency between the random variables.

However,

I We get an upper bound on the exact free energy.

KL(QMF ||P) = E (QMF )− S(QMF )︸︷︷︸=F [QMF ] Variational MF energy

− (− log(Z ))︸︷︷︸Exact free energy

Since KL(QMF ||P) ≥ 0

F (QMF ) ≥ − log(Z )

24/101

hey

29

Mean Field Method in general

I P(x) = 1Z

∏a∈F

fa(xa) is True distribution

I Q(x) =∏i

qi (xi ) is Approximate distribution

FMF (Q) =∑i

S(qi ) +∑a∈F

∑xa

∏xi∈N(a)

qi (xi ) log fa(xa)

I We passed from (|X |n − 1) to n(|X | − 1)

I FMF is no longer convex.

minQ

FMF (Q) subject to∑xi

qi (xi ) = 1

25/101

hey

30

Mean Field Method in general

I Add Lagrange multiplier λi

I Find the stationary condition by ∂L(Q,λ)∂qi (xi )

= 0

qi (xi ) ∝∏

a∈N(i)

ma→i (xi )

where

ma→i (xi ) = exp

( ∑xj :j∈N(a)\i

log fa(xa)∏

j∈N(a)\i

qj(xj)

)

I A simple greedy algorithm for finding a stationary point consists inupdating the q by iterating the above equations until convergence.

26/101

hey

31

Outline







I Bethe free energy


27/101

hey

32

TAP approximation

The Legendre Transform and Plefka’s Expansion

28/101

hey

33

Plefka Expansion

I Don’t restrict the approximate distribution Q to be productdistributions

I Minimize free energy in two steps:

I Constrained minimization in the family of distributions satisfying〈X〉Q = m for fixed m

G(m) = minQ{F [Q] = E [Q]− S [Q] |〈X〉Q = m}

I Minimize G(m) with respect to m

29/101

hey

34

Plefka Expansion

G (m) = minQ{F [Q] | 〈X〉Q = m}

By adding Lagrange multiplier λThen Lagrangian

G (m, λ) = E [Q]− S [Q]−∑i

λi (〈xi 〉Q −mi )

G (m, λ) =∑X

Q(X)H[X]− S [Q]−∑x

∑i

λixiQ(X) +∑i

λimi

is the form of variational free energy, where H[X] is replaced byH[X]−

∑i

λixi . We can construct such a gibbs free energy by adding a

set of external auxiliary field.

⇒ Qλ(X) = 1Z e−H[X]+

∑i

λixi

30/101

hey

35

Plefka Expansion

The dual function is,

G (mi ) = maxλi

{∑i

λimi − log(Z (λi ))}

I This equation known as Legendre transform between {λi} and {mi}.

I Z (λi ) is the normalizing constant for the Gibbs distribution

Qλ(X) =1

Zλi

e−H[X]+

∑i

λixi=

1

Zλi

e−

∑i,j

Jijxixj−∑i

θixi+∑i

λixi

I Set θ → 0 by shifting the Lagrange multiplier λi → λi − θi

I Z (λi ) =∑xi

exp(−∑i,j

Jijxixj +∑i

λixi )

31/101

hey

36

Plefka Expansion

G (mi ) = maxλi

{∑i

λimi − log(∑xi

exp(−∑i,j

βJijxixj +∑i

λixi ))}

I Plefka expansion is derived by Jij → βJij , by Taylor expanding theGibbs free energy around β = 0, where β is an inverse temperaturein physics,

Notice

I For each term in Taylor expansion, one has to expand the Lagrangemultiplier λi which maximize the Gibbs distribution as well as log(Z )

I The auxiliary field is temperature dependent.

32/101

hey

37

Plefka Expansion

I with Gn = ∂n

∂βn G (m)|β=0

G (m) = G0(m) + βG1(m) +β2

2!G2(m) + ...

I G0(m) =∑i

{1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

}Spins are entirely

controlled by the auxiliary field.

I G1(m) = −∑i<j

Jijmimj

I G2(m) = − 12

∑ij

J2ij (1−m2

i )(1−m2j )

I ...

33/101

hey

38

Plefka Expansion

I with Gn = ∂n

∂βn G (m)|β=0

G (m) = G0(m) + βG1(m) +β2

2!G2(m) + ...

I G0(m) =∑i

{1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

}Spins are entirely


I G1(m) = −∑i<j

Jijmimj

I G2(m) = − 12

∑ij

J2ij (1−m2

i )(1−m2j )

I ...

33/101

hey

39

Plefka Expansion

I with Gn = ∂n

∂βn G (m)|β=0

G (m) = G0(m) + βG1(m) +β2

2!G2(m) + ...

I G0(m) =∑i

{1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

}Spins are entirely


I G1(m) = −∑i<j

Jijmimj

I G2(m) = − 12

∑ij

J2ij (1−m2

i )(1−m2j )

I ...

33/101

hey

40

Plefka Expansion

I with Gn = ∂n

∂βn G (m)|β=0

G (m) = G0(m) + βG1(m) +β2

2!G2(m) + ...

I G0(m) =∑i

{1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

}Spins are entirely


I G1(m) = −∑i<j

Jijmimj

I G2(m) = − 12

∑ij

J2ij (1−m2

i )(1−m2j )

I ...

33/101

hey

41

Plefka Expansion

with Gn = ∂n

∂βn G (m)|β=0

G (m) = G0(m) + βG1(m) +β2

2!G2(m) + ...

I G0 =∑i

{1+mi

2 ln 1+mi

2 + 1−mi

2 ln 1−mi

2

}⇒ MF variational entropy

I G1(m) = −∑i<j

Jijmimj ⇒ MF variational energy

I G2(m) = − 12

∑ij

J2ij (1−m2

i )(1−m2j )

I ...⇒ Takes into account the higher order dependencies

34/101

hey

42

TAP approximation

TAP approximation= Minimizing G (m) for β = 1 and keeping only termsup to second order

GTAP(mi ) =−∑(ij)

Jijmimj +∑i

{1 + mi

2ln

1 + mi

2+

1−mi

2ln

1−mi

2

}− 1/2

∑(ij)

J2ij (1−m2

i )(1−m2j )

︸︷︷︸dependencies between rvs

I TAP takes in to account the dependencies between random variables.

I It’s exact in the high temperature for certain classes of models (SKmodels).

35/101

hey

43

TAP approximation

Fixed points of TAP approximation:

mi = tanh( ∑

j∈N(i)

Jijmj +1

2(1− 2mi )

∑j∈N(i)

J2ijmj(1−mj)

)

I Running these equations doesn’t guarantee that TAP-Gibbs freeenergy decreases. (mi appears on both sides)

I There is danger that radius of convergence (of taylor expansion) willbe too small to obtain result for values of β we are interested in.

36/101

hey

44

Outline

I Standard BP algorithm

I Junction tree algorithm

I Region Based free energyI Different types of region graph

I Special case: Bethe free energy

I Stationary points of Bethe free energy = BP Fixed points

I Generalized belief propagation (GBP)I Stationary points of Region based free approximation

37/101

hey

45

Outline







38/101

hey

46

Message Passing - Computing the marginals

p(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)fC (x4)

b1(x1) = p(x1) =?

39/101

hey

47

Message Passing

I b1(x1) = mA→1(x1)

I

I

I

40/101

hey

48

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)m2→A(x2)

I

I

41/101

hey

49

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)m2→A(x2)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I

42/101

hey

50

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I b1(x1) =∑x2,x3,x4

fA(x1, x2)fB(x2, x3, x4)m3→Bm4→B(x2)

43/101

hey

51

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I b1(x1) =∑x2,x3,x4

fA(x1, x2)fB(x2, x3, x4)m4→B(x2)

44/101

hey

52

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I b1(x1) =∑x2,x3,x4

fA(x1, x2)fB(x2, x3, x4)m4→B(x2)

45/101

hey

53

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I b1(x1) =∑x2,x3,x4

fA(x1, x2)fB(x2, x3, x4)mC→4(x4)

46/101

hey

54

Message Passing

I b1(x1) = mA→1(x1)

I b1(x1) =∑x2

fA(x1, x2)mB→2(x2)

I b1(x1) =∑x2,x3,x4

fA(x1, x2)fB(x2, x3, x4)fC (x4)

47/101

hey

55

Outline







48/101

hey

56

Junction Tree algorithm

I Works for general graphI Tree shape graphs

I Graphs with cycles

I Directed graphs

I Undirected graphs

I Remove cycles by clustering nodes into cliques.

I Perform Belief Propagation on cliques.

I Exact inference of (clique) marginals.

49/101

hey

57

Junction Tree algorithm - Moralization

I we first moralize the graph by connecting all unconnected parents.After this we make the graph an undirected graph

50/101

hey

58

Junction Tree algorithm- Triangulation

I Triangulation i.e. for any given cycle there is an edge between anytwo non-successive nodes in the cycle

51/101

hey

59


ψC1(xA, xB) = ψA,B(xA, xB)

52/101

hey

60


ψC2(xB , xC , xF ) = ψB,C (xB , xC )ψC ,F (xC , xF )

53/101

hey

61


ψC3(xC , xF , xG ) = ψC ,F (xC , xF )ψF ,G (xF , xG )

54/101

hey

62


ψC4(xC , xD , xG , xH) =

ψC ,D,H(xC , xD , xH)ψD,G ,H(xD , xG , xH)

55/101

hey

63


ψC5(xC , xE , xH) = ψC ,E ,H(xC , xE , xH)

56/101

hey

64

Independence in junction tree

I supposeI T is a junction tree for graph G .

I Consider cliques Ci and Cj with separator Sij = Ci ∩ Cj

I Variables X and Y are on opposite site of separator.

I X and Y are independent given Sij

57/101

hey

65


Given junction tree and potentials on the cliques, the messages fromclique Ci to Cj is

mij(xSij ) =∑Ci\Sij

ψCi (xCi )∏

k∈N(i)\j

mki (xSki)

I Sij : nodes shared by i and j

I N(i): neighboring cliques of i

I The marginal distribution of any cliquesare

p(xCi ) = ψCi

∏k∈N(i)

mki (xSki)

p(xSij ) = mijmji

58/101

hey

66


I m12(xB) =∑xA

ψC1(xA, xB)

I m23(xC , xF ) =∑xB

ψC2(xB , xC , xF )m12(xB)

I m34(xC , xG ) =∑xF

ψC3(xC , xF , xG )m23(xC , xF )

I m45(xC , xH) =∑xD ,xG

ψC4(xC , xD , xG , xH)m34(xC , xG )

59/101

hey

67

Outline







60/101

hey

68


To find the best approximate to P = 1Z

∏c∈cliques

φc(xc)

KL(Q||P) =∑X

Q(x) ln Q(x)−∑x

Q(x) ln p(x)

where

I U[Q] = −∑x

Q(x) ln Q(x) is the entropy of Q

I H[Q] = −∑

c∈cliques

∑xc

Q(xc) log φc(xc) is called average energy

=⇒ minQ

KL(Q||P) = ln Z + minQ

(U[Q]− H[Q])︸︷︷︸Variational free energy

61/101

hey

69

Variational Free energy

I Two solution methods to

minQ

F [Q]

I Approximate F[Q]

I Region Based approximation =⇒ FR(qR)

I Choose a simpler form of Q

I Mean Field Approximation =⇒ Q =∏

qi

62/101

hey

70

Region Based free energy

I We decompose the system into subsystems and then approximatethe free energy by combining the free energies of the subsystems

I Group nodes in to (possibly overlapping) clusters.

I In each region, all variable nodes connected to any included factornodes are included.

I The sets of nodes {1, 2},{B,C , 2, 3, 4} could be regions.

I {B, 3} could not be a region.

63/101

hey

71


I The overall energy is the sum of the free energies of all the regions.

I If some of the large regions overlap, subtract out the free energies ofthese overlap region.

I Each factor and variable node should be counted exactly once.

I For every factor node a and every variable node i in a set of regionsR, the counting number is∑

R∈R

cRI(a ∈ FR) =∑R∈R

cRI(i ∈ VR) = 1

where I(x ∈ S) = 1 if x ∈ S

64/101

hey

72






R∈R


cRI(i ∈ VR) = 1


64/101

hey

73






R∈R


cRI(i ∈ VR) = 1


64/101

hey

74






R∈R


cRI(i ∈ VR) = 1


64/101

hey

75


I Region base free energy for a set of region R is

FR(bR) = UR(bR)− HR(bR)

I Count every node once.

I UR(bR) =∑

R∈RcRUR(bR) =⇒ region based average energy

I HR(bR) =∑

R∈RcRHR(bR) =⇒ region based approximate entropy

65/101

hey

76


if ∑R∈R

cRI(i ∈ FR) = 1for all a ∈ F

andbR(xR) = pR(xR)

=⇒ The average energy becomes exact.

UR(bR) =∑R∈R

cRUR(bR) = −∑R∈R

cR∑xR

bR(xR)∑a∈FR

ln fa(xa)

Exact energy⇒U =∑x∈S

p(x)E (x) = −∑a

∑xa

pa(xa) ln fa(xa)

66/101

hey

77


I Counting each variable node and factor node exactly once, results inexactness of the average energy.

I However, the region based entropy is still an approximation.

HR(bR) =∑R∈R

cRHR(bR) = −∑R∈R

cR∑xR

bR(xR) ln bR(xR)

I We are interested in the accuracy of HR(bR) near its maximum.

minbR

FR(bR) = minbR{UR(bR)− HR(bR)}

I HR(bR) should achieve its maximum when all beliefs bR(xR) areuniform. (Maxent normal )

67/101

hey

78

Outline







68/101

hey

79

Bethe Free energy

Regions are R = {Ri ,Ra, i ∈ V , a ∈ F}I Ri = ({i}, 0, 0)

I Ra = ({N (a)}, {a}, {(i , a) : i ∈ N (a)})

I Large regions containing a single factornode a and all attached variable nodes.cr = 1

I Small regions containing a single variablenode cr = 1− di where di = |N (i)|

I R1 is subregion of R2 if R1 ⊂ R2

69/101

hey

80

Bethe Free energy

I Bethe region graph for thefollowing factor graph

70/101

hey

81

Bethe Free energy


71/101

hey

82

Bethe Free energy


72/101

hey

83

Bethe Free energy

cr = 1 for r ∈ Ra

cr = 1− di for r ∈ Ri

73/101

hey

84

Bethe Free energy

I Assigning counting number to the regions.

74/101

hey

85

Bethe Free energy

I Every variable node and factor node is counted once.

75/101

hey

86

Bethe Free energy

I Bethe free energy:

FBethe = UBethe − HBethe

I Bethe average energy:

UBethe = −∑a

∑xa

ba(xa) ln fa(xa)

I Bethe entropy:

HBethe =−∑a

∑xa

ba(xa) ln ba(xa)

+∑i

(di − 1)∑xi

bi (xi ) ln bi (xi )

76/101

hey

87

Bethe Free energy - Maxent normal

I Global maximum of Bethe entropy is achieved when the beliefsbi (xi ), ba(xa) are uniform.

HBethe =∑i

H(bi )−∑a

I (ba)

whereH(bi ) = −

∑xa

bi (xi ) ln bi (xi )

I (ba) = −(∑

xa

ba(xa) ln ba(xa)−∑

i∈N(a)

H(bi ))

I Maximum of H(bi ) achieved when bi (xi ) has uniform dist.

I I (ba) ≥ 0→ when the beliefs are uniform, I (ba) = 0

77/101

hey

88

Constrained Bethe free energy

Constrained Bethe free energy enforces the beliefs to obey:

I The normalization constrains:∑xi

bi (xi ) = 1

∑xa

ba(xa) = 1

I Consistency constraints ∑xa\xi

ba(xa) = bi (xi )

I Inactive Constraint ⇒ Complementary slackness

0 ≤ bi (xi ) ≤ 1

0 ≤ ba(xa) ≤ 178/101

hey

89

Minimizing Constrained Bethe free energy

Theorem:Stationary points of the constrained Bethe free energy are BP fixedpoints.

minimizeb

FBethe

subject to∑xi

bi (xi ) = 1∑xa

ba(xa) = 1∑xa\xi

ba(xa) = bi (xi )

ba(xa), bi (xi ) ≥ 0

79/101

hey

90


I Lagrangian:

L = FBethe +∑i

γi

{∑xi

bi (xi )− 1

}

+∑a

∑i∈N(a)

∑xi

λai (xi )

{∑xa\xi

ba(xa)− bi (xi )

}

I ∂L∂bi (xi )

= 0 =⇒ bi (xi ) = exp

(1

di−1{1− γi +∑

a∈N(i)

λai (xi )}

)

I ∂L∂ba(xa)

= 0 =⇒ ba(xa) = exp

(− Ea(xa) +

∑a∈N(i)

λai (xi )

)

80/101

hey

91


I Lagrangian:

L = FBethe +∑i

γi

{∑xi

bi (xi )− 1

}

+∑a

∑i∈N(a)

∑xi

λai (xi )

{∑xa\xi

ba(xa)− bi (xi )

}

I ∂L∂bi (xi )

= 0 =⇒ bi (xi ) = exp

(1

di−1{1− γi +∑

a∈N(i)

λai (xi )}

)

I ∂L∂ba(xa)

= 0 =⇒ ba(xa) = exp

(− Ea(xa) +

∑a∈N(i)

λai (xi )

)

80/101

hey

92


I Lagrangian:

L = FBethe +∑i

γi

{∑xi

bi (xi )− 1

}

+∑a

∑i∈N(a)

∑xi

λai (xi )

{∑xa\xi

ba(xa)− bi (xi )

}

I ∂L∂bi (xi )

= 0 =⇒ bi (xi ) = exp

(1

di−1{1− γi +∑

a∈N(i)

λai (xi )}

)

I ∂L∂ba(xa)

= 0 =⇒ ba(xa) = exp

(− Ea(xa) +

∑a∈N(i)

λai (xi )

)

80/101

hey

93

Bethe Fixed points

Define

λai (xi ) = ln∏

b∈N(i)\a

mb→i (xi )

Obtain BP equations:

bi (xi ) ∝∏

a∈N(i)

ma→i (xi )

ba(xa) ∝ fa(xa)∏

i∈N(a)

∏b∈N(i)\a

mb→i (xi )

81/101

hey

94

Unrealizable beliefs

I bA(x1, x2) =

(0.4 0.10.1 0.4

)

I bB(x2, x3) =

(0.4 0.10.1 0.4

)

I bC (x1, x3) =

(0.1 0.40.4 0.1

)I b1(x1) = b2(x2) = b3(x3) =

(0.50.5

)

I There is no b(x1, x2, x3)!

82/101

hey

95

Unrealizable beliefs

I bA(x1, x2) =

(0.4 0.10.1 0.4

)

I bB(x2, x3) =

(0.4 0.10.1 0.4

)

I bC (x1, x3) =

(0.1 0.40.4 0.1

)I b1(x1) = b2(x2) = b3(x3) =

(0.50.5

)I There is no b(x1, x2, x3)!

82/101

hey

96

Region based energy

I How to select a set of regions R and and counting number cR?

I Some methods are:I Bethe method

I Junction Graph method

I Cluster variation method

I Region Graph method

83/101

hey

97

Region Graph

I Region graph is a directed acyclic graph, R → R ′ ⇒ R ′ ⊆ R.

I If there is a directed path between R and R ′, we say R is ancestor ofR ′ , R ∈ A(R ′) and R ′ is a descendant of R, R ′ ∈ D(R)

I In in a region graph these set of conditions satisfied,

cR = 1−∑

R′∈A(R)

c ′R for all R ∈ R

84/101

hey

98

Region Graph Condition

I Every nodes is counted once:∑R∈R


cRI(i ∈ VR) = 1

⇒ ensures that the region graph average energy is exact

I Regions containing a particular variable node, form a connectedsubgraph⇒ Marginal probability is consistent.

85/101

hey

99

Example of not valid region graph

I This is not a valid region graph. Variable 5 is not counted once.

86/101

hey

100

Example of valid region graph


87/101

hey

101



88/101

hey

102



89/101

hey

103

Region graph

cR = 1−∑

R′∈A(R)

c ′R for all R ∈ R

90/101

hey

104

Region graph

I Valid region graph (every node is counted once)

91/101

hey

105

Generalized Belief Propagation

I Theorem: The stationary points of the constrained region-basedfree energy for a valid region graph, are the fixed points ofGeneralized belief propagation” for that region.

Stationary point of FR({bR}) =∑R∈R

cRFR(bR)

subject to∑xR

bR(xR) = 1 forall R ∈ R∑xP\xC

bP(xP) = bC (xC ) Parent, Child regions ∈ R

bR(xR) ≥ 0

92/101

hey

106


I Belief in a region is product of:

I Local information (factors in region)

I Messages from parent regions

I Messages into descendant regions from parents who ware notdescendant.

I Message update rules obtained by enforcing marginalizationconstraints.

93/101

hey

107


Belief in a region is:

bR(xR) ∝∏a∈AR

fa(xa)×

( ∏P∈P(R)

mP→R(xR)

)︸︷︷︸Messages from parent regions

×

( ∏D∈D(R)

∏P′∈P(D)\ε(R)

mP′→D(xD)

)︸︷︷︸

messages into descendant regions from parents who ware not descendant

94/101

hey

108


I Bethe region graph for thefollowing graph

[2]J.S. Yedidia, Construction free energy approximation, 2005

95/101

hey

109

Generalized Belief propagation

96/101

hey

110


97/101

hey

111


98/101

hey

112


Use marginalization constraints to derive message-update rules

99/101

hey

113


Use marginalization constraints to derive message-update rules

100/101

hey

114

Thanks

Questions?

101/101

hey

115

Documents

Free Energy Approximation - Drexel University College of ...Outline I Basics of graphical model I Basics of message passing algorithm I Variational free energy I Mean eld approximation