Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Clamping Variables and Approximate Inference
Adrian WellerUniversity of Cambridge
MSR CambridgeMar 18, 2016
Work with Tony Jebara and Justin Domke
For more information, seehttp://mlg.eng.cam.ac.uk/adrian/
1 / 21
Motivation: undirected graphical models
Powerful way to represent relationships across variables
Many applications including: computer vision, social networkanalysis, deep belief networks, protein folding...
In this talk, focus on binary pairwise (Ising) models
Example: Grid for computer vision (attractive)
2 / 21
Motivation: undirected graphical models
Example: Part of epinions social network (mixed)
Figure courtesy of N. Ruozzi
3 / 21
Motivation: undirected graphical models
x4 x5 x6 x7 x8
x1 x2 x3
Example: Restricted Boltzmann machine (mixed)
A fundamental problem is marginal inference
Estimate marginal probability distribution of one variable
p(x1) =∑
x2,...,xn
p(x1, x2, . . . , xn)
Closely related to computing the partition function
Computationally intractable, focus on approximate methods
Our theme: combining approximate inference with clampingcan be very fruitful as a proof technique, and in practice
4 / 21
Background: Binary pairwise models
Binary variables X1, . . . ,Xn ∈ {0, 1}Singleton and pairwise potentials θ
Write θ · x for the total score of a complete configuration
Probability distribution given by
p(x) =1
Zexp(θ · x)
To ensure probabilities sum to 1, need normalizing constant
Z =∑
x exp (θ · x)
Z is the partition function, a fundamental quantity we’d liketo compute or approximate
5 / 21
Background: A variational approximation
Recall p(x) =1
Zexp(θ · x)
Exact inference may be viewed as optimization,
logZ = maxµ∈M
[ θ · µ+ S(µ) ]
M is the space of marginals that are globally consistent, S isthe (Shannon) entropy
Bethe makes two pairwise approximations,
logZB = maxq∈L
[ θ · q + SB(q) ]
L is the space of marginals that are pairwise consistent, SB isthe Bethe entropy approximation
Loopy Belief Propagation finds stationary points of Bethe
For models with no cycles (acyclic), Bethe is exact ZB = Z6 / 21
Background: When is Bethe a good approximation?
We know that Bethe is exact for acyclic models, ZB = Z
When else does Bethe perform well?
‘Tree-like models’: models with long cycles or weak potentials
Also: attractive models (all edges attractive)
Sudderth, Wainwright and Willsky (NIPS 2007) used loopseries to show that for a subclass of attractive binary pairwisemodels, ZB ≤ Z
Conjectured ZB ≤ Z for all attractive binary pairwise models
Proved true by Ruozzi (NIPS 2012) using graph covers
Here we provide a separate proof building from first principles,and also derive an upper bound for Z in terms of ZB
We use the idea of clamping variables
7 / 21
Background: What is clamping?
x2
x3x4
x1
x5
x10x6x7
x9x8
Example model
To compute the partition function Z , canenumerate all states and sum
x1x2 . . . x10 score exp(score)
0 0 . . . 0 1 2.70 0 . . . 1 2 7.4. . . . . . . . .0 1 . . . 1 1.3 3.71 0 . . . 0 -1 0.41 0 . . . 1 0.2 1.2. . . . . . . . .1 1 . . . 1 1.8 6.0
Total Z = 47.1
8 / 21
Background: What is clamping?
x2
x3x4
x5
x10x6x7
x9x8
x1
Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:
Z = Z |X1=0 + Z |X1=1
After we clamp a variable, it may be removed
x1x2 . . . x10 score exp(score)
0 0 . . . 0 1 2.70 0 . . . 1 2 7.4. . . . . . . . .0 1 . . . 1 1.3 3.7 27.5
1 0 . . . 0 -1 0.41 0 . . . 1 0.2 1.2. . . . . . . . .1 1 . . . 1 1.8 6.0 19.6
Total Z = 47.1
p(X1 = 1) =Z |X1=1
Z
9 / 21
Background: What is clamping?
x2
x3x4
x5
x10x6x7
x9x8
x1
Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:
Z = Z |X1=0 + Z |X1=1
After we clamp a variable, it may be removed
After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)
If not,
Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always
10 / 21
Background: What is clamping?
x2
x3x4
x5
x10x6x7
x9x8
x1
Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:
Z = Z |X1=0 + Z |X1=1
After we clamp a variable, it may be removed
After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)
If not,
Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always
10 / 21
Background: What is clamping?
x2
x3x4
x5
x10x6x7
x9x8
x1
Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:
Z = Z |X1=0 + Z |X1=1
After we clamp a variable, it may be removed
After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)
If not,
Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Will this lead to a better estimate than approximate inferenceon the original model? Always?
Often but not always
10 / 21
Background: What is clamping?
x2
x3x4
x5
x10x6x7
x9x8
x1
Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:
Z = Z |X1=0 + Z |X1=1
After we clamp a variable, it may be removed
After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)
If not,
Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always
10 / 21
A variational perspective on clamping
Bethe approximation
logZB = maxq∈L
[ θ · q + SB(q) ]
Observe that when Xi is clamped, we optimize over a subset
logZB |Xi=0 = maxq∈L:qi=0
[ θ · q + SB(q) ]
⇒ ZB |Xi=0 ≤ ZB , similarly ZB |Xi=1 ≤ ZB
Recap of Notation
Z true partition functionZB Bethe optimum partition function
Z(i)B := ZB |Xi=0 + ZB |Xi=1
≤ 2ZB
approximation obtained whenclamp and sum approximate
sub-partition functions
11 / 21
Clamping variables: an upper bound on Z
From before,
Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB
Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact
For example, if must delete 2 variables Xi ,Xj , obtain
Z(ij)B :=
∑a,b∈{0,1}
ZB |Xi=a,Xj=b ≤ 22ZB
But sub-partition functions are exact, hence LHS = Z
x2
x3x4
x1
x5
x10x6x7
x9x8
x2
x3x4
x5
x10x6x7
x9x8
x1
x3x4
x5
x10x6x7
x9x8
x1
x2
12 / 21
Clamping variables: an upper bound on Z
Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB
Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact
Let k(G ) be the minimum size of a feedback vertex set
Theorem (result is tight in a sense)
Z ≤ 2kZB
13 / 21
Clamping variables: an upper bound on Z
Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB
Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact
Let k(G ) be the minimum size of a feedback vertex set
Theorem (result is tight in a sense)
Z ≤ 2kZB
14 / 21
Attractive models: a lower bound on Z
An attractive model is one with all edges attractive
Recall definition,
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Theorem (actually show a stronger result, ask if interested)
For an attractive binary pairwise model and any Xi , ZB ≤ Z(i)B
Repeat as before: ZB ≤ Z(i)B ≤ Z
(ij)B ≤ · · · ≤ Z
Corollary (similar proof to earlier result; first proved Ruozzi, 2012)
For an attractive binary pairwise model, ZB ≤ Z
⇒ each clamp and sum can only improve ZB
15 / 21
Attractive models: a lower bound on Z
An attractive model is one with all edges attractive
Recall definition,
Z(i)B := ZB |Xi=0 + ZB |Xi=1
Theorem (actually show a stronger result, ask if interested)
For an attractive binary pairwise model and any Xi , ZB ≤ Z(i)B
Repeat as before: ZB ≤ Z(i)B ≤ Z
(ij)B ≤ · · · ≤ Z
Corollary (similar proof to earlier result; first proved Ruozzi, 2012)
For an attractive binary pairwise model, ZB ≤ Z
⇒ each clamp and sum can only improve ZB
15 / 21
Recap of results so far
We have used clamping as a proof technique
Derived lower and upper bounds on Z for attractive models
ZB ≤︸ ︷︷ ︸attractive only
Z ≤ 2kZB︸ ︷︷ ︸attractive and mixed
⇔ Z
2k≤ ZB ≤ Z︸︷︷︸
attractive only
We also proved that for attractive models, clamping andsumming (optimum) Bethe sub-partition functions can onlyimprove the estimate
How about for mixed models?
16 / 21
Example: here clamping any variable worsens ZB estimate
x1 x2
x3x4
Blue edges are attractive with edge weight +2Red edges are repulsive with edge weight −2No singleton potentials
(performance is only slightly worse with clamping)
In practice, if we pick a good variable to clamp, then clampingis usually helpful
17 / 21
New work: what does clamping do for MF and TRW?
Mean field (MF) approximation assumes independentvariables, yields a lower bound, ZM ≤ Z
Tree-reweighted (TRW) is a pairwise approximation similar toBethe but allows a convex optimization and yields an upperbound, Z ≤ ZT ZM ≤ Z ≤ ZT
Earlier, we showed that for Bethe, clamping always improvesthe approximation for attractive models; often but not alwaysimproves for mixed models
How about for MF and TRW? ZM ≤ ZB ≤ ZT
Theorem
For both MF and TRW, for attractive and mixed models, clampingand summing approximate sub-partition functions can only improvethe respective approximation and bound (any number of labels).
18 / 21
New work: what does clamping do for MF and TRW?
Mean field (MF) approximation assumes independentvariables, yields a lower bound, ZM ≤ Z
Tree-reweighted (TRW) is a pairwise approximation similar toBethe but allows a convex optimization and yields an upperbound, Z ≤ ZT ZM ≤ Z ≤ ZT
Earlier, we showed that for Bethe, clamping always improvesthe approximation for attractive models; often but not alwaysimproves for mixed models
How about for MF and TRW? ZM ≤ ZB ≤ ZT
Theorem
For both MF and TRW, for attractive and mixed models, clampingand summing approximate sub-partition functions can only improvethe respective approximation and bound (any number of labels).
18 / 21
Error in log Z vs number of clamps: grids
larg
e(9
x9)
0 1 2 3 4 5−15
−10
−5
0
5
10
best
worst
pseudo
greedy
0 1 2 3 4 5−10
−5
0
5
10
15
20
25
best
worst
pseudo
greedy
smal
l(5
x5)
0 1 2 3 4 5−3
−2
−1
0
1
2
3
best
worst
pseudo
greedy
0 1 2 3 4 5
−2
0
2
4
6
best
worst
pseudo
greedy
attractive grids [0, 6] mixed grids [−6, 6]19 / 21
Conclusions for practitioners
Typically Bethe performs very well
Clamping can be very helpful, more so for denser models withstronger edge weights, a setting where inference is often hard
We provide fast methods to select a good variable to clamp
MF and TRW provide useful bounds on Z and ZB
Thank you
For more information, seehttp://mlg.eng.cam.ac.uk/adrian/
20 / 21
References
N. Ruozzi. The Bethe partition function of log-supermodulargraphical models. In NIPS, 2012.
E. Sudderth, M. Wainwright, and A. Willsky. Loop series andBethe variational bounds in attractive graphical models. InNIPS, 2007.
A. Weller and J. Domke. Clamping improves TRW and meanfield approximations. To appear in AISTATS, 2016.
A. Weller and T. Jebara. Bethe bounds and approximatingthe global optimum. In AISTATS, 2013.
A. Weller and T. Jebara. Clamping variables and approximateinference. In NIPS, 2014.
J. Yedidia, W. Freeman, and Y. Weiss. Understanding beliefpropagation and its generalizations. In IJCAI, DistinguishedLecture Track, 2001.
21 / 21
Supplementary material
Extra slides for questions or furtherexplanation
22 / 21
Error in log Z vs number of clamps: complete graphs
0 1 2 3 4 5
−1.5
−1
−0.5
0
best
worst
pseudo
greedy
0 1 2 3 4 5
0
10
20
30
40
best
worst
pseudo
greedy
attractive K15, [0, 6] mixed K15, [−6, 6]
For dense mixed models (many edges),
MF can be better than Bethe
What happens if we increase edge strength?
23 / 21
Error in log Z vs number of clamps: complete graphs
0 1 2 3 4 5
0
10
20
30
40
best
worst
pseudo
greedy
0 1 2 3 4 5−20
0
20
40
60
80
100
best
worst
pseudo
greedy
mixed K15, [−6, 6] mixed K15, [−12, 12]
With stronger edges, MF is much better than Bethe!
But MF assumes variables are independent, what’s going on?
Frustrated cycles cause Bethe to overestimate by a lotTRW is even worseMF behaves much better (in marginal polytope)
24 / 21
Error in log Z vs number of clamps: complete graphs
0 1 2 3 4 5
0
10
20
30
40
best
worst
pseudo
greedy
0 1 2 3 4 5−20
0
20
40
60
80
100
best
worst
pseudo
greedy
mixed K15, [−6, 6] mixed K15, [−12, 12]
With stronger edges, MF is much better than Bethe!
But MF assumes variables are independent, what’s going on?
Frustrated cycles cause Bethe to overestimate by a lotTRW is even worseMF behaves much better (in marginal polytope)
24 / 21
Time (secs) vs error in log Z for various methods
Mixed models, Wij ∼ U[−6, 6]Time shown on a log scale
10−2
100
102
−5
0
5
10
TRW
B
MF
maxW+c+TRE
pseudo
greedy
10−2
100
102
0
5
10
15
TRW
B
MF
maxW+c+TRE
pseudo
greedy
7x7 grid complete K10
Clamping can make the subsequent optimization problemseasier, hence sometimes total time with clamping is lowerwhile also being more accurate
25 / 21
Clamping variables: strongest result for attractive models
logZB = maxq∈L [ θ · q + SB(q) ]
For any variable Xi and x ∈ [0, 1], let qi = q(Xi = 1) and
logZBi (x) = maxq∈L:qi=x [ θ · q + SB(q) ]
ZBi (x) is ‘Bethe partition function constrained to qi = x ’
Note: ZBi (0) = ZB |Xi=0, ZBi (x∗) = ZB , ZBi (1) = ZB |Xi=1
Define new function,
Ai (x) := logZBi (x)− Si (x)
Theorem (implies all other results for attractive models)
For an attractive binary pairwise model, Ai (x) is convex
Builds on derivatives of Bethe free energy from [WJ13]
26 / 21
Clamping variables: strongest result for attractive models
logZB = maxq∈L [ θ · q + SB(q) ]
For any variable Xi and x ∈ [0, 1], let qi = q(Xi = 1) and
logZBi (x) = maxq∈L:qi=x [ θ · q + SB(q) ]
ZBi (x) is ‘Bethe partition function constrained to qi = x ’
Note: ZBi (0) = ZB |Xi=0, ZBi (x∗) = ZB , ZBi (1) = ZB |Xi=1
Define new function,
Ai (x) := logZBi (x)− Si (x)
Theorem (implies all other results for attractive models)
For an attractive binary pairwise model, Ai (x) is convex
Builds on derivatives of Bethe free energy from [WJ13]
26 / 21
Experiments: Which variable to clamp?
Compare error | logZ − logZ(i)B | to original error | logZ − logZB |
for various ways to choose which variable Xi to clamp:
best Clamp best improvement in error of Z in hindsight
worst Clamp worst improvement in error of Z in hindsight
avg Clamp average performance
maxW max sum of incident edge weights∑
j∈N(i) |Wij |Mpower more sophisticated, based on powers of related matrix
x2
x3x4
x1
x5
x10x6x7
x9x8
x10
x1
27 / 21
Experiments: attractive random graph n = 10, p = 0.5
unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]
Error of estimate of logZ
Observe
Clamping any variable helpssignificantly
Our selection methodsperform well
2 4 8 12 160
0.05
0.1
0.15
0.2
0.25
max interaction strength W
Avg `1 error of singleton
marginals
Using Frank-Wolfe to optimize
Bethe free energy
2 4 8 12 160
0.05
0.1
0.15
0.2
max
Originalall ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
28 / 21
Experiments: mixed random graph n = 10, p = 0.5
unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]
Error of estimate of logZ
Results remain promisingfor higher n 2 4 8 12 16
0
1
2
3
4
5
6
max
Originalavg ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
AWAvg `1 error of singleton
marginals
Using Frank-Wolfe to optimize
Bethe free energy
2 4 8 12 160
0.1
0.2
0.3
0.4
max
Originalall ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
29 / 21
Experiments: attractive complete graph n = 10, TRW
unary θi ∼ U[−0.1, 0.1],edge Wij ∼ U[−Wmax ,Wmax ]
Error of estimate of logZ
Note low unary potentials
2 4 8 12 160
0.2
0.4
0.6
0.8
1
max interaction strength W
AWAvg `1 error of singleton
marginals
Clamping a variable ‘breaks
symmetry’ and overcomes
TRW advantage
2 4 8 12 160
0.1
0.2
0.3
0.4
0.5
max
Originalall ClampmaxW Clampbest Clampworst ClampMpowerTRW
interaction strength W
30 / 21
Experiments: mixed complete graph n = 10, TRW
unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]
Error of estimate of logZ
Note regular singleton
potentials
2 4 8 12 160
10
20
30
40
50
max
Originalavg ClampmaxW Clampbest Clampworst ClampMpowerTRW
interaction strength W
AWAvg `1 error of singleton
marginals
2 4 8 12 160
0.1
0.2
0.3
0.4
max interaction strength W
31 / 21
Experiments: attractive random graph n = 50, p = 0.1
unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]
Error of estimate of logZ
‘worst Clamp’ performs worse
here due to suboptimal
solutions found by Frank-Wolfe2 4 8 12 16
0
0.05
0.1
0.15
0.2
0.25
max interaction strength W
AWAvg `1 error of singleton
marginals
2 4 8 12 160
0.02
0.04
0.06
0.08
max interaction strength W
32 / 21
Experiments: mixed random graph n = 50, p = 0.1
unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]
Error of estimate of logZ
Performance still good for
clamping just one variable
2 4 8 12 160
5
10
15
20
25
30
max
Originalavg ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
AWAvg `1 error of singleton
marginals
2 4 8 12 160
0.1
0.2
0.3
0.4
max
Originalall ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
33 / 21
Experiments: attractive ‘lamp’ graph
unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]
Error of estimate of logZ
Mpower performs well,
significantly better than maxW
2 4 8 12 160
0.05
0.1
0.15
0.2
0.25
max interaction strength W
Avg `1 error of singleton
marginals
x2
x3x4
x1
x5
x10x6x7
x9x8
x10
x1
2 4 8 12 160
0.02
0.04
0.06
0.08
0.1
0.12
max
Originalall ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
34 / 21
Experiments: mixed ‘lamp’ graph
unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]
Error of estimate of logZ
Mpower performs well,
significantly better than maxW
2 4 8 12 160
0.5
1
1.5
2
max
Originalavg ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
Avg `1 error of singleton
marginals
x2
x3x4
x1
x5
x10x6x7
x9x8
x10
x1
2 4 8 12 160
0.05
0.1
0.15
0.2
max
Originalall ClampmaxW Clampbest Clampworst ClampMpower
interaction strength W
35 / 21