Inference by randomly perturbing max-solvers
Tamir Hazan University of Haifa
!!!!
Inference in machine learning
• 20 years ago: does an image contain a person?
Inference in machine learning
• 10 years ago: which object is in the image?
Inference in machine learning
• Today’s challenge: exponentially many options
• For each pixel: decide if it is foreground or background.
• The space of possible structures is exponential
Inference in machine learning
• Interactive annotation:
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)
Inference in machine learning
• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)
y 2 {0, 1}n
Inference in machine learning
• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)
Inference in machine learning
• machine learning applications are characterized by: - complex structures - potential function that scores these structures
y = (y1, ..., yn)
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
Inference in machine learning
• machine learning applications are characterized by: - complex structures - potential function that scores these structures
y = (y1, ..., yn)
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
high score low score
Inference in machine learning
• machine learning applications are characterized by: - complex structures - potential function that scores these structures
y = (y1, ..., yn)
• For machine learning we need to efficiently infer from distributions over complex structures.
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
Inference in machine learning
Gibbs distribution
• MCMC samplers: - Gibbs sampling, Metropolis-Hastings, Swendsen-Wang
• Many efficient sampling algorithms for special cases: - Ising models (Jerrum 93) - Counting bi-partite matchings in planar graphs (Kasteleyn 61) - Approximating the permanent (Jerrum 04) - Many others…
p(y1, ..., yn) =1
Zexp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
Sampling likely structures
✓i(yi) = log p(yi|xi)
• RGB color of pixel ixi
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
✓i,j(yi, yj) =
(1 if yi = yj�1 otherwise
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
Sampling likely structures
✓i(yi) = log p(yi|xi)
• RGB color of pixel ixi
✓i,j(yi, yj) =
(1 if yi = yj�1 otherwise
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
Sampling likely structures
✓i(yi) = log p(yi|xi)
• RGB color of pixel ixi
✓i,j(yi, yj) =
(1 if yi = yj�1 otherwise
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
Sampling likely structures
✓i(yi) = log p(yi|xi)
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
Sampling likely structures
✓i(yi) = log p(yi|xi)
• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)
✓i(yi) = 0
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
Sampling likely structures
✓i(yi) = log p(yi|xi)
• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)
✓i(yi) = 0
• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)
Sampling likely structures
✓i(yi) = log p(yi|xi)
• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)
✓i(yi) = 0
• Data terms (signals) that are important in AI applications significantly change the complexity of sampling
• Instead of sampling, it may be significantly faster to find the most likely structure
Most likely structure
• Instead of sampling, it may be significantly faster to find the most likely structure
Most likely structure
• The most likely structure
Most likely structure
y⇤ = arg max
y1,...,yn
X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)
• Maximum a-posterior (MAP) inference.
• Many efficient optimization algorithms for special cases: - Beliefs propagation: trees (Pearl 88), perfect graphs (Jebara 10), - Graph-cuts for image segmentation - branch and bound (Rother 09), branch and cut (Gurobi) - Linear programming relaxations (Schlesinger 76, Wainwright 05,
Kolmogorov 06, Werner 07, Sontag 08, Hazan 10, Batra 10, Nowozin 10, Pletscher 12, Kappes 13, Savchynskyy13, Tarlow 13, Kohli 13, Jancsary 13, Schwing 13)
- CKY for parsing - Many others…
The challenge
Sampling from the likely high dimensional structures (with millions of variables, e.g., image segmentation with 12 million pixels) as efficient as optimizing
• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others
structures
scores
y⇤
Most likely structure
• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others
structures
scores
y⇤
Most likely structure
✓(y) =X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)
y = (y1, ..., yn)
• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others
structures
scores
y⇤
Most likely structure
• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others
structures
scores
y⇤
Most likely structure
• The maximizing structure is not robust in case of multiple high scoring alternatives
Most likely structure
• The maximizing structure is not robust in case of multiple high scoring alternatives
structures
scores
y⇤
Most likely structure
• The maximizing structure is not robust in case of multiple high scoring alternatives
structures
scores
y⇤
Probabilistic Vs. Rule-Based
• Rule based grammars do not generalize well across domains and languages:
Probabilistic Vs. Rule-Based
• Rule based grammars do not generalize well across domains and languages:
Most likely structure
• The maximizing structure is not robust in case of multiple high scoring alternatives
Most likely structure
• The maximizing structure is not robust in case of multiple high scoring alternatives
structures
scores
y⇤
Most likely structure
• Randomly perturbing the system reveals its complexity
structures
scores
- little effect when the maximizing structure is “evident”
y⇤
Random perturbations
• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident”
structures
scores
y⇤
Random perturbations
• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring
structures
structures
scores
structures
scores
y⇤
y⇤
Random perturbations
• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring
structures
structures
scores
structures
scores
y⇤
y⇤
Random perturbations
• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring
structures
structures
scores
structures
scores
y⇤
y⇤
Random perturbations
• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring
structures
!
• Related work: - McFadden 74 (Discrete choice theory) - Talagrand 94, Barvinok 07 (Canonical processes)
structures
scores
structures
scores
y⇤
y⇤
Random perturbations
Random perturbations
structures
scores
y⇤
✓(y⇤)
✓(y)
y
• Notation:
scores (potential) ✓(y)
Random perturbations
structures
scores
y⇤
✓(y⇤)
✓(y)
y
• Notation:
scores (potential)
perturbed score
✓(y)
Random perturbations
structures
scores
y⇤
✓(y⇤)
✓(y)
y
• Notation:
�(y)
scores (potential)
perturbed score
perturbations �(y)
✓(y)
Random perturbations
structures
scores
y⇤
✓(y⇤)
✓(y)
y
• Notation:
�(y)
scores (potential)
perturbed score
perturbations �(y)
✓(y)
✓(y) + �(y)
Random perturbations
structures
scores
y⇤
✓(y⇤)
✓(y)
y
�(y)
• For every structure y, the perturbation value is a random variable (y is an index, traditional notation is ).
�(y)�y
• Perturb-max models: how stable is the maximal structure to random changes in the potential function.
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
Perturb-max models
•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)
F (t)def= P [�(y) t] = exp(� exp(�t))
Perturb-max models
•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)
F (t)def= P [�(y) t] = exp(� exp(�t))
f(t) = F 0(t) = exp(�t)F (t)
Perturb-max models
•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)
F (t)def= P [�(y) t] = exp(� exp(�t))
then the perturb-max model is the Gibbs distribution
1
Zexp(✓(y)) = P�⇠Gumbel[y = argmax
y{✓(y) + �(y)}]
has Gumbel distribution whose mean is
Let be i.i.d Gumbel ( ). Then�(y)
logZ
max
y{✓(y) + �(y)}
P [�(y) t] = F (t)
Perturb-max models
• Why Gumbel distribution?
• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))
Z =
X
y
exp(✓(y))
has Gumbel distribution whose mean is
Let be i.i.d Gumbel ( ). Then�(y)
logZ
max
y{✓(y) + �(y)}
P [�(y) t] = F (t)
Perturb-max models
• Why Gumbel distribution?
• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))
Perturb-max models
• Why Gumbel distribution?
• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))
= exp(�X
y
exp(�(t� ✓(y)))) = F (t� logZ)
•Proof: P� [max
y{✓(y) + �(y)} t] =
Y
y
F (t� ✓(y))
has Gumbel distribution whose mean is
Let be i.i.d Gumbel ( ). Then�(y)
logZ
max
y{✓(y) + �(y)}
P [�(y) t] = F (t)
Perturb-max models
• Max stability:
1
Zexp(✓(y)) = P�⇠Gumbel[y = argmax
y{✓(y) + �(y)}]
• Implications (taking gradients):
log
⇣X
y
exp(✓(y))
⌘= E�⇠Gumbel
hmax
y{✓(y) + �(y)}
i
Perturb-max models
• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations
Perturb-max models
• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations
P� [y = argmax
y{✓(y) + �(y)}]
Perturb-max models
• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations
P� [y = argmax
y{✓(y) + �(y)}]
y = (y1, ..., yn)
Perturb-max models
• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations
P� [y = argmax
y{✓(y) + �(y)}]
y = (y1, ..., yn)
Perturb-max models
• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations
P� [y = argmax
y{✓(y) + �(y)}]
• Use low dimension perturbations
P� [y = argmax
y{✓(y) +
nX
i=1
�i(yi)}]
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
The marginal polytope✓(y1, ..., yn) =
X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
The marginal polytope
y1 y2 y3
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
The marginal polytope
y1 y2 y3
✓2(0)
✓2(1)
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
✓1(0)
✓1(1)
✓3(0)
✓3(0)
The marginal polytope
y1 y2 y3
✓2(0)
✓2(1)
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
✓1,2(0, 0)✓1,2(0, 1)
✓1,2(1, 0)✓1,2(1, 1)
✓2,3(0, 0)✓2,3(0, 1)
✓2,3(1, 0)✓2,3(1, 1)✓1(0)
✓1(1)
✓3(0)
✓3(0)
The marginal polytope
y1 y2 y3
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
M
µ
The marginal polytope
y1 y2 y3
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
µ =
0
@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))
1
AM
µ
The marginal polytope
y1 y2 y3
✓(y1, ..., yn) =X
i2V
✓i(yi) +X
i,j2E
✓i,j(yi, yj)
µ =
0
@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))
1
AM
µ9p(y1, y2, y3) s.t. µ1(y1) =
X
y2,y3
p(y1, y2, y3), ...
µ1,2(y1, y2) =X
y3
p(y1, y2, y3), ...
The marginal polytope
M
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
minimal
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
minimal
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
minimal
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
The marginal polytope
M
p(y) / exp
⇣X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)⌘
minimal
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
minimal
The marginal polytope
M
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
minimal
The marginal polytope
M
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
minimal
µ =
0
@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))
1
A
The marginal polytope
M
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
minimal
• Proof idea:
µi(yi) =@E�
hmaxy
�Pi ✓i(yi) +
Pi,j ✓i,j(yi, yj) +
Pi �i(yi)
i
@✓i(yi)
µ =
0
@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))
1
A
The marginal polytope
M
p(y) = P�
hy = argmax
y
�X
i
✓i(yi) +X
i,j
✓i,j(yi, yj) +X
i
�i(yi) i
minimal
• Proof idea:
µi(yi) =@E�
hmaxy
�Pi ✓i(yi) +
Pi,j ✓i,j(yi, yj) +
Pi �i(yi)
i
@✓i(yi)
µ =
0
@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))
1
A
µi,j(yi, yj) =@E�
hmaxy
�Pi ✓i(yi) +
Pi,j ✓i,j(yi, yj) +
Pi �i(yi)
i
@✓i,j(yi, yj)
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
Non-MCMC sampling
• Perturb-max provably approximate the Gibbs marginals on trees [H, Maji, Jaakkola 13] and practically on general graphs.
• Perturb-max sample from tree-shaped Gibbs distribution [Gane, H, Jaakkola 14]
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
Image annotation
• Image annotation is a time consuming (and tedious) task. Can computers do it for us?
Image annotation• Why not to use the most likely annotation instead?
Image annotation
• Most likely annotation is inaccurate around - “thin” areas
• Why not to use the most likely annotation instead?
Image annotation
• Most likely annotation is inaccurate around - “thin” areas - clutter
• Why not to use the most likely annotation instead?
Interactive image annotation
• Perturb-max models show the boundary of decision.
Interactive image annotation
• Perturb-max models show the boundary of decision.
Interactive image annotation
• Interactive annotation directs the human annotator to areas of uncertainty - significantly reduces annotation time [Maji, H., Jaakkola 14].
• Perturb-max models show the boundary of decision.
Uncertainty
• Entropy H(p✓) = �X
y
p✓(y) log p✓(y)
• Entropy = uncertainty - It is a nonnegative function over probability distributions. - It attains its maximal value for the uniform distribution. - It attains its minimal value for the zero-one distribution.
• Computing the entropy requires summing over exponential many configurations y = (y1, ..., yn)
• Can we bound it with perturb-max approach?
Uncertainty
• Perturb-max models
p✓(y)def= P� [y = argmax
y{✓(y) +
nX
i=1
�i(yi)}]
• Entropy
H(p✓) = �X
y
p✓(y) log p✓(y)
• Entropy bound H(p✓) E�
h nX
i=1
�i(y⇤i )i
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
• Perturb-max entropy bound:
H(p✓) EhX
i
�i(y⇤i )i=
X
i
Eh�i(y
⇤i )i
• Standard entropy independence bound:
H(p✓) X
i
H(p✓(yi))
p✓(yi) = P� [yi = argmax
y{✓(y) +
nX
i=1
�i(yi)}]
• Perturb-max entropy bound requires less samples since sampled average tail decreases exponentially [Orabona, H. Sarwate, Jaakkola 14].
Uncertainty
• How does it compare to standard entropy bounds?
Perturb-max entropy bounds• Spin glass, 10x10 grid
X
i
✓i(yi) +X
i,j
✓i,j(yi, yj)
10−1 100 101 1020
1
2
3
4
5
6
7
λ (E = Σi θi yi − Σi,j λθijyiyj)
Entro
py e
stim
ate
3 x 3 grid
EntropyMarginal entropyPerturb entropy
Interactive image annotation
initial boundary final boundary
refinedcurrent refinedcurrent
…
…
Outline
• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.
• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max
• Application: interactive annotation. - New entropy bounds for perturb-max models.
Open problems
• Perturb-max models: - When does fixing variables in the max-function amount to statistical
conditioning? - When do perturb-max models preserve the most likely assignment? - How do the perturbations dimension affect the model properties? - In what ways higher dimension perturbations reveal complex
structures in the model? - How to apply perturbations in restricted spaces, e.g., super-modular
potential functions? - How to encourage diverse sampling? - Perturb-max models stabilize the prediction. Do they connect
computational and statistical stability?
Thank you
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
The flashback slide
• Max stability:
log
⇣X
y
exp(✓(y))
⌘= E�⇠Gumbel
hmax
y{✓(y) + �(y)}
i
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
H(p) min
✓
n
E�
h
max
y
�
ˆ✓(y) +X
i
�i(yi)
i
�X
y
ˆ✓(y)p(y)o
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
H(p) min
✓
n
E�
h
max
y
�
ˆ✓(y) +X
i
�i(yi)
i
�X
y
ˆ✓(y)p(y)o
p✓ p✓
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
H(p) min
✓
n
E�
h
max
y
�
ˆ✓(y) +X
i
�i(yi)
i
�X
y
ˆ✓(y)p(y)o
p✓ p✓✓⇤ = ✓ ✓⇤ = ✓
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
H(p) min
✓
n
E�
h
max
y
�
ˆ✓(y) +X
i
�i(yi)
i
�X
y
ˆ✓(y)p(y)o
p✓ p✓✓⇤ = ✓ ✓⇤ = ✓
H(p✓) E�
hmax
y
�✓(y) +
X
i
�i(yi) i
�X
y
✓(y)p✓(y)
• Theorem:
y⇤ = argmax
y{✓(y) +
nX
i=1
�i(yi)}
Uncertainty*
H(p✓) EhX
i
�i(y⇤i )i
• Proof idea: conjugate duality
H(p) = min
✓
n
logZ(
ˆ✓) �X
y
ˆ✓(y)p(y)
o
logZ(
ˆ✓) E�
hmax
y
�ˆ✓(y) +
X
i
�i(yi) i
H(p) min
✓
n
E�
h
max
y
�
ˆ✓(y) +X
i
�i(yi)
i
�X
y
ˆ✓(y)p(y)o
p✓ p✓✓⇤ = ✓ ✓⇤ = ✓
H(p✓) E�
hmax
y
�✓(y) +
X
i
�i(yi) i
�X
y
✓(y)p✓(y)