Download pdf - Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as ﬁnding the most likely one.!•Connections

Inference by randomly perturbing max-solvers

Tamir Hazan University of Haifa

!!!!

Inference in machine learning

• 20 years ago: does an image contain a person?


• 10 years ago: which object is in the image?


• Today’s challenge: exponentially many options

• For each pixel: decide if it is foreground or background.

• The space of possible structures is exponential


• Interactive annotation:

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

Outline


• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

Outline



• Application: interactive annotation. - New entropy bounds for perturb-max models.

• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)



y 2 {0, 1}n




• machine learning applications are characterized by: - complex structures - potential function that scores these structures

y = (y1, ..., yn)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)



y = (y1, ..., yn)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

high score low score



y = (y1, ..., yn)

• For machine learning we need to efficiently infer from distributions over complex structures.

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)


Gibbs distribution

• MCMC samplers: - Gibbs sampling, Metropolis-Hastings, Swendsen-Wang

• Many efficient sampling algorithms for special cases: - Ising models (Jerrum 93) - Counting bi-partite matchings in planar graphs (Kasteleyn 61) - Approximating the permanent (Jerrum 04) - Many others…

p(y1, ..., yn) =1

Zexp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

Sampling likely structures

✓i(yi) = log p(yi|xi)

• RGB color of pixel ixi

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

✓i,j(yi, yj) =

(1 if yi = yj�1 otherwise


p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘




✓i,j(yi, yj) =



p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘




✓i,j(yi, yj) =



p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘






• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)

✓i(yi) = 0





✓i(yi) = 0





✓i(yi) = 0

• Data terms (signals) that are important in AI applications significantly change the complexity of sampling

• Instead of sampling, it may be significantly faster to find the most likely structure

Most likely structure

• Instead of sampling, it may be significantly faster to find the most likely structure


• The most likely structure


y⇤ = arg max

y1,...,yn

X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

• Maximum a-posterior (MAP) inference.

• Many efficient optimization algorithms for special cases: - Beliefs propagation: trees (Pearl 88), perfect graphs (Jebara 10), - Graph-cuts for image segmentation - branch and bound (Rother 09), branch and cut (Gurobi) - Linear programming relaxations (Schlesinger 76, Wainwright 05,

Kolmogorov 06, Werner 07, Sontag 08, Hazan 10, Batra 10, Nowozin 10, Pletscher 12, Kappes 13, Savchynskyy13, Tarlow 13, Kohli 13, Jancsary 13, Schwing 13)

- CKY for parsing - Many others…

The challenge

Sampling from the likely high dimensional structures (with millions of variables, e.g., image segmentation with 12 million pixels) as efficient as optimizing

• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others

structures

scores

y⇤



structures

scores

y⇤


✓(y) =X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

y = (y1, ..., yn)


structures

scores

y⇤



structures

scores

y⇤


• The maximizing structure is not robust in case of multiple high scoring alternatives



structures

scores

y⇤



structures

scores

y⇤

Probabilistic Vs. Rule-Based

• Rule based grammars do not generalize well across domains and languages:

Probabilistic Vs. Rule-Based

• Rule based grammars do not generalize well across domains and languages:





structures

scores

y⇤


• Randomly perturbing the system reveals its complexity

structures

scores

- little effect when the maximizing structure is “evident”

y⇤

Random perturbations

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident”

structures

scores

y⇤


• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring

structures

structures

scores

structures

scores

y⇤

y⇤



structures

structures

scores

structures

scores

y⇤

y⇤



structures

structures

scores

structures

scores

y⇤

y⇤



structures

!

• Related work: - McFadden 74 (Discrete choice theory) - Talagrand 94, Barvinok 07 (Canonical processes)

structures

scores

structures

scores

y⇤

y⇤



structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

scores (potential) ✓(y)


structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

scores (potential)

perturbed score

✓(y)


structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

�(y)

scores (potential)

perturbed score

perturbations �(y)

✓(y)


structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

�(y)

scores (potential)

perturbed score

perturbations �(y)

✓(y)

✓(y) + �(y)


structures

scores

y⇤

✓(y⇤)

✓(y)

y

�(y)

• For every structure y, the perturbation value is a random variable (y is an index, traditional notation is ).

�(y)�y

• Perturb-max models: how stable is the maximal structure to random changes in the potential function.

Outline




Perturb-max models

•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)

F (t)def= P [�(y) t] = exp(� exp(�t))

Perturb-max models



f(t) = F 0(t) = exp(�t)F (t)

Perturb-max models



then the perturb-max model is the Gibbs distribution

1

Zexp(✓(y)) = P�⇠Gumbel[y = argmax

y{✓(y) + �(y)}]

has Gumbel distribution whose mean is

Let be i.i.d Gumbel ( ). Then�(y)

logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Perturb-max models

• Why Gumbel distribution?

• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))

Z =

X

y

exp(✓(y))



logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Perturb-max models



Perturb-max models



= exp(�X

y

exp(�(t� ✓(y)))) = F (t� logZ)

•Proof: P� [max

y{✓(y) + �(y)} t] =

Y

y

F (t� ✓(y))



logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Perturb-max models

• Max stability:

1

Zexp(✓(y)) = P�⇠Gumbel[y = argmax

y{✓(y) + �(y)}]

• Implications (taking gradients):

log

⇣X

y

exp(✓(y))

⌘= E�⇠Gumbel

hmax

y{✓(y) + �(y)}

i

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

Perturb-max models


P� [y = argmax

y{✓(y) + �(y)}]

Perturb-max models


P� [y = argmax

y{✓(y) + �(y)}]

y = (y1, ..., yn)

Perturb-max models


P� [y = argmax

y{✓(y) + �(y)}]

y = (y1, ..., yn)

Perturb-max models


P� [y = argmax

y{✓(y) + �(y)}]

• Use low dimension perturbations

P� [y = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

Outline




The marginal polytope✓(y1, ..., yn) =

X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

The marginal polytope

y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)


y1 y2 y3

✓2(0)

✓2(1)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

✓1(0)

✓1(1)

✓3(0)

✓3(0)


y1 y2 y3

✓2(0)

✓2(1)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

✓1,2(0, 0)✓1,2(0, 1)

✓1,2(1, 0)✓1,2(1, 1)

✓2,3(0, 0)✓2,3(0, 1)

✓2,3(1, 0)✓2,3(1, 1)✓1(0)

✓1(1)

✓3(0)

✓3(0)


y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

M

µ


y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

AM

µ


y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

AM

µ9p(y1, y2, y3) s.t. µ1(y1) =

X

y2,y3

p(y1, y2, y3), ...

µ1,2(y1, y2) =X

y3

p(y1, y2, y3), ...


M


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i


M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal


M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal


M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A


M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

• Proof idea:

µi(yi) =@E�

hmaxy

�Pi ✓i(yi) +

Pi,j ✓i,j(yi, yj) +

Pi �i(yi)

i

@✓i(yi)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A


M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

• Proof idea:

µi(yi) =@E�

hmaxy

�Pi ✓i(yi) +


Pi �i(yi)

i

@✓i(yi)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A

µi,j(yi, yj) =@E�

hmaxy

�Pi ✓i(yi) +


Pi �i(yi)

i

@✓i,j(yi, yj)

Outline




Non-MCMC sampling

• Perturb-max provably approximate the Gibbs marginals on trees [H, Maji, Jaakkola 13] and practically on general graphs.

• Perturb-max sample from tree-shaped Gibbs distribution [Gane, H, Jaakkola 14]

Outline


• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max


Image annotation

• Image annotation is a time consuming (and tedious) task. Can computers do it for us?

Image annotation• Why not to use the most likely annotation instead?

Image annotation

• Most likely annotation is inaccurate around - “thin” areas

• Why not to use the most likely annotation instead?

Image annotation

• Most likely annotation is inaccurate around - “thin” areas - clutter

• Why not to use the most likely annotation instead?

Interactive image annotation

• Perturb-max models show the boundary of decision.




• Interactive annotation directs the human annotator to areas of uncertainty - significantly reduces annotation time [Maji, H., Jaakkola 14].


Uncertainty

• Entropy H(p✓) = �X

y

p✓(y) log p✓(y)

• Entropy = uncertainty - It is a nonnegative function over probability distributions. - It attains its maximal value for the uniform distribution. - It attains its minimal value for the zero-one distribution.

• Computing the entropy requires summing over exponential many configurations y = (y1, ..., yn)

• Can we bound it with perturb-max approach?

Uncertainty

• Perturb-max models

p✓(y)def= P� [y = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

• Entropy

H(p✓) = �X

y

p✓(y) log p✓(y)

• Entropy bound H(p✓) E�

h nX

i=1

�i(y⇤i )i

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

• Perturb-max entropy bound:

H(p✓) EhX

i

�i(y⇤i )i=

X

i

Eh�i(y

⇤i )i

• Standard entropy independence bound:

H(p✓) X

i

H(p✓(yi))

p✓(yi) = P� [yi = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

• Perturb-max entropy bound requires less samples since sampled average tail decreases exponentially [Orabona, H. Sarwate, Jaakkola 14].

Uncertainty

• How does it compare to standard entropy bounds?

Perturb-max entropy bounds• Spin glass, 10x10 grid

X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

10−1 100 101 1020

1

2

3

4

5

6

7

λ (E = Σi θi yi − Σi,j λθijyiyj)

Entro

py e

stim

ate

3 x 3 grid

EntropyMarginal entropyPerturb entropy


initial boundary final boundary

refinedcurrent refinedcurrent

…

…

Outline


• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max


Open problems

• Perturb-max models: - When does fixing variables in the max-function amount to statistical

conditioning? - When do perturb-max models preserve the most likely assignment? - How do the perturbations dimension affect the model properties? - In what ways higher dimension perturbations reveal complex

structures in the model? - How to apply perturbations in restricted spaces, e.g., super-modular

potential functions? - How to encourage diverse sampling? - Perturb-max models stabilize the prediction. Do they connect

computational and statistical stability?

Thank you

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

The flashback slide

• Max stability:

log

⇣X

y

exp(✓(y))

⌘= E�⇠Gumbel

hmax

y{✓(y) + �(y)}

i

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

✓

n

E�

h

max

y

�

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

✓

n

E�

h

max

y

�

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

✓

n

E�

h

max

y

�

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

✓

n

E�

h

max

y

�

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

H(p✓) E�

hmax

y

�✓(y) +

X

i

�i(yi) i

�X

y

✓(y)p✓(y)

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i


H(p) = min

✓

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

✓

n

E�

h

max

y

�

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

H(p✓) E�

hmax

y

�✓(y) +

X

i

�i(yi) i

�X

y

✓(y)p✓(y)