Learning with Intractable Inference and Partial Supervision - Stanford … · 2015. 12. 3. · Learning with Intractable Inference and Partial Supervision Jacob Steinhardt Stanford

Learning with Intractable Inference and Partial Supervision

Jacob Steinhardt

Stanford University

[email protected]

September 8, 2015

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 1 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?


Motivation

An Example








Motivation

An Example



Statistical reasoning: aggregate data across sentences to reach conclusions.

Computational reasoning: focus on easily disambiguated words first.





Motivation

An Example








Motivation

An Example








Motivation

An Example








Motivation

An Example








Formal Setting

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions


Formal Setting

Setting: Structured Prediction

input x :

output y : v o l c a n i c

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference


Formal Setting


input x :


Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Structured output space Y — requires inference


Formal Setting


input x :


Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference


Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸︷︷︸given

−Ey∼pθ (·|x)[φ(x , y)]︸︷︷︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)


Formal Setting






.







Formal Setting






.







Formal Setting






.







Formal Setting






.







Formal Setting






.







Formal Setting






.







Formal Setting






.







Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸︷︷︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸︷︷︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!


Formal Setting



Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Where pθ (y | x) = ∑z pθ (y ,z | x)




.




Formal Setting







.




Formal Setting







.




Formal Setting







.




Formal Setting







.




Formal Setting

This Work

Two thrusts:1 How can we reify computation as part of a statistical model?2 How can we relax the supervision signal to aid computation while still

maintaining consistent parameter estimates?


Formal Setting

Related Work

Learning tractable models / accounting for approximations

sum-product networks (Poon & Domingos, 2011)

max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)

fast-mixing Markov chains (S. & Liang, 2015)

many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,

Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)

Improving expressivity of variational inference

combining with MCMC (Salimans, Kingma, & Welling, 2015)

using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)

Computational-statistical tradeoffs

huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;

Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &

Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &

Duchi, 2015; S., Valiant, & Wager, 2015)


Formal Setting

Related Work
















Formal Setting

Related Work
















Reified Context Models

1 Motivation

2 Formal Setting



5 Open Questions



Structured Prediction Task

input x :




Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...



Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a


Key idea: contexts!

*odef=

aoboco...



Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a


Key idea: contexts!

*odef=

aoboco...



Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a


Key idea: contexts!

*odef=

aoboco...



Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Desiderata

r *o **l ***cv *a **i ***r




r ro rol *olc


* ** *** ****



Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!



Reifying Contexts

input x :


context c: v *o *ol *olc · · · · · ·




Reifying Contexts

input x :





Reifying Contexts

input x :


r ro rol *olc

←“context sets”v ra ral ***cy *o *ol ***r

* ** *** ****C1 C2 C3 C4




Reifying Contexts

input x :


r ro rol *olc


* ** *** ****C1 C2 C3 C4

Challenge: how to trade off contexts of different lengths?

=⇒ Reify contexts as part of model!



Reifying Contexts

input x :


r ro rol *olc


* ** *** ****C1 C2 C3 C4





Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!




Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4






Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4






Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4






Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κ

φ2 φ3 φ4 φ5





Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4






Given:



Define the model


(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency


Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4





Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.






abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.







abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.




Precision

input x :


Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM



Precision

input x :




0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM



Precision

input x :



Measure precision (# of correct words) vs. recall (# of words predicted).

comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM



Precision

input x :




0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM



Precision

Measure precision (# of correct words) vs. recall (# of words predicted).

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00pr

ecis

ion

Word Recognition

Beam searchRCM



Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .

latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I am

output y 13 5 54 13 5





0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5





0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:






0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:



Fit 2nd-order HMM with EM, using RCMs for approximate E-step.

use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:



Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.

again compare to beam search (Nuhn et al., 2013)Fraction of correctly mapped words:


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam




Decipherment task:






0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam






0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8m

appi

ng a

ccur

acy

DeciphermentRCMbeam



Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

h

Decipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.End of training: lots of information, long contexts.






1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding







1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.

End of training: lots of information, long contexts.






1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding




Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets


https://codalab.org/worksheets


Discussion








Discussion








Discussion







Relaxed Supervision

1 Motivation

2 Formal Setting



5 Open Questions


Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)


Relaxed Supervision









Relaxed Supervision









Relaxed Supervision









Relaxed Supervision









Relaxed Supervision

Approach

tractable

intractable

θ

β

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.


Relaxed Supervision

Approach

tractable

intractable

θ

β






Relaxed Supervision

Approach

tractable

intractable

θ

β






Relaxed Supervision

Approach

tractable

intractable

θ

β






Relaxed Supervision

Relaxed Supervision: Example


Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?


Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)



Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)

)

As β → ∞, recover original objective.


Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)

As β → ∞, recover original objective.


Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z,y




Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z,y




Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z,y




Two questions:




Relaxed Supervision




`(θ ,β ;x ,y) =− log

(∑z,y




Two questions:




Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision







distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).


Lemma


S(z,y) =k∧

j=1

Sj(z,y)


Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷︸︸︷multiset(z)]︸︷︷︸S(z,y)

=⇒V∧

j=1

[count(z, j) = count(y , j)]


Relaxed Supervision




Decomposition:

[y =


=⇒V∧

j=1



Relaxed Supervision




Decomposition:

[y =


=⇒V∧

j=1



Relaxed Supervision




Decomposition:

[y =


=⇒V∧

j=1



Relaxed Supervision




Decomposition:

[y =


=⇒V∧

j=1

[count(z, j) =

πj(y)︷︸︸︷count(y , j)]︸︷︷︸

Sj(z,y)


Relaxed Supervision




Decomposition:

[y =


⇐⇒V∧

j=1

[count(z, j) =

πj(y)︷︸︸︷count(y , j)]︸︷︷︸

Sj(z,y)


Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸︷︷︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision



input x : brown dog (input utterance)

latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)


Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision



input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)

output y : Q11∩Q6 (denotation, observed as a set)


Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision





Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision





Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision





Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision





Decomposition:


⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)


Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸︷︷︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Lemma

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.


Relaxed Supervision




−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma




Lemma



Relaxed Supervision




−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma




Lemma



Relaxed Supervision




−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma




Lemma



Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸︷︷︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸︷︷︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).


Relaxed Supervision


Inference task:



.


Rejection sampler:



∑x ,y∈Data

(∑z


)−1

≤ τ. (1)



Relaxed Supervision


Inference task:



.


Rejection sampler:



∑x ,y∈Data

(∑z


)−1

≤ τ. (1)



Relaxed Supervision


Inference task:



.


Rejection sampler:



∑x ,y∈Data

(∑z


)−1

≤ τ. (1)



Relaxed Supervision


Inference task:



.


Rejection sampler:



∑x ,y∈Data

(∑z


)−1

≤ τ. (1)



Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

100

101

102

103

104

105

num

ber o

f sam

ples

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)


Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

100

101

102

103

104

105

num

ber o

f sam

ples

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)


Open Questions

1 Motivation

2 Formal Setting



5 Open Questions


Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢


Open Questions







Thanks!谢谢


Open Questions



Understand non-convex optimization

Metacomputationusing Reified Context Models?



Thanks!谢谢


Open Questions







Thanks!谢谢


Open Questions







Thanks!谢谢


Open Questions







Thanks!谢谢


Open Questions







Thanks!谢谢


Documents

Learning with Intractable Inference and Partial Supervision - Stanford … · 2015. 12. 3. · Learning with Intractable Inference and Partial Supervision Jacob Steinhardt Stanford