Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning with Intractable Inference and Partial Supervision
Jacob Steinhardt
Stanford University
September 8, 2015
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 1 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.
Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Motivation
An Example
Company officials refused to comment.公司官员拒绝对此发表评论。
He said the company would appeal.他表示该公司将提出上诉。
Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.
Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).
Statistical inference is computationally intractable.
How can we bring these two paradigms together?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31
Formal Setting
1 Motivation
2 Formal Setting
3 Reified Context Models
4 Relaxed Supervision
5 Open Questions
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 3 / 31
Formal Setting
Setting: Structured Prediction
input x :
output y : v o l c a n i c
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31
Formal Setting
Setting: Structured Prediction
input x :
output y : v o l c a n i c
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]
Structured output space Y — requires inference
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31
Formal Setting
Setting: Structured Prediction
input x :
output y : v o l c a n i c
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Supervised Learning is Easy
Recall: want to maximize E[logpθ (y | x)].
Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:
∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given
−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference
.
Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.
In practice, anything reasonable (MCMC, beam search) works.
Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.
Approximate inference is easy in supervised settings.
Unless we care about estimating uncertainty (calibration, precision/recall)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]
Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
Partially Supervised Structured Prediction
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)
Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y
.
Inference errors on z get reinforced during learning.
Inference often hardest (and most consequential) at beginning of learning!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31
Formal Setting
This Work
Two thrusts:1 How can we reify computation as part of a statistical model?2 How can we relax the supervision signal to aid computation while still
maintaining consistent parameter estimates?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 7 / 31
Formal Setting
Related Work
Learning tractable models / accounting for approximations
sum-product networks (Poon & Domingos, 2011)
max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)
fast-mixing Markov chains (S. & Liang, 2015)
many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,
Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)
Improving expressivity of variational inference
combining with MCMC (Salimans, Kingma, & Welling, 2015)
using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)
Computational-statistical tradeoffs
huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;
Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &
Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &
Duchi, 2015; S., Valiant, & Wager, 2015)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31
Formal Setting
Related Work
Learning tractable models / accounting for approximations
sum-product networks (Poon & Domingos, 2011)
max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)
fast-mixing Markov chains (S. & Liang, 2015)
many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,
Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)
Improving expressivity of variational inference
combining with MCMC (Salimans, Kingma, & Welling, 2015)
using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)
Computational-statistical tradeoffs
huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;
Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &
Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &
Duchi, 2015; S., Valiant, & Wager, 2015)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31
Formal Setting
Related Work
Learning tractable models / accounting for approximations
sum-product networks (Poon & Domingos, 2011)
max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)
fast-mixing Markov chains (S. & Liang, 2015)
many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,
Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)
Improving expressivity of variational inference
combining with MCMC (Salimans, Kingma, & Welling, 2015)
using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)
Computational-statistical tradeoffs
huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;
Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &
Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &
Duchi, 2015; S., Valiant, & Wager, 2015)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31
Reified Context Models
1 Motivation
2 Formal Setting
3 Reified Context Models
4 Relaxed Supervision
5 Open Questions
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 9 / 31
Reified Context Models
Structured Prediction Task
input x :
output y : v o l c a n i c
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 10 / 31
Reified Context Models
Contexts Are Key
v o l c a
v *o **l ***cDP:
v o l c a
v vo vol volcbeam search:
Key idea: contexts!
*odef=
aoboco...
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31
Reified Context Models
Contexts Are Key
v o l c a
v *o **l ***cDP:
v o l c a
v vo vol volcbeam search:
Key idea: contexts!
*odef=
aoboco...
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31
Reified Context Models
Contexts Are Key
v o l c a
v *o **l ***cDP:
v o l c a
v vo vol volcbeam search:
Key idea: contexts!
*odef=
aoboco...
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31
Reified Context Models
Contexts Are Key
v o l c a
v *o **l ***cDP:
v o l c a
v vo vol volcbeam search:
Key idea: contexts!
*odef=
aoboco...
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Desiderata
r *o **l ***cv *a **i ***r
coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates
r ro rol rolcv ra ral ralc
expressivity (long contexts)capture complex dependencies
r ro rol *olc
← best of both worldsv ra ral ***cy *o *ol ***r
* ** *** ****
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·
Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i c
context c: v *o *ol *olc · · · · · ·
Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·
Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·
r ro rol *olc
←“context sets”v ra ral ***cy *o *ol ***r
* ** *** ****C1 C2 C3 C4
Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·
r ro rol *olc
←“context sets”v ra ral ***cy *o *ol ***r
* ** *** ****C1 C2 C3 C4
Challenge: how to trade off contexts of different lengths?
=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reifying Contexts
input x :
output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·
r ro rol *olc
←“context sets”v ra ral ***cy *o *ol ***r
* ** *** ****C1 C2 C3 C4
Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κ
φ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Reified Context Models
Given:
context sets C1, . . . ,CL
features φi(ci−1,yi)
Define the model
pθ (y1:L,c1:L−1) ∝ exp
(L
∑i=1
θ>
φi(ci−1,yi)
)· κ(y ,c)︸ ︷︷ ︸
consistency
Graphical model structure:
Y1 Y2 Y3 Y4 Y5
C1 C2 C3 C4
κ κ κ κ κφ2 φ3 φ4 φ5
inference viaforward-backward!
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Adaptive Context Selection
Select context sets Ci during forward pass of inference
Greedily select contexts with largest mass
abcde...
c
e
ce?
C1
cacb...eaeb...?a...
ca
?a
ca?a??
C2
etc.
Biases towards short contexts unless there is high confidence.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31
Reified Context Models
Precision
input x :
output y : v o l c a n i c
Model assigns probability to each prediction, so can predict on most confidentsubset.
Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search
0.0 0.2 0.4 0.6 0.8 1.0recall
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
prec
isio
n
Word Recognition
Beam searchRCM
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31
Reified Context Models
Precision
input x :
output y : v o l c a n i c
Model assigns probability to each prediction, so can predict on most confidentsubset.
Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search
0.0 0.2 0.4 0.6 0.8 1.0recall
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
prec
isio
n
Word Recognition
Beam searchRCM
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31
Reified Context Models
Precision
input x :
output y : v o l c a n i c
Model assigns probability to each prediction, so can predict on most confidentsubset.
Measure precision (# of correct words) vs. recall (# of words predicted).
comparison: beam search
0.0 0.2 0.4 0.6 0.8 1.0recall
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
prec
isio
n
Word Recognition
Beam searchRCM
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31
Reified Context Models
Precision
input x :
output y : v o l c a n i c
Model assigns probability to each prediction, so can predict on most confidentsubset.
Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search
0.0 0.2 0.4 0.6 0.8 1.0recall
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
prec
isio
n
Word Recognition
Beam searchRCM
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31
Reified Context Models
Precision
Measure precision (# of correct words) vs. recall (# of words predicted).
0.0 0.2 0.4 0.6 0.8 1.0recall
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00pr
ecis
ion
Word Recognition
Beam searchRCM
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .
latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I am
output y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.
use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.
again compare to beam search (Nuhn et al., 2013)Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Decipherment task:
cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5
Goal: determine cipher
Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
map
ping
acc
urac
y
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Partially Supervised Learning
Fraction of correctly mapped words:
0 5 10 15 20training passes
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8m
appi
ng a
ccur
acy
DeciphermentRCMbeam
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31
Reified Context Models
Contexts During Training
Context lengths increase smoothly during training:
0 5 10 15 20number of passes
1.5
2.0
2.5
3.0
3.5
4.0
4.5
aver
age
cont
ext l
engt
h
Decipherment
******↓
***ing↓
idding
Start of training: little information, short contexts.End of training: lots of information, long contexts.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31
Reified Context Models
Contexts During Training
Context lengths increase smoothly during training:
0 5 10 15 20number of passes
1.5
2.0
2.5
3.0
3.5
4.0
4.5
aver
age
cont
ext l
engt
hDecipherment
******↓
***ing↓
idding
Start of training: little information, short contexts.End of training: lots of information, long contexts.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31
Reified Context Models
Contexts During Training
Context lengths increase smoothly during training:
0 5 10 15 20number of passes
1.5
2.0
2.5
3.0
3.5
4.0
4.5
aver
age
cont
ext l
engt
hDecipherment
******↓
***ing↓
idding
Start of training: little information, short contexts.
End of training: lots of information, long contexts.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31
Reified Context Models
Contexts During Training
Context lengths increase smoothly during training:
0 5 10 15 20number of passes
1.5
2.0
2.5
3.0
3.5
4.0
4.5
aver
age
cont
ext l
engt
hDecipherment
******↓
***ing↓
idding
Start of training: little information, short contexts.End of training: lots of information, long contexts.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31
Reified Context Models
Discussion
RCMs provide both expressivity and coverage, which enable:
More accurate uncertainty estimates (precision)
Better partially supervised learning updates
Reproducible experiments on Codalab: codalab.org/worksheets
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31
Reified Context Models
Discussion
RCMs provide both expressivity and coverage, which enable:
More accurate uncertainty estimates (precision)
Better partially supervised learning updates
Reproducible experiments on Codalab: codalab.org/worksheets
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31
Reified Context Models
Discussion
RCMs provide both expressivity and coverage, which enable:
More accurate uncertainty estimates (precision)
Better partially supervised learning updates
Reproducible experiments on Codalab: codalab.org/worksheets
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31
Reified Context Models
Discussion
RCMs provide both expressivity and coverage, which enable:
More accurate uncertainty estimates (precision)
Better partially supervised learning updates
Reproducible experiments on Codalab: codalab.org/worksheets
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31
Relaxed Supervision
1 Motivation
2 Formal Setting
3 Reified Context Models
4 Relaxed Supervision
5 Open Questions
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 20 / 31
Relaxed Supervision
Intractable Supervision
Sometimes, even supervision is intractable:
input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles
Intractable no matter how simple the model is!
but likely statistical relationships (e.g. between CITY and Los Angeles)
Need a way to relax the likelihood.
while maintaining good statistical properties (asymptotic consistency)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31
Relaxed Supervision
Intractable Supervision
Sometimes, even supervision is intractable:
input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles
Intractable no matter how simple the model is!
but likely statistical relationships (e.g. between CITY and Los Angeles)
Need a way to relax the likelihood.
while maintaining good statistical properties (asymptotic consistency)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31
Relaxed Supervision
Intractable Supervision
Sometimes, even supervision is intractable:
input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles
Intractable no matter how simple the model is!
but likely statistical relationships (e.g. between CITY and Los Angeles)
Need a way to relax the likelihood.
while maintaining good statistical properties (asymptotic consistency)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31
Relaxed Supervision
Intractable Supervision
Sometimes, even supervision is intractable:
input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles
Intractable no matter how simple the model is!
but likely statistical relationships (e.g. between CITY and Los Angeles)
Need a way to relax the likelihood.
while maintaining good statistical properties (asymptotic consistency)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31
Relaxed Supervision
Intractable Supervision
Sometimes, even supervision is intractable:
input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles
Intractable no matter how simple the model is!
but likely statistical relationships (e.g. between CITY and Los Angeles)
Need a way to relax the likelihood.
while maintaining good statistical properties (asymptotic consistency)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31
Relaxed Supervision
Approach
tractable
intractable
θ
β
Start with intractable likelihood q(y | z), model family pθ (z | x).
Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).
Derive constraints on (θ ,β ) that ensure tractability.
Learn within the tractable region.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31
Relaxed Supervision
Approach
tractable
intractable
θ
β
Start with intractable likelihood q(y | z), model family pθ (z | x).
Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).
Derive constraints on (θ ,β ) that ensure tractability.
Learn within the tractable region.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31
Relaxed Supervision
Approach
tractable
intractable
θ
β
Start with intractable likelihood q(y | z), model family pθ (z | x).
Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).
Derive constraints on (θ ,β ) that ensure tractability.
Learn within the tractable region.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31
Relaxed Supervision
Approach
tractable
intractable
θ
β
Start with intractable likelihood q(y | z), model family pθ (z | x).
Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).
Derive constraints on (θ ,β ) that ensure tractability.
Learn within the tractable region.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z
pθ (z,y | x)
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z
pθ (z,y | x)
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z
pθ (z,y | x)
)
As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z,y
pθ (z, y | x)exp(−distβ (y ,y))
)
As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z,y
pθ (z, y | x)exp(−distβ (y ,y))
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z,y
pθ (z, y | x)exp(−distβ (y ,y))
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z,y
pθ (z, y | x)exp(−distβ (y ,y))
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Example
input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。
Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).
`(θ ,β ;x ,y) =− log
(∑z,y
pθ (z, y | x)exp(−distβ (y ,y))
)As β → ∞, recover original objective.
but optimizing will send β → 0!
Two questions:
How to create natural pressure to increase β?
How to define distances for general problems?
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Relaxed Supervision: Formal Framework
Assume (WLOG) that z→ y is deterministic: y = f (z).
Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].
Take projections πj : Y →Yj , j = 1, . . . ,k .
Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.
Define distance function:
distβ (z,y) =k
∑j=1
βj · (1−Sj(z,y)).
Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.
Lemma
Suppose that π1×·· ·×πk is injective. Then
S(z,y) =k∧
j=1
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
=⇒V∧
j=1
[count(z, j) = count(y , j)]
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
=⇒V∧
j=1
[count(z, j) = count(y , j)]
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
=⇒V∧
j=1
[count(z, j) = count(y , j)]
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
=⇒V∧
j=1
[count(z, j) = count(y , j)]
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
=⇒V∧
j=1
[count(z, j) =
πj(y)︷ ︸︸ ︷count(y , j)]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Unordered Supervision
input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}
Let count(·, j) count number of occurences of character j .
Decomposition:
[y =
f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)
⇐⇒V∧
j=1
[count(z, j) =
πj(y)︷ ︸︸ ︷count(y , j)]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)
latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)
output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Example: Conjunctive Semantic Parsing
Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs
input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)
For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .
Decomposition:
y = JzK︸ ︷︷ ︸S(z,y)
⇐⇒m∧
j=1
I[JzK⊆ Qj ] =
πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸
Sj(z,y)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31
Relaxed Supervision
Normalization Constant
Create pressure to increase β by adding normalization constant:
qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)
−A(β ))
`(θ ,β ;x ,y) =− log
(∑z
pθ (z | x)qβ (y | z)).
Lemma
Given π1, . . . ,πk , let A(β )def= ∑
kj=1 log(1+(|Yj |−1)exp(−βj)). Then,
∑y exp(−distβ (z,y))≤ A(β ) for all z.
Lemma
Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31
Relaxed Supervision
Normalization Constant
Create pressure to increase β by adding normalization constant:
qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)
−A(β ))
`(θ ,β ;x ,y) =− log
(∑z
pθ (z | x)qβ (y | z)).
Lemma
Given π1, . . . ,πk , let A(β )def= ∑
kj=1 log(1+(|Yj |−1)exp(−βj)). Then,
∑y exp(−distβ (z,y))≤ A(β ) for all z.
Lemma
Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31
Relaxed Supervision
Normalization Constant
Create pressure to increase β by adding normalization constant:
qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)
−A(β ))
`(θ ,β ;x ,y) =− log
(∑z
pθ (z | x)qβ (y | z)).
Lemma
Given π1, . . . ,πk , let A(β )def= ∑
kj=1 log(1+(|Yj |−1)exp(−βj)). Then,
∑y exp(−distβ (z,y))≤ A(β ) for all z.
Lemma
Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31
Relaxed Supervision
Normalization Constant
Create pressure to increase β by adding normalization constant:
qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)
−A(β ))
`(θ ,β ;x ,y) =− log
(∑z
pθ (z | x)qβ (y | z)).
Lemma
Given π1, . . . ,πk , let A(β )def= ∑
kj=1 log(1+(|Yj |−1)exp(−βj)). Then,
∑y exp(−distβ (z,y))≤ A(β ) for all z.
Lemma
Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31
Relaxed Supervision
Constraints for Efficient Inference
Inference task:
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x
.
pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).
Rejection sampler:
sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).
Bound expected number of samples:
∑x ,y∈Data
(∑z
pθ (z | x)exp(β>ψ(z,y))
)−1
≤ τ. (1)
Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31
Relaxed Supervision
Constraints for Efficient Inference
Inference task:
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x
.
pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).
Rejection sampler:
sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).
Bound expected number of samples:
∑x ,y∈Data
(∑z
pθ (z | x)exp(β>ψ(z,y))
)−1
≤ τ. (1)
Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31
Relaxed Supervision
Constraints for Efficient Inference
Inference task:
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x
.
pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).
Rejection sampler:
sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).
Bound expected number of samples:
∑x ,y∈Data
(∑z
pθ (z | x)exp(β>ψ(z,y))
)−1
≤ τ. (1)
Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31
Relaxed Supervision
Constraints for Efficient Inference
Inference task:
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x
.
pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).
Rejection sampler:
sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).
Bound expected number of samples:
∑x ,y∈Data
(∑z
pθ (z | x)exp(β>ψ(z,y))
)−1
≤ τ. (1)
Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31
Relaxed Supervision
Constraints for Efficient Inference
Inference task:
∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y
−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x
.
pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).
Rejection sampler:
sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).
Bound expected number of samples:
∑x ,y∈Data
(∑z
pθ (z | x)exp(β>ψ(z,y))
)−1
≤ τ. (1)
Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31
Relaxed Supervision
Experiments
Conjunctive semantic parsing:
0 10 20 30 40 50iteration
0.0
0.2
0.4
0.6
0.8
1.0
accu
racy
FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)
0 10 20 30 40 50iteration
100
101
102
103
104
105
num
ber o
f sam
ples
FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 29 / 31
Relaxed Supervision
Experiments
Conjunctive semantic parsing:
0 10 20 30 40 50iteration
0.0
0.2
0.4
0.6
0.8
1.0
accu
racy
AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)
0 10 20 30 40 50iteration
100
101
102
103
104
105
num
ber o
f sam
ples
AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 29 / 31
Open Questions
1 Motivation
2 Formal Setting
3 Reified Context Models
4 Relaxed Supervision
5 Open Questions
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 30 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimization
Metacomputationusing Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31
Open Questions
Scale up to larger taskssemantic parsing, reinforcement learning, program induction
Extend to Bayesian models
Understand non-convex optimizationMetacomputation
using Reified Context Models?
Probabilistic abstract interpretation
Statistics & Computation: still a long way to go
Thanks!谢谢
J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31