Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Joint work with Sameer Singh, Michael Wick, Karl Schultz, Sebastian Reidel, Limin Yao, Aron Culotta.
Andrew McCallum
Department of Computer ScienceUniversity of Massachusetts Amherst
Probabilistic Programming andProbabilistic Databases with
Imperatively-defined Factor Graphs
Goal
Build models that mine actionable knowledge
from unstructured text.
Cites
Entities and Relations
Research Paper
Cites
Grant
Expertise
Venue University Groups
Research Paper
Person
Entities and Relations
Extracted Database
• 8 million research papers
• 2 million authors
• 400k grants, 90k institutions, 10k venues
Gather raw data
Extraction Resolution
Text
Men
tions
Entities
query
answer
Combining ambiguous evidence
Applying probabilistic modeling to large data.
Applying probabilistic modeling to large data.
scalability
algorithms
data structures
software engineering
bio/medicalinformatics
information extraction
computer vision
scientificdata modeling
StatisticsComputational
...
Rich model structurespatio-temporalhierarchicalrelationalinfinite
Implementing the new model is a significant task.
parallelism
Bayesian NetworkDirected, Generative Graphical Models
c d e
a b
p(a, b, c, d, e) = p(a) p(b|a) p(c|a) p(d|a, c) p(e|b, c)
Markov Random FieldUndirected Graphical Models, aka, Markov Network
c d e
a b
p(a, b, c, d, e) =1Z
φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)
Markov Random FieldUndirected Graphical Models, aka, Markov Network
p(a, b, c, d, e) =1Z
φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)
c d e
a b
p(a, b, c, d, e) =1Z
φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)
Factor GraphCan represent both directed and undirected graphical models
p(a, b, c, d, e) =1Z
φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)
c d e
a b
p(a, b, c, d, e) =1Z
φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)
Factor GraphCan represent both directed and undirected graphical models
p(a, b, c, d, e) =1Z
φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)
c d e
a b
p(a, b, c, d, e) =1Z
φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)
Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables
c d e
a boutputpredictedvariables
inputobservedvariablesx
y
[Lafferty, McCallum, Pereira 2001]
p(y|x) =1
Zx
�
f
φ(x∈f ,y∈f )
Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables
c d e
a boutputpredictedvariables
inputobservedvariablesx
y
p(y|x) =1
Zx
�
f
φ(x∈f ,y∈f )
+ Tremendous freedom to use arbitrary features of input. + Predict multiple dependent variables (“structured output”)
[Lafferty, McCallum, Pereira 2001]
= exp
� �
k
λkfk(xt, yt)
�
Relational Graphical ModelRelational = repeated structure of data & factors
d e
a b
g
f
i
h
Relational Graphical ModelRelational = repeated structure of data & factors
d e
a b
g
f
i
h
x1
y1
FactorTemplate #1
y1 y2FactorTemplate #2
Information Extraction with Linear-chain CRFs
Finite state model
Today Morgan Stanley Inc announced Mr. Friday’s appointment.
s1 s2 s3 s4 s5 s6 s7 s8
person nameorganization namebackground
Graphical model
state sequence
observation sequence
State-of-the-art predictive accuracy on many tasks.
Logistic Regression analogue of a hidden Markov model
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
Database A (Schema A)First Name Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Database B (Schema B)Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
Schema A Schema BFirst Name Name
Last Name Phone
Contact
John #1 John #2
J. Smith John Smith
J. Smith J Smiht
John Smith
John D. Smith
Information Integration
Entity# Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555
… … …
Schema Matching Coreference
Canonicalization: Normalized DB
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
• x1 is a set of mentions {J. Smith,John,John Smith}}
• x2 is a set of mentions {Amanda, A. Jones}
• f12 is a factor between x1/x2
• y12 is a binary variable indicating a match (no)
• f1 is a factor over cluster x1
• y1 is a binary variable indicating match (yes)
• Entity/attribute factors omitted for clarity
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
• x6 is a set of attributes {phone,contact,telephone}}
• x7 is a set of attributes {last name, last name}
• f67 is a factor between x6/x7
• y67 is a binary variable indicating a match (no)
• f7 is a factor over cluster x7
• y7 is a binary variable indicating match (yes)
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
f43
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
f43
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
Really Hairy!
How to do
• parameter estimation• inference
• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...
• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)
• But, getting marginal distributions can be difficult.
• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.
• But, even finding model’s predicted best output is expensive.
• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)
Parameter Estimation in Large State Spaces
Metropolis-HastingsSampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Given factor graph with target variables y and observed x
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
proposal distribution
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Can do MAP inference with decreasing temperature on ratio of p(y)’s
mod
M-H Natural Efficiencies1. Partition function cancels
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
SampleRank
MH Efficiency
factors cancel
=
�y �i∈y � ψ(x , y �i)
�y i∈y ψ(x , y i)
=
��y �i∈δy � ψ(x , y �i)
� ��y i∈y �/δy � ψ(x , y i)
�
��y i∈δy
ψ(x , y i)� ��
y i∈y/δyψ(x , y i)
�
=
�y �i∈δy � ψ(x , y �i)
�y i∈δy
ψ(x , y i)
δy is the “diff”, ie variables in y that have changed
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27
2. Unchanged factors cancel
How to learn parameters for ?
1.
UPDATE
Ranking Intermediate SolutionsExample
2.
∆ Model = -23∆ Truth = -0.2
3.
∆ Model = 10∆ Truth = -0.1
4.
∆ Model = -10∆ Truth = -0.1
5.
∆ Model = -3∆ Truth = 0.3
• Like Perceptron:Proof of convergence under Marginal Separability.
• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!
• Very fast to train.
UPDATE
Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]
sufficient statistics for update
proposal
truth
Persistent Contrastive Divergence [Tieleman 2008]
Sample Rank
Comparison to LASO
Sample Rank scores possible worlds.
LASO scores “actions” (transitions).
SCORESCORESCORESCORESCORE
SCORESCORESCORESCORE
No concern about generation ordering of output vars.Defines standard factor graph score on possible world.Can get marginal probability distributions.
Daumé, Langford, Marcu ’05
SampleRank on Coreference• ACE 2004
• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington
2005 Ng . 69.5%
2007 Culotta, Wick, Hall, McCallum . 79.3%
2008 Bengston, Roth . 80.8%
2009 Wick, McCallum MCMC+SampleRank . 81.5%
Contrastive Divergence . 75.1%
Persistent Contrastive Divergence . 74.9%
Perceptron . 76.3%
B3
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
f43
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
Dataset• Faculty and alumni listings from
university websites, plus an IE system
• 9 different database schemas
• ~1400 mentions, 294 coreferent
DEX IE Northwestern Fac UPenn FacFirst Name Name Name
Middle Name Title First Name
Last Name PhD Alma Mater Last Name
Title Research Interests Job+Department
Department Office Address
Company Name E-mail
Home Phone
Office Phone
Fax Number
Example schemas:
Coreference Results
PairPairPair MUCMUCMUC
F1 Prec Recall F1 Prec Recall
No Canon
ISO 72.7 88.9 61.5 75.0 88.9 64.9
No Canon
CASC 64.0 66.7 61.5 65.7 66.7 64.9No Canon
JOINT 76.5 89.7 66.7 78.8 89.7 70.3
Canon
ISO 78.3 90.0 69.2 80.6 90.0 73.0
CanonCASC 65.8 67.6 64.1 67.6 67.6 67.6
Canon
JOINT 81.7 90.6 74.4 84.1 90.6 74.4
~15% error reduction from joint model
ISO = isolated CASC = cascade JOINT = joint inference
Schema Matching Results
PairPairPair MUCMUCMUC
F1 Prec Recall F1 Prec Recall
No Canon
ISO 50.9 40.9 67.5 69.2 81.8 60.0
No Canon
CASC 50.9 40.9 67.5 69.2 81.8 60.0No Canon
JOINT 68.9 100 52.5 69.6 100 53.3
Canon
ISO 50.9 40.9 67.5 69.2 81.8 60.0
CanonCASC 52.3 41.8 70.0 74.1 83.3 66.7
Canon
JOINT 71.0 100 55.0 75.0 100 60.0
~40% error reduction from joint model
ISO = isolated CASC = cascade JOINT = joint inference
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
f43
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
Really Hairy!
How to do
• parameter estimation• inference
✔
✔
x6 x7
y67
f67x5 x8
x4
f5y5
f8y8y54 y54
Schema Matching
f43
y1 y2x3
y3
y13 y23
y12
f1 f2
Coreference and Canonicalization
€
P(Y | X) =1ZX
ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏
yi∈Y∏
€
ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛
⎝ ⎜
⎞
⎠ ⎟ f7
y5
y7
x1 x2
Really Hairy!
How to do
• parameter estimation• inference• software engineering
✔
✔
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
Probabilistic Programming Languages
• Make it easy to specify rich, complex models, using the full power of programming languages- data structures
- control mechanisms
- abstraction
• Inference implementation comes for free
Provides language to easily create new models
Small Sampling of Probabilistic Programming Languages
• Explicit Directed Graph- BUGS
• Functional- IBAL, Church
• Object Oriented- Figaro, Infer.NET
• Logic-based- Markov logic, BLOG, PRISM
Declarative Model Specification• One of biggest advances in Artificial Intelligence community
• Gone too far?Much domain knowledge is also procedural.
• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...
Imperative tools for creating aDeclarative Model Specification
• One of biggest advances in Artificial Intelligence community
• Gone too far?Much domain knowledge is also procedural.
• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...
Imperative tools for creating aDeclarative Model Specification
• Preserve the declarative statistical semantics of factor graphs
• Provide imperative hooks to define structure, parameterization, inference, estimation.
“Imperatively-Defined Factor Graphs”
Our approach:
[McCallum, Rohanemanesh, Wick, Schultz, Singh, NIPS, 2008]
Our Design Goals• Represent factor graphs
- emphasis on conditional random fields
• Scalability- input data, output configuration, factors, tree-width
- observed data that cannot fit in memory
- super-exponential number of factors
• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...
• Integrate declarative & procedural knowledge- natural, easy-to-use
- upcoming slides: 2 examples of injecting imperativ-ism into factor graphs
FACTORIE• “Factor Graphs, Imperative, Extensible”• Implemented as a library in Scala [Martin Odersky]
- object oriented & functional- type inference- runs in JVM (complete interoperation with Java)- fast, JIT compiled, but also cmd-line interpreter
• Library, not new “little language”- integrate data pre-processing & eval. w/ model spec- leverage OO-design: modularity, encapsulation, inheritance
• Scalable- billions of variables, super-exp #factors, DB back-end- fast parameter estimation through SampleRank [2009]
http://code.google.com/p/factorie
Stages of FACTORIE programming1. Define “templates for data” (i.e. classes)
- Use data structures just like in deterministic programming.
- Only special requirement: “undo” capability for changes.- (Variable holds single possible value, not a distribution.)
2. Define “templates for factors”- Distinct from above data representation;
makes it easy to modify model scoring independently.
- Leverage data’s natural relations to define factors’ relations.
3. Select inference (MCMC, variational)
- Optionally, define MCMC proposal functions that leverage domain knowledge.
4. Read the data, creating variables.Then inference / parameter estimation is often a one-liner!
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
Scala• New variable
var myHometown : String
• New constantval myName = “Andrew”
• New methoddef climb(increment:double) = myAltitude += increment
• New classclass Skier extends Person
• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }
• New class with traitclass BackcountrySkier extends Skier with FirstAid
• Generics in square bracketsnew ArrayList[Skier]
Example: Linear-Chain CRF for Segmentation
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg)
class Token(word:String) extends CategoricalVariable(word)
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq
class Token(word:String) extends CategoricalVariable(word) with VarInSeq
Bill loves skiing Tom loves snowshoeing
T F F T F F
label.prevlabel.next
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq
class Token(word:String) extends CategoricalVariable(word) with VarInSeq
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label}
Bill loves skiing Tom loves snowshoeing
T F F T F F
Avoid representing relations by indices.Do it directly with member pointers... arbitrary data structure.
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label])
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token])
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])
Bill loves skiing Tom loves snowshoeing
T F F T F F
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores• How to find factors from variables & vice versa?
– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors
– But complex to maintain as structure changes– Factors consume memory
Imperativ-ism #1: Model Structure• Maintain no map structure between factors and variables
• Finding factors is easy. Usually # templates < 50.– Given changed variable, query each template
• What’s hard: Given factor template and one changed variable, find other variables
• In factor Template object, define imperative methods that do this.– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– i.e., use Turing-complete language to determine structure on the fly.
• Other nice attributes– Easy to do value-conditioned structure. Case Factor Diagrams, etc.– Not only avoid super-exp, don’t even allocate all factors for current config.– FACTORIE provides several simpler mechanisms that build on this primitive.
Primitive operation:
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] {
}, new TemplateWithStatistics2[Label,Label])
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)
}, new TemplateWithStatistics2[Label,Label])
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label])
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) })
Example: Linear-Chain CRF for Segmentation
Labels
Words
class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })
Imperativ-ism #2: Neighbor-Sufficient Map• “Neighbor Variables” of a factor
– Values of variables touching the factor• “Sufficient Statistics” of a factor
– Vector, dot product with weights of log-linear factor → factor’s score
• Usually confounded. Separate them w/ user-defined function!• Skip-chain NER. Instead of 5x5 parameters, just 2.
(label1, label2) → label1 == label2
Labels
Words
Bill loves Paris Bill the painter ...
PER O LOC PER O O
[Sutton & McCallum 2006]
Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })
Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable])
Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) })
Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })
Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })val labels:Collection[Label] = readData()val inferencer = new GibbsSampler(model)for (i <- 1 to numIterations) inferencer.process(labels)
Example RunCoNLL 2003 NER
Example: Dependency Parsingclass Word(str:String) extends CategoricalVariable(word)class Node(word:Word, parent:Node) extends RefVariable(parent)
object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}
object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}
VERB
Example: Alternative Template Spec
val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) },)
Instead of previous:
Higher-level “Entity-Relationship” Specification:val model = new Model( Foreach[Label] { label => Score(label) }, Foreach[Label] { label => Score(label, label.token) }, Foreach[Label] { label => Score(label.prev, label) })
Example: First-order Logic Templates
val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)
Example: First-order Logic Templates
val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)
Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }
Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }
val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}
Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }
val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}
val sampler = new CollapsedGibbsSampler(phis ++ documents.map(_.theta))val zs = documents.flatMap(document => document.map(word => word.choice))sampler.processAll(zs)
Experimental Comparison• Joint Segmentation & Coreference of
research paper citations (Cora data).- 1295 mentions, 134 entities, 36487 tokens
• Compare with Markov Logic Networks (Alchemy)
- Same observable features
• FACTORIE results:
- ~25% reduction in error (segmentation & coref)
- 3-20x faster
- coref results:
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
founded(Bill Gates,Microsoft)
nationality(Steve Jobs,USA)
...(...,...)
Microsoft was founded by Bill Gates
With Microsoft chairman Bill Gates soon relinquishing ...
Paul Porter, a founder of Industry Ears
334K entities, 10 types 488K relation instances, 54 types
2 years216K articles
Joint model ofentities & relations
founded(Paul Porter,Industry Ears)
founded(D.L.Sifry,Technorati)
founded
founded
founded(D. L. Sifry, Technorati)?
KnowledgeBase Augmentation[Yao, Riedel, McCallum 2010]
Cross-document
Joint
Inferencefounded
Microsoft was founded by Bill Gates...
personcompany
nation
country
With Microsoft chairman Bill Gates soon relinquishing...
Bill Gates was born in the USA in 1955
1
nationof
Elevation Partners, was founded by Roger McNamee ...
Roger McNamee,USAYrel
Z1
personR McNamee
countryUSA
worksfor
comp. Microsoft
1Z1
1Z1
Roger McNamee,Microsoft
functionality-factor
ner-relation-factors
relation-mention factors
Ytypel
mention factors
Elevation Partners, was founded by Roger McNamee ...
Elevation Partners, was founded by Roger McNamee ...
Figure 1: Factor Graph for joint relation mention prediction and relation type identification.
fine the following conditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
ePKj
k=1 θjkfj
k(yi,xi) (3)
In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias
r and feature func-tion fBias
r for each possible relation r. fBiasr fires if
the relation associated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need to
model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .
The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical
context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is
fMen101 (yc,xMc)
def=
1 yc = founder∧m1", director of "m2 ∈ xMc
0 otherwise.
It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.
Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.
3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types
and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint
r,t1...ta (and weight θJointr,t1...ta) for each
combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint
founder,person,company to be larger thanθJointfounder,person,country.
We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair
i,r,t that fire if ei is
1
nationof
Elevation Partners, was founded by Roger McNamee ...
Roger McNamee,USAYrel
Z1
personR McNamee
countryUSA
worksfor
comp. Microsoft
1Z1
1Z1
Roger McNamee,Microsoft
functionality-factor
ner-relation-factors
relation-mention factors
Ytypel
mention factors
Elevation Partners, was founded by Roger McNamee ...
Elevation Partners, was founded by Roger McNamee ...
Figure 1: Factor Graph for joint relation mention prediction and relation type identification.
fine the following conditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
ePKj
k=1 θjkfj
k(yi,xi) (3)
In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias
r and feature func-tion fBias
r for each possible relation r. fBiasr fires if
the relation associated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need to
model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .
The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical
context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is
fMen101 (yc,xMc)
def=
1 yc = founder∧m1", director of "m2 ∈ xMc
0 otherwise.
It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.
Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.
3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types
and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint
r,t1...ta (and weight θJointr,t1...ta) for each
combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint
founder,person,company to be larger thanθJointfounder,person,country.
We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair
i,r,t that fire if ei is
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
nationality. When testing our model we thenencounter a sentence such as
(3) Arrest Warrant Issued for Richard Gere inIndia.
that leads us to extract that RICHARD GERE is a cit-izen of INDIA.
2.6 Global Consistency of FactsAs discussed above, distant supervision can lead tonoisy extractions. However, such noise can often beeasily identified by testing how compatible the ex-tracted facts are to each other. In this work we areconcerned with a particular type of compatibility:selectional preferences.
Relations require, or prefer, their arguments to beof certain types. For example, the nationalityrelation requires the first argument to be a person,and the second to be a country. On inspection,we find that these preferences are often not satis-fied in a baseline distant supervision system akin toMintz et al. (2009). This often results from patternssuch as “<Entity1> in <Entity2>” that fire in manycases where <Entity2> is a location, but not acountry.
3 ModelOur observations in the previous section suggestthat we should (a) explicitly model compatibil-ity between extracted facts, and (b) integrate ev-idence from several documents to exploit redun-dancy. In this work we choose a Conditional Ran-dom Field (CRF) to achieve this. CRFs are a naturalfit for this task: They allow us to capture correlationsin an explicit fashion, and to incorporate overlappinginput features from multiple documents.
The hidden output variables of our model are Y =(Yc)c∈C . That is, we have one variable Yc for eachcandidate tuple c ∈ C . This variable can take asvalue any relation in C with the same arity as c. Seeexample relation variables in figure 1.
The observed input variables X consists of a fam-ily of variables Xc =
�X1
c, . . .Xmc
�m∈M
for eachcandidate tuple c. Here Xi
c stores relevant observa-tions we make for the i-th candidate mention tuple ofc in the corpus. For example, X1
BILL GATES,MICROSOFT
in figure 1 would contain, among others, the pattern“[M2] was founded by [M1]”.
3.1 Factor TemplatesOur conditional probability distribution over vari-ables X and Y is defined using using a set T offactor templates. Each template Tj ∈ T definesa set of factors {(yi,xi)}, a set Kj of feature in-dices, parameters
�θjk
�
k∈Kj
and feature functions�
f jk
�
k∈Kj
. Together they define the following con-
ditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
eP
k∈Kjθjkfj
k(yi,xi)
(4)In our case the set T consists of four templates
we will describe below. We construct this graphicalmodel using FACTORIE (McCallum et al., 2009), aprobabilistic programming language that simplifiesthe construction process, as well as inference andlearning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the templateis unrolled, it creates one factor per variable Yc forcandidate tuple c ∈ C. The template also consists ofone weight θBias
r and feature function fBiasr for each
possible relation r. fBiasr fires if the relation associ-
ated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need
to model the correlation between relation instancesand their mentions in text. For this purpose we de-fine the template TMention that connects each relationinstance variable Yc with its observed mention vari-ables Xc. Crucially, this template gathers mentionsfrom multiple documents, and enables us to exploitredundancy.
The feature functions of this template are takenfrom Mintz et al. (2009). This includes features thatinspect the lexical content between entity mentionsin the same sentence, and the syntactic path betweenthem. One example is
fMen101 (yc,xc)
def=
1 yc = founded ∧ ∃i with"M2 was founded by M1" ∈ xi
c
0 otherwise.
founder
Microsoft was founded by Bill Gates...
personcompany
nationality
country
With Microsoft chairman Bill Gates soon relinquishing...
Bill Gates was born in the USA in 1955
1
nationof
Elevation Partners, was founded by Roger McNamee ...
Roger McNamee,USAYrel
Z1
personR McNamee
countryUSA
worksfor
comp. Microsoft
1Z1
1Z1
Roger McNamee,Microsoft
functionality-factor
ner-relation-factors
relation-mention factors
Ytypel
mention factors
Elevation Partners, was founded by Roger McNamee ...
Elevation Partners, was founded by Roger McNamee ...
Figure 1: Factor Graph for joint relation mention prediction and relation type identification.
fine the following conditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
ePKj
k=1 θjkfj
k(yi,xi) (3)
In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias
r and feature func-tion fBias
r for each possible relation r. fBiasr fires if
the relation associated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need to
model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .
The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical
context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is
fMen101 (yc,xMc)
def=
1 yc = founder∧m1", director of "m2 ∈ xMc
0 otherwise.
It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.
Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.
3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types
and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint
r,t1...ta (and weight θJointr,t1...ta) for each
combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint
founder,person,company to be larger thanθJointfounder,person,country.
We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair
i,r,t that fire if ei is
1
nationof
Elevation Partners, was founded by Roger McNamee ...
Roger McNamee,USAYrel
Z1
personR McNamee
countryUSA
worksfor
comp. Microsoft
1Z1
1Z1
Roger McNamee,Microsoft
functionality-factor
ner-relation-factors
relation-mention factors
Ytypel
mention factors
Elevation Partners, was founded by Roger McNamee ...
Elevation Partners, was founded by Roger McNamee ...
Figure 1: Factor Graph for joint relation mention prediction and relation type identification.
fine the following conditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
ePKj
k=1 θjkfj
k(yi,xi) (3)
In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias
r and feature func-tion fBias
r for each possible relation r. fBiasr fires if
the relation associated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need to
model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .
The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical
context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is
fMen101 (yc,xMc)
def=
1 yc = founder∧m1", director of "m2 ∈ xMc
0 otherwise.
It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.
Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.
3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types
and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint
r,t1...ta (and weight θJointr,t1...ta) for each
combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint
founder,person,company to be larger thanθJointfounder,person,country.
We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair
i,r,t that fire if ei is
1
nationof
Elevation Partners, was founded by Roger McNamee ...
Roger McNamee,USAYrel
Z1
personR McNamee
countryUSA
worksfor
comp. Microsoft
1Z1
1Z1
Roger McNamee,Microsoft
functionality-factor
ner-relation-factors
relation-mention factors
Ytypel
mention factors
Elevation Partners, was founded by Roger McNamee ...
Elevation Partners, was founded by Roger McNamee ...
Figure 1: Factor Graph for joint relation mention prediction and relation type identification.
fine the following conditional distribution:
p (y|x) =1
Zx
�
Tj∈T
�
(yi,xi)∈Tj
ePKj
k=1 θjkfj
k(yi,xi) (3)
In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.
3.1.1 Bias TemplateWe use a bias template TBias that prefers certain
relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias
r and feature func-tion fBias
r for each possible relation r. fBiasr fires if
the relation associated with tuple c is r.
3.1.2 Mention TemplateIn order to extract relations from text, we need to
model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .
The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical
context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is
fMen101 (yc,xMc)
def=
1 yc = founder∧m1", director of "m2 ∈ xMc
0 otherwise.
It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.
Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.
3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types
and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint
r,t1...ta (and weight θJointr,t1...ta) for each
combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint
founder,person,company to be larger thanθJointfounder,person,country.
We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair
i,r,t that fire if ei is
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
g ( , ) � DKL ( || )
g ( ) = log�1− µi + µie
θi�− µie
θi
. . . + wφfφ (y , y , y ) + . . .
> 0
= maxy� ,y� ,y�
fφ
�y� , y� , y� �
< maxy� ,y� ,y�
fφ
�y� , y� , y� �
Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)
Φ (yi,j ;x) = exp
��
k
wkfk (yi,j ;x)
�
p (y;x) =1
ZxΨ1 (y;x) · . . . · Ψn (y;x)
log E [Ψi]− µi
Ψi (y;x) = exp (θiφi (y;x))
µi = E [φi]
Y
Y
Y
Y
Y
X1
X2
Figure 1: Factor Graph of our model that captures selectional preferences and functionality constraints. For
readability we only label a subsets of equivalent variables and factors. Note that the graph shows an example
assignment to variables.
It tests whether for any mentions of the candidate
tuple the phrase "founded by" appears between the
mentions of the argument entities.
3.1.3 Selectional Preferences Templates
To capture the correlations between entity types
and relations the entities participate in, we introduce
the template TJoint. It connects a relation instance
variable Ye1,...,en to the individual entity type vari-
ables Ye1 , . . . , Yen . To measure the compatibility
between relation and entity variables, we use one
feature f Joint
r,t1...ta (and weight θJoint
r,t1...ta) for each com-
bination of relation and entity types r, t1 . . . ta.
f Joint
r,t1...ta fires when the factor variables are in the
state r, t1 . . . ta. For example, f Joint
founded,person,company
fires if Ye1 is in state person, Ye2 in state company,
and Ye1,e2 in state founded.
We also add a template TPair that measures the
pairwise compatibility between the relation variable
Ye1,...,ea and each entity variable Yei in isolation.
Here we use features fPair
i,r,t that fire if ei is the i-th ar-
gument of c, has the entity type t and the candidate
tuple c is labelled as instance of relation r. For ex-
ample, fPair
1,founded,person fires if Ye1(argument i = 1)
is in state person, and Ye1,e2 in state founded, re-
gardless of the state of Ye2 .
X1BILL GATES,USA
3.2 InferenceThere are two types of inference we have to perform:
sampling from the posterior during training (see sec-
tion 3.3), and finding the most likely configuration
(aka MAP inference). In both settings we employ a
Gibbs sampler (Geman and Geman, 1990) that ran-
domly picks a variable Yc and samples its relation
value conditioned on its Markov Blanket. At test
time we decrease the temperature of our sampler in
order to find an approximation of the MAP solution.
3.3 TrainingMost learning methods need to calculate the model
expectations (Lafferty et al., 2001) or the MAP con-
figuration (Collins, 2002) before making an update
to the parameters. This step of inference is usually
the bottleneck for learning, even when performed
approximately.
SampleRank (Wick et al., 2009) is a rank-based
learning framework that alleviates this problem by
performing parameter updates within MCMC infer-
ence. Every pair of consecutive samples in the
MCMC chain is ranked according to the model and
the ground truth, and the parameters are updated
when the rankings disagree. This update can fol-
low different schemes, here we use MIRA (Cram-
mer and Singer, 2003). This allows the learner to
acquire more supervision per instance, and has led
to efficient training for models in which inference
Entities and Relations[Yao, Riedel, McCallum 2010]
Isolated Pipeline Joint
0.78 0.82 0.94
Relation Extraction ExperimentsManual EvaluationTraining Set
Precision @50
2 years 1 year
Outline• Motivate software engineering for statistics
• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)
- Information Integration (really hairy CRFs, MCMC, SampleRank)
• Probabilistic Programming: FACTORIE
• Example
• Relation Extraction (cross-document, w/out labeled data)
• Probabilistic Programming inside a DB
• Ongoing Work
Information Extraction into DB
DocumentsExtraction &
Matching
Information Extraction into DB
DocumentsExtraction &
Matching
Information Extraction into DB
DocumentsExtraction &
Matching
Information Extraction into DB
DocumentsExtraction &
Matching
Information Extraction into DB
DocumentsExtraction &
Matching
query
queryproc
answer
Information Extraction into Pr DB
DocumentsExtraction &
Matching
Information Extraction into Pr DB
DocumentsExtraction &
Matching
Information Extraction into Pr DB
DocumentsExtraction &
Matching
Information Extraction into Pr DB
DocumentsExtraction &
Matching
Information Extraction into Pr DB
DocumentsExtraction &
Matching
query
queryproc
answer
Information Extraction into Pr DB
Documents
The MCMC Alternative
Information Extraction into Pr DB
Documents
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
DB contains only one possible world
at a time.
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
SQL answer
The MCMC Alternative
Information Extraction into Pr DB
Documents
Extraction &Matching
MHinference
query
answer
SQL answer
[Wick, McCallum, Miklau 2010]
The MCMC Alternative
Particle Filtering
Documents
Extraction &Matching
query
answer
Particle Filtering
Documents
Extraction &Matching
query
answer
[Schultz, McCallum, Miklau 2010]
with compactrepresentation
Ongoing Work
• Factored particle filtering
• Query-specific MCMC
• Inference caching wrt to E[query]
• Learned proposal distributions
• Bayesian inference in distributed systems
• Probabilistic Database of all of Wikipedia, automatically growing by reading.
Thank you!
http://code.google.com/p/factorieVersion 0.9 at