Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Modeling Missing Data in Distant Supervision for Information Extraction

Alan Ritter (CMU)Luke Zettlemoyer (University of Washington)

Mausam (University of Washington)Oren Etzioni (Vulcan Inc.)TACL, 1, 367-378, 2013.

Presented by Naoaki Okazaki (Tohoku University)

2014-09-05 Modeling Missing Data in Distant Supervision 1

Relation instance extractionSteven Spielberg’s film Saving Private Ryan is loosely based on the brothers’ story.

Extractor Film Director

Saving Private Ryan Steven Spielberg

Film-director relation

• Fully-supervised learning (Zhou+ 05, …)• Uses ACE corpora to build relation-instance classifiers• Suffers from the limited number of training data

• Unsupervised information extraction (Banko+ 07, …)• Extracts relational patterns between entities, and clusters the

patterns into relations• Difficult to map clusters into relations of interest

• Bootstrap learning (Brin 98, …)• Uses seed instances to extract a new set of relational patterns• Often suffers from low precision (semantic drift)

• Distant supervision (Mintz+ 09, …)• Combines the advantages of the above approaches


Distant supervision (Mintz+, 09)Person Birthplace

Edwin Hubble Marshfield

… … Automatic annotation

Astronomer Edwin Hubble was born in Marshfield, Missouri.

Feature extraction

Mintz et al. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.* Each row presents a single feature. Concatenate features from different sentences containing the same entity pairs.

Problem: An entity pair cannot have multiple relationsE.g., Founded(Jobs, Apple) and CEO-of(Jobs, Apple) are true.


MultiR (Hoffmann+, 11)

Introduces latent variables (𝑧𝑧𝑖𝑖) to indicate the relation expressed by sentence 𝑥𝑥𝑖𝑖

0 1 1 0

Founder Founder CEO-of

𝑦𝑦born−in 𝑦𝑦founder 𝑦𝑦CEO−of 𝑦𝑦capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑝𝑝 𝒚𝒚, 𝒛𝒛 𝒙𝒙

=1𝑍𝑍𝑥𝑥�𝑟𝑟

Φjoin(𝑦𝑦𝑟𝑟 , 𝒛𝒛)�𝑖𝑖

Φextract(𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖)

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple) 𝑥𝑥𝑖𝑖: a sentence containing the entity pair𝑦𝑦𝑟𝑟 ∈ {0,1}: 1 if the knowledge base includes the pair with relation 𝑟𝑟, 0 otherwise𝑧𝑧𝑖𝑖 ∈ 𝑅𝑅: the relation expressed by sentence 𝑥𝑥𝑖𝑖

Φextract 𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖 = exp �𝑗𝑗

𝜃𝜃𝑗𝑗𝜙𝜙𝑗𝑗(𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖)

Φjoin 𝑦𝑦𝑟𝑟 , 𝒛𝒛 = 1(¬𝑦𝑦𝑟𝑟⋁∃𝑖𝑖: 𝑗𝑗 = 𝑧𝑧𝑖𝑖)(Deterministic OR)

The same as (Mintz+ 09)

Φjoin ensures that a sentence 𝑥𝑥𝑖𝑖 expressing the relation 𝑟𝑟 exists if 𝑟𝑟 is true

Allows multiple relations for the same entity pair


MultiR: Training

Hoffmann et al. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.

Loop for passes over the training data

Loop for entity pairs in the KB

Predict sentence-level and KB-level relations (ignoring

the facts in the KB)

Find an optimal assignment of sentence-level relations

consistent with the facts in KB

We need two kinds of inferences

Update feature weights similarly to the perceptron algorithm


MultiR: Inference 1: argmax𝒚𝒚,𝒛𝒛

𝑝𝑝(𝒚𝒚, 𝒛𝒛|𝒙𝒙)

? ? ? ?

? ? ?







𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2

born−infounderCEO−ofcapita−of

Predict a relation label for each sentence

independently

Aggregate sentence-level predictions into

global-level predictions


MultiR: Inference 1: argmax𝒚𝒚,𝒛𝒛

𝑝𝑝(𝒚𝒚, 𝒛𝒛|𝒙𝒙)

0 1 0 0

founder founder founder







𝒛𝒛

𝒙𝒙

𝒚𝒚


0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2


Predict a relation label for each sentence

independently

Aggregate sentence-level predictions into

global-level predictions

Very easy to find!Computational cost:

𝑜𝑜( 𝑅𝑅 𝒙𝒙 )


MultiR: Inference 2: argmax𝒛𝒛

𝑝𝑝(𝒛𝒛|𝒙𝒙,𝒚𝒚)

0 1 1 0

? ? ?







𝒛𝒛

𝒙𝒙

𝒚𝒚


0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2


0.5 87 16 11

8 96 7 0.1

0.1 0.2

Define an edge weight: w 𝑦𝑦𝑟𝑟 , 𝑧𝑧𝑖𝑖 = Φextract(𝑟𝑟, 𝑥𝑥𝑖𝑖)

A node with 𝑦𝑦𝑟𝑟 = 1 must have at least an edge connecting to 𝑧𝑧𝑖𝑖

Each node 𝑧𝑧𝑖𝑖 must have an edge connecting to 𝑦𝑦𝑟𝑟

Find a set of edges that maximize the sum of weights


MultiR: Inference 2: argmax𝒛𝒛


0 1 1 0

founder founder CEO-of







𝒛𝒛

𝒙𝒙

𝒚𝒚


0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2


16 118 9

6 7

Define an edge weight: w 𝑦𝑦𝑟𝑟 , 𝑧𝑧𝑖𝑖 = Φextract(𝑟𝑟, 𝑥𝑥𝑖𝑖)

A node with 𝑦𝑦𝑟𝑟 = 1 must have at least an edge connecting to 𝑧𝑧𝑖𝑖

Each node 𝑧𝑧𝑖𝑖 must have an edge connecting to 𝑦𝑦𝑟𝑟

Find a set of edges that maximize the sum of weights

Exact solution in polynomial time

In practice, approximate solution by greedy search (assigning 𝑧𝑧𝑖𝑖 for

each node 𝑦𝑦𝑟𝑟 = 1) is sufficient2014-09-05 Modeling Missing Data in Distant Supervision 9

Contribution of this work• MultiR makes two assumptions (hard constraints):

• If a fact is not found in the database, it cannot be mentioned in the text

• If a fact is in the database, it must be mentioned in at least one sentence.

• Relax MultiR to handle the situation where:• A fact is not mentioned in text (MIT)• A fact mentioned in text is missing in database (MID)

• Side effect of this relaxation• Incorporates the tendency that the knowledge base is

likely to include popular entities and relations2014-09-05 Modeling Missing Data in Distant Supervision 10

Distant Supervision with Data Not Missing at Random (DNMAR)

0 1 1 0

Founder Founder visit

𝑦𝑦born−in 𝑦𝑦founder 𝑦𝑦CEO−of 𝑦𝑦visit



Steve Jobs visited Apple store…



𝒛𝒛

𝒙𝒙

𝒚𝒚


0 1 0 1𝒕𝒕

Introduce a layer of latent variables (𝑡𝑡𝑟𝑟) to handle missing cases

𝜙𝜙miss 𝑦𝑦𝑟𝑟 , 𝑡𝑡𝑟𝑟

=

−𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 (𝑦𝑦𝑟𝑟 = 1⋀𝑡𝑡𝑟𝑟 = 0)(missing in text)

−𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 (𝑦𝑦𝑟𝑟 = 0⋀𝑡𝑡𝑟𝑟 = 1)(missing in DB)

0 (otherwise)

Relaxing two hard constraints in MultiR into soft ones with penalty

factors −𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 and −𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀

Introduce a new factor:

Training algorithm is the same as the one used in MultiR


Constrained inference: argmax𝒛𝒛


0 1 1 0

? ? ?

𝑦𝑦born−in 𝑦𝑦founder 𝑦𝑦CEO−of 𝑦𝑦visit



Steve Jobs visited Apple store…



𝒛𝒛

𝒙𝒙

𝒚𝒚


? ? ? ?𝒕𝒕

𝑧𝑧∗ = argmax𝒛𝒛

�𝑖𝑖=1

𝑛𝑛

𝜃𝜃 � Φextract 𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖 + �𝑟𝑟

𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 � 1(𝑦𝑦𝑟𝑟⋁∃𝑖𝑖: 𝑟𝑟 = 𝑧𝑧𝑖𝑖) −𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀� 1(¬𝑦𝑦𝑟𝑟⋁∃𝑖𝑖: 𝑟𝑟 = 𝑧𝑧𝑖𝑖)

Became more challenging

A* search can find an exact solution, but is not scalable

with many variables

Present a greedy hill climbing approach for the inference:

1. Initialize 𝑧𝑧𝑖𝑖 at random2. Obtain neighborhoods of

the current solution3. Move to the neighbor

yielding the highest score4. Repeat this process


Incorporating popularity in KB• We tune the penalty factors 𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 and 𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀 on a

development set• We can take into account how likely each fact is to

be observed in the text and the knowledge base• Facts about Barack Obama are likely to exist• Facts about Naoaki Okazaki are unlikely to exists

• Control the penalty factor for each entity pair• Popularity of entities: 𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀

(𝑒𝑒1,𝑒𝑒2) = −𝛾𝛾min(𝑐𝑐 𝑒𝑒1 , 𝑐𝑐(𝑒𝑒2))• A larger penalty if the model predicts that a fact about a

popular entity does not exist in KB• Well-aligned relations: assign 3 kinds of values of 𝛼𝛼𝑀𝑀𝑀𝑀𝑀𝑀𝑟𝑟

• A larger penalty if a popular relation such as contains, place_lived, and nationality does not exist in text


Experiments• Binary relation extraction

• The standard setting (Riedel+, 10)• Knowledge base: Freebase relations• Text corpus: 1.8m New York Times articles

• Two kinds of evaluation• Sentence-level extractions using the dataset (Hoffmann+, 11)• Holdout evaluation on Freebase knowledge

• Unary relation extraction (NE categorization)• Twitter NE categorization dataset (Ritter+, 11)

• Knowledge base: Freebase (instances and their categories)• Text corpus: tweets

• Hold-out evaluation


Results

17% increase in area under the curve.Incorporating popularity yielded 27% increase over the baseline.

This evaluation underestimate precision because many facts correctly extracted from text are missing in the database.DNMAR doubled the recall.

Ritter et al. (2013) Modeling Missing Data in Distant Supervision for Information Extraction, TACL(1), 367-378.


Conclusion• Investigated the problem of missing data in distant

supervision• Presented an extension of MultiR to handle missing

data• Could incorporate the popularity of facts to be

included in the knowledge base and text• Presented a scalable inference algorithm based on

greedy hill-climbing• Demonstrated the effectiveness of the modeling


References• Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke

Zettlemoyer, Daniel S. Weld. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.

• Slides and codes

• Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.


http://raphaelhoffmann.com/publications/acl2011.pdf

http://raphaelhoffmann.com/publications/acl2011-slides.pptx

http://www.cs.washington.edu/ai/raphaelh/mr

http://web.stanford.edu/%7Ejurafsky/mintz.pdf

Science

Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)