Mausam - Indian Institute of Technology Delhimausam/courses/csl772/autumn... · Who won Bigg Boss...

Preview:

Citation preview

Open Information Extraction

Mausam

“The Internet is the world’s largest library. It’s just that all the books are on the floor.”

- John Allen Paulos

> 1 Trillion URLs (Google, 2008)

2

Information Overload

3

Paradigm Shift: from retrieval to reading

Who won Bigg Boss 7?

What sport teams are based in Arizona?

World Wide Web

4

Gauhar Khan

Phoenix Suns, Arizona Cardinals,…

Paradigm Shift: from retrieval to reading

Quick view of today’s news

World Wide Web

5

Science Report

Finding: beer that doesn’t

give a hangover

Researcher: Ben Desbrow

Country: Australia

Organization: Griffith

Health Institute

Paradigm Shift: from retrieval to reading

World Wide Web

6

Which US West coast

companies are hiring for a

software engineer position?

Google, Microsoft,

Facebook, …

8

What is Machine Reading?

Text Assertions Inferences

Information Extraction (IE)

IE(sentence) = Relation instance, probability

“Edison was the inventor of the phonograph.” InventorOf(Edison, phonograph), 0.9

“You shall know a word by the company it keeps” (Firth, 1957)

9

Context Clues

• …Baltimore mayor…

• …Baltimore international airport…

• cities such as Chicago, Baltimore, and..

Where do clues come from?

10

How to Scale IE?

1970s-1980s: heuristic, hand-crafted clues

• Facts from earnings announcements

• Narrow domains; brittle clues

1990s: IE as supervised learning

“Mary was named to the post of CFO, succeeding Joe who retired abruptly.”

11

No.

12

Does “IE as supervised learning” scale to reading the Web?

Critique of IE=supervised learning

• Relation specific

• Genre specific

• Hand-craft training examples

Does not scale to the Web!

13

14

Semi-Supervised Learning

• Few hand-labeled examples

• Limit on the number of relations

• relations are pre-specified

• Still does not scale to the Web

per relation!

Lessons from DB/KR Research

• Declarative KR is expensive & difficult

• Formal semantics is at odds with

– Broad scope

– Distributed authorship

• KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03)

15

Machine Reading at Web Scale

• A “universal schema” is impossible

• Global consistency is like world peace

• Ontological “glass ceiling”

– Limited vocabulary

– Pre-determined predicates

– Swamped by reading at scale!

16

Motivation

• General purpose– hundreds of thousands of relations

– thousands of domains

• Scalable: computationally efficient– huge body of text on Web and elsewhere

• Scalable: minimal manual effort– large-scale human input impractical

• Knowledge needs not anticipated in advance– rapidly retargetable

Open IE Guiding Principles

• Domain independence

– Training for each domain/fact type not feasible

• Coherence

– Readability important for human interactions

• Scalability

– Ability to process large number of documents fast

Open Information Extraction

“When Saddam Hussain invaded Kuwait in 1990, the international..”

(Saddam Hussain, invaded, Kuwait)

Open IE

Extracting information from natural language text

for all relations in all domains in a few passes.

(Google, acquired, Youtube)(Oranges, contain, Vitamin C)

(Edison, invented, phonograph)…

Open Information Extraction

“When Saddam Hussain invaded Kuwait in 1990, the international..”

(Saddam Hussain, invaded, Kuwait)

Open IE

Extracting information from natural language text

for all relations in all domains in a few passes.

(Google, acquired, Youtube)(Oranges, contain, Vitamin C)

(Edison, invented, phonograph)…

Open Information Extraction

“Edison was the inventor of the phonograph.”

(Edison, was the inventor of, phonograph)

Open IE

Extracting information from natural language text

for all relations in all domains in a few passes.

(Google, acquired, Youtube)(Oranges, contain, Vitamin C)

(Edison, invented, phonograph)…

Open IE

• Avoid hand-labeling sentences

• Single pass over corpus

• No pre-specified vocabulary– Challenge: map relation phrase to canonical relation

– E.g., “was the inventor of” invented

22

Traditional IE Open IE

Input: Corpus + Hand-

labeled Data

Corpus + Existing

resources

Relations: Specified

in Advance

Discovered

Automatically

Complexity:

Output:

O(D * R)

R relations

relation-specific

O(D)

D documents

Relation-

independent

23

OPEN VERSUS TRADITIONAL IE

Open vs. Traditional IE

TextRunner

First Web-scale, Open IE system (Banko, IJCAI ‘07)

1,000,000,000 distinct extractions

Peak of 0.9 precision (but low recall)

24

Trajectory of Open IE

2003 KnowItAll “web reading” project

2007

TextRunner: 1,000,000,000 extractions

2008-9

Synonymy, horn-clause inference

2010-11

ReVerb,

Ontology mapping,

Relation properties

2012

OLLIE,

Event Templates,

Multi-doc Summarization

25

Demo

• http://openie.cs.washington.edu

Trajectory of Open IE

2003 KnowItAll “web reading” project

2007

TextRunner: 1,000,000,000 extractions

2008-9

Synonymy, horn-clause inference

2010-11

ReVerb,

Ontology mapping,

Relation properties

2012

OLLIE,

Event Templates,

Multi-doc Summarization

27

The chapter was founded by FANHS, which is headquartered in Seattle.

1. Identify Candidate Args

2. Identify Relation Phrase

28

Labeled Examples

HeuristicsUnlabeled Text

Patterns [Banko & Etzioni 07]Wikipedia [Wu & Weld 10]

CRF [Banko & Etzioni 07, Wu & Weld 10]Markov Logic Network [Zhu et al. 09]

Open IE Example

TR: Problem #1Incoherent Extractions

The guide contains dead links and omits sites.

(The guide, contains omits, sites)

Extendicare agreed to buy Arbor for about $432M in assumed debt.

(Arbor, for assumed, debt).

≈ 15% of TextRunner’s extractions

29

TR: Problem #2 Uninformative Extractions

Homer made a deal with the devil.

(Homer, made, deal)

Existing systems miss verb + noun constructions:

Jane is an expert in physics is is an expert in

Robocop takes place in Detroit takes takes place in

Obama gave a speech on energy gave gave a speech on

30

Relation Frequency in TextRunner

0

2

4

6

8

10

is has makes gives takes gets

% o

f Ex

trac

tio

ns

Relation

Paris is the capital of France

Iran has a role in Afghan talks

Apple made a deal with Google

TCP/IP gave rise to the internet

31

Syntactic ConstraintRelation phrase must start with a verb and match the pattern:

32

The guide contains dead links and omits sites.

(The guide, contains omits, sites)

V discovered V = verb | particle | adv

V P died from P = prep | partcle | inf. marker

V W* P played a role in W = noun | adj | adv | det | pron

or multiple contiguous matches to the pattern:

wants to find a solution for

ReVerb

Identify Relations from Verbs.

1. Find longest phrase matching a simple syntactic constraint:

33

Lexical Constraint

Problem: “overspecified” relation phrases

Obama is offering only modest greenhouse gas reduction targets at the conference.

Solution: must have many distinct args in a large corpus

34

≈ 1is offering only modest …

Obama the conference

100s ≈

is the patron saint of

Anne mothersGeorge EnglandHubbins quality footwear

….

35

Sample of ReVerb Relations

inhibits tumor

growth inhas a PhD in joined forces with

is a person

who studies voted in favor of won an Oscar for

has a maximum

speed of

died from

complications ofmastered the art of

gained fame asgranted political

asylum to

is the patron

saint of

was the first

person to

identified the cause

ofwrote the book on

DARPA MR Domains <50

NYU, Yago <100

NELL ~500

DBpedia 3.2 940

PropBank 3,600

VerbNet 5,000

WikiPedia InfoBoxes, f > 10 ~5,000

TextRunner 100,000+

ReVerb 1,500,000+

36

NUMBER OF RELATIONS

Number of Relations

Coverage of the ReVerb Model

8% Non-Contiguous

X was founded in 1995 by Y

X is produced and hosted by YX shut Y down

4% Not Between Args

Discovered by Y, X …

… the Y that X discovered

3% Does Not Match POS Pattern

X has a lot of faith in Y X to attack Y37

85% of verb-based relations satisfy ReVerb constraint

Sampled 300 Web sentences from (Wu & Weld)

Limitations to Model:

ReVerb Extraction Algorithm

38

Hudson was born in Hampstead, which is a suburb of London.

arg1arg1

arg2 arg2

1. Identify longest relation phrases satisfying constraints

2. Heuristically identify arguments for reach relation phrase

(Hudson, was born in, Hampstead)

(Hampstead, is a suburb of, London)

Experiments: Relation Phrases(Etzioni, Fader, Christensen, Soderland, Mausam – IJCAI’11)

ReVerb

Error Analysis

ReVerb

Summary

• Semantically tractable subset of language

• ReVerb: simple model of relation phrases

• Superfast, highly scalable

• Code/data available at openie.cs.washington.edu

Motivating Examples

“The assassination of Franz Ferdinand,improbable as it may seem, began WWI.”

(it, began, WWI)

“Republicans in the Senate filibusteredan effort to begin debate on the jobs bill.”

(the Senate, filibustered, an effort)

“The plan would reduce the number ofteenagers who begin smoking.”

(The plan, would reduce the number of, teenagers)

Analysis – arg1 substructure

Category Pattern Freq

Basic Noun PhrasesChicago was founded in 1833

NN, JJ NN, etc 65%

Prepositional AttachmentsThe forest in Brazil is threatened by ranching.

NP PP NP 19%

ListGoogle and Apple are headquartered in Silicon Valley.

NP, (NP,)* CC NP 15%

Relative ClauseChicago, which is located in Illinois, has three million residents.

NP (that|WP|WDT)? NP? VP NP <1%

Analysis – arg2 substructureCategory Pattern Freq

Basic Noun PhrasesCalcium prevents osteoporosis

NN, JJ NN, etc 60%

Prepositional AttachmentsBarack Obama is one of the presidents of the United States

NP PP NP 18%

ListA galaxy consists of stars and stellar remnants

NP, (NP,)* CC NP 15%

Independent ClauseScientists estimate that 80% of oil remains a threat.

(that|WP|WDT)? NP? VP NP 8%

Relative ClauseThe shooter killed a woman who was running from the scene.

NP (that|WP|WDT)? NP? VP NP 6%

ArglearnerArgument Extraction Methodology

• Break problem into four parts:– Identify arg1 right bound

… TOK TOK TOK TOK TOK rel TOK TOK TOK …

– Identify arg1 left bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …

– Identify arg2 left bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …

– Identify arg2 right bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …

Classifier

Classifier

Classifier

Evaluation on Web Text

YieldProcessing Time per sentence: Reverb (0.015 sec), Arglearner (1.6x)

Speed Parse (3.2x), Accuracy Parse (31x)

Trajectory of Open IE

2003 KnowItAll “web reading” project

2007

TextRunner: 1,000,000,000 extractions

2008-9

Synonymy, horn-clause inference

2010-11

ReVerb,

Ontology mapping,

Relation properties

2012

OLLIE,

Event Templates,

Multi-doc Summarization

54

ReVerb+Arglearner: Error Analysis

• Last night at CES (Consumer Electronics Show), Steve Balmer, the CEO of Microsoft, held a press conference.

• The first in our list is Stephen Googleheim, born in Virginia in 1953 to Swedish parents.

• After winning the Superbowl, the Giants are now the top dogs of the NFL.

• …is that it makes Judaism different from Christianity and Islam

• Ahmadinejad was elected as the new President of Iran.

OLLIE: Open Language Learningfor Information Extraction

ReVerb

Seed Tuples

Training Data

Open PatternLearning

Bootstrapper

Pattern Templates

Pattern Matching Context AnalysisSentence Tuples Ext. Tuples

Extraction

Learning

ReVerb

Seed Tuples

Training Data

Open PatternLearning

Bootstrapper

Pattern Templates

Pattern Matching Context AnalysisSentence Tuples Ext. Tuples

Extraction

Learning

Bootstrapping Approach

Other Syntactic rels

Verb-basedrelations

Semantic rels

Bootstrapping Approach

Other Syntactic rels

Verb-basedrelations

Reverb’sVerb-basedrelations

Semantic rels

Federer is coached by Paul Annacone.

Bootstrapping Approach

Other Syntactic rels

Verb-basedrelations

Reverb’sVerb-basedrelations

Semantic rels

Federer is coached by Paul Annacone.

Now coached by Paul Annacone, Federer has …

Bootstrapping Approach

Other Syntactic rels

Verb-basedrelations

Reverb’sVerb-basedrelations

Semantic rels

Federer is coached by Paul Annacone.

Now coached by Paul Annacone, Federer has …

Paul Annacone, the coach of Federer,

Bootstrapping Approach

Other Syntactic rels

Verb-basedrelations

Reverb’sVerb-basedrelations

Semantic rels

Federer is coached by Paul Annacone.

Now coached by Paul Annacone, Federer has …

Paul Annacone, the coach of Federer,

Federer hired Annacone as his new coach.

Bootstrapping

High Quality ReVerb Extractions

ClueWeb Sentences

Extraction Lemmas(seeds)

(Ahmadinejad, is the current president of, Iran)

ahmadinejad, president, iran

Ahmadinejad, who is the president of Iran, is a puppet for the Ayatollahs.

ReVerb

Seed Tuples

Training Data

Open PatternLearning

Bootstrapper

Pattern Templates

Pattern Matching Context AnalysisSentence Tuples Ext. Tuples

Extraction

Learning

{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}

(arg1, be President of, arg2)

Pattern Templates

Pattern Templates Open Pattern Templates

Can we generalize this pattern to all relations (beyond Presidents)?

Example

(Obama, is President of, US)“US President Obama gives us hope.”

{arg2}↑nn↑{arg1}↓nn↓{rel:NN}

(arg1, be {rel} of, arg2)

(Department, be Police of, NY)(Department, be NY of, Police)X

Open Pattern Templates(Syntactic)

• Syntactic checks

– Relation in the middle of the pattern

– No NN/amod edges

– Prepositions in relation/pattern match

{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}

(arg1, be President of, arg2)

{arg1}↓rcmod↓{rel:NN}↓prep↓{arg2}

(arg1, be {rel} {prep}, arg2)

Syntactic Generalization

Open Pattern Templates(Syntactic)

• Syntactic checks

– Relation in the middle of the pattern

– No NN/amod edges

– Prepositions in relation/pattern match

{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}

(arg1, be President of, arg2)

{arg1}↓rcmod↓{rel:NN}↓prep↓{arg2}

(arg1, be {rel} {prep}, arg2)

Syntactic Generalization

Open Pattern Templates(Semantic)

• Other patterns are not always applicable

…however they are not completely useless

(Obama, is President of, US)“US President Obama gives us hope.”

{arg2}↑nn↑{arg1}↓nn↓{rel:NN}

(arg1, be {rel} of, arg2)

{arg2}↑nn↑{arg1}↓nn↓{rel:NN}

(arg1, be {rel} of, arg2)rel in {president, chairman, CEO…}

X{arg2}↑nn↑{arg1}↓nn↓{rel:NN}

(arg1, be {rel} of, arg2)rel in Person-Nouns

Type Generalization

Skipping over Nodes(Ahmadinejad, is President of, Iran)

“Ahmadinejad was elected as the president of Iran”

{arg1}↑nsubjpass↑{slot:VBN}↓prep_as↓{rel:NNP}↓prep_of↓{arg2}

(arg1, be {rel} of, arg2)slot in {elect, select, choose, …}

Goal

“I learned that the 2012 Sasquatch music festival is scheduled for May 25th until May 28th—all day.”

(the 2012 Sasquatch music festival, is scheduled for, March 25th)

Pattern TemplatesPattern-

Extractor

Pattern Extractor

{arg1} <nsubjpass< {rel:VBN} >prep> {arg2}

Pattern Extractor

ReVerb

Seed Tuples

Training Data

Open PatternLearning

Bootstrapper

Pattern Templates

Pattern Matching Context AnalysisSentence Tuples Ext. Tuples

Extraction

Learning

Motivating Examples

“Early astronomers believed that the earth is the center of the universe.”

(earth, is the center of, universe)

“If he wins five key states, Romney will be elected President.”

(Romney, will be elected, President)

Context Analysis

“Early astronomers believed that the earth is the center of the universe.”

[(earth, is the center of, universe) Attribution: early astronomers]

“If he wins five key states, Romney will be elected President.”

[(Romney, will be elected, President) Modifier: if he wins five key states]

Ranking Function

• Supervised learning

• Features

– Frequency of pattern in training set

– Lexical/POS features

– Length/coverage features

– …

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600

OLLIE

ReVerb

WOE

Yield

Pre

cisi

on

parse

Evaluation(Mausam, Schmitz, Bart, Soderland, Etzioni - EMNLP’12)

Noun-based Relations

Relations OLLIE ReVerb incr.

is the capital of 8,566 146 59x

is president of 21,306 1,970 11x

is professor at 8,334 400 21x

is scientist of 730 5 146x

Semantic Patterns

Context Analysis

Summary

• Bootstrapping based on ReVerb– Look for args as well as relations when bootstrapping

• Generalization– Syntactic and semantic generalizations of learned patterns

• Context around an extraction– Obtains superior precision than ReVerb

• Syntactically different ways of expressing a relation– Obtains much higher recall than ReVerb

• Code– Available at http://ollie.cs.washington.edu

Demo

• www.cse.iitd.ac.in/nlpdemo

Recommended