Information Extraction - Indian Institute of Technology...

Preview:

Citation preview

Information Extraction

Mausam(Based on slides of Heng Ji, Dan Jurafsky, Chris Manning, Ray Mooney, Alan Ritter, Alex Yates)

Extracting information from text

• Company report: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”

• Extracted Complex Relation:Company-Founding

Company IBMLocation New YorkDate June 16, 1911Original-Name Computing-Tabulating-Recording Co.

• But we will focus on the simpler task of extracting relation triplesFounding-year(IBM,1911)

Founding-location(IBM,New York)

Extracting Relation Triples from TextThe Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891

Stanford EQ Leland Stanford Junior University

Stanford LOC-IN California

Stanford IS-A research university

Stanford LOC-NEAR Palo Alto

Stanford FOUNDED-IN 1891

Stanford FOUNDER Leland Stanford

Why Information Extraction?

• Create new structured knowledge bases, useful for any app

• Augment current knowledge bases• Adding words to WordNet thesaurus, facts to FreeBase or DBPedia

• Support question answering• The granddaughter of which actor starred in the movie “E.T.”?

(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

• But which relations should we extract?

4

5

Early Information Extraction

• FRUMP (Dejong, 1979) was an early information extraction system that processed news stories and identified various types of events (e.g. earthquakes, terrorist attacks, floods).

• Used “sketchy scripts” of various events to identify specific pieces of information about such events.

• Able to summarize articles in multiple languages.

• Relied on “brittle” hand-built symbolic knowledge structures that were hard to build and not very robust.

6

MUC• DARPA funded significant efforts in IE in the early to mid 1990’s.

• Message Understanding Conference (MUC) was an annual event/competition where results were presented.

• Focused on extracting information from news articles:• Terrorist events

• Industrial joint ventures

• Company management changes

• Information extraction of particular interest to the intelligence community (CIA, NSA).

• Established standard evaluation methodology using development (training) and test data and metrics: precision, recall, F-measure.

Automated Content Extraction (ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL

PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

Geographical

Subsidiary

Sports-Affiliation

17 relations from 2008 “Relation Extraction Task”

Automated Content Extraction (ACE)

• Physical-Located PER-GPE

He was in Tennessee

• Part-Whole-Subsidiary ORG-ORGXYZ, the parent company of ABC

• Person-Social-Family PER-PER

John’s wife Yoko

• Org-AFF-Founder PER-ORG

Steve Jobs, co-founder of Apple…

•8

UMLS: Unified Medical Language System

• 134 entity types, 54 relations

Injury disrupts Physiological FunctionBodily Location location-of Biologic FunctionAnatomical Structure part-of OrganismPharmacologic Substance causes Pathological FunctionPharmacologic Substance treats Pathologic Function

Extracting UMLS relations from a sentence

Doppler echocardiography can be used to

diagnose left anterior descending artery

stenosis in patients with type 2 diabetes

Echocardiography, Doppler DIAGNOSES Acquired stenosis

10

Databases of Wikipedia Relations

11

Relations extracted from InfoboxStanford state CaliforniaStanford motto “Die Luft der Freiheit weht”…

Wikipedia Infobox

Relation databases that draw from Wikipedia

• Resource Description Framework (RDF) triplessubject predicate object

Golden Gate Park location San Francisco

dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco

• DBPedia: 1 billion RDF triples, 385 from English Wikipedia

• Frequent Freebase relations:people/person/nationality, location/location/containspeople/person/profession, people/person/place-of-birthbiology/organism_higher_classification film/film/genre

12

Ontological relations

• IS-A (hypernym): subsumption between classes

• Giraffe IS-A ruminant IS-A ungulate IS-Amammal IS-A vertebrate IS-A animal…

• Instance-of: relation between individual and class

• San Francisco instance-of city

Examples from the WordNet Thesaurus

How to build relation extractors

1. Hand-written patterns

2. Supervised machine learning

3. Semi-supervised and unsupervised • Bootstrapping (using seeds)

• Distant supervision

• Unsupervised learning from the web

Hand Written Patterns

15

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

• What does Gelidium mean?

• How do you know?`

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

• What does Gelidium mean?

• How do you know?`

Hearst’s Patterns for extracting IS-A relations

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and|or) X)”

“such Y as X”

“X or other Y”

“X and other Y”

“Y including X”

“Y, especially X”

Hearst’s Patterns for extracting IS-A relations

Hearst pattern Example occurrences

X and other Y ...temples, treasuries, and other important civic buildings.

X or other Y Bruises, wounds, broken bones or other injuries...

Y such as X The bow lute, such as the Bambara ndang...

Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X ...common-law countries, including Canada and England...

Y , especially X European countries, especially France, England, and Spain...

Extracting Richer Relations Using Rules

• Intuition: relations often hold between specific entities

• located-in (ORGANIZATION, LOCATION)

• founded (PERSON, ORGANIZATION)

• cures (DRUG, DISEASE)

• Start with Named Entity tags to help extract relation!

Named Entities aren’t quite enough.Which relations hold between 2 entities?

Drug Disease

Cure?

Prevent?

Cause?

What relations hold between 2 entities?

PERSON ORGANIZATION

Founder?

Investor?

Member?

Employee?

President?

Extracting Richer Relations Using Rules andNamed Entities

Who holds what office in what organization?

PERSON, POSITION of ORG

• George Marshall, Secretary of State of the United States

PERSON(named|appointed|chose|etc.) PERSON Prep? POSITION

• Truman appointed Marshall Secretary of State

PERSON [be]? (named|appointed|etc.) Prep? ORG POSITION

• George Marshall was named US Secretary of State

Hand-built patterns for relations

• Plus:

• Human patterns tend to be high-precision

• Can be tailored to specific domains

• Minus

• Human patterns are often low-recall

• A lot of work to think of all possible patterns!

• Don’t want to have to do this for every relation!

• We’d like better accuracy

Supervised Algorithms

25

Supervised machine learning for relations

• Choose a set of relations we’d like to extract

• Choose a set of relevant named entities

• Find and label data• Choose a representative corpus

• Label the named entities in the corpus

• Hand-label the relations between these entities

• Break into training, development, and test

• Train a classifier on the training set

26

How to do classification in supervised relation extraction

1. Find all pairs of named entities (usually in same sentence)

2. Decide if 2 entities are related

3. If yes, classify the relation

• Why the extra step?

• Faster classification training by eliminating most pairs

• Can use distinct feature-sets appropriate for each task.

27

Automated Content Extraction (ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL

PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

Geographical

Subsidiary

Sports-Affiliation

17 sub-relations of 6 relations from 2008 “Relation Extraction Task”

Relation Extraction

Classify the relation between two entities in a sentence

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

SUBSIDIARY

FAMILY

EMPLOYMENT

NIL

FOUNDER

CITIZEN

INVENTOR…

Word Features for Relation Extraction

• Headwords of M1 and M2, and combinationAirlines Wagner Airlines-Wagner

• Bag of words and bigrams in M1 and M2{American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}

• Words or bigrams in particular positions left and right of M1/M2M2: -1 spokesman

M2: +1 said

• Bag of words or bigrams between the two entities{a, AMR, of, immediately, matched, move, spokesman, the, unit}

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2

Named Entity Type and Mention LevelFeatures for Relation Extraction

• Named-entity types• M1: ORG

• M2: PERSON

• Concatenation of the two named-entity types• ORG-PERSON

• Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)• M1: NAME [it or he would be PRONOUN]

• M2: NAME [the company would be NOMINAL]

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2

Parse Features for Relation Extraction

• Base syntactic chunk sequence from one to the otherNP NP PP VP NP NP

• Constituent path through the tree from one to the otherNP NP S S NP

• Dependency path

Airlines matched Wagner said

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1 Mention 2

Gazeteer and trigger word features for relation extraction

• Trigger list for family: kinship terms• parent, wife, husband, grandparent, etc. [from WordNet]

• Gazeteer:• Lists of useful geo or geopolitical words

• Country name list

• Other sub-entities

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Classifiers for supervised methods

• Now you can use any classifier you like

• MaxEnt

• Naïve Bayes

• SVM

• ...

• Train it on the training set, tune on the dev set, test on the test set

• Can also cast as sequence labeling – use CRFs

37

Using Background Knowledge (Chan and Roth, 2010)

• Features employed are usually restricted to being defined on the various representations of the target sentences

• Humans rely on background knowledge to recognize relations

• Overall aim of this work• Propose methods of using knowledge or resources that exists

beyond the sentence

• Wikipedia, word clusters, hierarchy of relations, entity type constraints, coreference

• As additional features, or under the Constraint Conditional Model (CCM) framework with Integer Linear Programming (ILP)37

3838

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

3939

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

4040

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

4141

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

4242

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background KnowledgeDavid Brian Cone (born January 2, 1963) is a

former Major League Baseball pitcher. He compiled

an 8–3 postseason record over 21 postseason

starts and was a part of five World Series

championship teams (1992 with the Toronto Blue

Jays and 1996, 1998, 1999 & 2000 with the New

York Yankees). He had a career postseason ERA of

3.80. He is the subject of the book A Pitcher's

Story: Innings With David Cone by Roger Angell.

Fans of David are known as "Cone-Heads."

Cone lives in Stamford, Connecticut, and is formerly

a color commentator for the Yankees on the YES

Network.[1]

Contents

[hide]

1 Early years

2 Kansas City Royals

3 New York Mets

Partly because of the resulting lack of leadership,

after the 1994 season the Royals decided to

reduce payroll by trading pitcher David Cone and

outfielder Brian McRae, then continued their

salary dump in the 1995 season. In fact, the

team payroll, which was always among the

league's highest, was sliced in half from $40.5

million in 1994 (fourth-highest in the major

leagues) to $18.5 million in 1996 (second-lowest

in the major leagues)

4343

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

fine-grained

Employment:Staff 0.20

Employment:Executive 0.15

Personal:Family 0.10

Personal:Business 0.10

Affiliation:Citizen 0.20

Affiliation:Based-in 0.25

4444

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

fine-grained coarse-grained

Employment:Staff 0.200.35 Employment

Employment:Executive 0.15

Personal:Family 0.100.40 Personal

Personal:Business 0.10

Affiliation:Citizen 0.200.25 Affiliation

Affiliation:Based-in 0.25

4545

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

fine-grained coarse-grained

Employment:Staff 0.200.35 Employment

Employment:Executive 0.15

Personal:Family 0.100.40 Personal

Personal:Business 0.10

Affiliation:Citizen 0.200.25 Affiliation

Affiliation:Based-in 0.25

4646

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Using Background Knowledge

fine-grained coarse-grained

Employment:Staff 0.200.35 Employment

Employment:Executive 0.15

Personal:Family 0.100.40 Personal

Personal:Business 0.10

Affiliation:Citizen 0.200.25 Affiliation

Affiliation:Based-in 0.25

0.55

47

Knowledge1: Word Class Information(as additional feature)

• Supervised systems face an issue of data sparseness (of lexical features)

• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts

using algorithm of (Brown et al., 1992)

apple pear Apple IBM

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

47

48

Knowledge1: Word Class Information

• Supervised systems face an issue of data sparseness (of lexical features)

• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts

using algorithm of (Brown et al., 1992)

apple pear Apple

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

48

IBM

49

Knowledge1: Word Class Information

• Supervised systems face an issue of data sparseness (of lexical features)

• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts

using algorithm of (Brown et al., 1992)

apple pear Apple

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

49

IBM011

50

Knowledge1: Word Class Information

• All lexical features consisting of single words will be duplicated with its corresponding bit-string representation

apple pear Apple IBM

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

50

00 01 10 11

51

Knowledge2: Wikipedia(as additional feature)

• We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages

• Introduce a new feature based on:

• introduce a new feature by combining the above with the coarse-grained entity types of mi,mj

otherwise ,0

)(or )( if ,1),(1

imjm

ji

mAmAmmw ji

51

mi mjr ?

5353

weight vector for

“local” modelscollection of

classifiers

Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)

54

Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)

54

weight vector for

“local” modelscollection of

classifiers

penalty for violating

the constraint

how far y is from a

“legal” assignment

55

Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)

55

•Wikipedia

•word clusters

•hierarchy of relations

•entity type constraints

•coreference

5656

David

Cone

,

a

Kansas

City

native

,

was

originally

signed

by

the

Royals

and

broke

into

the

majors

with

the

team

Constraint Conditional Models (CCMs)

fine-grained coarse-grained

Employment:Staff 0.200.35 Employment

Employment:Executive 0.15

Personal:Family 0.100.40 Personal

Personal:Business 0.10

Affiliation:Citizen 0.200.25 Affiliation

Affiliation:Based-in 0.25

57

• Key steps• Write down a linear objective function

• Write down constraints as linear inequalities

• Solve using integer linear programming (ILP) packages

57

Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)

58

Knowledge3: Relations between our target relations

......

personal

......

employment

family biz executivestaff

58

59

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

59

coarse-grained

classifier

fine-grained

classifier

60

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

60

mi mj

coarse-grained?

fine-grained?

61

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

61

62

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

62

63

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

63

64

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

64

65

Knowledge3: Hierarchy of Relations

......

personal

......

employment

family biz executivestaff

65

66

Knowledge3: Hierarchy of Relations Write down a linear objective function

R LR L R rf

rfRR

R rcrcRR

RfRc

yrfpxrcp,,

)()(max

66

coarse-grained

prediction

probabilities

fine-grained

prediction

probabilities

67

Knowledge3: Hierarchy of Relations Write down a linear objective function

R LR L R rf

rfRR

R rcrcRR

RfRc

yrfpxrcp,,

)()(max

67

coarse-grained

prediction

probabilities

fine-grained

prediction

probabilities

coarse-grained

indicator

variable

fine-grained

indicator

variable

indicator variable == relation assignment

68

Knowledge3: Hierarchy of Relations Write down constraints

• If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc.

• (Capturing the inverse relationship) If we assign rfto R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label

68

nrfRrfRrfRrcRyyyx

,,,, 21

)(,, rfparentRrfRxy

69

Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)

• Entity types are useful for constraining the possible labels that a relation R can assume

69

mi mj

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

70

• Entity types are useful for constraining the possible labels that a relation R can assume

70

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

per org

per org

per

per per

per

per

org

gpe

gpe

per per

mi mj

Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)

71

• We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations• By improving the coarse-grained predictions and combining with the

hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications

71

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

per org

per org

per

per per

per

per

org

gpe

gpe

per per

mi mj

Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)

72

Knowledge5: Coreference

72

mi mj

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

73

Knowledge5: Coreference

• In this work, we assume that we are given the coreference information, which is available from the ACE annotation.

73

mi mj

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

null

74

Experiment Results

74

F1% improvement from using each knowledge source

All nwire 10% of nwire

BasicRE 50.5% 31.0%

Summary: Supervised Relation Extraction

+ Can get high accuracies with enough hand-labeled

training data, if test similar enough to training

- Labeling a large training set is expensive

- Supervised models are brittle, don’t generalize well

to different genres

Bootstrapping Approaches

76

Rule Learning

• Thinking up some patterns for hyponyms might not be too hard, but what about some new relationship? • E.g., enzymes and the molecular pathway(s) they’re involved in?

• Cities and their mayors? Films and their directors?

• Can we automate the process of identifying patterns?

• Rule learning automates this process, if it is given some examples of the relationship of interest.• For instance, some example enzyme names and the names of the

pathways they’re involved in.

Seed-based or bootstrapping approaches to relation extraction

• No training set? Maybe you have:

• A few seed tuples or

• A few high-precision patterns

• Can you use those seeds to do something useful?

• Bootstrapping: use the seeds to directly learn to populate a relation

Relation Bootstrapping (Hearst 1992)

• Gather a set of seed pairs that have relation R

• Iterate:

1. Find sentences with these pairs

2. Look at the context between or around the pair and generalize the context to create patterns

3. Use the patterns for grep for more pairs

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Bootstrapping for Relation Extraction

Bootstrapping

• <Mark Twain, Elmira> Seed tuple• Grep (google) for the environments of the seed tuple

“Mark Twain is buried in Elmira, NY.”

X is buried in Y

“The grave of Mark Twain is in Elmira”

The grave of X is in Y

“Elmira is Mark Twain’s final resting place”

Y is X’s final resting place.

• Use those patterns to grep for new tuples

• Iterate

BootstrappingSeed Examples

Philadelphia – Michael Nutter

New York – Michael Bloomberg

Rule Learning Extraction Rules

X is mayor of Y

X, mayor of Y

X runs City Hall in Y

High-confidenceExtractions

BootstrappingSeed Examples

Philadelphia – Michael Nutter

New York – Michael Bloomberg

San Diego – Jerry Sanders

Belgrade -- Dragan Đilas

Rule Learning Extraction Rules

X is mayor of Y

X, mayor of Y

X runs City Hall in Y

Social Democrat X is new mayorof Y

High-confidenceExtractions

Occurrences of

seed tuples:

Computer servers at Microsoft’s

headquarters in Redmond…

In mid-afternoon trading, share of

Redmond-based Microsoft fell…

The Armonk-based IBM introduced

a new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of its

Pentium processor.

ORGANIZATION LOCATION

MICROSOFT REDMOND

IBM ARMONK

BOEING SEATTLE

INTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Bootstrapping for Relation Extraction

• <STRING1>’s headquarters in <STRING2>

•<STRING2> -based <STRING1>

•<STRING1> , <STRING2>

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

LearnedPatterns:

Bootstrapping for Relation Extraction

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Generatenew seedtuples; start newiteration

ORGANIZATION LOCATION

AG EDWARDS ST LUIS

157TH STREET MANHATTAN

7TH LEVEL RICHARDSON

3COM CORP SANTA CLARA

3DO REDWOOD CITY

JELLIES APPLE

MACWEEK SAN FRANCISCO

Bootstrapping for Relation Extraction

Main Issue: Semantic Drift

• US Presidents “presidents such as…” Company presidents

• States “states such as…” liquid, solid, gas, Canada, …• “Microfinance institutions (MFIs) in the United States such as Accion”

• …

87

Doubly Anchored Patterns[Kozareva, ACL 08]

• “X such as Y and Z” • Fix two look for third

• “presidents such as Clinton and *”• Will likely not drift to company presidents

88

89

Pattern-Match Rule Learning• Writing accurate patterns for each slot for each application

requires laborious software engineering.

• Alternative is to use rule induction methods.

• RAPIER system (Califf & Mooney, 1999) learns three regex-style patterns for each slot:• Pre-filler pattern

• Filler pattern

• Post-filler pattern

• RAPIER allows use of POS and WordNet categories in patterns to generalize over lexical items.

90

RAPIER Pattern Induction Example• If goal is to extract the name of the city in which a

posted job is located, the least-general-generalization constructed by RAPIER is:

“…located in Atlanta, Georgia…” “…offices in Kansas City, Missouri…”

Rapier

Pattern Induction

Prefiller: “in” as Prep

Filler: 1 to 2 PropNouns

Postfiller: PropNoun which is a State

Distant Supervision

94

Distant Supervision

• Combine bootstrapping with supervised learning

• Instead of 5 seeds,

• Use a large database to get huge # of seed examples

• Create lots of features from all these examples

• Combine in a supervised classifier

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17Fei Wu and Daniel S. Weld. 2007. Autonomously Semantifying Wikipeida. CIKM 2007Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL09

Distant supervision paradigm

• Like supervised classification:

• Uses a classifier with lots of features

• Supervised by detailed hand-created knowledge

• Doesn’t require iteratively expanding patterns

• Like unsupervised classification:

• Uses very large amounts of unlabeled data

• Not sensitive to genre issues in training corpus

Distantly supervised learning of relation extraction patterns

For each relation

For each tuple in big database

Find sentences in large corpus with both entities

Extract frequent features (parse, words, etc)

Train supervised classifier using thousands of patterns

4

1

2

3

5

PER was born in LOC

PER, born (XXXX), LOC

PER’s birthplace in LOC

<Edwin Hubble, Marshfield><Albert Einstein, Ulm>

Born-In

Hubble was born in Marshfield

Einstein, born (1879), Ulm

Hubble’s birthplace in Marshfield

P(born-in | f1,f2,f3,…,f70000)

Heuristics for Labeling Training Data

Person Birth Location

Barack Obama

Honolulu

Mitt Romney Detroit

Albert Einstein

Ulm

Nikola Tesla Smiljan

… …

“Barack Obama was born

on August 4, 1961 at … in

the city of Honolulu ...”

“… site between Downtown

Honolulu and Waikiki is

proposed site for Barack

Obama's presidential library.

“Born in Honolulu, Barack

Obama went on to become…”

(Barack Obama, Honolulu)

(Mitt Romney, Detroit)

(Albert Einstein, Ulm)

98

e.g. [Mintz et. al. 2009]

Problem 1: not all sentences are positive training data!

𝑠1 𝑠2 𝑠3… 𝑠𝑛

𝑧1 𝑧2 𝑧3… 𝑧𝑛

𝑑1 𝑑2 𝑑𝑘…

Local Extractors

Deterministic OR

(Barack Obama, Honolulu)

99

Sentences

Aggregate Relations

(Born-In, Lived-In, children, etc…)

𝑃 𝑧𝑖 = 𝑟 𝑠𝑖 ∝ exp(𝜃 ⋅ 𝑓 𝑠𝑖 , 𝑟 ) Relation mentions

𝑧

𝑃(𝑧, 𝑑|𝑠; 𝜃)MaximizeConditionalLikelihood

Multi-instance Learning (MultiR)

Heuristics for Labeling Training Data

Person Birth Location

Barack Obama

Honolulu

Mitt Romney Detroit

Albert Einstein

Ulm

Nikola Tesla Smiljan

… …

“Barack Obama was born

on August 4, 1961 at … in

the city of Honolulu ...”

“… site between Downtown

Honolulu and Waikiki is

proposed site for Barack

Obama's presidential library.

“Born in Honolulu, Barack

Obama went on to become…”

(Barack Obama, Honolulu)

(Mitt Romney, Detroit)

(Albert Einstein, Ulm)

100

e.g. [Mintz et. al. 2009]

Problem 2: not all databases are complete (neither is text)

Missing Data Problems…

• 2 Assumptions Drive learning:• Not in DB -> not mentioned in text

• In DB -> must be mentioned at least once

• Leads to errors in training data:• False positives

• False negatives

101

Changes

𝑠1 𝑠2 𝑠3 … 𝑠𝑛

𝑧1 𝑧2 𝑧3 … 𝑧𝑛

𝑑1 𝑑2 𝑑𝑘…

102

Modeling Missing Data

𝑠1 𝑠2 𝑠3 … 𝑠𝑛

𝑧1 𝑧2 𝑧3 … 𝑧𝑛

𝑡1 𝑡2 𝑡𝑘…

Mentioned in DB 𝑑1 𝑑2 𝑑𝑘…

Encourage Agreement

Mentioned in Text

103

[Ritter et. al. TACL 2013]

Side Information• Entity coverage

in database• Popular

entities

• Good coverage in Freebase Wikipedia

• Unlikely to extract new facts

104

𝑠1 𝑠2 𝑠3 … 𝑠𝑛

𝑧1 𝑧2 𝑧3 … 𝑧𝑛

𝑡1 𝑡2 𝑡𝑘…

𝑑1 𝑑2 𝑑𝑘…

Experiments

• Red: MultiR

• Black: Soft Constraints

• Green: Missing Data Model

105

[Hoffmann et. al. 2011]

Recommended