62
1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and the Graduate Center City University of New York [email protected] December 3, 2010

1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

Embed Size (px)

Citation preview

Page 1: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

1/55

Web-Scale Knowledge Discovery and Population

from Unstructured Data

ACLCLP-IR2010 Workshop

Heng Ji

Computer Science DepartmentQueens College and the Graduate Center

City University of New York

[email protected] 3, 2010

Page 2: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

2/55

Outline Motivation of Knowledge Base Population (KBP) KBP2010 Task Overview Data Annotation and Analysis Evaluation Metrics A Glance of Evaluation Results CUNY-BLENDER team @ KBP2010 Discussions and Lessons Preview of KBP2011

Cross-lingual (Chinese-English) KBP Temporal KBP

Page 3: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

3/55

Limitations of Traditional IE/QA Tracks Traditional Information Extraction (IE) Evaluations (e.g.

Message Understanding Conference /Automatic Content Extraction programs)

Most IE systems operate one document a time; MUC-style Event Extraction hit the 60% ‘performance ceiling’

Look back at the initial goal of IE Create a database of relations and events from the entire corpus Within-doc/Within-Sent IE was an artificial constraint to simplify the

task and evaluation

Traditional Question Answering (QA) Evaluations Limited efforts on disambiguating entities in queries Limited use of relation/event extraction in answer search

Page 4: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

4/55

The Goal of KBP Hosted by the U.S. NIST, started from 2009, supported by DOD, coordinated

by Heng Ji and Ralph Grishman in 2010, 55 teams registered, 23 teams participated

Our Goal Bridge IE and QA communities Promote research in discovering facts about entities and expanding a

knowledge source

What’s New & Valuable Extraction at large scale (> 1 million documents) ; Using a representative collection (not selected for relevance); Cross-document entity resolution (extending the limited effort in ACE); Linking the facts in text to a knowledge base; Distant (and noisy) supervision through Infoboxes; Rapid adaptation to new relations; Support multi-lingual information fusion (KBP2011); Capture temporal information (KBP2011)

All of these raise interesting and important research issues

Page 5: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

5/55

Knowledge Base Population (KBP2010) Task Overview

Page 6: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

6/55

KBP Setup

Knowledge Base (KB) Attributes (a.k.a., “slots”) derived from Wikipedia

infoboxes are used to create the reference KB

Source Collection A large corpus of newswire and web documents

(>1.3 million docs) is provided for systems to discover information to expand and populate KB

Page 7: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

7/55

Entity Linking: Create Wiki Entry?

Query = “James Parsons”

NIL

Page 8: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

8/55

Entity Linking Task Definition Involve Three Entity Types

Person, Geo-political, Organization Regular Entity Linking

Names must be aligned to entities in the KB; can use Wikipedia texts

Optional Entity linking Without using Wikipedia texts, can use Infobox values

Query Example <query id="EL000304"> <name>Jim Parsons</name> <docid>eng-NG-31-100578-11879229</docid> </query>

Page 9: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

9/55

Slot Filling: Create Wiki Infoboxes?

School Attended: University of Houston

<query id="SF114"> <name>Jim Parsons</name> <docid>eng-WL-11-174592-12943233</docid> <enttype>PER</enttype> <nodeid>E0300113</nodeid> <ignore>per:date_of_birth per:age per:country_of_birth per:city_of_birth</ignore> </query>

Page 10: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

10/55

Regular Slot FillingPerson Organization

per:alternate_names per:title org:alternate_names

per:date_of_birth per:member_of org:political/religious_affiliation

per:age per:employee_of org:top_members/employees

per:country_of_birth per:religion org:number_of_employees/members

per:stateorprovince_of_birth per:spouse org:members

per:city_of_birth per:children org:member_of

per:origin per:parents org:subsidiaries

per:date_of_death per:siblings org:parents

per:country_of_death per:other_family org:founded_by

per:stateorprovince_of_death per:charges org:founded

per:city_of_death org:dissolved

per:cause_of_death org:country_of_headquarters

per:countries_of_residence org:stateorprovince_of_headquarters

per:stateorprovinces_of_residence org:city_of_headquarters

per:cities_of_residence org:shareholders

per:schools_attended org:website

Page 11: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

11/55

Data Annotation and Analysis

Page 12: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

12/55

Data Annotation Overview

Entity LinkingCorpus

Genre/Source Size (entity mentions)

Person Organization GPE

Training 2009 Training 627 2710 567

2010 Web data 500 500 500

Evaluation Newswire 500 500 500

Web data 250 250 250

Slot Filling Corpus

Task Source Size (entities)

Person Organization

Training Regular Task 2009 Evaluation 17 31

2010 Participants 25 25

2010 LDC 25 25

Surprise Task 2010 LDC 16 16

Evaluation Regular Task LDC 50 50

Surprise Task LDC 20 20

Source collection: about 1.3 million newswire docs and 500K web docs, a few speech transcribed docs

Page 13: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

13/55

Entity Linking Inter-Annotator Agreement

Annotator 1

Annotator 3

Annotator 2

Entity Type #Total Queries Agreement Rate Genre #Disagreed Queries

Person 59 91.53% Newswire 4

Web Text 1

Geo-political 64 87.5% Newswire 3

Web Text 5

Organization 57 92.98% Newswire 3

Web Text 1

Page 14: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

14/5514/35

Slot Filling Human Annotation Performance Evaluation assessment of LDC Hand Annotation

Performance P(%) R(%) F(%)

All Slots 70.14 54.06 61.06

All except per:top-employee, per:member_of, per:title

71.63 57.6 63.86

Why is the precision only 70%? 32 responses were judged as inexact and 200 as wrong answers A third annotator’s assessment on 20 answers marked as wrong:

65% incorrect; 15% correct; 20% uncertain Some annotated answers are not explicitly stated in the document

… some require a little world knowledge and reasoning Ambiguities and underspecification in the annotation guideline Confusion about acceptable answers Updates to KBP2010 annotation guideline for assessment

Page 15: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

15/5515/23

Slot Filling Annotation Bottleneck The overlap rates between two participant annotators in community

are generally lower than 30% Keep adding more human annotators help? No

Page 16: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

16/5516/23

Can Amazon Mechanical Turk Help?

Given a q, a and supporting context sentence, Turk should judge if the answer is

Y: correct; N: incorrect; U: unsure

Result Distribution for 1690 instances

Useful Annotations (41.8%) Useless Annotations (58.2%)

Cases Number Cases Number

Y Y Y Y Y 230 Y Y Y N N 164

N N N N N 16 Y Y Y N U 165

Y Y Y Y U 151 N N N Y Y 158

N N N N U 24 Y Y N N U 171

Y Y Y Y N 227 Y Y N U U 77

N N N N Y 46 Y Y U U U 17

Y Y Y U U 13 N N N Y U 72

N N N U U 59 N N Y U U 57

Y N U U U 22

Y U U U U 8

N N U U U 11

N U U U U 1

U U U U U 1

Page 17: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

17/5517/23

Why is Annotation so hard for Non-Experts?

Even for all-agreed cases, some annotations are incorrect…

Require quality control Training difficulties

Query Slot Answer Context

Citibank

org:

top_members

/employees

Tim Sullivan

He and Tim Sullivan, Citibank's Boston area manager, said they still to plan seek advice from activists going forward.

International Monetary

Fund

org:subsidiaries

World Bank

President George W. Bush said Saturday that a summit of world leaders agreed to make reforms to the World Bank and International Monetary Fund.

Page 18: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

18/5518/35

Evaluation Metrics

Page 19: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

19/5519/35

Entity Linking Scoring Metric Micro-averaged Accuracy (official metric)

Mean accuracy across all queries

Macro-averaged Accuracy Mean accuracy across all KB entries

Page 20: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

20/5520/35

Slot Filling Scoring Metric

Each response is rated as correct, inexact, redundant, or wrong (credit only given for correct responses)

Redundancy: (1) response vs. KB; (2) among responses: build equivalence class, credit only for one member of each class

Correct = # (non-NIL system output slots judged correct) System = # (non-NIL system output slots) Reference =

# (single-valued slots with a correct non-NIL response) + # (equivalence classes for all list-valued slots)

Standard Precision, Recall, F-measure

Page 21: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

21/5521/35

Evaluation Results

Page 22: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

22/5522/35

Top-10 Regular Entity Linking Systems

<0.8 correlation between overall vs. Non-NIL performance

Page 23: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

23/5523/35

Human/System Entity Linking Comparison (subset of 200 queries)

Average among three annotators

Page 24: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

24/5524/35

Top-10 Regular Slot Filling Systems

Page 25: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

25/55

CUNY-BLENDER Team @ KBP2010

Page 26: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

26/5526/23

Free Base

Wikipedia

Answers

Query Expansion

Query

QAIE Pattern Matching

Answer Validation

AnswerFiltering

Statistical Answer Re-ranking

Cross-System & Cross-Slot Reasoning

External KBs

Answer Validation

TextMining

Inexact & RedundantAnswer Removal

Priority-based Combination

System Overview

Page 27: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

27/5527/23

IE Pipeline

KBP 2010 slots ACE2005 relations/ eventsper:date_of_birth,

per:country_of_birth,per:stateorprovince_of_birth,per:city_of_birth

event: be-born

per:countries_of_residence,per:stateorprovinces_of_residence,per:cities_of_residence,per:religion

relation:citizen-resident-religion-ethnicity

per:school_attended relation:student-alumper:member_of relation:membership,

relation:sports-affiliationper:employee_of relation:employmentper:spouse, per:children, per:parents,per:siblings, per:other_family

relation:family, event: marry,event:divorce

per:charges event:charge-indict, event:convict

Apply ACE Cross-document IE (Ji et al., 2009) Mapping ACE to KBP, examples:

Page 28: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

28/5528/23

Pattern Learning Pipeline

Selection of query-answer pairs from Wikipedia Infobox

split into two sets Pattern extraction

For each {q,a} pair, generalize patterns by entity tagging and regular expressions e.g. <q> died at the age of <a>

Pattern assessment Evaluate and filter based on matching rate

Pattern matching Combine with coreference resolution

Answer Filtering based on entity type checking, dictionary checking and dependency parsing constraint filtering

Page 29: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

29/5529/23

QA Pipeline

Apply open domain QA system, OpenEphyra (Schlaefer et al., 2007)

Relevance metric related to PMI and CCP Answer pattern probability:

P (q, a) = P (q NEAR a): NEAR within the same sentence boundary

Limited by occurrence based confidence and recall issues

sentencesafreqqfreq

aNEARqfreqaqR #

)()(

)(),(

Page 30: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

30/5530/23

More Queries and Fewer Answers Query Template expansion

Generated 68 question templates for organizations and 68 persons Who founded <org>? Who established <org>? <org> was created by who?

Query Name expansion Wikipedia redirect links

Heuristic rules for Answer Filtering Format validation Gazetteer based validation Regular expression based filtering Structured data identification and answer filtering

Page 31: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

31/5531/23

Motivation of Statistical Re-Ranking Union and voting are too sensitive to the performance of

baseline systems Union guarantees highest recall

requires comparable performance Voting

assumes more frequent answers are more likely true (FALSE) Priority-based combination

voting with weights assumes system performance does not vary by slot (FALSE)

Slot IE QA PL

org:country_of_headquarters 75.0 15.8 16.7

org:founded - 46.2 -

per:date_of_birth 100 33.3 76.9

per:origin - 22.6 40

Page 32: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

32/5532/23

Statistical Re-Ranking Maximum Entropy (MaxEnt) based supervised re-

ranking model to re-rank candidate answers for the same slot

Features Baseline Confidence Answer Name Type Slot Type X System Number of Tokens X Slot Type Gazetteer constraints Data format Context sentence annotation (dependency parsing, …) …

Page 33: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

33/5533/23

MLN-based Cross-Slot Reasoning Motivation

each slot is often dependent on other slots can construct new ‘revertible’ queries to verify candidate

answers X is per:children of Y Y is per:parents of X; X was born on date Y age of X is approximately (the current

year – Y)

Use Markov Logic Networks (MLN) to encode cross-slot reasoning rules

Heuristic inferences are highly dependent on the order of applying rules

MLN can adds a weight to each inference rule integrates soft rules and hard rules

Page 34: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

34/55

Name Error Examples

<PER>Faisalabad</PER>'s <PER>Catholic Bishop</PER> <PER>John Joseph</PER>, who had been campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in 1998.

Nominal Missing Error Examples supremo/shepherd/prophet/sheikh/Imam/overseer/oligarchs/

Shiites…

Intuitions of using lexical knowledge discovered from ngrams Each person has a Gender (he, she…) and is Animate (who…)

Error Analysis on Supervised Model

classification errors spurious errors

missing errors

Page 35: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

35/55

Motivations of Using Web-scale Ngrams

Data is Power Web is one of the largest text corpora: however,

web search is slooooow (if you have a million queries).

N-gram data: compressed version of the web Already proven to be useful for language modeling Google N-gram: 1 trillion token corpus

(Ji and Lin, 2009)

Page 36: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

36/55

car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit-and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10

Page 37: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

37/55

Discovery Patterns (Bergsma et al., 2005, 2008) (tag=N.*|word=[A-Z].*) tag=CC.* (word=his|her|its|their) (tag=N.*|word=[A-Z].*) tag=V.* (word=his|her|its|their) …

If a mention indicates male and female with high confidence, it’s likely to be a person mention

Gender Discovery from Ngrams

Patterns for candidate mentions

male female neutral plural

John Joseph bought/… his/… 32 0 0 0

Haifa and its/… 21 19 92 15

screenwriter published/… his/…

144 27 0 0

it/… is/… fish 22 41 1741 1186

Page 38: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

38/55

Discovery Patterns Count the relative pronoun after nouns not (tag=(IN|[NJ].*) tag=[NJ].* (? (word=,)) (word=who|which|where|

when)

If a mention indicates animacy with high confidence, it’s likely to be a person mention

Animacy Discovery from Ngrams

Patterns for candidate mentions

Animate

Non-Animate

who when where which

supremo 24 0 0 0

shepherd 807 24 0 56

prophet 7372 1066 63 1141

imam 910 76 0 57

oligarchs 299 13 0 28

sheikh 338 11 0 0

Page 39: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

39/55

Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words,

nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop

words

Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property)

Confidence (candidate, Male/Female/Animate) > Full Matching: John Joseph (M:32) Composite Matching: Ayub (M:87) Masih (M:117) Relaxed Matching: Mahmoud (M:159 F:13) Hamadan(N:19) Salim(F:13 M:188)

Qawasmi(M:0 F:0)

Unsupervised Mention Detection Using Gender and Animacy Statistics

Page 40: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

40/55

Mention Detection Performance

Methods P(%) R(%) F(%)

NameMention

Detection

Supervised Model 88.24 81.08 84.51

Unsupervised Methods

Using Ngrams

87.05 82.34 84.63

Nominal

Mention

Detection

Supervised Model 85.93 70.56 77.49

Unsupervised Methods

Using Ngrams

71.20 85.18 77.57

• Apply the parameters optimized on dev set directly on the blind test set• Blind test on 50 ACE05 newswire documents, 555 person name mentions and 900 person nominal mentions

Page 41: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

41/5541/23

Impact of Statistical Re-Ranking

Pipelines Precision Recall F-measure

Bottom-up Supervised IE 0.2416 0.1421 0.1789Pattern Matching 0.2186 0.3769 0.2767

Top-down QA 0.2668 0.1730 0.2099

Priority based Combination 0.3048 0.2658 0.2840

Re-Ranking based Combination 0.2797 0.4433 0.3430

5-fold cross-validation on training set Mitigate the impact of errors produced by scoring based on

co-occurrence (slot type x sys feature) e.g. the query “Moro National Liberation Front” and answer

“1976”did not have a high co-occurrence, but was bumped up by the re-ranker based on the slot type feature org:founded

Page 42: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

42/5542/23

Impact of Cross-Slot Reasoning

Operations Total Correct(%) Incorrect(%)

Removal 277 88% 12%

Adding 16 100% 0%

Brian McFadden | per:title | singers | “She had two daughter

with one of the MK’d Westlife singers, Brian McFadden, calling

them Molly Marie and Lilly Sue”

Page 43: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

43/5543/35

Slot-Specific Analysis A few slots account for a large fraction of the answers:

per:title, per:employee_of, per:member_of, and org:top_members/employees account for 37% of correct responses

For a few slots, delimiting exact answer is difficult …result is ‘inexact’ slot fills

per:charges, per:title (“rookie driver”; “record producer”)

For a few slots, equivalent-answer detection is importantto avoid redundant answers

per:title again accounts for the largest number of cases. e.g., “defense minister” and “defense chief” are equivalent.

Page 44: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

44/5544/35

How much Inference is Needed?

Page 45: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

45/5545/35

Why KBP is more difficult than ACE

Cross-sentence Inference – non-identity coreference(per:children) Lahoud is married to an Armenian and the couple have three children. Eldest

son Emile Emile Lahoud was a member of parliament between 2000 and 2005.

Cross-slot Inference (per:children) People Magazine has confirmed that actress Julia Roberts has given birth to her

third child a boy named Henry Daniel Moder. Henry was born Monday in Los Angeles and weighed 8? lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November 2006.

Page 46: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

46/5546/23

Statistical Re-Ranking based Active Learning

Page 47: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

47/55

Preview of KBP2011

Page 48: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

48/55

Cross-lingual Entity Linking

Query = “吉姆 .帕森斯”

Page 49: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

49/55

Cross-lingual Slot Filling

Other family: Todd Spiewak

Query = “James Parsons”

Page 50: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

50/55

Cross-lingual Slot Filling

Two Possible Strategies 1. Entity Translation (ET) + Chinese KBP 2. Machine Translation (MT) + English KBP

Stimulate Research on Information-aware Machine Translation Translation-aware Information Extraction Foreign Language KBP, Cross-lingual Distant Learning

Page 51: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

51/55

Query: Elizabeth II Slot type: per:cities_of_residence Answer: Gulf XIN20030511.0130.0011 | Xinhua News Agency,

London , may 10 -according to British media ten, British Queen Elizabeth II did not favour in the Gulf region to return British unit to celebrate the victory in the war.

Error Example of SF on MT output

Page 52: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

52/55

Query Name in Document Not Translated

Query: Celine Dion Answer: PER:Origin = Canada British singer , Clinton's plan to Caesar Palace of the ( Central

news of UNAMIR in Los Angeles , 15th (Ta Kung Pao) - consider British singer , Clinton ( ELT on John ) today, according to the Canadian and the seats of the Matignon Accords , the second to Las Vegas in the international arena heavyweight.

Page 53: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

53/55

Answer in Document Not Translated Query: David Kelly Answer: per:schools_attended = Oxford University MT: The 59-year-old Kelly is the flea basket for trapping

fish microbiology and internationally renowned biological and chemical weapons experts. He had participated in the United Nations Iraq weapons verification work, and the British Broadcasting Corporation ( BBC ) the British Government for the use of force on Iraq evidence the major sources of information. On , Kelly in the nearby slashed his wrist, and public opinion holds that he &quot; cannot endure the enormous psychological pressure &quot; to commit suicide.

Page 54: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

54/55

Temporal KBP (Slot Filling)

Page 55: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

55/55

Temporal KBP Many attributes such as a person’s title and

employer, and spouse change over time Time-stamped data is more valuable Distinguish static attributes and dynamic attributes Address the multiple answer problem

What representation to require? <start_date, end_date>

Such explicit info rarely provided <<earliest_start_date, latest_start_date>,

<earliest_end_date, latest_end_date>> Captures wider range of information

Page 56: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

56/55

Temporal KBP: scoring Score each element of 4-tuple separately, then

combine scores Smoothed score to handle +∞ and -∞ Need rules for granularity mismatches

Year vs month vs day Possible Formula (constraint based validation) key = <t1, t2, t3, t4> ; answer = <x1, x2, x3, x4>

if xi is judged as incorrect then

otherwise ( ) 0iS x

1 1( ) ( )

4 1 | |ii i

S xt x m

Page 57: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

57/55

Need Cross-document Aggregation Query: Ali Larijani; Answer: Iran

Doc1:

Ali Larijani had held the post for over two years but resigned after reportedly falling out with the hardline Ahmadinejad over the handling of Iran's nuclear case.

Doc2:

The new speaker, Ali Larijani, who resigned as the country's nuclear negotiator in October over differences with Ahmadinejad, is a conservative and an ardent advocate of Iran's nuclear program, but is seen as more pragmatic in his approach and perhaps willing to engage in diplomacy with the West.

Page 58: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

58/55

Same Relation Repeat Over Time Query: Mark Buse; Answer: McCain

Doc1: NYT_ENG_20080220.0185.LDC2009T13.sgm

(seven years, P7Y); (2001, 2001) In his case, it was a round trip through the revolving door: Buse had directed McCain's committee staff for seven years before leaving in 2001 to lobby for telecommunications companies.

Doc2:LTW_ENG_20081003.0118.LDC2009T13.sgm

(this year, 2008) Buse returned to McCain's office this year as chief of staff.

Page 59: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

59/55

Require Paraphrase Discovery

Query: During when was R. Nicholas Burns a member of the U.S. State Department? Answer: 1995-2008 <DOCID> APW_ENG_19950112.0477.LDC2007T07 </DOCID>

R. Nicholas Burns, a career foreign service officer in charge of Russian affairs at the National Security Council, is due to be named the new spokesman at the U.S. State Department, a senior U.S. official said Thursday.

[APW_ENG_20070324.0924.LDC2009T13 and many other DOCS]

The United States is "very pleased by the strength of this resolution" after two years of diplomacy, said R. Nicholas Burns, undersecretary for political affairs at the State Department.

<DOCID> NYT_ENG_20080118.0161.LDC2009T13 </DOCID>

R. Nicholas Burns, the country's third-ranking diplomat and Secretary of State Condoleezza Rice's right-hand man, is retiring for personal reasons, the State Department said Friday.

<DOCID> NYT_ENG_20080302.0157.LDC2009T13 </DOCID>

The chief U.S. negotiator, R. Nicholas Burns, who left his job on Friday, countered that the sanctions were all about Iran's refusal to stop enriching uranium, not about weapons. But that argument was a tough sell.

Page 60: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

60/5560/23

Related Work Extracting slots for persons and organizations (Bikel et al., 2009; Li et

al., 2009; Artiles et al., 2008) Distant Learning (Mintz et al., 2009)

Re-ranking techniques (e.g. Collins et al., 2002; Zhai et al., 2004; Ji et al., 2006)

Answer validation for QA (e.g. Magnini et al., 2002; Peatas et al., 2007; Ravichandran et al., 2003; Huang et al., 2009)

Inference for Slot Filling (Bikel et al., 2009; Castelli et al., 2010)

Page 61: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

61/55

Conclusions KBP proves a much more challenging task than traditional IE/QA Brings great opportunity to stimulate research and collaborations

across communities An adventure to promote IE to web-scale processing and higher

quality Encourage research on cross-document cross-lingual IE Big gains from statistical re-ranking combining 3 pipelines

Information Extraction Pattern Learning Question-Answering

Further gains from MLN cross-slot reasoning Automatic profiles from SF dramatically improve EL Human-system combination provides efficient answer-key

generation Faster, better, cheaper!

Page 62: 1/55 Web-Scale Knowledge Discovery and Population from Unstructured Data ACLCLP-IR2010 Workshop Heng Ji Computer Science Department Queens College and

62/55

Thank you and

Join us:http://nlp.cs.qc.cuny.edu/kbp/2010

62