Inductive Approaches to the Detection and Classification of Semantic Relation Mentions

Inductive Approaches to the Detection and Classification of Semantic Relation Mentions

Depth Report Examination Presentation

Gabor MelliAugust 27, 2007

http://www.gabormelli.com/2007/2007_DepthReport_Melli_Presentation.ppt

Overview

• Introduction (~5 mins.)

• Task Description (~5 mins.)

• Predictive Features (~10 mins.)

• Inductive Algorithms (~10 mins.)

• Benchmark Tasks (~5 mins.)

• Research Directions (~5 mins.)

Simple examples of the “shallow” semantics sought

• “E. coli is a bacteria.” RTypeOf (E. coli, bacteria)

• “An organism has proteins.” RPartOf (proteins, organism)

• “IBM is based in Armonk, NY.” RHeadquarterLocation (IBM, Armonk, NY)

Motivations

• Information Retrieval– Researchers could retrieve scientific papers based on relations

• E.g. “all papers that report localization experiments on V. cholera’s outer membrane proteins”

– Judges could retrieve legal cases.• E.g. “all Supreme Court cases involving third party liability claims”

• Information Fusion– Researchers could populate a database with semantic relations

in research articles.• E.g. SubcellularLocalization(Organism,Protein,Location)

– Activists could save resources when compiling statistics from newspaper reports.

• Document Summarization, Question Answering, …

State-of-the-Art

• Current focus is to automatically induce predictive patterns/classifiers.– Can be more quickly applied to a new domain than an

engineered solution.• Human levels of competency are nearby.

– F-measure:• 76% on the ACE-2004 benchmark task (Zhou et al, 2007)• 75% on a protein/gene interaction (Fundel et al, 2007)• 72% on the SemEval-2007 task (Beamer et al, 2007).

– Though under simplified conditions• binary relations within a single sentence• perfectly classified entity mentions.

Shallow semantic analysis is challenging

• Many ways to say the same thing– O is based in L.; L-based O …; Headquartered in

L, O …; From its L headquarters, O …

• Many relations to disambiguate from.

“The pilus(location) of V. cholerae(organism) is essential for intestinal(location) colonization.

TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”

Next Section

• Introduction

• Task Description

• Predictive Features

• Inductive Algorithms

• Benchmark Tasks

• Research Directions

Task Description

• Documents, Token, Sentences• Entity Mentions: Detected and Classified• Semantic Relation Cases and Mentions• Performance Metrics• Comparison with Information Extraction Task• What name for the task?• General Pipelined Process• Subtask: Relation Case Generation• Subtask: Relation Case Labeling• Naïve Baseline Algorithms

• Documents, Token, Sentences• Entity Mentions: Detected and Classified• Semantic Relation Cases and Mentions• Performance Metrics• Comparison with Information Extraction Task• What name for the task?• General Pipelined Process• Subtask: Relation Case Generation• Subtask: Relation Case Labeling• Naïve Baseline Algorithms

Document, Tokens, Sentences

Entity Mentions are pre-Detected (and pre-Classified)

Semantic Relations

• A relation with fixed set of two or more arguments.

Ri(Arg1,…,Arga) {TRUE, FALSE}

• Examples:– TypeOf (E.coli, Bacteria) TRUE– OrgLocation(IBM, Jupiter) FALSE– SCL(V.cholerae, TcpC, Extracellular) TRUE

Semantic Relation Cases• Some permutation of distinct entity mentions

within the document.

• D1: “E.coli1 is a bacteria2. As with all bacteria3, E.coli4 has a cytoplasm5”

C(Ri, D1, E1, E2)

C(Ri, D1, E2, E1)

…

C(Rj, D1, E4, E3, E5)

C(Rj, D1, E3, E4, E5)

e – entity mentions

amax – arguments

c – relation cases

)!(

!

maxae

ec

Semantic RelationDetection vs. Classification

C(R , Di, Ej,…,Ek) ?

RelationDetection

{True,False}

Relation Classification{1,2,…,r}

?

Predict the semantic relation Rj associated with a relation mention.

Predict whether this is a true mention of some semantic relation.

Test and Training Sets

C(R?, Dd+1, E1, E2) ?…

C(R?, Dd+k, Ex,…, Ey) ?

C(R1, D1, E1, E2) F

C(R1, D1, E1, E3) T…

C(Rr, Dd, E2, E3, E5) F

C(Rr, Dd, E3, E4, E5) F

Performance Metrics• Precision (P): probability that a test case that is

predicted to have label True is tp. • Recall (R): probability that a True test case will

be tp.• F-measure (F1): Harmonic mean of the

Precision and Recall estimates.

FNFPTP

TP

RP

PR

RP

F

2

221111

1

• Accuracy: Proportion of predictions with correct labels, True of False.

Pipelined Process Framework

TrainingDocuments Feature

Generation

Case Labeling

Model Induction

UnlabeledRel. Cases

LabeledRCs

Classifier

Natural Language Processing

Relation Case

Generation

Training Phase

TestingDocuments

PredictionGeneration Predictions

Natural Language Processing

Relation Case

Generation

Feature Generation

Testing Phase

Next Section

• Introduction




• Benchmark Tasks


Predictive Feature Categories

1. Token-based2. Entity Mention Argument-based3. Chunking-based4. Shallow Phrase-Structure Parse Tree-

based5. Phrase-Structure Parse Tree-based6. Dependency Parse Tree-based7. Semantic Role Label-based

1. Token-based2. Entity Mention Argument-based3. Chunking-based4. Shallow Phrase-Structure Parse Tree-

based5. Phrase-Structure Parse Tree-based6. Dependency Parse Tree-based7. Semantic Role Label-based

1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

r d

Shallow PTree (SPS)

PS PTree (PS)Dep. PTree

(Dep)SRL

(SRL)

Rel. Case Feature Space

R D E j E ,j

Token (Tk)

Entity Mention

(Em)

Chunking

(Ch)

1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

r d E i E e 4 2 2 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0

Shallow PTree (SPS)


(Dep)SRL

(SRL)


R D E j E ,j

Token (Tk)

Entity Mention

(Em)

Chunking

(Ch)

1 1

1 1… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

r d

Shallow PTree (SPS)


(Dep)SRL

(SRL)


R D E j E ,j

Token (Tk)

Entity Mention

(Em)

Chunking

(Ch)

Vector of Feature Information

“Protein1 is a Location1 lipoprotein requiredfor Location2 biogenesis.”

Token-based Features “Protein1 is a Location1 ...”

• Token Distance– 2 intervening tokens

• Token Sequence(s)– Unigrams

– Bigrams

the of . , and in a … pyelonephritis0 0 0 0 0 0 1 … 0

of the and the is the is a … causes pyelonephritis0 0 0 1 … 0

tok. dist.2

Token-based Features (cont.)

• Stemmed Word Sequences– “banks bank”

– “scheduling schedule”

• Disambiguated Word-Sense (WordNet)– “bank” river’s edge; financial inst.; row of objects

• Token Part-of-Speech Role SequencesNN IN JJ DT COMMA … WP0 0 0 1 0 … 0

DT IN IN NN JJ DT NN JJ AUX DT … AUX RBS0 0 0 1 … 0

Entity Mention-based Features

• Entity Mention Tokens– IBM 1, Tierra del Fuego 3, …

• Entity Mention’s Semantic Type– Semantic Class

• Organization• Location

– Subclass• Company; University; Charity• Country; Province; Region; City

Entity Mention Features (cont.)

• Entity Mention Type– Name John Doe, E. coli, periplasm, …

– Nominal the president, the country, …

– Pronomial he, she, they, it, …

• Entity Mention’s Ontology Id– secreted; extracellular GO0005576– E. coli; Escheria coli 571 (NCBI tax_id)

Phrase-Structure Parse Tree

http://lfg-demo.computing.dcu.ie/lfgparser.html

Shortest-Path Enclosed Tree

Loss of context?

Two types of subtrees proposed

Both approaches lead to an exponential number of subtree features!

Elementary subtrees general

subtrees

Elementary subtrees

Now we have a populatedfeature space

1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

r d E i E e 4 2 2 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0

Shallow PTree (SPS)


(Dep)SRL

(SRL)


R D E j E ,j

Token (Tk)

Entity Mention

(Em)

Chunking

(Ch)

TF…

?

labe

l

Next Section

• Introduction




• Benchmark Tasks


Inductive Approaches Available

• Supervised Algorithms– Requires a training set

• Semi-supervised Algorithms– Also accepts an unlabeled set

• Unsupervised Algorithms– Does not use a training set

• Most solutions restrict themselves to the task of detecting and classifying binary relation cases that are intra-sentential.

Supervised Algorithms

• Discriminative model– Feature-based (state of the art)

• E.g. k-Nearest Neighbor, Logistic Regression, …

– Kernel-based (state of the art)• E.g. Support Vector Machine

• Generative model– E.g. Probabilistic Context Free Grammars,

and Hidden Markov Models

Feature-based Algorithms

• Kambhatla, 2004– Early proposal to the use a broad set of features.

• Liu et al, 2007– Proposed the use of features previously found to be

predictive for the task of Semantic Role Labeling.

• Jiang and Zhai, 2007– Used bigram and trigram PS parse tree subtree

features (and dependency parse tree subtrees).– Adding trigram-based features produced marginal

improvement in performance; therefore marginal improvement likely by adding higher-order subtrees.

Kernel-based Induction

• Zelenko et al, 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Zhang et al, 2006.

• Require a kernel function, K(C1,C2) → [0,∞], that maps any two feature vectors to a similarity score from within some transformed space.

• If symmetric and positive definite then comparison between vectors can often be performed efficiently in a high-dimensional space.

• If cases are separable in that space then the kernel attains the benefit of the high-dimensional space without explicitly generating the feature space.

Kernel by Zhang et al, 2006

• Applies the Convolution Tree Kernel proposed in (Collins and Duffy, 2001; Haussler, 1999)

• Number of common subtrees Kc(T1,T2)

22,11

2121 ),(),(NnNn

C nnTTK

– Nj is the set of parent nodes in tree Tj

– (n1, n2) evaluates the common sub-trees rooted at n1 and n2

Kernel computed recursively inO(|N1| |N2|)

– (n1, n2)=0 If productions at n1 and n2 differ

– (n1, n2)=1 if n1 and n2 are POS nodes

– Otherwise,

• #ch(ni) is the number of children of node ni

• ch(n,k) is the kth child of node n• , (0< <1) is a decay factor

)),(),,((1(),( 21

)(#

121

1

knchknchnnnch

k

Generative Models Approaches

• Earliest approach (Leek 1997; Miller 1998).• Instead of directly estimating model parameters

for the conditional probability P(Y | X).

• Estimate model parameters for P(X | Y) and P(Y) from the training set

• Then apply Bayes rules to decide which label has the highest posterior probability.

• If the model fits the data then the generated likelihood ratio estimate is known to be optimal

Two Approaches Surveyed

• Probabilistic Context Free Grammars– Miller et al, 1998; Miller et al, 2000

• Hidden Markov Models– Leek, 1997– McCallum et al, 2000– Ray and Craven, 2001; Skounakis, Craven, and

Ray, 2003

PCFG-based Model

Miller et al, 1998/2000

• From augmented representation learn a PCFG based on these trees.

• Infer the maximum likelihood estimates of the probabilities based on the frequencies in the training corpus, along with an interpolated adjustment of lower order estimates to handle the (increased) challenge of data sparsity.

• Parses of test cases that contain the semantic labels are predicted to be relation mentions.

Semi-Supervised Approaches

• (Brin, 1998; Agichtein and Gravano, 2000)– Use token-based features– Apply resampling with replacement– Assume that relations in the training set are

redundantly present and restated in test set.

• (Shi et al, 2007)– Uses (Miller et al, 1998/2000) approach.– Uses a naïve baseline to convert unlabelled

cases to true training cases.

Snowball’s Bootstrapping

(Xia, 2006)

Unsupervised Use of Lexico-Syntactic Patterns

• Suggested initially by (Hearst, 1992).• Applied to relation detection by (Pantel et al, 2004;

Etzioni et al, 2005)• Sample patterns:

– <Class> such as <Member1>, …, <Memberi>– <Class> like <Member1> and <Member2>– <Member> is a <Class>– <Class>, including <Member>

• Suited for the detection of TypeOf() subsumption relations over large corpora.

Next Section

• Introduction




• Benchmark Tasks


Benchmark Tasks• Message Understanding Conference (MUC)

– DARPA, (1989 – 1997), Newswire– TR task: Location_Of(ORG, LOC); Employee_of(PER, ORG);

and Product_Of(ARTIFACT, ORG)

• Automatic Content Extraction (ACE)– NIST, (2002 – …), Newswire– Relation Mention Detection: ~5 major, ~24 minor rels– Physical(E1,E2); Social(Personx, Persony); Employ(Org, Person);

…

• Protein Localization Relation Extraction– SFU, (2006 – …)– SubcellularLocation(Organism, Protein, Location)

Message Understanding Conference 1997 Miller et al,

1998

ACE-2003

Paper Method Type Features P R F1 P R F1

ZhouZJZ, 2007 Kernel Composite PS+Em+Ch+Dep+Ext 80.8 68.4 74.1 65.2 54.9 59.6 ZhouZJZ, 2007 Kernel Convolution PS 80.1 63.8 71.0 63.4 51.9 57.1 ZhangZSZ, 2006 Kernel Composite+poly Em+Ch+PS.SPET+Dep.SP 77.3 65.6 70.9 64.9 51.2 57.2 ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 76.3 63.0 69.0 ZhangZS, 2006 Kernel Composite Em+PS+Ext 76.3 63.0 69.0 64.6 50.8 56.8 ZhangZS, 2006 Kernel Composite Em+PS 76.1 62.9 68.9 ZhouSZZ, 2005 Feature SVM Tok(1)+Em+Ch+PS.SPET+Dep.SP 77.2 60.7 68.0 63.1 49.5 55.5 HarabagiuBM, 2005 Kernel Kernel Tok(1+n)+Em+SRL 72.2 44.5 55.1 Kambhatla, 2004 Feature MaxEnt Tok(1+n)+Em+Dep. 63.5 45.2 52.8 ZhangZS, 2006 Kernel Convolution PS Ptree 72.8 53.8 61.9 ZhangZSZ, 2006 Kernel Linear Em 79.5 34.6 48.2 CulottaS, 2004 Kernel Tree Em+Dep.SP 67.1 35.0 45.8 HarabagiuBM, 2005 Kernel Kernel Tok(1+n)+Em 60.5 20.3 30.4

N/A N/A

ACE - 2003

N/A

N/A N/A

N/A

N/A

N/A

5 Major Relation Types

24 Relation Subtypes


ZhouZJZ, 2007 Kernel Composite PS+Em+Ch+Dep+Ext 80.8 68.4 74.1 65.2 54.9 59.6 ZhouZJZ, 2007

ACE - 2003 5 Major Relation Types


Prokaryote Protein LocalizationRelation Extraction (PPLRE) Task

Paper Approach Method Features P R F1Shi07 Baseline All True n/a 14.1 61.5 23.0 Shi07 Discrim. Snowball NER+Tokens(uni) 66.6 18.2 26.2 Shi07 Generative LPCFG PS PTree+NER 58.6 26.2 36.2 Shi07 PS PTree+NER 59.4 29.2 39.0 Shi07 PS PTree+NER+Coref. 64.7 33.8 44.4

MultipleMultiple

“The pilus(location) of V. cholerae(organism) is essential for intestinal colonization.

The pilus(location) biogenesis apparatus is composed of nine proteins.

TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”

Next Section

• Introduction




• Benchmark Tasks


Research Directions

1. Additional Features/Knowledge

2. Inter-sentential Relation Cases

3. Relations with More than Two Arguments

4. Grounding Entity Mentions to an Ontology

5. Qualifying the Certainty of a Relation Case

1. Additional Features/Knowledge

2. Inter-sentential Relation Cases

3. Relations with More than Two Arguments

4. Grounding Entity Mentions to an Ontology

5. Qualifying the Certainty of a Relation Case

Additional Features/Knowledge

• Expose additional features that can identify the more esoteric ways of expressing a relation.

• Features from outside of the “shortest-path”.– Challenge: past open-ended attempts have reduced

performance (Jiang and Zhi, 2007)– (Zhou et al, 2007) add heuristics for five common

situations.

• Use domain-specific background knowledge.– E.g. Gram-positive bacteria (such as M. tuberculosis)

do not have a periplasm therefore do not predict periplasm.

Inter-sentential Relation Cases

• Challenge: current approaches focus on syntactic features which cannot be extended beyond the sentence boundary.– Idea: apply Centering Theory (Hirano et al,

2007)– Idea: create a text graph and to apply graph

mining.• Challenge: A significant increase in the

proportion false relation cases. – Idea: a threshold on the number of pairings

anyone entity mention can take.

Relations with > Two Arguments

• Idea: decompose the problem into a set of (n – 1) binary relations and then join relation cases that share an entity mention (Shi et al, 2007; Liu et al, 2007).– How to pick the ‘shared’ entity mention?– How much information is lost?

• Idea: Create a unified feature vector with features associated with each entity mention pair.

Shortened Reference ListACE Project, (2002-2007). http://www.ldc.upenn.edu/Projects/ACE/.E. Agichtein, and L. Gravano. (2000). Snowball: Extracting Relations from Large Plain-Text Collections. In Proc. of DL-2000. D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. (1993). FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. In

Proc. IJCAI 1993.B. Beamer, S. Bhat, B. Chee, A. Fister, A. Rozovskaya, and R. Girju. (2007). UIUC: A Knowledge-rich Approach to Identifying Semantic Relations between

Nominals. In Proc. of the Fourth International Workshop on Semantic Evaluations (SemEval-2007).R. Bunescu, and R. J. Mooney. (2005). A Shortest Path Dependency Kernel for Relation Extraction. In Proc. of HLT/EMNLP-2005. C. Cardie. (1997). Empirical Methods in Information Extraction. AI Magazine, 18(4).M. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge-bases by Extracting Information from Text Sources. In Proc. of the International

Conference on Intelligent Systems for Molecular Biology. A. Culotta, and J. S. Sorensen. (2004). Dependency Tree Kernels for Relation Extraction. In Proc. of ACL-2004.O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. S. Weld and A. Yates. (2005).

Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artificial Intelligence, 165(1).K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3). R. Grishman, and B. Sundheim. (1996). Message Understanding Conference - 6: A Brief History. In Proc. of COLING-1996.S. M. Harabagiu, C. A. Bejan and P. Morarescu. (2005). Shallow Semantics for Relation Extraction. In Proc. of IJCAI-2005.T. Hasegawa, S. Sekine, and R. Grishman. (2004). Discovering Relations among Named Entities from Large Corpora. In Proc. of ACL-2004.J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007. N. Kambhatla. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proc. of ACL-2004.T.R. Leek. (1997). Information Extraction Using Hidden Markov Models. M.Sc. Thesis, University of California, San Diego.Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007.S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. (2000). A novel use of statistical parsing to extract information from text. In Proc. of NAACL-2000.S. Ray, and M. Craven. (2001). Representing Sentence Structure in Hidden Markov Models for Information Extraction. In Proc. IJCAI-2001.D. Roth, and W. Yih. (2002). Probabilistic Reasoning for Entity & Relation Recognition.. In Proc. of COLING-2002.Z. Shi. (2007). Ph.D. thesis. Forthcoming.Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing

Techniques. Proc. of NAACL/HLT-2007M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003.F. M. Suchanek, G. Ifrim and G. Weikum. (2006). Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents. In Proc. of KDD-

2006.D. Zelenko, C. Aone, and A. Richardella. (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, Vol. 3.M. Zhang, J. Su, D. Wang. G. Zhou and C. Lim. (2005). Discovering Relations between Named Entities from a Large Raw Corpus Using Tree Similarity-based

Clustering. In Proc. of IJCNLP-2005. M. Zhang, J. Zhang, and J. Su. (2006). Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. In Proc. of HLT-2006.S. Zhao, and R. Grishman. (2005). Extracting Relations with Integrated Information Using Kernel Methods. In Proc. of ACL-2005.G. Zhou, M. Zhang, D. Ji and Q. Zhu. (2007). Tree Kernel-Based Relation Extraction with Context-Sensitive Structured Parse Tree Information. In Proc. of

ACL-2007.

http://www.ldc.upenn.edu/Projects/ACE/



http://www.cs.columbia.edu/~gravano/Papers/2000/dl00.pdf

http://www.isi.edu/~hobbs/ijcai93.pdf

http://www.cs.utexas.edu/~razvan/papers/spk-emnlp-05.pdf

http://www.cs.cornell.edu/home/cardie/papers/ai-mag.pdf

http://www.biostat.wisc.edu/~craven/papers/ismb99.pdf

http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf

http://turing.cs.washington.edu/papers/KnowItAll_AIJ.pdf

http://nlp.cs.nyu.edu/muc/muc6-history-coling.ps

http://www.ijcai.org/papers/1589.pdf

http://cs.nyu.edu/~sekine/papers/acl04-hasegawa.pdf

.

The End

BackupSlides for Questions

Entity Mentions pre-Detected

Typed Semantic Relations

• require that the semantic relation’s arguments also be associated with a semantic class.

• For example, argument A1,2 may be associated with the semantic class ORGANIZATION.

Information Extraction vs. Relation Detection and Classification

• Some of the surveyed algorithms such as (Brin, 1998; Miller et al, 1998; Agichtein and Gravano, 2000, Suchanek et al, 2006) are presented in the literature as information extraction algorithms, not as relation detection and classification algorithms.

• They are included in the survey nonetheless because they can naturally be applied to the task of relation detection and classification. This situation is to be expected because the identification of relation mentions can be a natural preprocessing step to information extraction (ACE, 2002-2007).

• IE = “populate a relational database table” (or to fill-in the slots of a template), where each record represents an instance of an entity or semantic relation in the domain.

Information Extraction

Document Content D1 “IBM1 is based in Armonk, NY2.”

D2 “Armonk1-based International Business Machines Inc2.is expanding its central office space3. IBM4’s decision to retain headquarters5 in New York6 came despite the many offers7 presented by the other states8 in the region9.”

• Example corpus:

Information Extraction detects duplicate relation cases

Organization Location (Mentions)

International Business Machines Inc.(IBM) Armonk, New York D1(E1, E2), D2(E2,

E1), D2(E4, E6)

OrgHeadquarterLocation mentions Document A1 A2

D1 E1 (IBM) E2 (Armonk, NY)

D2 E2 (International Business Machines Inc.) E1 (Armonk)

D2 E4 (IBM.) E6 (New York)

Relation Detection and ClassificationInformation Extraction

What Task Name?

• Relation Extraction: Culotta and Sorensen, 2004; Harabagiu et al, 2005; Bunescu and Mooney, 2005; Zhang et al, 2005 and 2006; Jiang and Zhai, 2007; Xu et al, 2007; and Zhou et al, 2007.

• Relation Mention Detection (RMD): ACE Project, 2002 – 2007.

• Semantic Relation Identification: Beamer et al, 2007.

• Semantic Relation Classification: Girju et al, 2007.• Relation Detection: Zhao and Grishman, 2005• Relation Discovery: Hasegawa et al, 2004.• Relation Recognition: Roth and Yih, 2002.

Relation Case Generation

• Input: (D, R): A text document D and a set of semantic relations R with a arguments.

• Output: (C): A set of unlabelled semantic relation cases.• Method:

• Identify all e entity mentions Ei in D

• Create every combination of a entity mentions from the e mentions in the document (without replacement).– For intrasentential semantic relation detection and classification tasks,

limit the entity mentions to be from the same sentence.– For typed semantic relation detection and classification tasks, limit the

combinations to those where there is a match between the semantic classes of each of the entity mentions Ei and the semantic class of their corresponding relation argument Ai.

Relation Case Labeling

Naïve Baseline Algorithms

• Predict True: Always predicts “True” regardless of the contents of the relation case – Attains the maximum Recall by any algorithm on the task.– Attains the maximum F1 by any naïve algorithm.– Most commonly used naïve baseline.

• Predict Majority: Predicts the most prevalent class label in the training set.– Maximizes accuracy.– Degenerate to a “Predict False” algorithm.

• Predict (Biased) Random: Randomly predicts "True" with probability matching the distribution of "True" cases in the testing dataset, “False” otherwise.– Trades-off some Precision and Recall for additional Accuracy.

Prediction Outcome Labels

• true positive (tp)– predicted to have the label True and whose label is

indeed True .• false positive (fp)

– predicted to have the label True but whose label is instead False .

• true negative (tn)– predicted to have the label False and whose label is

indeed False .• false negative (fn)

– predicted to have the label False and whose label is instead True .

Shallow Parse Tree

Chunking-based Features

• A shallow syntactic analysis of a sentence that is fast and somewhat domain robust. (Abney, 1989)

• Within a Phrase (Ch.Phr)– Flag whether the two entity mentions are inside the same noun

phrase, verb phrase or prepositional phrase.

http://www.cnts.ua.ac.be/conll2000/chunking/ http://l2r.cs.uiuc.edu/~cogcomp/shallow_parse_demo.php

“Extracellular TcpQ is required for TCP biogenesis..”

[NP Extracellular TcpQ] [VP is required] [PP for] [NP TCP biogenesis] .

Shallow Parse Tree Features

Subsequences within the SPS.LCS

• Inform the classifier about the subsequences

• Two versions (Zelenko et al, 2003)– Contiguous: based on all the subtrees

with n edges. – Non-contiguous (sparse): based on

subtrees that allow gaps.

Dependency Parse Features (Dep)

lipoprotein

PROTEIN1 is require

d

LOCATION1

nsubj

prep_for

cop

num

a

biogenesis

LOCATION2

nn det partmod

Semantic Role Labeling

Overlap with SRL Structures

• Features extracted from a sentence’s semantic role labeling (Harabagiu et al, 2005)

• The predicate argument number associated with the entity mention (A0, A1, A2, …).

– E.g. Is an entity mention is associated with role A1?

• The verb associated with the argument (e.g. be, require).

– E.g. the verb “be” is associated with the entity mention Ei.

LcnCPROTEIN LacZ

PROTEIN

Alkaline phosphatePROTEIN

galactosidasePROTEIN

“it”

Cytoplasm(CELL_REGION)

LcnC(PROTEIN)

located

Cytoplasm(CELL_REGION)

LcnC(PROTEIN)

showLcnCPROTEIN

LacZPROTEIN

PhoAPROTEIN

Alkaline phosphatePROTEIN

galactosidasePROTEIN

make

make

A1

A0

A1AM-LOC

AM-MNR

A2

A1

One Classifier or Many?

• Which is better:– One classifier for detection and classification,

or at least two?– If at least two, then one multi-label classifier

for each relation, or many binary classifiers (one per relation)

– Current empirical evidence suggest• One classifier for detection• One classifier per relation, for classification

Miller et al’s example of semantic annotation

Hidden Markov Model-based

• One of the first statistical approaches applied to the task.• Akin to use of a stochastic version of the finite state

automata successfully used in the FASTUS system.• Efficient algorithms exist for:

– learning the model’s parameters from word sequences– computing a sequence’s probability given the model– finding the highest probability path through the model’s states.

• Challenge has been to include more features into the models. – (McCallum et al, 2000) include capitalization, formatting, and

POS.– (Ray and Craven, 2001) added shallow-parse tree– (Skounakis el al, 2003) use of hierarchical HMMs to represent

syntax.

(Brin, 1998 and Agichtein and Gravano, 2000)

• DIPRE and Snowball use resampling with replacement.• Snowball:

1. Uses NER and classification to better restrict relation cases considered.

2. Uses word unigrams instead of the single feature of the word sequence;

3. Use a discriminative algorithm4. Stop iterations based on a threshold on the Precision.

• Advantages: word-based patterns that make up its classifier can be inspected by a domain expert. – E.g. <ORG> {<'s 0.7> < in 0.3>, <headquarters 0.7>} <LOC>.

• Challenges:– more than six thresholds need to be manually set. – experimental evidence does not support its bootstrapping

approach.

Hidden Markov Model-based (cont.)

• Train two HMM models:– A model () from positive cases– A null model () from negative ones.

• Given a test case sequence S the probability P(| S) and P(| S) is computed.

• Once the log-odds for the prior probability of the relation sequence, , is calculated then each test case’s label is decided based on the log of the ratio of the probability of the case.

)|(

)|(log)(log

SP

SPSodds

Hasegawa et al, 2004

• Detect and classify all relation cases that require the same two argument types.

• E.g. R(PERSON, GEO-POLITICAL ENTITY)– CitizenOf(), PresidentOf(), EnemyOf()

• Approach:– Use hierarchical clustering and a cosine similarity

function.• Clusters correspond to cases of the same

relation.• Cluster can be described by a small set of words

that frequently appear in the cluster.

Global Inference Approaches• An alternative to the pipelined approach• Globally model all of the decisions in order to capture the mutual influences

that exist with down stream decisions.• Opportunity exists to repair incorrectly labeled entity mentions.• For example, a typed relation detection algorithm could predict that an entity

mention that is currently labeled as “GENE” is likely incorrect because a relation case that it participates requires that the argument be a “PROTEIN”.

• (Roth and Yih, 2002) propose to use the dependencies between relation and entity mentions to repair their labels.

– First induce separate classifiers for entity detection and classification and for relation detection and classification. Any state-of-the-art supervised algorithm presented above could be used.

– Next they perform global inference based on the conditional distributions of the two classifiers.

• (Miller et al, 2000) and (Shi et al, 2007) also perform a global inference and report a noticeable number of the mislabeled entity mentions.

ACE - 2004


ZhouZJZ, 2007 Kernel Composite Tok(1)+Em+Ch+PS+Dep 82.2 70.2 75.8 70.3 62.2 66.0JiangZ, 2007 Feature Logistic Reg. Tok(1+2)+Em(-Ext)+PS.SPET(1+2+3) 74.6 71.3 72.9ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 76.1 68.4 72.1 68.6 59.3 63.6JiangZ, 2007 Feature Logistic Reg. PS.SPET(1+2+3) 72.6 68.8 70.7ZhaoG, 2005 Kernel Composite+poly Tok(1)+Em(-Ext)+Dep.SP 69.2 70.5 70.4ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 73.5 67.0 70.1JiangZ, 2007 Feature Logistic Reg. Tok(1+2+3) 71.7 65.3 68.3JiangZ, 2007 Feature Logistic Reg. Tok(1+2) 66.2 70.1 68.1ZhangZSZ, 2006 Kernel Convolution PS.SPET 72.5 56.7 63.6JiangZ, 2007 Feature Logistic Reg. Tok(1) 64.7 61.4 63.0ZhangZSZ, 2006 Kernel Linear Em 75.1 42.7 54.4


ACE - 2004 5 Major Relation Types

N/A

N/AN/A

N/AN/A

N/AN/AN/AN/A

Grounding Entity Mentions to an Ontology

• This step is essential for information extraction, but can be cumbersome and difficult to automate.

• E.g. A biologist would likely require the protein sequence (e.g. MKQSTIALAL…) not protein name (e.g. “alkaline phosphatase”). The sequence can be found in a master database such as Swiss-Prot, but at least two organisms have proteins with the same name.

• Idea: Use the relation information to disambiguate between ontology entries.

Qualifying the Certainty of a Relation Case

• It would be useful qualify the certainty that can be assigned to a relation mention.

• E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay.

• Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.