Upload
priscilla-vazquez
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Inductive Approaches to the Detection and Classification of Semantic Relation Mentions. Depth Report Examination Presentation Gabor Melli August 27, 2007. http://www.gabormelli.com/2007/2007_DepthReport_Melli_Presentation.ppt. Overview. Introduction ( ~ 5 mins .) - PowerPoint PPT Presentation
Citation preview
Inductive Approaches to the Detection and Classification of Semantic Relation Mentions
Depth Report Examination Presentation
Gabor MelliAugust 27, 2007
http://www.gabormelli.com/2007/2007_DepthReport_Melli_Presentation.ppt
Overview
• Introduction (~5 mins.)
• Task Description (~5 mins.)
• Predictive Features (~10 mins.)
• Inductive Algorithms (~10 mins.)
• Benchmark Tasks (~5 mins.)
• Research Directions (~5 mins.)
Simple examples of the “shallow” semantics sought
• “E. coli is a bacteria.” RTypeOf (E. coli, bacteria)
• “An organism has proteins.” RPartOf (proteins, organism)
• “IBM is based in Armonk, NY.” RHeadquarterLocation (IBM, Armonk, NY)
Motivations
• Information Retrieval– Researchers could retrieve scientific papers based on relations
• E.g. “all papers that report localization experiments on V. cholera’s outer membrane proteins”
– Judges could retrieve legal cases.• E.g. “all Supreme Court cases involving third party liability claims”
• Information Fusion– Researchers could populate a database with semantic relations
in research articles.• E.g. SubcellularLocalization(Organism,Protein,Location)
– Activists could save resources when compiling statistics from newspaper reports.
• Document Summarization, Question Answering, …
State-of-the-Art
• Current focus is to automatically induce predictive patterns/classifiers.– Can be more quickly applied to a new domain than an
engineered solution.• Human levels of competency are nearby.
– F-measure:• 76% on the ACE-2004 benchmark task (Zhou et al, 2007)• 75% on a protein/gene interaction (Fundel et al, 2007)• 72% on the SemEval-2007 task (Beamer et al, 2007).
– Though under simplified conditions• binary relations within a single sentence• perfectly classified entity mentions.
Shallow semantic analysis is challenging
• Many ways to say the same thing– O is based in L.; L-based O …; Headquartered in
L, O …; From its L headquarters, O …
• Many relations to disambiguate from.
“The pilus(location) of V. cholerae(organism) is essential for intestinal(location) colonization.
TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”
Next Section
• Introduction
• Task Description
• Predictive Features
• Inductive Algorithms
• Benchmark Tasks
• Research Directions
Task Description
• Documents, Token, Sentences• Entity Mentions: Detected and Classified• Semantic Relation Cases and Mentions• Performance Metrics• Comparison with Information Extraction Task• What name for the task?• General Pipelined Process• Subtask: Relation Case Generation• Subtask: Relation Case Labeling• Naïve Baseline Algorithms
• Documents, Token, Sentences• Entity Mentions: Detected and Classified• Semantic Relation Cases and Mentions• Performance Metrics• Comparison with Information Extraction Task• What name for the task?• General Pipelined Process• Subtask: Relation Case Generation• Subtask: Relation Case Labeling• Naïve Baseline Algorithms
Document, Tokens, Sentences
Entity Mentions are pre-Detected (and pre-Classified)
Semantic Relations
• A relation with fixed set of two or more arguments.
Ri(Arg1,…,Arga) {TRUE, FALSE}
• Examples:– TypeOf (E.coli, Bacteria) TRUE– OrgLocation(IBM, Jupiter) FALSE– SCL(V.cholerae, TcpC, Extracellular) TRUE
Semantic Relation Cases• Some permutation of distinct entity mentions
within the document.
• D1: “E.coli1 is a bacteria2. As with all bacteria3, E.coli4 has a cytoplasm5”
C(Ri, D1, E1, E2)
C(Ri, D1, E2, E1)
…
C(Rj, D1, E4, E3, E5)
C(Rj, D1, E3, E4, E5)
e – entity mentions
amax – arguments
c – relation cases
)!(
!
maxae
ec
Semantic RelationDetection vs. Classification
C(R , Di, Ej,…,Ek) ?
RelationDetection
{True,False}
Relation Classification{1,2,…,r}
?
Predict the semantic relation Rj associated with a relation mention.
Predict whether this is a true mention of some semantic relation.
Test and Training Sets
C(R?, Dd+1, E1, E2) ?…
C(R?, Dd+k, Ex,…, Ey) ?
C(R1, D1, E1, E2) F
C(R1, D1, E1, E3) T…
C(Rr, Dd, E2, E3, E5) F
C(Rr, Dd, E3, E4, E5) F
Performance Metrics• Precision (P): probability that a test case that is
predicted to have label True is tp. • Recall (R): probability that a True test case will
be tp.• F-measure (F1): Harmonic mean of the
Precision and Recall estimates.
FNFPTP
TP
RP
PR
RP
F
2
221111
1
• Accuracy: Proportion of predictions with correct labels, True of False.
Pipelined Process Framework
TrainingDocuments Feature
Generation
Case Labeling
Model Induction
UnlabeledRel. Cases
LabeledRCs
Classifier
Natural Language Processing
Relation Case
Generation
Training Phase
TestingDocuments
PredictionGeneration Predictions
Natural Language Processing
Relation Case
Generation
Feature Generation
Testing Phase
Next Section
• Introduction
• Task Description
• Predictive Features
• Inductive Algorithms
• Benchmark Tasks
• Research Directions
Predictive Feature Categories
1. Token-based2. Entity Mention Argument-based3. Chunking-based4. Shallow Phrase-Structure Parse Tree-
based5. Phrase-Structure Parse Tree-based6. Dependency Parse Tree-based7. Semantic Role Label-based
1. Token-based2. Entity Mention Argument-based3. Chunking-based4. Shallow Phrase-Structure Parse Tree-
based5. Phrase-Structure Parse Tree-based6. Dependency Parse Tree-based7. Semantic Role Label-based
1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r d
Shallow PTree (SPS)
PS PTree (PS)Dep. PTree
(Dep)SRL
(SRL)
Rel. Case Feature Space
R D E j E ,j
Token (Tk)
Entity Mention
(Em)
Chunking
(Ch)
1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r d E i E e 4 2 2 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0
Shallow PTree (SPS)
PS PTree (PS)Dep. PTree
(Dep)SRL
(SRL)
Rel. Case Feature Space
R D E j E ,j
Token (Tk)
Entity Mention
(Em)
Chunking
(Ch)
1 1
1 1… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r d
Shallow PTree (SPS)
PS PTree (PS)Dep. PTree
(Dep)SRL
(SRL)
Rel. Case Feature Space
R D E j E ,j
Token (Tk)
Entity Mention
(Em)
Chunking
(Ch)
Vector of Feature Information
“Protein1 is a Location1 lipoprotein requiredfor Location2 biogenesis.”
Token-based Features “Protein1 is a Location1 ...”
• Token Distance– 2 intervening tokens
• Token Sequence(s)– Unigrams
– Bigrams
the of . , and in a … pyelonephritis0 0 0 0 0 0 1 … 0
of the and the is the is a … causes pyelonephritis0 0 0 1 … 0
tok. dist.2
Token-based Features (cont.)
• Stemmed Word Sequences– “banks bank”
– “scheduling schedule”
• Disambiguated Word-Sense (WordNet)– “bank” river’s edge; financial inst.; row of objects
• Token Part-of-Speech Role SequencesNN IN JJ DT COMMA … WP0 0 0 1 0 … 0
DT IN IN NN JJ DT NN JJ AUX DT … AUX RBS0 0 0 1 … 0
Entity Mention-based Features
• Entity Mention Tokens– IBM 1, Tierra del Fuego 3, …
• Entity Mention’s Semantic Type– Semantic Class
• Organization• Location
– Subclass• Company; University; Charity• Country; Province; Region; City
Entity Mention Features (cont.)
• Entity Mention Type– Name John Doe, E. coli, periplasm, …
– Nominal the president, the country, …
– Pronomial he, she, they, it, …
• Entity Mention’s Ontology Id– secreted; extracellular GO0005576– E. coli; Escheria coli 571 (NCBI tax_id)
Phrase-Structure Parse Tree
http://lfg-demo.computing.dcu.ie/lfgparser.html
Shortest-Path Enclosed Tree
Loss of context?
Two types of subtrees proposed
Both approaches lead to an exponential number of subtree features!
Elementary subtrees general
subtrees
Elementary subtrees
Now we have a populatedfeature space
1 1 E 1 E 2 9 1 4 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 11 1 E 1 E 3 2 0 3 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r d E i E e 4 2 2 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0
Shallow PTree (SPS)
PS PTree (PS)Dep. PTree
(Dep)SRL
(SRL)
Rel. Case Feature Space
R D E j E ,j
Token (Tk)
Entity Mention
(Em)
Chunking
(Ch)
TF…
?
labe
l
Next Section
• Introduction
• Task Description
• Predictive Features
• Inductive Algorithms
• Benchmark Tasks
• Research Directions
Inductive Approaches Available
• Supervised Algorithms– Requires a training set
• Semi-supervised Algorithms– Also accepts an unlabeled set
• Unsupervised Algorithms– Does not use a training set
• Most solutions restrict themselves to the task of detecting and classifying binary relation cases that are intra-sentential.
Supervised Algorithms
• Discriminative model– Feature-based (state of the art)
• E.g. k-Nearest Neighbor, Logistic Regression, …
– Kernel-based (state of the art)• E.g. Support Vector Machine
• Generative model– E.g. Probabilistic Context Free Grammars,
and Hidden Markov Models
Feature-based Algorithms
• Kambhatla, 2004– Early proposal to the use a broad set of features.
• Liu et al, 2007– Proposed the use of features previously found to be
predictive for the task of Semantic Role Labeling.
• Jiang and Zhai, 2007– Used bigram and trigram PS parse tree subtree
features (and dependency parse tree subtrees).– Adding trigram-based features produced marginal
improvement in performance; therefore marginal improvement likely by adding higher-order subtrees.
Kernel-based Induction
• Zelenko et al, 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Zhang et al, 2006.
• Require a kernel function, K(C1,C2) → [0,∞], that maps any two feature vectors to a similarity score from within some transformed space.
• If symmetric and positive definite then comparison between vectors can often be performed efficiently in a high-dimensional space.
• If cases are separable in that space then the kernel attains the benefit of the high-dimensional space without explicitly generating the feature space.
Kernel by Zhang et al, 2006
• Applies the Convolution Tree Kernel proposed in (Collins and Duffy, 2001; Haussler, 1999)
• Number of common subtrees Kc(T1,T2)
22,11
2121 ),(),(NnNn
C nnTTK
– Nj is the set of parent nodes in tree Tj
– (n1, n2) evaluates the common sub-trees rooted at n1 and n2
Kernel computed recursively inO(|N1| |N2|)
– (n1, n2)=0 If productions at n1 and n2 differ
– (n1, n2)=1 if n1 and n2 are POS nodes
– Otherwise,
• #ch(ni) is the number of children of node ni
• ch(n,k) is the kth child of node n• , (0< <1) is a decay factor
)),(),,((1(),( 21
)(#
121
1
knchknchnnnch
k
Generative Models Approaches
• Earliest approach (Leek 1997; Miller 1998).• Instead of directly estimating model parameters
for the conditional probability P(Y | X).
• Estimate model parameters for P(X | Y) and P(Y) from the training set
• Then apply Bayes rules to decide which label has the highest posterior probability.
• If the model fits the data then the generated likelihood ratio estimate is known to be optimal
Two Approaches Surveyed
• Probabilistic Context Free Grammars– Miller et al, 1998; Miller et al, 2000
• Hidden Markov Models– Leek, 1997– McCallum et al, 2000– Ray and Craven, 2001; Skounakis, Craven, and
Ray, 2003
PCFG-based Model
Miller et al, 1998/2000
• From augmented representation learn a PCFG based on these trees.
• Infer the maximum likelihood estimates of the probabilities based on the frequencies in the training corpus, along with an interpolated adjustment of lower order estimates to handle the (increased) challenge of data sparsity.
• Parses of test cases that contain the semantic labels are predicted to be relation mentions.
Semi-Supervised Approaches
• (Brin, 1998; Agichtein and Gravano, 2000)– Use token-based features– Apply resampling with replacement– Assume that relations in the training set are
redundantly present and restated in test set.
• (Shi et al, 2007)– Uses (Miller et al, 1998/2000) approach.– Uses a naïve baseline to convert unlabelled
cases to true training cases.
Snowball’s Bootstrapping
(Xia, 2006)
Unsupervised Use of Lexico-Syntactic Patterns
• Suggested initially by (Hearst, 1992).• Applied to relation detection by (Pantel et al, 2004;
Etzioni et al, 2005)• Sample patterns:
– <Class> such as <Member1>, …, <Memberi>– <Class> like <Member1> and <Member2>– <Member> is a <Class>– <Class>, including <Member>
• Suited for the detection of TypeOf() subsumption relations over large corpora.
Next Section
• Introduction
• Task Description
• Predictive Features
• Inductive Algorithms
• Benchmark Tasks
• Research Directions
Benchmark Tasks• Message Understanding Conference (MUC)
– DARPA, (1989 – 1997), Newswire– TR task: Location_Of(ORG, LOC); Employee_of(PER, ORG);
and Product_Of(ARTIFACT, ORG)
• Automatic Content Extraction (ACE)– NIST, (2002 – …), Newswire– Relation Mention Detection: ~5 major, ~24 minor rels– Physical(E1,E2); Social(Personx, Persony); Employ(Org, Person);
…
• Protein Localization Relation Extraction– SFU, (2006 – …)– SubcellularLocation(Organism, Protein, Location)
Message Understanding Conference 1997 Miller et al,
1998
ACE-2003
Paper Method Type Features P R F1 P R F1
ZhouZJZ, 2007 Kernel Composite PS+Em+Ch+Dep+Ext 80.8 68.4 74.1 65.2 54.9 59.6 ZhouZJZ, 2007 Kernel Convolution PS 80.1 63.8 71.0 63.4 51.9 57.1 ZhangZSZ, 2006 Kernel Composite+poly Em+Ch+PS.SPET+Dep.SP 77.3 65.6 70.9 64.9 51.2 57.2 ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 76.3 63.0 69.0 ZhangZS, 2006 Kernel Composite Em+PS+Ext 76.3 63.0 69.0 64.6 50.8 56.8 ZhangZS, 2006 Kernel Composite Em+PS 76.1 62.9 68.9 ZhouSZZ, 2005 Feature SVM Tok(1)+Em+Ch+PS.SPET+Dep.SP 77.2 60.7 68.0 63.1 49.5 55.5 HarabagiuBM, 2005 Kernel Kernel Tok(1+n)+Em+SRL 72.2 44.5 55.1 Kambhatla, 2004 Feature MaxEnt Tok(1+n)+Em+Dep. 63.5 45.2 52.8 ZhangZS, 2006 Kernel Convolution PS Ptree 72.8 53.8 61.9 ZhangZSZ, 2006 Kernel Linear Em 79.5 34.6 48.2 CulottaS, 2004 Kernel Tree Em+Dep.SP 67.1 35.0 45.8 HarabagiuBM, 2005 Kernel Kernel Tok(1+n)+Em 60.5 20.3 30.4
N/A N/A
ACE - 2003
N/A
N/A N/A
N/A
N/A
N/A
5 Major Relation Types
24 Relation Subtypes
Paper Method Type Features P R F1 P R F1
ZhouZJZ, 2007 Kernel Composite PS+Em+Ch+Dep+Ext 80.8 68.4 74.1 65.2 54.9 59.6 ZhouZJZ, 2007
ACE - 2003 5 Major Relation Types
24 Relation Subtypes
Prokaryote Protein LocalizationRelation Extraction (PPLRE) Task
Paper Approach Method Features P R F1Shi07 Baseline All True n/a 14.1 61.5 23.0 Shi07 Discrim. Snowball NER+Tokens(uni) 66.6 18.2 26.2 Shi07 Generative LPCFG PS PTree+NER 58.6 26.2 36.2 Shi07 PS PTree+NER 59.4 29.2 39.0 Shi07 PS PTree+NER+Coref. 64.7 33.8 44.4
MultipleMultiple
“The pilus(location) of V. cholerae(organism) is essential for intestinal colonization.
The pilus(location) biogenesis apparatus is composed of nine proteins.
TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”
Next Section
• Introduction
• Task Description
• Predictive Features
• Inductive Algorithms
• Benchmark Tasks
• Research Directions
Research Directions
1. Additional Features/Knowledge
2. Inter-sentential Relation Cases
3. Relations with More than Two Arguments
4. Grounding Entity Mentions to an Ontology
5. Qualifying the Certainty of a Relation Case
1. Additional Features/Knowledge
2. Inter-sentential Relation Cases
3. Relations with More than Two Arguments
4. Grounding Entity Mentions to an Ontology
5. Qualifying the Certainty of a Relation Case
Additional Features/Knowledge
• Expose additional features that can identify the more esoteric ways of expressing a relation.
• Features from outside of the “shortest-path”.– Challenge: past open-ended attempts have reduced
performance (Jiang and Zhi, 2007)– (Zhou et al, 2007) add heuristics for five common
situations.
• Use domain-specific background knowledge.– E.g. Gram-positive bacteria (such as M. tuberculosis)
do not have a periplasm therefore do not predict periplasm.
Inter-sentential Relation Cases
• Challenge: current approaches focus on syntactic features which cannot be extended beyond the sentence boundary.– Idea: apply Centering Theory (Hirano et al,
2007)– Idea: create a text graph and to apply graph
mining.• Challenge: A significant increase in the
proportion false relation cases. – Idea: a threshold on the number of pairings
anyone entity mention can take.
Relations with > Two Arguments
• Idea: decompose the problem into a set of (n – 1) binary relations and then join relation cases that share an entity mention (Shi et al, 2007; Liu et al, 2007).– How to pick the ‘shared’ entity mention?– How much information is lost?
• Idea: Create a unified feature vector with features associated with each entity mention pair.
Shortened Reference ListACE Project, (2002-2007). http://www.ldc.upenn.edu/Projects/ACE/.E. Agichtein, and L. Gravano. (2000). Snowball: Extracting Relations from Large Plain-Text Collections. In Proc. of DL-2000. D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. (1993). FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. In
Proc. IJCAI 1993.B. Beamer, S. Bhat, B. Chee, A. Fister, A. Rozovskaya, and R. Girju. (2007). UIUC: A Knowledge-rich Approach to Identifying Semantic Relations between
Nominals. In Proc. of the Fourth International Workshop on Semantic Evaluations (SemEval-2007).R. Bunescu, and R. J. Mooney. (2005). A Shortest Path Dependency Kernel for Relation Extraction. In Proc. of HLT/EMNLP-2005. C. Cardie. (1997). Empirical Methods in Information Extraction. AI Magazine, 18(4).M. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge-bases by Extracting Information from Text Sources. In Proc. of the International
Conference on Intelligent Systems for Molecular Biology. A. Culotta, and J. S. Sorensen. (2004). Dependency Tree Kernels for Relation Extraction. In Proc. of ACL-2004.O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. S. Weld and A. Yates. (2005).
Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artificial Intelligence, 165(1).K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3). R. Grishman, and B. Sundheim. (1996). Message Understanding Conference - 6: A Brief History. In Proc. of COLING-1996.S. M. Harabagiu, C. A. Bejan and P. Morarescu. (2005). Shallow Semantics for Relation Extraction. In Proc. of IJCAI-2005.T. Hasegawa, S. Sekine, and R. Grishman. (2004). Discovering Relations among Named Entities from Large Corpora. In Proc. of ACL-2004.J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007. N. Kambhatla. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proc. of ACL-2004.T.R. Leek. (1997). Information Extraction Using Hidden Markov Models. M.Sc. Thesis, University of California, San Diego.Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007.S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. (2000). A novel use of statistical parsing to extract information from text. In Proc. of NAACL-2000.S. Ray, and M. Craven. (2001). Representing Sentence Structure in Hidden Markov Models for Information Extraction. In Proc. IJCAI-2001.D. Roth, and W. Yih. (2002). Probabilistic Reasoning for Entity & Relation Recognition.. In Proc. of COLING-2002.Z. Shi. (2007). Ph.D. thesis. Forthcoming.Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing
Techniques. Proc. of NAACL/HLT-2007M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003.F. M. Suchanek, G. Ifrim and G. Weikum. (2006). Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents. In Proc. of KDD-
2006.D. Zelenko, C. Aone, and A. Richardella. (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, Vol. 3.M. Zhang, J. Su, D. Wang. G. Zhou and C. Lim. (2005). Discovering Relations between Named Entities from a Large Raw Corpus Using Tree Similarity-based
Clustering. In Proc. of IJCNLP-2005. M. Zhang, J. Zhang, and J. Su. (2006). Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. In Proc. of HLT-2006.S. Zhao, and R. Grishman. (2005). Extracting Relations with Integrated Information Using Kernel Methods. In Proc. of ACL-2005.G. Zhou, M. Zhang, D. Ji and Q. Zhu. (2007). Tree Kernel-Based Relation Extraction with Context-Sensitive Structured Parse Tree Information. In Proc. of
ACL-2007.
.
The End
BackupSlides for Questions
Entity Mentions pre-Detected
Typed Semantic Relations
• require that the semantic relation’s arguments also be associated with a semantic class.
• For example, argument A1,2 may be associated with the semantic class ORGANIZATION.
Information Extraction vs. Relation Detection and Classification
• Some of the surveyed algorithms such as (Brin, 1998; Miller et al, 1998; Agichtein and Gravano, 2000, Suchanek et al, 2006) are presented in the literature as information extraction algorithms, not as relation detection and classification algorithms.
• They are included in the survey nonetheless because they can naturally be applied to the task of relation detection and classification. This situation is to be expected because the identification of relation mentions can be a natural preprocessing step to information extraction (ACE, 2002-2007).
• IE = “populate a relational database table” (or to fill-in the slots of a template), where each record represents an instance of an entity or semantic relation in the domain.
Information Extraction
Document Content D1 “IBM1 is based in Armonk, NY2.”
D2 “Armonk1-based International Business Machines Inc2.is expanding its central office space3. IBM4’s decision to retain headquarters5 in New York6 came despite the many offers7 presented by the other states8 in the region9.”
• Example corpus:
Information Extraction detects duplicate relation cases
Organization Location (Mentions)
International Business Machines Inc.(IBM) Armonk, New York D1(E1, E2), D2(E2,
E1), D2(E4, E6)
OrgHeadquarterLocation mentions Document A1 A2
D1 E1 (IBM) E2 (Armonk, NY)
D2 E2 (International Business Machines Inc.) E1 (Armonk)
D2 E4 (IBM.) E6 (New York)
Relation Detection and ClassificationInformation Extraction
What Task Name?
• Relation Extraction: Culotta and Sorensen, 2004; Harabagiu et al, 2005; Bunescu and Mooney, 2005; Zhang et al, 2005 and 2006; Jiang and Zhai, 2007; Xu et al, 2007; and Zhou et al, 2007.
• Relation Mention Detection (RMD): ACE Project, 2002 – 2007.
• Semantic Relation Identification: Beamer et al, 2007.
• Semantic Relation Classification: Girju et al, 2007.• Relation Detection: Zhao and Grishman, 2005• Relation Discovery: Hasegawa et al, 2004.• Relation Recognition: Roth and Yih, 2002.
Relation Case Generation
• Input: (D, R): A text document D and a set of semantic relations R with a arguments.
• Output: (C): A set of unlabelled semantic relation cases.• Method:
• Identify all e entity mentions Ei in D
• Create every combination of a entity mentions from the e mentions in the document (without replacement).– For intrasentential semantic relation detection and classification tasks,
limit the entity mentions to be from the same sentence.– For typed semantic relation detection and classification tasks, limit the
combinations to those where there is a match between the semantic classes of each of the entity mentions Ei and the semantic class of their corresponding relation argument Ai.
Relation Case Labeling
Naïve Baseline Algorithms
• Predict True: Always predicts “True” regardless of the contents of the relation case – Attains the maximum Recall by any algorithm on the task.– Attains the maximum F1 by any naïve algorithm.– Most commonly used naïve baseline.
• Predict Majority: Predicts the most prevalent class label in the training set.– Maximizes accuracy.– Degenerate to a “Predict False” algorithm.
• Predict (Biased) Random: Randomly predicts "True" with probability matching the distribution of "True" cases in the testing dataset, “False” otherwise.– Trades-off some Precision and Recall for additional Accuracy.
Prediction Outcome Labels
• true positive (tp)– predicted to have the label True and whose label is
indeed True .• false positive (fp)
– predicted to have the label True but whose label is instead False .
• true negative (tn)– predicted to have the label False and whose label is
indeed False .• false negative (fn)
– predicted to have the label False and whose label is instead True .
Shallow Parse Tree
Chunking-based Features
• A shallow syntactic analysis of a sentence that is fast and somewhat domain robust. (Abney, 1989)
• Within a Phrase (Ch.Phr)– Flag whether the two entity mentions are inside the same noun
phrase, verb phrase or prepositional phrase.
http://www.cnts.ua.ac.be/conll2000/chunking/ http://l2r.cs.uiuc.edu/~cogcomp/shallow_parse_demo.php
“Extracellular TcpQ is required for TCP biogenesis..”
[NP Extracellular TcpQ] [VP is required] [PP for] [NP TCP biogenesis] .
Shallow Parse Tree Features
Subsequences within the SPS.LCS
• Inform the classifier about the subsequences
• Two versions (Zelenko et al, 2003)– Contiguous: based on all the subtrees
with n edges. – Non-contiguous (sparse): based on
subtrees that allow gaps.
Dependency Parse Features (Dep)
lipoprotein
PROTEIN1 is require
d
LOCATION1
nsubj
prep_for
cop
num
a
biogenesis
LOCATION2
nn det partmod
Semantic Role Labeling
Overlap with SRL Structures
• Features extracted from a sentence’s semantic role labeling (Harabagiu et al, 2005)
• The predicate argument number associated with the entity mention (A0, A1, A2, …).
– E.g. Is an entity mention is associated with role A1?
• The verb associated with the argument (e.g. be, require).
– E.g. the verb “be” is associated with the entity mention Ei.
LcnCPROTEIN LacZ
PROTEIN
Alkaline phosphatePROTEIN
galactosidasePROTEIN
“it”
Cytoplasm(CELL_REGION)
LcnC(PROTEIN)
located
Cytoplasm(CELL_REGION)
LcnC(PROTEIN)
showLcnCPROTEIN
LacZPROTEIN
PhoAPROTEIN
Alkaline phosphatePROTEIN
galactosidasePROTEIN
make
make
A1
A0
A1AM-LOC
AM-MNR
A2
A1
One Classifier or Many?
• Which is better:– One classifier for detection and classification,
or at least two?– If at least two, then one multi-label classifier
for each relation, or many binary classifiers (one per relation)
– Current empirical evidence suggest• One classifier for detection• One classifier per relation, for classification
Miller et al’s example of semantic annotation
Hidden Markov Model-based
• One of the first statistical approaches applied to the task.• Akin to use of a stochastic version of the finite state
automata successfully used in the FASTUS system.• Efficient algorithms exist for:
– learning the model’s parameters from word sequences– computing a sequence’s probability given the model– finding the highest probability path through the model’s states.
• Challenge has been to include more features into the models. – (McCallum et al, 2000) include capitalization, formatting, and
POS.– (Ray and Craven, 2001) added shallow-parse tree– (Skounakis el al, 2003) use of hierarchical HMMs to represent
syntax.
(Brin, 1998 and Agichtein and Gravano, 2000)
• DIPRE and Snowball use resampling with replacement.• Snowball:
1. Uses NER and classification to better restrict relation cases considered.
2. Uses word unigrams instead of the single feature of the word sequence;
3. Use a discriminative algorithm4. Stop iterations based on a threshold on the Precision.
• Advantages: word-based patterns that make up its classifier can be inspected by a domain expert. – E.g. <ORG> {<'s 0.7> < in 0.3>, <headquarters 0.7>} <LOC>.
• Challenges:– more than six thresholds need to be manually set. – experimental evidence does not support its bootstrapping
approach.
Hidden Markov Model-based (cont.)
• Train two HMM models:– A model () from positive cases– A null model () from negative ones.
• Given a test case sequence S the probability P(| S) and P(| S) is computed.
• Once the log-odds for the prior probability of the relation sequence, , is calculated then each test case’s label is decided based on the log of the ratio of the probability of the case.
)|(
)|(log)(log
SP
SPSodds
Hasegawa et al, 2004
• Detect and classify all relation cases that require the same two argument types.
• E.g. R(PERSON, GEO-POLITICAL ENTITY)– CitizenOf(), PresidentOf(), EnemyOf()
• Approach:– Use hierarchical clustering and a cosine similarity
function.• Clusters correspond to cases of the same
relation.• Cluster can be described by a small set of words
that frequently appear in the cluster.
Global Inference Approaches• An alternative to the pipelined approach• Globally model all of the decisions in order to capture the mutual influences
that exist with down stream decisions.• Opportunity exists to repair incorrectly labeled entity mentions.• For example, a typed relation detection algorithm could predict that an entity
mention that is currently labeled as “GENE” is likely incorrect because a relation case that it participates requires that the argument be a “PROTEIN”.
• (Roth and Yih, 2002) propose to use the dependencies between relation and entity mentions to repair their labels.
– First induce separate classifiers for entity detection and classification and for relation detection and classification. Any state-of-the-art supervised algorithm presented above could be used.
– Next they perform global inference based on the conditional distributions of the two classifiers.
• (Miller et al, 2000) and (Shi et al, 2007) also perform a global inference and report a noticeable number of the mislabeled entity mentions.
ACE - 2004
Paper Method Type Features P R F1 P R F1
ZhouZJZ, 2007 Kernel Composite Tok(1)+Em+Ch+PS+Dep 82.2 70.2 75.8 70.3 62.2 66.0JiangZ, 2007 Feature Logistic Reg. Tok(1+2)+Em(-Ext)+PS.SPET(1+2+3) 74.6 71.3 72.9ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 76.1 68.4 72.1 68.6 59.3 63.6JiangZ, 2007 Feature Logistic Reg. PS.SPET(1+2+3) 72.6 68.8 70.7ZhaoG, 2005 Kernel Composite+poly Tok(1)+Em(-Ext)+Dep.SP 69.2 70.5 70.4ZhangZSZ, 2006 Kernel Composite Em+Ch+PS.SPET+Dep.SP 73.5 67.0 70.1JiangZ, 2007 Feature Logistic Reg. Tok(1+2+3) 71.7 65.3 68.3JiangZ, 2007 Feature Logistic Reg. Tok(1+2) 66.2 70.1 68.1ZhangZSZ, 2006 Kernel Convolution PS.SPET 72.5 56.7 63.6JiangZ, 2007 Feature Logistic Reg. Tok(1) 64.7 61.4 63.0ZhangZSZ, 2006 Kernel Linear Em 75.1 42.7 54.4
24 Relation Subtypes
ACE - 2004 5 Major Relation Types
N/A
N/AN/A
N/AN/A
N/AN/AN/AN/A
Grounding Entity Mentions to an Ontology
• This step is essential for information extraction, but can be cumbersome and difficult to automate.
• E.g. A biologist would likely require the protein sequence (e.g. MKQSTIALAL…) not protein name (e.g. “alkaline phosphatase”). The sequence can be found in a master database such as Swiss-Prot, but at least two organisms have proteins with the same name.
• Idea: Use the relation information to disambiguate between ontology entries.
Qualifying the Certainty of a Relation Case
• It would be useful qualify the certainty that can be assigned to a relation mention.
• E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay.
• Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.