View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Applying Corpus Based Approaches using Syntactic Patterns and Predicate Argument Relations to Hypernym Recognition for Question Answering
Kieran White and Richard Sutcliffe
Contents
Motivation Objectives Experimental Framework P-System Classifications Larger Evaluation Comparison of Three Models Conclusions
Question answering and the DLT system TREC, CLEF and NTCIR
Four stages to answering a factoid query in a standard question answering system Identification of answer type required Document retrieval Named Entity recognition Answer selection
Motivation
Example from TREC 2003 How long is a quarter in an NBA game? Identity type of answer required
length_of_time
Document retrieval Boolean query: quarter AND NBA AND game Relax query if no documents returned
Motivation
Example from TREC 2003 Named Entity recognition
Locate instances of length_of_time Named Entities in documents
Answer selection Select one length_of_time Named Entity (e.g. 12 minutes)
using a scoring function
Motivation
Comparing query terms with those in supporting sentences White and Sutcliffe (2004) Compared terms in
50 TREC question answering factoid queries from 2003 Supporting sentences
Morphological relationships Identical terms (e.g. Washington Monument and
Washington Monument) Different inflections of terms (e.g. New York and New
York's Terms with different parts of speech (e.g. France and
French)
Motivation
Comparing query terms with those in supporting sentences Semantic relationships
Synonyms (e.g. orca and killer whale) Terms linked by a causal relationship (e.g. die and typhus) Word chains (e.g. Oscar and best-actress) Hypernyms (e.g. city and Berlin) Hyponyms (e.g. Titanic and ship) Meronyms (e.g. Death Valley and California Desert) Holonyms (e.g. 20th century and 1945) Attributes and units to quantify them (e.g. hot and degrees) Co-occurrence (e.g. old and 18)
Motivation
Comparing query terms with those in supporting sentences Hypernyms / Hyponyms most common semantic
relationship type TREC Examples
What stadium was the first televised MLB game played in?
In 1939, the first televised major league baseball games were shown on experimental station W2XBS when the Cincinnati Reds and the Brooklyn Dodgers split a doubleheader at Ebbets Field.
Motivation
TREC Examples What actress has received the most Oscar nominations?
Oscar perennial Meryl Streep is up for best actress for the film, tying Katharine Hepburn for most acting nominations with 12.
What ancient tribe of Mexico left behind huge stone heads standing 6-11 feet tall?
The company's founder, geophysicist Sheldon Breiner, is a Stanford University graduate who used the first cesium magnetometer to discover two colossal and ancient Olmec heads in Mexico in the 1960s.
Motivation
TREC Examples When was the Titanic built?
The techniques used today to analyze the defects in the metal did not exist back in 1910 when the ship was being built, he said.
Motivation
How to classify? Common nouns in ontologies (e.g. WordNet) Proper nouns???
Feature co-occurrence Labelling clusters of semantically related terms (Pantel
and Ravichandran, 2004) Responding with the hypernym of similar previously
classified hyponyms (Alfonseca and Manandhar, 2002; Pekar and Staab, 2003; Takenobu et al., 1997)
Motivation
Search patterns (Fleischman et al., 2003; Girju, 2001; Hearst, 1992; 1998; Mann, 2002; Moldovan et al., 2000)
Feature co-occurrence and search patterns (Hahn and Schattinger,1998)
Motivation
Objectives
Create one or more hyponym classifiers for use as a component in a question answering system
Evaluate accuracy of classifiers when identifying the occupations of people
Experimental Framework
P-System Takes a name as input and attempts to respond with the
person's occupation Predicate-argument co-occurrence frequencies
A-System Takes a name as input and attempts to respond with the
person's occupation Search pattern
H-System Takes a name as input and attempts to respond with the
correct occupation sense P-System and A-System hybrid
Experimental Framework
AQUAINT corpus (Graff, 2002) List of 364 occupations 257 names from 2002-2004 TREC Question
Answering track queries 250 were classifiable by a person
Experimental Framework
P-System Classifications
Minipar (Lin, 1998) Subject‑verb and verb‑object pairs Frequencies passed to Okapi's BM25 matching
function Candidate hypernyms (occupations) indexed as
documents Hyponyms (names) presented as queries
Ordered list of hypernyms returned in response to a hyponym
Top-ranking hypernym selected as answer
50 names from TREC queries No co-reference resolution
0.30 accuracy Full names substituted for partial names
0.44 accuracy
P-System Classifications
Accuracy increases with name co-occurrence frequency
Occupation co-occurrence frequency was not a limiting factor in our experiments
Grouping similar occupations allowed us to perform a 194-way classification experiment Accuracy increased from 0.44 to 0.56
P-System Classifications
Tuning constants provide some control over the specificity of occupations returned Best assignment of constants penalises occupations
occurring in fewer than 1,000 predicate-argument pairs Some occupations could be classified better than
others Which could P-System classify accurately? How accurately? Test set too small
P-System Classifications
Larger Evaluation
Apposition pattern Provides reference judgements
Ontology of occupations and an ontological similarity measure Quantifies similarity between returned occupation and
nearest occupation in reference judgements Threshold for the ontological similarity measure
A value greater than or equal to this indicates that the response of P-System is correct
Apposition pattern Search pattern
occupation,? Capitalised Word Sequence Examples
For the past year, actor Aaron Eckhart has been receiving hate mail.
The landlord, Jon Mendelson, said he would consider any offer from Simmons.
Larger Evaluation
Apposition pattern 107,958 distinct capitalised word sequences found in
apposition with an occupation In a random sample of 1,000 instances
801 were correct 93 attributed some role to a person rather than their
occupation (e.g. referred to the leader of an organisation as a chief)
56 indicated an incorrect occupation 50 capitalised word sequence did not refer to the complete
name of a person
Larger Evaluation
Ontology of occupations Manually constructed ontology of hypernyms Internal nodes comprise filler nodes that provide
structure and occupations from the list of 364 Leaf nodes are all taken from the list of occupations
Larger Evaluation
Similarity Measure Semantic Association Measure (SAM) between two
nodes is calculated by1)Assigning a weight, w, to each edge where if c
1 represents
the number of successors of a node in the ontology and c2 is
the number of successors of one of its children then
2)Summing the weights of all edges between the two nodes to determine the distance, d
3)Finally, SAM = 11d
w=1−c21
c11
Larger Evaluation
Similarity Threshold Identified the best from a range of candidate thresholds
between 0.20 and 0.40 Compared a manual evaluation of P-System over 200
names in apposition with an occupation... ...to automated evaluation method using a candidate
threshold to produce a binary judgements
Larger Evaluation
Similarity Threshold If the SAM between the occupation returned by P-
System and the nearest occupation in apposition was...
>= candidate threshold Right by automated evaluation
< candidate threshold Wrong by automated evaluation
Larger Evaluation
Similarity Threshold Calculated proportion of times the automatic and
manual evaluations agreed in their judgements of a response
Selected candidate threshold with largest agreement level
Threshold 0.28 Agreement level of 0.872 Or 0.848 where unusual but otherwise correct classifications
was also considered right. High agreement levels validate evaluation method
Larger Evaluation
P-System was tested on the 3,177 names That exists in apposition with at least one
occupation Which are present in at least 100 predicate-
argument pairs Responses were automatically evaluated
Larger Evaluation
Classification accuracy for actors was 0.955 1.00 > accuracy >= 0.75
Actor, author, quarterback, prosecutor, singer, boxer, premier, coach, attorney, lawyer, politician
0.75 > accuracy>= 0.50 Spokesman, minister, senator, governor, president,
baseman, fielder
Larger Evaluation
0.50 > accuracy >= 0.25 Writer, runner, expert, killer, guard, executive, leader,
hero, brother, player, captain 0.25 > accuracy >= 0.00
General, officer, driver, chief, veteran, chairman, director, manager, agent, host
Larger Evaluation
Comparison of Three Models
A-System Uses apposition instances from previous experiment Returns occupation that occurs most frequently in
apposition with input name
H-System P-System and A-System hybrid If P and A both return an occupation
Returns the occupation sense that occurs in apposition that is closest to the response of P
If only P returns an occupation Returns a sense of the response of P
If only A returns an occupation Returns a sense of the response of A
Comparison of Three Models
Three-way comparison between P, A and H 250 classifiable TREC names Manual evaluation Compared the three models
In both a strict and lenient evaluation Where all names were classified and also where just
those names occurring in at least 100 predicate-argument pairs were classified
Controlled for the ability of A to classify a name
Comparison of Three Models
H is most accurate across all names Significantly better in lenient evaluation
Accuracies: H 0.584, A 0.492, P 0.424
In strict evaluation H is also the most accurate A only attempted to classify 0.632 of names
H and P attempted 0.904 and 0.892 of names
Comparison of Three Models
On names that were found in apposition with an occupation In the lenient evaluation H was most accurate
Accuracies: H 0.797, A 0.778, P 0.544
In the strict evaluation A was the best Accuracies: H 0.722, A 0.728, P 0.462
Comparison of Three Models
H-System returns more general occupations than A-System An advantage for it in the lenient evaluation A disadvantage in the strict evaluation
The principle of combining two very different approaches to classification has been validated
Comparison of Three Models
Combining two classification models such as in H-System allowed us to Respond with high accuracy Increase Recall beyond that of component classifiers
Experiments demonstrate that we can Identify hypernyms of proper nouns such as people's
names In the context of question answering
Conclusions