Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Noun Countability
Timothy Baldwin and Dominic Widdows
1 Noun Countability
Background
• Countability is a syntactic property of the noun phrase in
languages such as English, Dutch, Albanian and Tagalog
• In generation used to decide between:
a cake, cake, a piece of cake
• In analysis, helps to resolve ambiguity:
? I need a paper by this evening (academic/newspaper)
? I need some paper by this evening (material)
? I need the paper by this evening (ambiguous)
31 October, 2003
2 Noun Countability
Noun Phrase Countability
• Semantically motivated:
? bounded, indivisible individuals (+b)
prototypically countable: a dog, two dogs
? unbounded, divisible substances (−b)
prototypically uncountable: gold
31 October, 2003
3 Noun Countability
Countability Classes
• countable: book, button, person (one book, two books)
• uncountable: equipment, gold, wood (*oneequipment, much equipment, *two equipments)
• plural only: clothes, manners, outskirts (*oneclothes, clothes horse)
• bipartite: glasses, scissors, trousers (*one scissors,scissor kick, pair of scissors)
31 October, 2003
4 Noun Countability
Applications
• Determination of countability of unknown nouns (e.g.
acyclovir, coagulopathy)
• Detection of countability anomalies in multiword
expressions (e.g. public relations, cat’s cradle)
• Extraction of English determinerless PPs (e.g. by bus,at sea)
• Key component of noun type hierarchy in precision
grammars (e.g. ERG, Alpino)
31 October, 2003
5 Noun Countability
Learning the Countability ofEnglish Nouns from Corpus
Data
Timothy Baldwin and Francis BondACL 2003
31 October, 2003
6 Noun Countability
Learning Countability
• Observation: the countability properties of a noun type
are reflected in its corpus token occurrences:
... Cezanne snarling like a dog and then ...
... doing an impression of a rabid dog.
... with a pack of dogs running beside them.
Amnesty International has received information ...
Recent information from former detainees ...
... researchers often uncover information ...
31 October, 2003
7 Noun Countability
Case in Point
Acyclovir is a specifically anti-viral drug ...
Acyclovir has been developed and marketed by ...
Acyclovir given intravenously, ...
Coagulopathy is a well recognised complication ...
... may explain why coagulopathy after shunting is ...
... could stimulate a coagulopathy ...
... is also probably responsible for a coagulopathy ...
... a patient with a coagulopathy.
31 October, 2003
8 Noun Countability
Methodology
• Identify lexical and/or constructional features associated
with each countability class
• Determine the relative corpus occurrence of the features
for each noun
• Use the noun feature vectors to classify the noun as
a member of each of the countability classes, training
from gold-standard countability data
31 October, 2003
9 Noun Countability
Feature Clusters
Head noun number:1D target noun number as head of
NP (e.g. a shaggy dog = SINGULAR)
Modifier noun number:1D target noun number as
modifier in NP (e.g. dog food = SINGULAR)
Subject–verb agreement:2D target noun number as
subject vs. verb number agreement (e.g. the dog barks
= 〈SINGULAR,SINGULAR〉)Coordinate noun number:2D target noun number vs.
31 October, 2003
10 Noun Countability
the number of the head nouns of conjuncts (e.g. dogs
and mud = 〈PLURAL,SINGULAR〉)N of N constructions:2D number of N vs. type of N
(e.g. the type of dog = 〈TYPE,SINGULAR〉); total of 11
N types for use in this feature cluster (e.g. COLLECTIVE,
LACK, TEMPORAL).
Occurrence in PPs:2D the presence or absence of a
determiner (±DET) in singular head complement of PP
(e.g. per dog = 〈per ,−DET〉).Pronoun co-occurrence:2D what pronouns occur in the
31 October, 2003
11 Noun Countability
same sentence as singular and plural instances (e.g. The
dog ate its dinner = 〈its,SINGULAR〉). Approximation of
pronoun co-indexation.
Singular determiners:1D singular-selecting determiners
(e.g. a dog = a). Two types: countable (e.g. another,
each), uncountable (e.g. much, little).
Plural determiners:1D plural-selecting determiners (e.g.
few dogs = few).
Non-bounded determiners:2D non-bounded determiner
vs. noun number (e.g. more dogs = 〈more,PLURAL〉).
31 October, 2003
12 Noun Countability
Feature Values
1D corpfreq(f s,w) =freq(f s|w)freq(∗) (1)
wordfreq(f s,w) =freq(f s|w)freq(w)
(2)
featfreq(f s,w) =freq(f s|w)∑
ifreq(f i|w)(3)
2D featdimfreq(f s,t,w) =freq(f s,t|w)∑
ifreq(f i,t|w)(4)
featdimfreq(f s,t,w) =freq(f s,t|w)∑jfreq(f s,j|w)
(5)
31 October, 2003
13 Noun Countability
1-D case
corpfreq(f ,w)
wordfreq(f ,w)
featfreq(f ,w)
31 October, 2003
14 Noun Countability
1-D case
corpfreq(f ,w)
wordfreq(f ,w)
featfreq(f ,w)
31 October, 2003
15 Noun Countability
2-D casecorpfreq(f ,,w)
wordfreq(f ,,w)
featfreq(f ,,w)
31 October, 2003
16 Noun Countability
2-D case
featdimfreq(f ,,w)
31 October, 2003
17 Noun Countability
2-D case
featdimfreq(f ,,w)
31 October, 2003
18 Noun Countability
2-D case
corpfreq(f ∗,,w)
wordfreq(f ∗,,w)
featfreq(f ∗,,w)
31 October, 2003
19 Noun Countability
2-D case
corpfreq(f ,∗,w)wordfreq(f ,∗,w)featfreq(f ,∗,w)
31 October, 2003
20 Noun Countability
Feature Value Extraction
• POS tagging and templates
? extract features with regexp-base templates
• Full text chunking
? conservative inter-chunk attachment disambiguation
• Robust parsing (RASP)
• Concatenated feature values from three systems
31 October, 2003
21 Noun Countability
Classifier architecture
• Training data: generated from combination of ALT-J/E
and COMLEX (5,943 common nouns in BNC)
? positive examples in both ALT-J/E and COMLEX
? negative examples in neither ALT-J/E nor COMLEX
• Test data: nouns with ≥ 10 BNC instances for all 3
methods (20,530 common nouns)
• Four binary supervised classifiers, one per countability
class (learned using TiMBL and k-NN)
31 October, 2003
22 Noun Countability
Supervised Classifiers: Basic Overview
Training data
Learner
Classifier
Test data
Test instance?
A
B
B
Classification
A
31 October, 2003
23 Noun Countability
k-NN in TiMBL
• Distance between feature vectors X and Y based on“overlap metric”:
∆(X, Y ) =∑
δ(xi, yi)
δ(xi, yi) =
| xi−yimaxi−mini
| if numeric, else0 if xi = yi
1 if xi 6= yi
• Retrieve the neighbours at the k closest distances and
classify according to the most common class amongst
them
31 October, 2003
24 Noun Countability
Cross Validation: Input
• Take training data:
31 October, 2003
25 Noun Countability
Cross Validation: Partitioning
• Split up into N equal-sized (optionally stratified)
partitions P i:
P
P
P
P
P
P
P
P
P
P
31 October, 2003
26 Noun Countability
Cross Validation: Fold 1
• For each i = 1...N , take P i as the test data and
{P j : j 6= i} as the training data
P
P
P
P
P
P
P
P
P
P
31 October, 2003
27 Noun Countability
Cross Validation: Fold 2
• For each i = 1...N , take P i as the test data and
{P j : j 6= i} as the training data
P
P
P
P
P
P
P
P
P
P
31 October, 2003
28 Noun Countability
Cross Validation: Fold 3
• For each i = 1...N , take P i as the test data and
{P j : j 6= i} as the training data
P
P
P
P
P
P
P
P
P
P
31 October, 2003
29 Noun Countability
Cross Validation: Fold i
• And so on ...
31 October, 2003
30 Noun Countability
Cross Validation: Evaluate
• Calculate classification accuracy, precision, recall, F-
score, ... according to the average across the N
iterations
• Effective method of minimising training bias and test
variance
31 October, 2003
31 Noun Countability
Cross-validated Countability Results
• Good results (particularly for countable and uncountable
nouns), well above the baseline accuracy in each case
• Best results for combined method (concatenation of
three pre-processors)
31 October, 2003
32 Noun Countability
Manual Evaluation over Open Data
• Classifier precision of 94.6% relative to lexicons
• Manually annotated 100 nouns from the test data:
? Agreement between classified and hand-annotated
countabilities 92.4%
? Agreement between classified and dictionary
countabilities 92.4%
• Classifiers agree with corpus as well as lexicons
31 October, 2003
33 Noun Countability
Reflections
• Impressive results, but still room for improvement
(particularly for the less-populated countability classes)
• Boundary between motivated countabilities and
conversions (e.g. chicken vs. elephant vs. dog)
• Difficulties caused by MWEs (e.g. cat’s cradle)
• Sense and frequency effects (e.g. information)
31 October, 2003
34 Noun Countability
Using an ontology todetermine English countability
Francis Bond and Caitlin Vatikiotis-BatesonCOLING 2002
31 October, 2003
35 Noun Countability
Semantic Predictability of Countability
• How far is English countability predictable from
meaning?
• Countability is to some degree deterministic given the
semantics of a word:
dog, pooch, canine, mongrel, ...BUT suitcases vs. luggage, leaves vs. foliage, etc.
31 October, 2003
36 Noun Countability
Case in Point
Coagulopathy: group of conditions of the blood clotting
(coagulation) system in which bleeding is prolonged and
excessive, a bleeding disorder
Acyclovir: antiviral drug
31 October, 2003
37 Noun Countability
Word Denotation and Countability
• Knowing the referent is not enough, e.g. scales
1. Thought of as being made of two arms: (British)
a pair of scales
2. Thought of as a set of numbers: (Australian)
a set of scales
3. Thought of as discrete whole objects: (American)
one scale/two scales
31 October, 2003
38 Noun Countability
Methodology
• Take an existing ontology and determine the default
countability for each synset (semantic class)
• Test how reliably defaults predict the countability of
members of each synset
• Base experimentation on the ALT-J/E semantic transfer
lexicon and ontology
31 October, 2003
39 Noun Countability
Lexicon
• ALT-J/E’s semantic transfer lexicon
• 71,833 linked Japanese-English noun pairs
31 October, 2003
40 Noun Countability
The Goi-Taikei Ontology
• A rich ontology and wide coverage of Japanese
• Used in many NLP applications such as MT
• 2,710 semantic classes (12-level tree structure) for
common nouns
• Constructed from translation pairs (without countability
in mind)
31 October, 2003
41 Noun Countability
Top Four Levels of Ontology
31 October, 2003
42 Noun Countability
Noun Countability Preferences in ALT-J/E
Noun Countability Code Example Default Default # %Preference Number Classifier
fully CO knife sg — 47,255 65.8countable
strongly BC cake sg — 3,110 4.3countable
weakly BU beer sg — 3,377 4.7countable
uncountable UC furniture sg piece 15,435 21.5
plural only PT scissors pl pair 2,107 2.9
31 October, 2003
43 Noun Countability
Experiment
• Treat every combination of semantic classes as a
different semantic class.
• Most frequent NCP is assigned to all members of a
class.
? Ties are resolved as follows: fully countable beats
strongly countable beats weakly countablebeats uncountable beats plural only.
• Baseline (all fully countable = 65.8%)
31 October, 2003
44 Noun Countability
Example
• Semantic Class 910:tableware
? crockery ⇔ toukirui (UC)? dinner set ⇔ youshokki (CO)? tableware ⇔ shokki (UC)? Western-style tableware ⇔ youshokki (UC)
• The most common NCP is UCAssociated uncountable with 910:tableware.
• This predicts the NCP correctly 75% of the time.
31 October, 2003
45 Noun Countability
Results
Conditions % Range Baseline
Training=Test 77.9 76.8–78.6 65.8
10-fold Cross Validation 71.2 69.8–72.1 65.8
• 11.6% given default value (fully countable)
31 October, 2003
46 Noun Countability
Discussion
• Semantics predicts countability around 78% of the time
: supports hypothesis that countability is semantically
motivated
• Less successful than corpus-based countability learning
• Problems of granularity/translation-orientation of
lexicon
• Problems with noise in lexicon
31 October, 2003
47 Noun Countability
The Ins and Outs of DutchNoun Countability
Classification
Timothy Baldwin and Leonoor van der BeekALTW2003
31 October, 2003
48 Noun Countability
Crosslinguistic Predictability of Countability
• In linguistically-related languages such as English and
Dutch, countability generally patterns the same way:
? same basic behaviour of translation-equivalent
lexical/syntactic markers of countability (e.g. one
dog ⇀↽ een hond, some rice ⇀↽ een beetje rijst)
? translation pairs often have same countability: car ⇀↽
auto [countable], food ⇀↽ eten [uncountable], BUT
thunderstorm [countable] vs. onweer [uncountable]
31 October, 2003
49 Noun Countability
Out-of- vs. In-language Classification
• Given high-quality training data in a closely-related
language (English – COMLEX +ALT-J/E) and medium-
quality data in the target language (Dutch – Alpinolexicon):
? which generates the best classifier?
? what is the best form of crosslingual mapping?
• Focus on the task of Dutch noun countability
classification
31 October, 2003
50 Noun Countability
Approaches to Monolingual Classification
• Evidence-based classification: base classification on
token evidence for each countability class
• Distribution-based classification: same as for EN-EN
classification task (Baldwin and Bond (2003))
31 October, 2003
51 Noun Countability
Approaches to Crosslingual Classification
• Corpus occurrence-based classification (binary vs.
multiclass):
? cluster-to-cluster classification: EN and ND feature
clusters pattern the same
? feature-to-feature classification: EN and ND
features pattern the same (all features vs. partitions
of feature space)
31 October, 2003
52 Noun Countability
• Translation-based classification: countability
is preserved under translation (e.g. car ⇀↽ auto
[countable])
• Transliteration-based classification: countability
is preserved under transliteration (e.g. paranoia ⇀↽
paranoia [uncountable])
• System combination: classify according to combined
outputs of individual methods
? crosslingual + unsupervised monolingual
? crosslingual + monolingual
31 October, 2003
53 Noun Countability
Results
• Better results for crosslingual than monolingual
classification (!)
• Classifiers produce countability results more consistent
with corpus occurrence than Alpino lexicon
• Translation and transliteration are excellent predictors
of countability
• Semantics in crosslingual countability classification?
31 October, 2003
54 Noun Countability
A Preview of Results from EuroWordNet
Dutch Alpino English dict+learnedcount uncount count uncount
Dutch Alpino 0.75 0.37 0.87 0.47
Dutch annotated 0.76 0.44 0.90 0.75
English annotated 0.64 0.49 0.58 0.47
English dict+learned 0.63 0.33 0.62 0.47
31 October, 2003
55 Noun Countability
Final Reflections
• Demonstration of types of methods that can be used to
determine noun type countability
? distribution-based
? semantics/sense-based
? translation/transliteration-based
• Where next? Watch this space!
31 October, 2003
56 Noun Countability
Acknowledgements
• Thanks to Francis Bond and Caitlin Vatikiotis-Bateson
for sharing their wonderful slides and graphics!
31 October, 2003