Timothy Baldwin and Dominic Widdows

Preview:

Citation preview

Noun Countability

Timothy Baldwin and Dominic Widdows

1 Noun Countability

Background

• Countability is a syntactic property of the noun phrase in

languages such as English, Dutch, Albanian and Tagalog

• In generation used to decide between:

a cake, cake, a piece of cake

• In analysis, helps to resolve ambiguity:

? I need a paper by this evening (academic/newspaper)

? I need some paper by this evening (material)

? I need the paper by this evening (ambiguous)

31 October, 2003

2 Noun Countability

Noun Phrase Countability

• Semantically motivated:

? bounded, indivisible individuals (+b)

prototypically countable: a dog, two dogs

? unbounded, divisible substances (−b)

prototypically uncountable: gold

31 October, 2003

3 Noun Countability

Countability Classes

• countable: book, button, person (one book, two books)

• uncountable: equipment, gold, wood (*oneequipment, much equipment, *two equipments)

• plural only: clothes, manners, outskirts (*oneclothes, clothes horse)

• bipartite: glasses, scissors, trousers (*one scissors,scissor kick, pair of scissors)

31 October, 2003

4 Noun Countability

Applications

• Determination of countability of unknown nouns (e.g.

acyclovir, coagulopathy)

• Detection of countability anomalies in multiword

expressions (e.g. public relations, cat’s cradle)

• Extraction of English determinerless PPs (e.g. by bus,at sea)

• Key component of noun type hierarchy in precision

grammars (e.g. ERG, Alpino)

31 October, 2003

5 Noun Countability

Learning the Countability ofEnglish Nouns from Corpus

Data

Timothy Baldwin and Francis BondACL 2003

31 October, 2003

6 Noun Countability

Learning Countability

• Observation: the countability properties of a noun type

are reflected in its corpus token occurrences:

... Cezanne snarling like a dog and then ...

... doing an impression of a rabid dog.

... with a pack of dogs running beside them.

Amnesty International has received information ...

Recent information from former detainees ...

... researchers often uncover information ...

31 October, 2003

7 Noun Countability

Case in Point

Acyclovir is a specifically anti-viral drug ...

Acyclovir has been developed and marketed by ...

Acyclovir given intravenously, ...

Coagulopathy is a well recognised complication ...

... may explain why coagulopathy after shunting is ...

... could stimulate a coagulopathy ...

... is also probably responsible for a coagulopathy ...

... a patient with a coagulopathy.

31 October, 2003

8 Noun Countability

Methodology

• Identify lexical and/or constructional features associated

with each countability class

• Determine the relative corpus occurrence of the features

for each noun

• Use the noun feature vectors to classify the noun as

a member of each of the countability classes, training

from gold-standard countability data

31 October, 2003

9 Noun Countability

Feature Clusters

Head noun number:1D target noun number as head of

NP (e.g. a shaggy dog = SINGULAR)

Modifier noun number:1D target noun number as

modifier in NP (e.g. dog food = SINGULAR)

Subject–verb agreement:2D target noun number as

subject vs. verb number agreement (e.g. the dog barks

= 〈SINGULAR,SINGULAR〉)Coordinate noun number:2D target noun number vs.

31 October, 2003

10 Noun Countability

the number of the head nouns of conjuncts (e.g. dogs

and mud = 〈PLURAL,SINGULAR〉)N of N constructions:2D number of N vs. type of N

(e.g. the type of dog = 〈TYPE,SINGULAR〉); total of 11

N types for use in this feature cluster (e.g. COLLECTIVE,

LACK, TEMPORAL).

Occurrence in PPs:2D the presence or absence of a

determiner (±DET) in singular head complement of PP

(e.g. per dog = 〈per ,−DET〉).Pronoun co-occurrence:2D what pronouns occur in the

31 October, 2003

11 Noun Countability

same sentence as singular and plural instances (e.g. The

dog ate its dinner = 〈its,SINGULAR〉). Approximation of

pronoun co-indexation.

Singular determiners:1D singular-selecting determiners

(e.g. a dog = a). Two types: countable (e.g. another,

each), uncountable (e.g. much, little).

Plural determiners:1D plural-selecting determiners (e.g.

few dogs = few).

Non-bounded determiners:2D non-bounded determiner

vs. noun number (e.g. more dogs = 〈more,PLURAL〉).

31 October, 2003

12 Noun Countability

Feature Values

1D corpfreq(f s,w) =freq(f s|w)freq(∗) (1)

wordfreq(f s,w) =freq(f s|w)freq(w)

(2)

featfreq(f s,w) =freq(f s|w)∑

ifreq(f i|w)(3)

2D featdimfreq(f s,t,w) =freq(f s,t|w)∑

ifreq(f i,t|w)(4)

featdimfreq(f s,t,w) =freq(f s,t|w)∑jfreq(f s,j|w)

(5)

31 October, 2003

13 Noun Countability

1-D case

corpfreq(f ,w)

wordfreq(f ,w)

featfreq(f ,w)

31 October, 2003

14 Noun Countability

1-D case

corpfreq(f ,w)

wordfreq(f ,w)

featfreq(f ,w)

31 October, 2003

15 Noun Countability

2-D casecorpfreq(f ,,w)

wordfreq(f ,,w)

featfreq(f ,,w)

31 October, 2003

16 Noun Countability

2-D case

featdimfreq(f ,,w)

31 October, 2003

17 Noun Countability

2-D case

featdimfreq(f ,,w)

31 October, 2003

18 Noun Countability

2-D case

corpfreq(f ∗,,w)

wordfreq(f ∗,,w)

featfreq(f ∗,,w)

31 October, 2003

19 Noun Countability

2-D case

corpfreq(f ,∗,w)wordfreq(f ,∗,w)featfreq(f ,∗,w)

31 October, 2003

20 Noun Countability

Feature Value Extraction

• POS tagging and templates

? extract features with regexp-base templates

• Full text chunking

? conservative inter-chunk attachment disambiguation

• Robust parsing (RASP)

• Concatenated feature values from three systems

31 October, 2003

21 Noun Countability

Classifier architecture

• Training data: generated from combination of ALT-J/E

and COMLEX (5,943 common nouns in BNC)

? positive examples in both ALT-J/E and COMLEX

? negative examples in neither ALT-J/E nor COMLEX

• Test data: nouns with ≥ 10 BNC instances for all 3

methods (20,530 common nouns)

• Four binary supervised classifiers, one per countability

class (learned using TiMBL and k-NN)

31 October, 2003

22 Noun Countability

Supervised Classifiers: Basic Overview

Training data

Learner

Classifier

Test data

Test instance?

A

B

B

Classification

A

31 October, 2003

23 Noun Countability

k-NN in TiMBL

• Distance between feature vectors X and Y based on“overlap metric”:

∆(X, Y ) =∑

δ(xi, yi)

δ(xi, yi) =

| xi−yimaxi−mini

| if numeric, else0 if xi = yi

1 if xi 6= yi

• Retrieve the neighbours at the k closest distances and

classify according to the most common class amongst

them

31 October, 2003

24 Noun Countability

Cross Validation: Input

• Take training data:

31 October, 2003

25 Noun Countability

Cross Validation: Partitioning

• Split up into N equal-sized (optionally stratified)

partitions P i:

P

P

P

P

P

P

P

P

P

P

31 October, 2003

26 Noun Countability

Cross Validation: Fold 1

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

27 Noun Countability

Cross Validation: Fold 2

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

28 Noun Countability

Cross Validation: Fold 3

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

29 Noun Countability

Cross Validation: Fold i

• And so on ...

31 October, 2003

30 Noun Countability

Cross Validation: Evaluate

• Calculate classification accuracy, precision, recall, F-

score, ... according to the average across the N

iterations

• Effective method of minimising training bias and test

variance

31 October, 2003

31 Noun Countability

Cross-validated Countability Results

• Good results (particularly for countable and uncountable

nouns), well above the baseline accuracy in each case

• Best results for combined method (concatenation of

three pre-processors)

31 October, 2003

32 Noun Countability

Manual Evaluation over Open Data

• Classifier precision of 94.6% relative to lexicons

• Manually annotated 100 nouns from the test data:

? Agreement between classified and hand-annotated

countabilities 92.4%

? Agreement between classified and dictionary

countabilities 92.4%

• Classifiers agree with corpus as well as lexicons

31 October, 2003

33 Noun Countability

Reflections

• Impressive results, but still room for improvement

(particularly for the less-populated countability classes)

• Boundary between motivated countabilities and

conversions (e.g. chicken vs. elephant vs. dog)

• Difficulties caused by MWEs (e.g. cat’s cradle)

• Sense and frequency effects (e.g. information)

31 October, 2003

34 Noun Countability

Using an ontology todetermine English countability

Francis Bond and Caitlin Vatikiotis-BatesonCOLING 2002

31 October, 2003

35 Noun Countability

Semantic Predictability of Countability

• How far is English countability predictable from

meaning?

• Countability is to some degree deterministic given the

semantics of a word:

dog, pooch, canine, mongrel, ...BUT suitcases vs. luggage, leaves vs. foliage, etc.

31 October, 2003

36 Noun Countability

Case in Point

Coagulopathy: group of conditions of the blood clotting

(coagulation) system in which bleeding is prolonged and

excessive, a bleeding disorder

Acyclovir: antiviral drug

31 October, 2003

37 Noun Countability

Word Denotation and Countability

• Knowing the referent is not enough, e.g. scales

1. Thought of as being made of two arms: (British)

a pair of scales

2. Thought of as a set of numbers: (Australian)

a set of scales

3. Thought of as discrete whole objects: (American)

one scale/two scales

31 October, 2003

38 Noun Countability

Methodology

• Take an existing ontology and determine the default

countability for each synset (semantic class)

• Test how reliably defaults predict the countability of

members of each synset

• Base experimentation on the ALT-J/E semantic transfer

lexicon and ontology

31 October, 2003

39 Noun Countability

Lexicon

• ALT-J/E’s semantic transfer lexicon

• 71,833 linked Japanese-English noun pairs

31 October, 2003

40 Noun Countability

The Goi-Taikei Ontology

• A rich ontology and wide coverage of Japanese

• Used in many NLP applications such as MT

• 2,710 semantic classes (12-level tree structure) for

common nouns

• Constructed from translation pairs (without countability

in mind)

31 October, 2003

41 Noun Countability

Top Four Levels of Ontology

31 October, 2003

42 Noun Countability

Noun Countability Preferences in ALT-J/E

Noun Countability Code Example Default Default # %Preference Number Classifier

fully CO knife sg — 47,255 65.8countable

strongly BC cake sg — 3,110 4.3countable

weakly BU beer sg — 3,377 4.7countable

uncountable UC furniture sg piece 15,435 21.5

plural only PT scissors pl pair 2,107 2.9

31 October, 2003

43 Noun Countability

Experiment

• Treat every combination of semantic classes as a

different semantic class.

• Most frequent NCP is assigned to all members of a

class.

? Ties are resolved as follows: fully countable beats

strongly countable beats weakly countablebeats uncountable beats plural only.

• Baseline (all fully countable = 65.8%)

31 October, 2003

44 Noun Countability

Example

• Semantic Class 910:tableware

? crockery ⇔ toukirui (UC)? dinner set ⇔ youshokki (CO)? tableware ⇔ shokki (UC)? Western-style tableware ⇔ youshokki (UC)

• The most common NCP is UCAssociated uncountable with 910:tableware.

• This predicts the NCP correctly 75% of the time.

31 October, 2003

45 Noun Countability

Results

Conditions % Range Baseline

Training=Test 77.9 76.8–78.6 65.8

10-fold Cross Validation 71.2 69.8–72.1 65.8

• 11.6% given default value (fully countable)

31 October, 2003

46 Noun Countability

Discussion

• Semantics predicts countability around 78% of the time

: supports hypothesis that countability is semantically

motivated

• Less successful than corpus-based countability learning

• Problems of granularity/translation-orientation of

lexicon

• Problems with noise in lexicon

31 October, 2003

47 Noun Countability

The Ins and Outs of DutchNoun Countability

Classification

Timothy Baldwin and Leonoor van der BeekALTW2003

31 October, 2003

48 Noun Countability

Crosslinguistic Predictability of Countability

• In linguistically-related languages such as English and

Dutch, countability generally patterns the same way:

? same basic behaviour of translation-equivalent

lexical/syntactic markers of countability (e.g. one

dog ⇀↽ een hond, some rice ⇀↽ een beetje rijst)

? translation pairs often have same countability: car ⇀↽

auto [countable], food ⇀↽ eten [uncountable], BUT

thunderstorm [countable] vs. onweer [uncountable]

31 October, 2003

49 Noun Countability

Out-of- vs. In-language Classification

• Given high-quality training data in a closely-related

language (English – COMLEX +ALT-J/E) and medium-

quality data in the target language (Dutch – Alpinolexicon):

? which generates the best classifier?

? what is the best form of crosslingual mapping?

• Focus on the task of Dutch noun countability

classification

31 October, 2003

50 Noun Countability

Approaches to Monolingual Classification

• Evidence-based classification: base classification on

token evidence for each countability class

• Distribution-based classification: same as for EN-EN

classification task (Baldwin and Bond (2003))

31 October, 2003

51 Noun Countability

Approaches to Crosslingual Classification

• Corpus occurrence-based classification (binary vs.

multiclass):

? cluster-to-cluster classification: EN and ND feature

clusters pattern the same

? feature-to-feature classification: EN and ND

features pattern the same (all features vs. partitions

of feature space)

31 October, 2003

52 Noun Countability

• Translation-based classification: countability

is preserved under translation (e.g. car ⇀↽ auto

[countable])

• Transliteration-based classification: countability

is preserved under transliteration (e.g. paranoia ⇀↽

paranoia [uncountable])

• System combination: classify according to combined

outputs of individual methods

? crosslingual + unsupervised monolingual

? crosslingual + monolingual

31 October, 2003

53 Noun Countability

Results

• Better results for crosslingual than monolingual

classification (!)

• Classifiers produce countability results more consistent

with corpus occurrence than Alpino lexicon

• Translation and transliteration are excellent predictors

of countability

• Semantics in crosslingual countability classification?

31 October, 2003

54 Noun Countability

A Preview of Results from EuroWordNet

Dutch Alpino English dict+learnedcount uncount count uncount

Dutch Alpino 0.75 0.37 0.87 0.47

Dutch annotated 0.76 0.44 0.90 0.75

English annotated 0.64 0.49 0.58 0.47

English dict+learned 0.63 0.33 0.62 0.47

31 October, 2003

55 Noun Countability

Final Reflections

• Demonstration of types of methods that can be used to

determine noun type countability

? distribution-based

? semantics/sense-based

? translation/transliteration-based

• Where next? Watch this space!

31 October, 2003

56 Noun Countability

Acknowledgements

• Thanks to Francis Bond and Caitlin Vatikiotis-Bateson

for sharing their wonderful slides and graphics!

31 October, 2003